# Apriori


The Apriori algorithm is a classical algorithm used in data mining for association rule learning. It’s designed to extract frequent itemsets from a large dataset, often used in market basket analysis, movies & post recommendation, etc to discover relationships between different items. 
The algorithm aims to find associations between products or events that commonly occur together, such as "if a customer buys bread, they often also buy butter."

It is breakdown into three sections: the support, confidence and the lift.

Support: This is a measure of how often an itemset appears in the dataset. It is mathematically define as: 
$$\frac{the  number of transactions in the dataset that contain the itemset}{total number of transaction}$$

Example: If 100 transactions were made and 25 of them included {milk, bread}, the support for {milk, bread} is

$\frac{25}{100} = 0.25$

Confidence: This measures the likelihood that an item B is also bought when item A is bought. It is mathematicaly define as:

$$confidence = \frac{Support(𝐴∪𝐵)}{(Support(𝐴)}$$
If 30 out of 50 transactions that include milk also include bread, the confidence of {milk} → {bread} is 30/50 = 0.6.

Lift: This indicates how much more likely item B is to be bought when item A is bought, compared to buying B independently. (confidence / support)

Formula:
$$Lift(𝐴→𝐵) = \frac{Support(𝐴∪𝐵)}{(Support(𝐴)×Support(𝐵)}$$

A lift greater than 1 indicates a strong association, while a lift equal to 1 suggests independence between items.

# The algorithm has four steps
1. Set a minimum support and confidence.
2. Take all the subsets in transactions having higer support than minimum support.
3. Take all the rules of these subsets having confidence than minimum confidence.
4. Sort the rules by decreasing lift.

# Applications of Apriori
1. Market Basket Analysis: Used to find products frequently purchased together, guiding product placement and promotions in retail.
2. Recommendation Systems: Suggesting products based on user preferences.
3. Healthcare: Identifying frequently co-occurring symptoms or diseases.
4. Fraud Detection: Finding patterns in fraudulent behavior.

## Importing the libraries

In [1]:
!pip install apyori

Collecting apyori
  Downloading apyori-1.1.2.tar.gz (8.6 kB)
  Preparing metadata (setup.py): started
  Preparing metadata (setup.py): finished with status 'done'
Building wheels for collected packages: apyori
  Building wheel for apyori (setup.py): started
  Building wheel for apyori (setup.py): finished with status 'done'
  Created wheel for apyori: filename=apyori-1.1.2-py3-none-any.whl size=5955 sha256=994188974aef281bd85dfb6a2a8d78cd457399456b525f952ce05e1cfd73c6ea
  Stored in directory: c:\users\admin\appdata\local\pip\cache\wheels\32\2a\54\10c595515f385f3726642b10c60bf788029e8f3a1323e3913a
Successfully built apyori
Installing collected packages: apyori
Successfully installed apyori-1.1.2


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Data Preprocessing

The dataset used for this study contains the list of items bought by customers at a grocery store for a week. 
Note: A total number of 7501 customers list of purchase items were recorded with one of the cutomers buying the most items of 20 items. 

In [2]:
dataset = pd.read_csv('Market_Basket_Optimisation.csv', header = None) # Indicating that the dataset has no column name

## Set the Apriori dataset format, list of transaction instead of as a data frame
transactions = []
## Loop over each customers items that were purchased. 
for i in range(0, 7501):
    transactions.append([str(dataset.values[i,j]) for j in range(0, 20)])

## Training the Apriori model on the dataset
Training the apriori model requires at least four elements. 
transactions: list of data in apriori format

Minimum number of support: For this case, we set the minimum support threshold to 3 transactions per day over 7 days, meaning the item must appear in at least 21 transactions over the week. This will help to capture items that are frequently bought within a reasonable threshold.
 $\frac{(3*7)}{7501}$
 
Minimum number of confidence = Randomly select a value between 0 and 1 as confidence values represents a probability. 
Important Notes:
Setting the confidence threshold too high (close to 1) may result in fewer rules because only very strong associations will pass the filter.
Setting it too low (close to 0) may lead to too many weak or uninformative rules.

Minimum number of lift: For a Market Basket Analysis, we chose a minimum number that is greater than 1 to have a positive association.

NOTE:
1. Lift = 1: The items are bought together as often as expected by chance (no association).
2. Lift > 1: The items are bought together more frequently than expected (positive association, co-occuring).
3. Lift < 1: The items are bought together less frequently than expected (negative association, Identifying Substitutes).

Since we are focusing on the relationship between two items for the promotion (buy one, get one free), the minimum and maximum length of itemsets is set to 2, to only look for pairs of products. 

In [3]:
from apyori import apriori
rules = apriori(transactions = transactions, min_support = 0.003, min_confidence = 0.2, min_lift = 3, min_length = 2, max_length = 2)

## Visualising the results

### Displaying the first results coming directly from the output of the apriori function

In [4]:
results = list(rules)

In [5]:
results

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)]),
 RelationRecord(items=frozenset({'mushroom cream sauce', 'escalope'}), support=0.005732568990801226, ordered_statistics=[OrderedStatistic(items_base=frozenset({'mushroom cream sauce'}), items_add=frozenset({'escalope'}), confidence=0.3006993006993007, lift=3.790832696715049)]),
 RelationRecord(items=frozenset({'pasta', 'escalope'}), support=0.005865884548726837, ordered_statistics=[OrderedStatistic(items_base=frozenset({'pasta'}), items_add=frozenset({'escalope'}), confidence=0.3728813559322034, lift=4.700811850163794)]),
 RelationRecord(items=frozenset({'fromage blanc', 'honey'}), support=0.003332888948140248, ordered_statistics=[OrderedStatistic(items_base=frozenset({'fromage blanc'}), items_add=frozenset({'honey'}), confidence=0

# Interpreting the result

[RelationRecord(items=frozenset({'light cream', 'chicken'}), support=0.004532728969470737, ordered_statistics=[OrderedStatistic(items_base=frozenset({'light cream'}), items_add=frozenset({'chicken'}), confidence=0.29059829059829057, lift=4.84395061728395)])

This means that there is a positive association of 4.8 between a customer buying light cream and chicken togther and its appear 0.004 times in the transaction. Also, there is a probability (chance) of  0.3 for a customer who buy light cream to also buy chicken.

### Putting the results well organised into a Pandas DataFrame

In [6]:
def inspect(results):
    lhs         = [tuple(result[2][0][0])[0] for result in results]
    rhs         = [tuple(result[2][0][1])[0] for result in results]
    supports    = [result[1] for result in results]
    confidences = [result[2][0][2] for result in results]
    lifts       = [result[2][0][3] for result in results]
    return list(zip(lhs, rhs, supports, confidences, lifts))
resultsinDataFrame = pd.DataFrame(inspect(results), columns = ['Left Hand Side', 'Right Hand Side', 'Support', 'Confidence', 'Lift'])

### Displaying the results non sorted

In [7]:
resultsinDataFrame

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
0,light cream,chicken,0.004533,0.290598,4.843951
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
2,pasta,escalope,0.005866,0.372881,4.700812
3,fromage blanc,honey,0.003333,0.245098,5.164271
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
6,light cream,olive oil,0.0032,0.205128,3.11471
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
8,pasta,shrimp,0.005066,0.322034,4.506672


### Displaying the results sorted by descending lifts

In [8]:
resultsinDataFrame.nlargest(n = 10, columns = 'Lift')

Unnamed: 0,Left Hand Side,Right Hand Side,Support,Confidence,Lift
3,fromage blanc,honey,0.003333,0.245098,5.164271
0,light cream,chicken,0.004533,0.290598,4.843951
2,pasta,escalope,0.005866,0.372881,4.700812
8,pasta,shrimp,0.005066,0.322034,4.506672
7,whole wheat pasta,olive oil,0.007999,0.271493,4.12241
5,tomato sauce,ground beef,0.005333,0.377358,3.840659
1,mushroom cream sauce,escalope,0.005733,0.300699,3.790833
4,herb & pepper,ground beef,0.015998,0.32345,3.291994
6,light cream,olive oil,0.0032,0.205128,3.11471
