In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

%pip install mlxtend

Collecting mlxtend
  Using cached mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0
Collecting mlxtend
  Using cached mlxtend-0.23.0-py3-none-any.whl (1.4 MB)
Installing collected packages: mlxtend
Successfully installed mlxtend-0.23.0
Note: you may need to restart the kernel to use updated packages.


# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [34]:
# load the data set and show the first five transaction
url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df1 = pd.read_csv(url)

df1.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [35]:
unique_product = (df1['6'].unique())

print(set(unique_product))

{nan, 'Milk', 'Bread', 'Wine', 'Pencil', 'Diaper', 'Eggs', 'Bagel', 'Meat', 'Cheese'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [36]:
#create an itemset based on the products
itemset = set(unique_product)

# encoding the feature
encode_feature = []
for index, row in df1.iterrows(): 
    label = {}
    uncommons = list(set(unique_product) - set(row))
    commons = list(set(unique_product).intersection(row))
    for ucom in uncommons:
        label[ucom] = 0
    for com in commons:
        label[com] = 1
    encode_feature.append(label)

In [37]:
# create new dataframe from the encoded features
encode_df1 = pd.DataFrame(encode_feature)

# show the new dataframe
encode_df1.head()


Unnamed: 0,NaN,Milk,Bagel,Bread,Pencil,Cheese,Eggs,Diaper,Meat,Wine
0,0,0,0,1,1,1,1,1,1,1
1,0,1,0,1,1,1,0,1,1,1
2,1,1,0,0,0,1,1,0,1,1
3,1,1,0,0,0,1,1,0,1,1
4,1,0,0,0,1,0,0,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [38]:
new_encode_df1 = (encode_df1.loc[:, encode_df1.columns.notna()])

new_encode_df1.head()

Unnamed: 0,Milk,Bagel,Bread,Pencil,Cheese,Eggs,Diaper,Meat,Wine
0,0,0,1,1,1,1,1,1,1
1,1,0,1,1,1,0,1,1,1
2,1,0,0,0,1,1,0,1,1
3,1,0,0,0,1,1,0,1,1
4,0,0,0,1,0,0,0,1,1


## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [39]:
from mlxtend.frequent_patterns import apriori, association_rules

frequently_purchased_products = apriori(new_encode_df1, min_support=0.2, use_colnames=True)

frequently_purchased_products



Unnamed: 0,support,itemsets
0,0.501587,(Milk)
1,0.425397,(Bagel)
2,0.504762,(Bread)
3,0.361905,(Pencil)
4,0.501587,(Cheese)
5,0.438095,(Eggs)
6,0.406349,(Diaper)
7,0.47619,(Meat)
8,0.438095,(Wine)
9,0.225397,"(Milk, Bagel)"


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [40]:
ass_rule_itemset = association_rules(frequently_purchased_products, metric="confidence", min_threshold=0.6)

ass_rule_itemset

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
1,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
2,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
4,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
5,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
6,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
7,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
8,"(Meat, Cheese)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845,0.296655
9,"(Meat, Milk)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

__1. Antecedent Support:__

- Definition: Antecedent support specifically refers to the support of the antecedent itemset in an association rule.
- Formula: support(X) = proportion of transactions that containing X
- Interpretation: It represents (frequency) how often the antecedent (itemset X) appears in the data set.

__2. Consequent Support__

- Definition: Consequent support is similar to antecedent support but applies to the consequent itemset in an association rule.
- Formula: support(Y) = proportion of transactions that containing Y
- Interpretation: It represents (frequency) how often the consequent (itemset Y) appears in the data set.

__3. Support__
- Definition: Support is the proportion of transactions in the dataset that contain a particular set of items or we can say the combined itemset of X∪Y.
- Formula: support(X→Y) = support(X∪Y)
- Interpretation: A high support indicates that the itemset is frequent in the dataset or we can say the frequency of the combined itemset X∪Y in the dataset.

__4. Confidence__
- Definition: Confidence measures the reliability of the implication of an association rule. It is the probability of the consequent given the antecedent.
- Formula: confidence(X→Y) = support(X→Y)/support(X)
- Interpretation: A high confidence suggests that when the antecedent is present, the consequent is likely to be present as well. 

__5. Lift__
- Definition: Lift measures how much more likely the consequent is, given the antecedent, compared to its likelihood without the antecedent.
- Formula: confidence(X→Y) = support(X→Y)/support(Y)
- Interpretation:  Lift value is greater than 1 indicates that the presence of the antecedent increases the likelihood of the consequent.

__6. Leverage__
- Definition: Leverage measures the difference between the observed frequency of the antecedent and consequent appearing together and the frequency that would be expected if they were independent.
- Formula: levarage(X→Y) = support(X→Y) − support(X) × support(C)
- Interpretation:  Positive leverage indicates that the antecedent and consequent appear together more often than expected by chance.

__7. Conviction__
- Definition: Conviction measures the ratio of the expected frequency that X occurs without Y (if they were independent) to the observed frequency of X not implying Y.
- Formula: conviction(X→Y) = (1−support(Y))/(1−confidence(X→Y))
- Interpretation: High conviction values indicate a strong dependency between the antecedent and consequent.