In [51]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
#!pip install mlxtend

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [52]:
# load the data set and show the first five transaction

url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)
print(df.head())

        0       1     2       3       4       5       6
0   Bread    Wine  Eggs    Meat  Cheese  Pencil  Diaper
1   Bread  Cheese  Meat  Diaper    Wine    Milk  Pencil
2  Cheese    Meat  Eggs    Milk    Wine     NaN     NaN
3  Cheese    Meat  Eggs    Milk    Wine     NaN     NaN
4    Meat  Pencil  Wine     NaN     NaN     NaN     NaN


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [53]:
unique_products = df["0"].unique()
print(set(unique_products))

{'Diaper', 'Meat', 'Bread', 'Cheese', 'Eggs', 'Wine', 'Bagel', 'Milk', 'Pencil'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [54]:
#create an itemset based on the products
transactions = df.apply(lambda x: x.dropna().tolist(), axis=1)

# encoding the feature
te = TransactionEncoder()
encoded_transactions = te.fit_transform(transactions)

In [55]:
# create new dataframe from the encoded features
df_encoded = pd.DataFrame(encoded_transactions, columns=te.columns_)
df_encoded = df_encoded.astype(int)

# show the new dataframe
df_encoded.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [56]:
df_encoded = df_encoded.dropna(axis=1)

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [57]:
frequent_itemsets = apriori(df_encoded, min_support=0.2, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)

Frequent Itemsets:
     support              itemsets
0   0.425397               (Bagel)
1   0.504762               (Bread)
2   0.501587              (Cheese)
3   0.406349              (Diaper)
4   0.438095                (Eggs)
5   0.476190                (Meat)
6   0.501587                (Milk)
7   0.361905              (Pencil)
8   0.438095                (Wine)
9   0.279365        (Bagel, Bread)
10  0.225397         (Bagel, Milk)
11  0.238095       (Cheese, Bread)
12  0.231746       (Diaper, Bread)
13  0.206349         (Meat, Bread)
14  0.279365         (Milk, Bread)
15  0.200000       (Pencil, Bread)
16  0.244444         (Wine, Bread)
17  0.200000      (Cheese, Diaper)
18  0.298413        (Cheese, Eggs)
19  0.323810        (Cheese, Meat)
20  0.304762        (Cheese, Milk)
21  0.200000      (Cheese, Pencil)
22  0.269841        (Cheese, Wine)
23  0.234921        (Diaper, Wine)
24  0.266667          (Eggs, Meat)
25  0.244444          (Eggs, Milk)
26  0.241270          (Eggs, Wine)
2



Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [59]:
rules_of_association = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)

print(rules_of_association)

       antecedents consequents  antecedent support  consequent support  \
0          (Bagel)     (Bread)            0.425397            0.504762   
1           (Eggs)    (Cheese)            0.438095            0.501587   
2         (Cheese)      (Meat)            0.501587            0.476190   
3           (Meat)    (Cheese)            0.476190            0.501587   
4         (Cheese)      (Milk)            0.501587            0.501587   
5           (Milk)    (Cheese)            0.501587            0.501587   
6           (Wine)    (Cheese)            0.438095            0.501587   
7           (Eggs)      (Meat)            0.438095            0.476190   
8   (Cheese, Eggs)      (Meat)            0.298413            0.476190   
9   (Cheese, Meat)      (Eggs)            0.323810            0.438095   
10    (Eggs, Meat)    (Cheese)            0.266667            0.501587   
11  (Cheese, Meat)      (Milk)            0.323810            0.501587   
12  (Cheese, Milk)      (Meat)        

Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

1. Support:

-Formula: Support(A) = (Transactions containing A) / (Total transactions)

-Support measures the proportion of transactions in the dataset that contain the itemset A. Higher support indicates that the itemset is more frequent in the dataset.

2. Confidence:

-Formula: Confidence(A → B) = Support(A ∩ B) / Support(A)

-Confidence measures the likelihood that a transaction containing itemset A also contains itemset B. It is the conditional probability of B given A. Higher confidence indicates a stronger association between A and B.

3. Lift:

-Formula: Lift(A → B) = (Support(A ∩ B) / Support(A)) / Support(B)

-Lift measures the ratio of the observed support to the expected support if A and B were independent. A lift value greater than 1 indicates that the presence of A has a positive effect on the presence of B. A lift value less than 1 suggests a negative effect, and a lift value close to 1 suggests independence.

4. Leverage:

-Formula: Leverage(A → B) = Support(A ∩ B) - Support(A) * Support(B)

-Leverage measures the difference between the observed support of A and B together and the expected support if they were independent. A leverage value of 0 indicates independence, while positive values suggest a positive relationship.

5. Conviction:

-Formula: Conviction(A → B) = (1 - Support(B)) / (1 - Confidence(A → B))

-Conviction measures the ratio of the expected frequency that A occurs without B to the observed frequency of A without B. Higher conviction values indicate stronger dependency between A and B, with values approaching infinity indicating a strong dependency.

6. Antecedent Support:

-The support of the antecedent (left-hand side) of the rule.

-Formula: Support(A)

-Antecedent support measures the proportion of transactions that contain the antecedent of the rule.

7. Consequent Support:

-The support of the consequent (right-hand side) of the rule.

-Formula: Support(B)

-Consequent support measures the proportion of transactions that contain the consequent of the rule.