In [20]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
from mlxtend.preprocessing import TransactionEncoder

! pip install mlxtend



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [21]:
# load the data set and show the first five transaction
url = "https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv"
df = pd.read_csv(url)
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [22]:
unique_products = set()

for index, row in df.iterrows():
    unique_products.update(row.dropna())

print(list(unique_products))

['Wine', 'Diaper', 'Cheese', 'Meat', 'Bread', 'Bagel', 'Eggs', 'Milk', 'Pencil']


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [28]:
#create an itemset based on the products
transactions = df.apply(lambda x: x.dropna().tolist(), axis=1)

# encoding the feature
te = TransactionEncoder()
encoded_transactions = te.fit_transform(transactions)

In [37]:
# create new dataframe from the encoded features
df_encoded = pd.DataFrame(encoded_transactions, columns=te.columns_)
df_encoded = df_encoded.astype(int)

# show the new dataframe
df_encoded.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Pencil,Wine
0,0,1,1,1,1,1,0,1,1
1,0,1,1,1,0,1,1,1,1
2,0,0,1,0,1,1,1,0,1
3,0,0,1,0,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1


## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [42]:
frequent_itemsets = apriori(df_encoded, min_support=0.2, use_colnames=True)
print("Frequent Itemsets:")
print(frequent_itemsets)

Frequent Itemsets:
     support              itemsets
0   0.425397               (Bagel)
1   0.504762               (Bread)
2   0.501587              (Cheese)
3   0.406349              (Diaper)
4   0.438095                (Eggs)
5   0.476190                (Meat)
6   0.501587                (Milk)
7   0.361905              (Pencil)
8   0.438095                (Wine)
9   0.279365        (Bread, Bagel)
10  0.225397         (Milk, Bagel)
11  0.238095       (Cheese, Bread)
12  0.231746       (Bread, Diaper)
13  0.206349         (Meat, Bread)
14  0.279365         (Bread, Milk)
15  0.200000       (Bread, Pencil)
16  0.244444         (Bread, Wine)
17  0.200000      (Cheese, Diaper)
18  0.298413        (Eggs, Cheese)
19  0.323810        (Cheese, Meat)
20  0.304762        (Cheese, Milk)
21  0.200000      (Cheese, Pencil)
22  0.269841        (Cheese, Wine)
23  0.234921        (Wine, Diaper)
24  0.266667          (Eggs, Meat)
25  0.244444          (Eggs, Milk)
26  0.241270          (Eggs, Wine)
2

Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [44]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
print("Association Rules:")
print(rules)

Association Rules:
       antecedents consequents  antecedent support  consequent support  \
0          (Bagel)     (Bread)            0.425397            0.504762   
1           (Eggs)    (Cheese)            0.438095            0.501587   
2         (Cheese)      (Meat)            0.501587            0.476190   
3           (Meat)    (Cheese)            0.476190            0.501587   
4         (Cheese)      (Milk)            0.501587            0.501587   
5           (Milk)    (Cheese)            0.501587            0.501587   
6           (Wine)    (Cheese)            0.438095            0.501587   
7           (Eggs)      (Meat)            0.438095            0.476190   
8   (Eggs, Cheese)      (Meat)            0.298413            0.476190   
9     (Eggs, Meat)    (Cheese)            0.266667            0.501587   
10  (Cheese, Meat)      (Eggs)            0.323810            0.438095   
11  (Cheese, Meat)      (Milk)            0.323810            0.501587   
12  (Cheese, Milk) 

Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

1. Antecedent Support

Definition: The support of the antecedent itemset.

Explanation: It represents the proportion of transactions in the dataset that contain the antecedent itemset.

2. Consequent Support

Definition: The support of the consequent itemset.

Explanation: Similar to antecedent support, it represents the proportion of transactions that contain the consequent itemset.

3. Support

Definition: The support of the combined antecedent and consequent itemsets.

Explanation: It represents the proportion of transactions in the dataset that contain both the antecedent and consequent itemsets.

4. Confidence

Definition: The conditional probability of the consequent given the antecedent.

Explanation: It measures how often the rule has been found to be true. A confidence of 0.6 means that, in 60% of the cases where the antecedent is present, the consequent is also present.

5. Lift

Definition: The ratio of the observed support to the expected support if the antecedent and consequent were independent.

Explanation: Lift indicates how much more likely the consequent is given the antecedent compared to when the two are independent. A lift value greater than 1 suggests that the presence of the antecedent increases the likelihood of the consequent.

6. Leverage

Definition: The difference between the observed support and the expected support if the antecedent and consequent were independent.

Explanation: Leverage measures how much the occurrence of the antecedent and the consequent together differs from what would be expected if they were independent. A positive leverage indicates that the antecedent and consequent co-occur more often than expected.

7. Conviction

Definition: The ratio of the expected frequency that the antecedent occurs without the consequent to the observed frequency of the antecedent occurring without the consequent.

Explanation: Conviction measures the degree of implication by evaluating how often the antecedent occurs without the consequent compared to what we would expect if they were independent. A high conviction value suggests a strong dependency between the antecedent and consequent.