In [15]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import LabelEncoder


I Wayan Rangga Rijasa - 0706022210019

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [16]:
# load the data set ans show the first five transaction
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


In [17]:
purchased_item = set(df.values.flatten())

print(purchased_item)

{'Wine', 'Diaper', 'Cheese', 'Bread', 'Pencil', 'Eggs', 'Meat', 'Milk', 'Bagel', nan}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [47]:
#create an itemset based on the products
products = set()
for col in df.columns:
    products.update(df[col].unique())
# encoding the feature
encoded_transactions = []
for _, row in df.iterrows():
    transaction_dict = {product: (1 if product in row.values else 0) for product in products}
    encoded_transactions.append(transaction_dict)

encoded_transactions[0]

{'Wine': 1,
 'Diaper': 1,
 'Cheese': 1,
 'Bread': 1,
 'Bagel': 0,
 'Eggs': 1,
 'Meat': 1,
 'Pencil': 1,
 'Milk': 0,
 nan: 0}

In [49]:

df_replaced = df.fillna('Missing')

data_flattened = df_replaced.values.ravel()

onehot_encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')

encoded_matrix = onehot_encoder.fit_transform(data_flattened.reshape(-1, 1))

encoded_columns = onehot_encoder.categories_[0]
encoded_df = pd.DataFrame(encoded_matrix, columns=encoded_columns)

product_presence = pd.DataFrame(0, index=df.index, columns=encoded_columns)

for idx, row_data in df_replaced.iterrows():
    for item in row_data:
        product_presence.loc[idx, item] = 1

product_presence.head()


Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,Missing,Pencil,Wine
0,0,1,1,1,1,1,0,0,1,1
1,0,1,1,1,0,1,1,0,1,1
2,0,0,1,0,1,1,1,1,0,1
3,0,0,1,0,1,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1,1


In [50]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.

if 'Missing' in product_data.columns:
    product_data.drop(columns=['Missing'], inplace=True)

product_data.head()

Unnamed: 0,Bagel,Bread,Cheese,Diaper,Eggs,Meat,Milk,NaN,Pencil,Wine
0,0,1,1,1,1,1,0,0,1,1
1,0,1,1,1,0,1,1,0,1,1
2,0,0,1,0,1,1,1,1,0,1
3,0,0,1,0,1,1,1,1,0,1
4,0,0,0,0,0,1,0,1,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [51]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules
frequent_itemsets = apriori(product_data, min_support=0.2, use_colnames=True)
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.504762,(Bread)
2,0.501587,(Cheese)
3,0.406349,(Diaper)
4,0.438095,(Eggs)
5,0.47619,(Meat)
6,0.501587,(Milk)
7,0.869841,(NaN)
8,0.361905,(Pencil)
9,0.438095,(Wine)


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [54]:
rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
rules.drop(columns=['zhangs_metric'], inplace=True)
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Bagel),(NaN),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902
2,(Bread),(NaN),0.504762,0.869841,0.396825,0.786164,0.903801,-0.042237,0.608683
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
4,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
5,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
6,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
7,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
8,(Cheese),(NaN),0.501587,0.869841,0.393651,0.78481,0.902245,-0.042651,0.604855
9,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

Antecedent Support:
The proportion of transactions in the dataset that contain the antecedent (left-hand side of the rule).

Antecedent Support

Transactions with Antecedent
Total Transactions
Antecedent Support= 
Total Transactions
Transactions with Antecedent
​
 
Example: For (Bagel) → (Bread), the antecedent support is 0.425397, meaning 42.54% of the transactions include "Bagel".

Consequent Support:
The proportion of transactions in the dataset that contain the consequent (right-hand side of the rule).

Consequent Support

Transactions with Consequent
Total Transactions
Consequent Support= 
Total Transactions
Transactions with Consequent
​
 
Example: For (Bagel) → (Bread), the consequent support is 0.504762, meaning 50.48% of the transactions include "Bread".

Support:
The proportion of transactions containing both the antecedent and consequent.

Support

Transactions with Antecedent and Consequent
Total Transactions
Support= 
Total Transactions
Transactions with Antecedent and Consequent
​
 
Example: For (Bagel) → (Bread), support is 0.279365, meaning 27.94% of transactions include both "Bagel" and "Bread".

Confidence:
The likelihood that a transaction containing the antecedent also contains the consequent.

Confidence

Support
Antecedent Support
Confidence= 
Antecedent Support
Support
​
 
Example: For (Bagel) → (Bread), confidence is 0.656716, indicating that 65.67% of transactions with "Bagel" also have "Bread".

Lift:
Measures how much more likely the antecedent and consequent occur together than if they were independent.

Lift

Confidence
Consequent Support
Lift= 
Consequent Support
Confidence
​
 
Example: For (Bagel) → (Bread), lift is 1.301042, meaning "Bagel" increases the likelihood of "Bread" by 30.10% compared to random chance.

Leverage:
Quantifies the difference between observed co-occurrence of antecedent and consequent and their expected co-occurrence if they were independent.

Leverage

Support
−
(
Antecedent Support
×
Consequent Support
)
Leverage=Support−(Antecedent Support×Consequent Support)
Example: For (Bagel) → (Bread), leverage is 0.064641, suggesting a positive association between "Bagel" and "Bread".

Conviction:
Reflects the likelihood of the antecedent occurring without the consequent. Higher values indicate stronger rules.

Conviction

1
−
Consequent Support
1
−
Confidence
Conviction= 
1−Confidence
1−Consequent Support
​
 
Example: For (Bagel) → (Bread), conviction is 1.442650, indicating moderate strength.

Interpretation from the Case Above:
Strong Associations:
Rules with high confidence, lift, and leverage, such as (Eggs) → (Cheese) (confidence: 0.681159, lift: 1.358008), suggest that purchasing "Eggs" increases the likelihood of purchasing "Cheese".

Weak Associations:
Rules with low leverage or lift close to 1, such as (Milk) → (Cheese) (lift: 1.211344), indicate weaker relationships.

Conviction Insights:
Conviction values above 1, such as for (Meat) → (Cheese) (conviction: 1.557540), highlight rules with a reduced likelihood of antecedents occurring without consequents, suggesting predictive value.

References

https://towardsdatascience.com/apriori-association-rule-mining-explanation-and-python-implementation-290b42afdfc6
https://chatgpt.com/share/67482e8a-d650-8002-8f42-2f88cb8ddb4d
https://yandaafrida.medium.com/association-rule-market-basket-analysis-menggunakan-python-a9c49b4bfc69