In [3]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from sklearn.preprocessing import OneHotEncoder

!pip install mlxtend==0.23.1



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here:
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [4]:
# prompt: # load the data set ans show the first five transaction

import pandas as pd

# Load the dataset
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')

# Display the first five transactions
print(df.head())

        0       1     2       3       4       5       6
0   Bread    Wine  Eggs    Meat  Cheese  Pencil  Diaper
1   Bread  Cheese  Meat  Diaper    Wine    Milk  Pencil
2  Cheese    Meat  Eggs    Milk    Wine     NaN     NaN
3  Cheese    Meat  Eggs    Milk    Wine     NaN     NaN
4    Meat  Pencil  Wine     NaN     NaN     NaN     NaN


# Get the set of product that has been purchased


In [5]:
unique_products = set(df.values.flatten())
print(unique_products)


{nan, 'Wine', 'Pencil', 'Meat', 'Milk', 'Diaper', 'Eggs', 'Bread', 'Cheese', 'Bagel'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [6]:
#create an itemset based on the products
itemset = {item: 0 for item in unique_products}

# encoding the feature
for item in df.iloc[0]:
  if item in itemset:
    itemset[item] = 1

itemset

{nan: 0,
 'Wine': 1,
 'Pencil': 1,
 'Meat': 1,
 'Milk': 0,
 'Diaper': 1,
 'Eggs': 1,
 'Bread': 1,
 'Cheese': 1,
 'Bagel': 0}

In [7]:
  # create new dataframe from the encoded features
  encoded_df = pd.DataFrame(0, index = range(len(df)), columns=itemset)

  # Encode each transaction
  for i, row in df.iterrows():
    for item in row:
      encoded_df.loc[i, item] = 1

  # show the new dataframe
  encoded_df.head()



Unnamed: 0,NaN,Wine,Pencil,Meat,Milk,Diaper,Eggs,Bread,Cheese,Bagel
0,0,1,1,1,0,1,1,1,1,0
1,0,1,1,1,1,1,0,1,1,0
2,1,1,0,1,1,0,1,0,1,0
3,1,1,0,1,1,0,1,0,1,0
4,1,1,1,1,0,0,0,0,0,0


In [8]:
# Since, the encoded dataframe consist of the empty column. We will drop the NaN column or u can use the index.
encoded_df = encoded_df.iloc[:, 1:]
encoded_df.head()


Unnamed: 0,Wine,Pencil,Meat,Milk,Diaper,Eggs,Bread,Cheese,Bagel
0,1,1,1,0,1,1,1,1,0
1,1,1,1,1,1,0,1,1,0
2,1,0,1,1,0,1,0,1,0
3,1,0,1,1,0,1,0,1,0
4,1,1,1,0,0,0,0,0,0


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products.
For this case study, we will min_support=0.2

In [9]:
#Set threshold value untuk digunakan dalam penghitungan support
from mlxtend.frequent_patterns import apriori, association_rules

#Apply Apriori algorithm to find frequent itemsets with min_support = 0.2
frequent_itemsets = apriori(encoded_df, min_support=0.2, use_colnames=True)

# Display the frequent itemsets
frequent_itemsets



Unnamed: 0,support,itemsets
0,0.438095,(Wine)
1,0.361905,(Pencil)
2,0.47619,(Meat)
3,0.501587,(Milk)
4,0.406349,(Diaper)
5,0.438095,(Eggs)
6,0.504762,(Bread)
7,0.501587,(Cheese)
8,0.425397,(Bagel)
9,0.2,"(Wine, Pencil)"


The we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [11]:
association_rules_df = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.6)
association_rules_df

  and should_run_async(code)


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409
1,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624,0.387409
2,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
3,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
4,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
5,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053
6,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
7,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
8,"(Milk, Cheese)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429,0.410959
9,"(Milk, Meat)",(Cheese),0.244444,0.501587,0.203175,0.831169,1.657077,0.080564,2.952137,0.524816


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__, __conviction__, __conviction__ and the interpretation from the case above (please use text section)

In association rule mining, key metrics help evaluate relationships between items in transactions. Antecedent support represents how frequently the "if" condition (antecedent) occurs in transactions, while consequent support reflects how often the "then" condition (consequent) appears. Support measures the proportion of transactions where both antecedent and consequent occur together, indicating their co-occurrence strength. Confidence is the likelihood of the consequent occurring given the antecedent, showing the rule's reliability. Lift compares the observed co-occurrence with what is expected by chance, with values above 1 indicating a strong positive association. Leverage quantifies how much the actual co-occurrence exceeds random chance, and conviction assesses the certainty of the rule, with higher values indicating stronger relationships.

From the given data, strong rules such as Milk, Meat -> Cheese demonstrate a high lift (1.657) and confidence (83.1%), suggesting frequent co-purchase of these items. Similarly, rules like Bagel -> Bread highlight common buying patterns, albeit with lower confidence, indicating potential areas for targeted promotions. The co-occurrence of staples such as "Milk," "Cheese," and "Meat" offers insights into customer behavior, suggesting opportunities for product bundling or cross-promotional strategies to enhance sales. These metrics help identify significant patterns in customer transactions that businesses can leverage for decision-making.