In [18]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [2]:
# load the data set and show the first five transaction
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [10]:
unique_products = set(df.values.flatten())
print(unique_products)

{'Meat', 'Bagel', 'Pencil', 'Eggs', 'Diaper', 'Bread', 'Cheese', nan, 'Wine', 'Milk'}


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [12]:
#create an itemset based on the products
itemset = set()
for i in range(len(df)):
    itemset.update(df.loc[i].dropna().values)
itemset = list(itemset)
print("Itemset:", itemset)


# encoding the feature
encoded_values = []
for index, row in df.iterrows():
    labels = {}
    uncommons = list(set(itemset) - set(row))
    commons = list(set(itemset).intersection(row))
    for uc in uncommons:
        labels[uc] = 0
    for com in commons:
        labels[com] = 1
    encoded_values.append(labels)
encoded_values[0]

Itemset: ['Meat', 'Bagel', 'Pencil', 'Eggs', 'Diaper', 'Bread', 'Cheese', 'Wine', 'Milk']


{'Bagel': 0,
 'Milk': 0,
 'Meat': 1,
 'Pencil': 1,
 'Eggs': 1,
 'Diaper': 1,
 'Bread': 1,
 'Cheese': 1,
 'Wine': 1}

In [13]:
# create new dataframe from the encoded features
new_df = pd.DataFrame(encoded_values)

# show the new dataframe
new_df.head()

Unnamed: 0,Bagel,Milk,Meat,Pencil,Eggs,Diaper,Bread,Cheese,Wine
0,0,0,1,1,1,1,1,1,1
1,0,1,1,1,0,1,1,1,1
2,0,1,1,0,1,0,0,1,1
3,0,1,1,0,1,0,0,1,1
4,0,0,1,1,0,0,0,0,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [14]:
new_df.head()

Unnamed: 0,Bagel,Milk,Meat,Pencil,Eggs,Diaper,Bread,Cheese,Wine
0,0,0,1,1,1,1,1,1,1
1,0,1,1,1,0,1,1,1,1
2,0,1,1,0,1,0,0,1,1
3,0,1,1,0,1,0,0,1,1
4,0,0,1,1,0,0,0,0,1


## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [17]:
frequent_items = apriori(new_df, min_support=0.2, use_colnames=True)
frequent_items

Unnamed: 0,support,itemsets
0,0.425397,(Bagel)
1,0.501587,(Milk)
2,0.47619,(Meat)
3,0.361905,(Pencil)
4,0.438095,(Eggs)
5,0.406349,(Diaper)
6,0.504762,(Bread)
7,0.501587,(Cheese)
8,0.438095,(Wine)
9,0.225397,"(Bagel, Milk)"


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [19]:
rules = association_rules(frequent_items, metric="confidence", min_threshold=0.6)
rules.drop(['zhangs_metric'], axis=1)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265
1,(Cheese),(Milk),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
2,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148
3,(Eggs),(Meat),0.438095,0.47619,0.266667,0.608696,1.278261,0.05805,1.338624
4,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891
5,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754
6,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203
7,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754
8,"(Cheese, Meat)",(Milk),0.32381,0.501587,0.203175,0.627451,1.250931,0.040756,1.337845
9,"(Cheese, Milk)",(Meat),0.304762,0.47619,0.203175,0.666667,1.4,0.05805,1.571429


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

These terms are commonly used in association rule mining, a technique within data mining that discovers interesting relationships or patterns in large datasets. Association rules are typically represented in the form of "if-then" statements, where certain events co-occur with others. Let's break down each term:

1. *Antecedent Support* : Antecedent support refers to the proportion of transactions in the dataset that contain the antecedent (the "if" part of the rule). It represents the frequency of occurrence of the antecedent
2. *Consequent Support* : Consequent support is similar to antecedent support, but it focuses on the consequent (the "then" part of the rule). It represents the frequency of occurrence of the consequent
3. *Support* : Support is the proportion of transactions that contain both the antecedent and the consequent. It measures the co-occurrence of the antecedent and the consequent in the dataset
4. *Confidence* : Confidence measures the reliability of the rule. It is the conditional probability of finding the consequent in a transaction given that the antecedent is present. Mathematically, confidence is calculated as the support for both antecedent and consequent divided by the support for the antecedent
5. *Lift* : Lift indicates how much more likely the consequent is to be purchased when the antecedent is purchased compared to when it is not. It is calculated as the ratio of the observed support to the expected support if the antecedent and consequent were independent
6. *Leverage* : Leverage measures the difference between the observed frequency of the antecedent and consequent occurring together and the frequency that would be expected if they were independen
7. *Conviction* : Conviction quantifies how much more likely the consequent is to occur when the antecedent is present compared to when it is not. A high conviction value implies a strong association