In [45]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [46]:
# load the data set and show the first five transaction
df = pd.read_csv('https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv')
df.head(5)

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [47]:
#get the unique product that has been purchased and show it
products = df['0'].unique()
print(products)

['Bread' 'Cheese' 'Meat' 'Eggs' 'Wine' 'Bagel' 'Pencil' 'Diaper' 'Milk']


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [48]:
# create an itemset based on the products
itemset = list(products)
print(itemset)

# encoding the feature
encoded_vals = []
for index, row in df.iterrows():
    labels = {}
    for item in itemset:
        if item in row.values:
            labels[item] = 1
        else:
            labels[item] = 0
    encoded_vals.append(labels)
encoded_vals[0]

['Bread', 'Cheese', 'Meat', 'Eggs', 'Wine', 'Bagel', 'Pencil', 'Diaper', 'Milk']


{'Bread': 1,
 'Cheese': 1,
 'Meat': 1,
 'Eggs': 1,
 'Wine': 1,
 'Bagel': 0,
 'Pencil': 1,
 'Diaper': 1,
 'Milk': 0}

In [49]:
  # create new dataframe from the encoded features
ohe_df = pd.DataFrame(encoded_vals)

  # show the new dataframe
ohe_df.head(5)


Unnamed: 0,Bread,Cheese,Meat,Eggs,Wine,Bagel,Pencil,Diaper,Milk
0,1,1,1,1,1,0,1,1,0
1,1,1,1,0,1,0,1,1,1
2,0,1,1,1,1,0,0,0,1
3,0,1,1,1,1,0,0,0,1
4,0,0,1,0,1,0,1,0,0


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [50]:
#drop nan values
#already did it in the encoding process

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [51]:
#apriori algorithm with min_support = 0.2
from mlxtend.frequent_patterns import apriori
freq_items = apriori(ohe_df, min_support=0.2, use_colnames=True, verbose=1)
freq_items.head(5)

Processing 4 combinations | Sampling itemset size 4 3




Unnamed: 0,support,itemsets
0,0.504762,(Bread)
1,0.501587,(Cheese)
2,0.47619,(Meat)
3,0.438095,(Eggs)
4,0.438095,(Wine)


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [52]:
# association rule of the frequent itemset based on confidence level with the threshold=0.6
from mlxtend.frequent_patterns import association_rules
rules = association_rules(freq_items, metric="confidence", min_threshold=0.6)
rules.head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Bagel),(Bread),0.425397,0.504762,0.279365,0.656716,1.301042,0.064641,1.44265,0.402687
1,(Meat),(Cheese),0.47619,0.501587,0.32381,0.68,1.355696,0.084958,1.55754,0.500891
2,(Cheese),(Meat),0.501587,0.47619,0.32381,0.64557,1.355696,0.084958,1.477891,0.526414
3,(Eggs),(Cheese),0.438095,0.501587,0.298413,0.681159,1.358008,0.07867,1.563203,0.469167
4,(Wine),(Cheese),0.438095,0.501587,0.269841,0.615942,1.227986,0.050098,1.297754,0.330409


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

Support: The support of an itemset is the proportion of transactions in the dataset that contain the itemset. It measures the frequency of occurrence of the itemset in the dataset.

Antecedent support: The antecedent support of a rule is the support of the itemset that appears on the left-hand side (antecedent) of the rule.

Consequent support: The consequent support of a rule is the support of the itemset that appears on the right-hand side (consequent) of the rule.

Confidence: The confidence of a rule is the proportion of transactions that contain the antecedent and the consequent out of all the transactions that contain the antecedent. It measures the strength of the association between the antecedent and the consequent.

Lift: The lift of a rule is the ratio of the observed support of the antecedent and the consequent to the expected support if they were independent. It measures the degree of dependence between the antecedent and the consequent.

Leverage: The leverage of a rule is the difference between the observed support of the antecedent and the consequent and the expected support if they were independent. It measures the difference between the actual frequency of the itemset and the expected frequency if the items were independent.

Conviction: The conviction of a rule is the ratio of the expected frequency of the consequent if it were independent of the antecedent to the observed frequency of the consequent. It measures the degree of implication between the antecedent and the consequent.