In [67]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

! pip install mlxtend
from mlxtend.frequent_patterns import apriori, association_rules



# Association Rule for Store Dataset

In this case study, we will explore how association rule can be used to analyze the items that are usualy purcased together.

you can refer to this article to find out about apriori and association rule:
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/apriori/
https://rasbt.github.io/mlxtend/user_guide/frequent_patterns/association_rules/

## Load Data

We will use the dataset of the transaction in a certain store. You can get the dataset here: 
https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv

In [68]:
# load the data set and show the first five transaction
url = 'https://gist.githubusercontent.com/Harsh-Git-Hub/2979ec48043928ad9033d8469928e751/raw/72de943e040b8bd0d087624b154d41b2ba9d9b60/retail_dataset.csv'
df = pd.read_csv(url)
df.head()

Unnamed: 0,0,1,2,3,4,5,6
0,Bread,Wine,Eggs,Meat,Cheese,Pencil,Diaper
1,Bread,Cheese,Meat,Diaper,Wine,Milk,Pencil
2,Cheese,Meat,Eggs,Milk,Wine,,
3,Cheese,Meat,Eggs,Milk,Wine,,
4,Meat,Pencil,Wine,,,,


# Get the set of product that has been purchased


Get the unique product that has been purchased

In [71]:
unique = pd.unique(df.values.ravel())
print(unique)

['Bread' 'Wine' 'Eggs' 'Meat' 'Cheese' 'Pencil' 'Diaper' 'Milk' nan
 'Bagel']


## Preprocess Data

In this step, we will transform our dataset so that we will have a one hot encoding based on the purchased products.

In [72]:
#create an itemset based on the products
itemsets = set(unique)

# encoding the feature
encoded = []
for index, row in df.iterrows(): 
    labels = {}
    uncommons = list(set(unique) - set(row))
    commons = list(set(unique).intersection(row))
    for i in uncommons:
        labels[i] = 0
    for j in commons:
        labels[j] = 1
    encoded.append(labels)


In [73]:
# create new dataframe from the encoded features
dfEncoded = pd.DataFrame(encoded)
# show the new dataframe
dfEncoded.head()



Unnamed: 0,NaN,Milk,Bagel,Pencil,Bread,Cheese,Eggs,Diaper,Meat,Wine
0,0,0,0,1,1,1,1,1,1,1
1,0,1,0,1,1,1,0,1,1,1
2,1,1,0,0,0,1,1,0,1,1
3,1,1,0,0,0,1,1,0,1,1
4,1,0,0,1,0,0,0,0,1,1


Since, the encoded dataframe consist of the empty column. We will drop the NaN column or select all columns other than the first column.

In [74]:
dfEncoded = dfEncoded.dropna(axis=1)

## Apriori Algorithm

We will use appriori algorithm to determine the frequently purchased products. 
For this case study, we will min_support=0.2

In [77]:
frequentlyPurchasedProduct = apriori(dfEncoded, min_support=0.2, use_colnames=True)
frequentlyPurchasedProduct



Unnamed: 0,support,itemsets
0,0.869841,(nan)
1,0.501587,(Milk)
2,0.425397,(Bagel)
3,0.361905,(Pencil)
4,0.504762,(Bread)
5,0.501587,(Cheese)
6,0.438095,(Eggs)
7,0.406349,(Diaper)
8,0.47619,(Meat)
9,0.438095,(Wine)


Then, we will generate association rule of the frequent itemset based on confidence level with the threshold=0.6

In [79]:
association_rules(frequentlyPurchasedProduct, metric = "confidence", min_threshold = 0.6)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(Milk),(nan),0.501587,0.869841,0.409524,0.816456,0.938626,-0.026778,0.709141,-0.115976
1,(Bagel),(nan),0.425397,0.869841,0.336508,0.791045,0.909413,-0.03352,0.622902,-0.147743
2,(Pencil),(nan),0.361905,0.869841,0.266667,0.736842,0.8471,-0.048133,0.494603,-0.220499
3,(Bread),(nan),0.504762,0.869841,0.396825,0.786164,0.903801,-0.042237,0.608683,-0.176903
4,(Cheese),(nan),0.501587,0.869841,0.393651,0.78481,0.902245,-0.042651,0.604855,-0.178565
5,(Eggs),(nan),0.438095,0.869841,0.336508,0.768116,0.883053,-0.044565,0.56131,-0.190735
6,(Diaper),(nan),0.406349,0.869841,0.31746,0.78125,0.898152,-0.035999,0.595011,-0.160381
7,(Meat),(nan),0.47619,0.869841,0.368254,0.773333,0.889051,-0.045956,0.57423,-0.192405
8,(Wine),(nan),0.438095,0.869841,0.31746,0.724638,0.833069,-0.063613,0.472682,-0.262869
9,(Milk),(Cheese),0.501587,0.501587,0.304762,0.607595,1.211344,0.053172,1.270148,0.350053


Provide explanation about __antecedent support__, __consequent support__, __support__, __confidence__, __lift__, __leverage__ and __conviction__

Given a rule "A -> C", A stands for antecedent and C stands for consequent.

__o support(A→C) = support(A∪C), range: [0,1]__

__Antecedent support__ computes the proportion of transactions that contain the antecedent A, and __consequent support__ computes the support for the itemset of the consequent C. The __support__ metric then computes the support of the combined itemset A ∪ C.


__o confidence(A→C) = support(A→C) / support(A), range: [0,1]__

The confidence of a rule A->C is the probability of seeing the consequent in a transaction given that it also contains the antecedent. Note that the metric is not symmetric or directed; for instance, the confidence for A->C is different than the confidence for C->A. The confidence is 1 (maximal) for a rule A->C if the consequent and antecedent always occur together.

__o lift(A→C) = confidence(A→C) / support(C), range: [0,∞]__

The lift metric is commonly used to measure how much more often the antecedent and consequent of a rule A->C occur together than we would expect if they were statistically independent. If A and C are independent, the Lift score will be exactly 1.

__o levarage(A→C)=support(A→C)−support(A)×support(C),range: [−1,1]__
 
Leverage computes the difference between the observed frequency of A and C appearing together and the frequency that would be expected if A and C were independent. A leverage value of 0 indicates independence.

__o conviction(A→C)=1−support(C)1−confidence(A→C),range: [0,∞]__

A high conviction value means that the consequent is highly depending on the antecedent. For instance, in the case of a perfect confidence score, the denominator becomes 0 (due to 1 - 1) for which the conviction score is defined as 'inf'. Similar to lift, if items are independent, the conviction is 1.