## Association Rule Mining in Retail Store

### Problem Statement:
 * What are the items that may be frequently purchased together?

### Objective:
* To learn how Apriori Algorithm and Association Rules works.
* To learn how Combination and Permutation helps to find Support and Confidence of itemsets respectively.
* To find frequent itemsets with high confidence and lift, keeping both item together will help to increase sales.


### Introduction
* Association rule mining is one of an important technique of data mining for knowledge discovery.
* The knowledge of the correlation between the items in the data transaction can use association rule mining.
* Retail store analysis is one of an application area of association rule mining technique.
* The possible percentage of the correlation of combined items gives the new knowledge. Therefore, it is a very helpful for determiner to take the decisions

### Analysis

In [216]:
## Importing Required Library

import pandas as pd
import numpy as np

In [217]:
# Reading Excel file 
bread = pd.read_excel('raw_bread.xlsx')

In [218]:
## Here we have transaction data, which include column, Date,Time,Transaction,Item
## we should remove duplicate transaction, it shows quantity of item in same transaction,
## it is not needed in appriori aglo as we only care about different item in particular transaction
bread

Unnamed: 0,"Date,Time,Transaction,Item"
0,"2016-10-30,09:58:11,1,Bread"
1,"2016-10-30,10:05:34,2,Scandinavian"
2,"2016-10-30,10:05:34,2,Scandinavian"
3,"2016-10-30,10:07:57,3,Hot chocolate"
4,"2016-10-30,10:07:57,3,Jam"
...,...
21288,"2017-04-09,14:32:58,9682,Coffee"
21289,"2017-04-09,14:32:58,9682,Tea"
21290,"2017-04-09,14:57:06,9683,Coffee"
21291,"2017-04-09,14:57:06,9683,Pastry"


In [219]:
## dropping Duplicate Transaction
bread = bread.drop_duplicates()

In [220]:
## we need to split transaction data into Dataframe/tabular structure as follow
new = bread['Date,Time,Transaction,Item'].str.split(',', n = 3, expand = True)

In [221]:
import warnings
warnings.filterwarnings('ignore')

In [222]:
## assigning column to data frame "bread"
bread['Date'] = new[0]
bread['Time'] = new[1]
bread['Transaction'] = new[2]
bread['Item'] = new[3]

In [223]:
# in this dataframe we only need column Trasaction and Item, rest is not needed in association mining rule
bread[['Date', 'Time', 'Transaction', 'Item']].head(10)
                                                    

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
5,2016-10-30,10:07:57,3,Cookies
6,2016-10-30,10:08:41,4,Muffin
7,2016-10-30,10:13:03,5,Coffee
8,2016-10-30,10:13:03,5,Pastry
9,2016-10-30,10:13:03,5,Bread
10,2016-10-30,10:16:55,6,Medialuna


In [224]:
# we need to convert cloumn transacton & item into Crosstab or we can say Binary Matrix as follow
tab = pd.crosstab(index= bread['Transaction'], columns= bread['Item'])
tab

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [225]:
## Just writing cdv file to check result
## I came to know that, we have one unwanted column named "NONE", we should remove it as follow and proceed further
#tab.to_csv('tab.csv')

In [226]:
## removing unwanted col "NONE"
tab = tab.drop(['NONE'], axis = 1)

In [227]:
tab

Item,Adjustment,Afternoon with the baker,Alfajores,Argentina Night,Art Tray,Bacon,Baguette,Bakewell,Bare Popcorn,Basket,...,The BART,The Nomad,Tiffin,Toast,Truffles,Tshirt,Valentine's card,Vegan Feast,Vegan mincepie,Victorian Sponge
Transaction,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
10,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
100,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1000,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1001,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
996,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
997,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
998,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


### Creating APRIORI_MY function to generate frequent itesets based on minimum threshold support = 0.02

In [228]:
def APRIORI_MY(data, min_support=0.04,  max_length = 4):
    # Collecting Required Library
    import numpy as np
    import pandas as pd
    from itertools import combinations
    
    # Creating a dictionary to stored support of an itemset.
    support = {} 
    # Step 1: storing all items available in dataset. 
    L = list(data.columns)
    
    # Step 2: generating combination of items with len i in ith iteration
    for i in range(1, max_length+1):
        c = list(combinations(L,i))
        
    # Reset "L" for next ith iteration
        L =[]     
    # Step 3: iterate through each item in "c"
        for j in list(c):
            #print(j)
            sup = data.loc[:,j].product(axis=1).sum()/len(data.index)
            if sup > min_support:
                #print(sup, j)
                support[j] = sup
                
    # Step 4: Appending frequent itemset in list "L", already reset list "L" 
                L = list(set(L) | set(j))
        
    # Step 5: data frame with cols "items", 'support'
    result = pd.DataFrame(list(support.items()), columns = ["Items", "Support"])
    return(result)

In [239]:
sup = APRIORI_MY(tab, 0.02, 3)
sup.sort_values(by = 'Support', ascending = False)

Unnamed: 0,Items,Support
4,"(Coffee,)",0.475081
1,"(Bread,)",0.32494
16,"(Tea,)",0.141643
3,"(Cake,)",0.103137
26,"(Bread, Coffee)",0.089393
11,"(Pastry,)",0.08551
12,"(Sandwich,)",0.071346
9,"(Medialuna,)",0.061379
7,"(Hot chocolate,)",0.057916
28,"(Cake, Coffee)",0.054349


### Creating ASSOCIATION_RULE_MY function to generate itemset based on minimun threshold confidence.

In [240]:
def ASSOCIATION_RULE_MY(df, min_threshold=0.5):
    import pandas as pd
    from itertools import permutations
    
    # STEP 1 creating required varaible
    support = pd.Series(df.Support.values, index=df.Items).to_dict()
    data = []
    L= df.Items.values
    
    # Step 2 generating rule using permutation
    p = list(permutations(L, 2))
    
    # Iterating through each rule
    for i in p:
        
        # If LHS(Antecedent) of rule is subset of RHS then valid rule.
        if set(i[0]).issubset(i[1]):
            conf = support[i[1]]/support[i[0]]
            #print(i, conf)
            if conf > min_threshold:
                #print(i, conf)
                j = i[1][not i[1].index(i[0][0])]
                lift = support[i[1]]/(support[i[0]]* support[(j,)])
                leverage = support[i[1]] - (support[i[0]]* support[(j,)])
                convection = (1 - support[(j,)])/(1- conf)
                data.append([i[0], (j,), support[i[0]], support[(j,)], support[i[1]], conf, lift, leverage, convection])

                        
    result = pd.DataFrame(data, columns = ["antecedents", "consequents", "antecedent support", "consequent support",
                                        "support", "confidence", "Lift", "Leverage", "Convection"])
    return(result)

In [241]:
my_asso = ASSOCIATION_RULE_MY(sup, 0.5)
my_asso

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,Lift,Leverage,Convection
0,"(Cake,)","(Coffee,)",0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667
1,"(Cookies,)","(Coffee,)",0.054034,0.475081,0.028014,0.518447,1.09128,0.002343,1.090053
2,"(Hot chocolate,)","(Coffee,)",0.057916,0.475081,0.029378,0.507246,1.067704,0.001863,1.065276
3,"(Juice,)","(Coffee,)",0.038296,0.475081,0.02046,0.534247,1.124537,0.002266,1.127031
4,"(Medialuna,)","(Coffee,)",0.061379,0.475081,0.034939,0.569231,1.198175,0.005779,1.218561
5,"(Pastry,)","(Coffee,)",0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079
6,"(Sandwich,)","(Coffee,)",0.071346,0.475081,0.037981,0.532353,1.120551,0.004086,1.122468
7,"(Toast,)","(Coffee,)",0.033365,0.475081,0.023502,0.704403,1.482699,0.007651,1.775789


### Finally sorting results by Lift to get highly associated itemsets.

In [242]:
my_asso.sort_values(by='Lift', ascending= False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,Lift,Leverage,Convection
7,"(Toast,)","(Coffee,)",0.033365,0.475081,0.023502,0.704403,1.482699,0.007651,1.775789
4,"(Medialuna,)","(Coffee,)",0.061379,0.475081,0.034939,0.569231,1.198175,0.005779,1.218561
5,"(Pastry,)","(Coffee,)",0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079
3,"(Juice,)","(Coffee,)",0.038296,0.475081,0.02046,0.534247,1.124537,0.002266,1.127031
6,"(Sandwich,)","(Coffee,)",0.071346,0.475081,0.037981,0.532353,1.120551,0.004086,1.122468
0,"(Cake,)","(Coffee,)",0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667
1,"(Cookies,)","(Coffee,)",0.054034,0.475081,0.028014,0.518447,1.09128,0.002343,1.090053
2,"(Hot chocolate,)","(Coffee,)",0.057916,0.475081,0.029378,0.507246,1.067704,0.001863,1.065276


## Cross Verifying results with  apriori and association rule from mlxtend -> frequent patterns package

In [233]:
from mlxtend.frequent_patterns import apriori, association_rules

In [243]:
## finding associate(item) which brought together with frequency greater than 0.1%
associate = apriori(df = tab, min_support= 0.02, use_colnames= True)

In [244]:
associate.sort_values(by = 'support', ascending = False)

Unnamed: 0,support,itemsets
4,0.475081,(Coffee)
1,0.32494,(Bread)
16,0.141643,(Tea)
3,0.103137,(Cake)
20,0.089393,"(Bread, Coffee)"
11,0.08551,(Pastry)
12,0.071346,(Sandwich)
9,0.061379,(Medialuna)
7,0.057916,(Hot chocolate)
23,0.054349,"(Cake, Coffee)"


### Createing associate rule such that item brought with conditional probability(Confidence) more than 50% with corresponding item

In [245]:
asso_rule = association_rules(associate, min_threshold= 0.5)

In [246]:
## Finally sorting results by Lift to get highly associated itemsets.

In [247]:
## Rule with minimun confidence = 50%
asso_rule = association_rules(associate, min_threshold= 0.5)

## Finally sorting results by Lift to get highly associated itemsets.
asso_rule.sort_values(by='lift', ascending= False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
7,(Toast),(Coffee),0.033365,0.475081,0.023502,0.704403,1.482699,0.007651,1.775789
4,(Medialuna),(Coffee),0.061379,0.475081,0.034939,0.569231,1.198175,0.005779,1.218561
5,(Pastry),(Coffee),0.08551,0.475081,0.047214,0.552147,1.162216,0.00659,1.172079
3,(Juice),(Coffee),0.038296,0.475081,0.02046,0.534247,1.124537,0.002266,1.127031
6,(Sandwich),(Coffee),0.071346,0.475081,0.037981,0.532353,1.120551,0.004086,1.122468
0,(Cake),(Coffee),0.103137,0.475081,0.054349,0.526958,1.109196,0.00535,1.109667
1,(Cookies),(Coffee),0.054034,0.475081,0.028014,0.518447,1.09128,0.002343,1.090053
2,(Hot chocolate),(Coffee),0.057916,0.475081,0.029378,0.507246,1.067704,0.001863,1.065276


### Conclusion
 * Results from developed function(APRIORI_MY, ASSOCIATION_RULE_MY) has matched with builts packages.
 * it is observed that "Toast" & "Coffee" are highly associated with lift 1.48.
 * Coffee has been brought most frequently with 47.5% of all the transcaction

### Reference 
* "https://github.com/viktree/curly-octo-chainsaw/blob/master/BreadBasket_DMS.csv"
* https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&ved=2ahUKEwj-4qW15J_qAhUwzTgGHQ5MCuUQFjAHegQICRAB&url=http%3A%2F%2Fwww.ijarcs.info%2Findex.php%2FIjarcs%2Farticle%2Fdownload%2F4564%2F4083&usg=AOvVaw0tJaQUepruvpCogDKbi7T3