## Association Rule Mining - Grocery Store Data Set
<b> In this article, we have a "GroceryStoreDataSet.csv" dataset, here we are going to identify strong relations discovered in products using some measures such as confidence or lift by using apriori alorithm. Here we are identifying the products associations which having min_support value with a threshold value of 20% and 60% minimum confidence value. In other words, when product X is purchased, we can say that the purchase of product Y is 60% or more.
    
<b> Goal is to identify strong relations discovered in products using some measures such as confidence or lift and identifying the association item. For the given dataset, apply the apriori algorithm to discover strong association rules. Assume that the min_support=20% and min_confidence=60%. Generate association rules from the frequent itemsets, calculate the confidence of each rule and identify all the strong association rules.
    
<b> I have a dataset that I downloaded on https://www.kaggle.com/shazadudwadia/supermarket.    

### Importing all the necessary libraries.

In [1]:
import numpy as np 
import pandas as pd

import warnings
warnings.simplefilter("ignore")

### Loading the Dataset

<b> Load the dataset by using read_csv() to read the dataset and save it to the 'df' variable and take a look at the first 5 lines using the head() method.

In [2]:
# Load the dataset 
df = pd.read_csv("GroceryStoreDataset.csv")

# Display the first 5 lines using the head() method.
df.head()

Unnamed: 0,Products
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"


<b> Let's examine the shape of the data set,

In [3]:
df.shape

(20, 1)

<b> We can see that our dataset is having just one column named Products with 20 entries. In that Products column multiple product categories are there.

<b> Let's split the products and create a list called by 'data',

In [4]:
# splitting the products and create a list called by 'data' object.
data = list(df['Products'].apply(lambda x:x.split(",")))

# print the list
data

[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

<b> Now installing 'mlxtend' library 

Mlxtend (machine learning extensions) is a Python library of useful tools for the day-to-day data science tasks. MLxtend offers additional functionalities and can be a valuable addition to your data science toolbox.

In [6]:
# installing mlxtend library.
!pip install mlxtend



### Apriori Algorithm and One-Hot Encoding

<b> Apriori's algorithm transforms True/False or 1/0.
    
<b> Using TransactionEncoder, we convert the list to a One-Hot Encoded Boolean list.
    
<b> Products that customers bought or did not buy during shopping will now be represented by values 1 and 0.

<b> Converting the list of items into transaction data for frequent itemset mining. Encodes database transaction data in form of a Python list of lists into a NumPy array. 
    
<b> By using and TransactionEncoder object, we can transform this dataset into an array format suitable for typical machine learning APIs. Via the fit method, the TransactionEncoder learns the unique labels in the dataset, and via the transform method, it transforms the input dataset (a Python list of lists) into a one-hot encoded NumPy boolean array:

In [7]:
#Let's transform the list, with one-hot encoding

# import the TransactionEncoder class from mlxtend.preprocessing library.
from mlxtend.preprocessing import TransactionEncoder

# Initializing the TransactionEncoder class as "te"
te = TransactionEncoder()

# fit and transform the data and same result store in a "te_array" object.
te_array = te.fit_transform(data)

In [8]:
# print the "te_array" object.
te_array                        # It will returns the NumPy boolean array

array([[ True, False,  True, False, False, False, False, False,  True,
        False, False],
       [ True, False,  True, False, False,  True, False, False,  True,
        False, False],
       [False,  True,  True, False, False, False, False, False, False,
        False,  True],
       [False, False,  True, False, False, False,  True,  True,  True,
        False, False],
       [ True, False, False, False, False, False, False,  True, False,
        False,  True],
       [False,  True,  True, False, False, False, False, False, False,
        False,  True],
       [False, False, False, False, False,  True, False,  True, False,
        False,  True],
       [ True, False,  True, False, False, False, False,  True, False,
        False,  True],
       [False, False,  True, False, False, False,  True,  True, False,
        False,  True],
       [False, False,  True, False, False, False, False, False,  True,
        False, False],
       [ True, False, False,  True,  True,  True, False, Fal

<b> Now creating DataFrame with above array and storing as 'df' object.

In [9]:
# importing the pandas library as pd 
import pandas as pd

# Creating the DataFrame with array's 
# And pass the unique column names that correspond to the data array by using 'te.columns_'
df = pd.DataFrame(te_array, columns=te.columns_)

# print the DataFrame
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,True,False,False,False,False,False,False,True,False,False,True
5,False,True,True,False,False,False,False,False,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


## Modelling - Algorithm Implementation
<b> To train the model, we will use the apriori function that will be imported from the mlxtend.frequent_patterns package. 
 
<b> I will try to use minimum support parameters for this modeling. 
    
<b> For this, I set a min_support value with a threshold value of 20% and printed them on the screen as well. 
    
<b> This function will return the rules to train the model on the dataset. Consider the below code:

In [10]:
# importing the aprorio algorithm from mlxtend.frequent_patterns library.
from mlxtend.frequent_patterns import apriori

# Calling the apriori algorithm. 
itemset = apriori(df, min_support=0.2, use_colnames=True, verbose = 1)

# print the result
itemset

Processing 72 combinations | Sampling itemset size 2Processing 42 combinations | Sampling itemset size 3


Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.2,(BOURNVITA)
2,0.65,(BREAD)
3,0.4,(COFFEE)
4,0.3,(CORNFLAKES)
5,0.25,(MAGGI)
6,0.25,(MILK)
7,0.3,(SUGER)
8,0.35,(TEA)
9,0.2,"(BISCUIT, BREAD)"


<b> In the above code, the first line is to import the apriori function. In the second line, the apriori function returns the output as the rules. It takes the following parameters:
    
   - df : Pandas DataFrame the encoded format.
    
    
   - min_support (default: 0.5): A float between 0 and 1 for minumum support of the itemsets returned.
    
    
   - use_colnames (default: False): If `True`, uses the DataFrames' column names in the returned DataFrame
  instead of column indices.
    
    
   - verbose (default: 0): Shows the number of iterations if >= 1 and `low_memory` is `True`. If >=1 and `low_memory` is `False`, shows the number of combinations.
    
    
   - min_length= It takes the minimum number of products for the association.
    
    
   - max_length = It takes the maximum number of products for the association.

It returns the pandas DataFrame with columns ['support', 'itemsets'] of all itemsets that are >= `min_support` and < than `max_len` (if `max_len` is not None).   

<b> Now, we will use the extracted frequent itemsets in rule creation. We can create our rules by defining metric and its threshold.
    
<b> we will use the association_rules function that will be imported from the mlxtend.frequent_patterns package. 
    
<b> I chose the 60% minimum confidence value. In other words, when product X is purchased, we can say that the purchase of product Y is 60% or more.    
    
<b> This function will return the rules to train the model on the dataset. Consider the below code:

In [11]:
# importing the association_rules algorithm from mlxtend.frequent_patterns library.
from mlxtend.frequent_patterns import association_rules

# Calling the association_rules. 
res = association_rules(itemset, metric='confidence', min_threshold=0.6)

# print the result
res

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(MILK),(BREAD),0.25,0.65,0.2,0.8,1.230769,0.0375,1.75
1,(SUGER),(BREAD),0.3,0.65,0.2,0.666667,1.025641,0.005,1.05
2,(CORNFLAKES),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
3,(SUGER),(COFFEE),0.3,0.4,0.2,0.666667,1.666667,0.08,1.8
4,(MAGGI),(TEA),0.25,0.35,0.2,0.8,2.285714,0.1125,3.25


In the above code, the first line is to import the association_rules function. In the second line, the association_rules function returns the output as the rules. It takes the following parameters:


   - df : Pandas DataFrame of frequent itemsets with columns ['support', 'itemsets']


   - metric : string (default: 'confidence'): Metric to evaluate if a rule is of interest. supported metrics are 'support', 'confidence', 'lift', 'leverage', and 'conviction'.


   - min_threshold : float (default: 0.8): Minimal threshold for the evaluation metric.


It returns the pandas DataFrame with columns "antecedents" and "consequents" that store itemsets, plus the scoring metric columns: 

   -  "antecedent support", "consequent support", "support", "confidence", "lift", "leverage", "conviction" of all rules for which metric(rule) >= min_threshold.

<b> For example, if we examine our 1st index value;

- <b> The probability of seeing sugar sales is seen as 30%.
    
- <b> Bread intake is seen as 65%.
    
- <b> We can say that the support of both of them is measured as 20%.
    
- <b> 67% of those who buys sugar, buys bread as well.
    
- <b> Users who buy sugar will likely consume 3% more bread than users who don't buy sugar.
    
- <b> Their correlation with each other is seen as 1.05.
    
    
<b> As a result, if item X and Y are bought together more frequently, then several steps can be taken to increase the profit. For instance:

- <b> Cross-Selling can be improved by combining products - items
    
- <b> The shop layout can be changed so that sales can be improved when certain items are kept together.
    
- <b> Promotional activities which are an advertising campaign can be carried out to increase the sales of goods that customers do not buy.
    
- <b> Collective discounts can be offered on these products if the customer buys both of them.
    