# Market Basket Analysis using Different Association Algorithms

Market Basket Analysis is the benchmark for analysing different Association Algorithms. The most common of these are **Apriori** and **FP Tree**.
This notebook is divided into 5 parts. Each part delves deep into each of those part. The parts are given below.

- Loading Data
- Basic Preprocessing
- EDA
- Model Creation and Rule Generation
- Insights

As usual we begin the process by importing required packages. We are using **MLXtend** package as this contains functions that help in association analysis. We are also using **Pandas** for reading data, basic preprocessing and overall conversion to *DataFrames*. We are using **Seaborn** and **Matplotlib** for visualization of association rules and basic EDA.


In [70]:
# Import required packages
import time
import numpy as np
import pandas as pd
import mlxtend as mlx
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import fpgrowth
from mlxtend.frequent_patterns import association_rules

import seaborn as sns
import matplotlib.pyplot as plt

# Pandas Configurations
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

# I. Loading the data

In this section, we read the data which is in XLSX format using `read_excel` function in Pandas and pass `header` argument as None. We are passing `header` as None because, in the original dataset there are no headers.  

We are using `head()` function to display a small subset of data from the entire dataset.

In [71]:
# Read a portion of data after loading the dataset
data = pd.read_excel("D:\MindLab\AI\Machine-Learning-and-Data-Science-Projects\Datasets\MBA.xlsx", header=None)
data.head(6)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,


# II. Basic Preprocessing

In this section the dataset is first converted from *NaN* to empty string. The reason being it will be easy to remove those at a later point. After the conversion to empty string is done, `filter` method is used to remove the empty strings from the list. Making the final data to a list of lists format making it ready for the MLXtend's `TransactionEncoder` function to consume.


In [72]:
data = data.fillna("") # Fill the NaNs with empty string

In [73]:
# Remove those empty strings and convert the data to a list of lists

processed_data = [ list(filter(None, x)) for x in data.values.tolist() ]
print(processed_data[0:12]) # Printing out a sample

[['shrimp', 'almonds', 'avocado', 'vegetables mix', 'green grapes', 'whole weat flour', 'yams', 'cottage cheese', 'energy drink', 'tomato juice', 'low fat yogurt', 'green tea', 'honey', 'salad', 'mineral water', 'salmon', 'antioxydant juice', 'frozen smoothie', 'spinach', 'olive oil'], ['burgers', 'meatballs', 'eggs'], ['chutney'], ['turkey', 'avocado'], ['mineral water', 'milk', 'energy bar', 'whole wheat rice', 'green tea'], ['low fat yogurt'], ['whole wheat pasta', 'french fries'], ['soup', 'light cream', 'shallot'], ['frozen vegetables', 'spaghetti', 'green tea'], ['french fries'], ['eggs', 'pet food'], ['cookies']]


# III. EDA

In this section some basic EDA is done. The dataset is described for some basic understanding of how the dataset is structured.


In [74]:
# Shape of dataset
print(f"Total number of rows in the Dataset: {data.shape[0]}")
print(f"Total number of columns in the Dataset: {data.shape[1]}")

Total number of rows in the Dataset: 7501
Total number of columns in the Dataset: 20


In [75]:
# Description of the Dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
 1   1       7501 non-null   object
 2   2       7501 non-null   object
 3   3       7501 non-null   object
 4   4       7501 non-null   object
 5   5       7501 non-null   object
 6   6       7501 non-null   object
 7   7       7501 non-null   object
 8   8       7501 non-null   object
 9   9       7501 non-null   object
 10  10      7501 non-null   object
 11  11      7501 non-null   object
 12  12      7501 non-null   object
 13  13      7501 non-null   object
 14  14      7501 non-null   object
 15  15      7501 non-null   object
 16  16      7501 non-null   object
 17  17      7501 non-null   object
 18  18      7501 non-null   object
 19  19      7501 non-null   object
dtypes: object(20)
memory usage: 1.1+ MB


# IV. Model Creation and Rule Generation

In this section we create an Apriori model using **MLXtend** package. First we convert the processed data into **MLXtend - Transaction Encoder** format, which converts the list of lists into a dataframe of Booleans. After this we then use `apriori` to train the model for apriori algorithm and use the model to output frequent item sets and possible association rules.


In [76]:
# Transaction Encoder
transact_encoder = TransactionEncoder()
transact_array = transact_encoder.fit(processed_data).transform(processed_data)
transformed_df = pd.DataFrame(transact_array, columns=transact_encoder.columns_)

In [77]:
transformed_df.head(6)

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,bramble,brownies,bug spray,burger sauce,burgers,butter,cake,candy bars,carrots,cauliflower,cereals,champagne,chicken,chili,chocolate,chocolate bread,chutney,cider,clothes accessories,cookies,cooking oil,corn,cottage cheese,cream,dessert wine,eggplant,eggs,energy bar,energy drink,escalope,extra dark chocolate,flax seed,french fries,french wine,fresh bread,fresh tuna,fromage blanc,frozen smoothie,frozen vegetables,gluten free bar,grated cheese,green beans,green grapes,green tea,ground beef,gums,ham,hand protein bar,herb & pepper,honey,hot dogs,ketchup,light cream,light mayo,low fat yogurt,magazines,mashed potato,mayonnaise,meatballs,melons,milk,mineral water,mint,mint green tea,muffins,mushroom cream sauce,napkins,nonfat milk,oatmeal,oil,olive oil,pancakes,parmesan cheese,pasta,pepper,pet food,pickles,protein bar,red wine,rice,salad,salmon,salt,sandwich,shallot,shampoo,shrimp,soda,soup,spaghetti,sparkling water,spinach,strawberries,strong cheese,tea,tomato juice,tomato sauce,tomatoes,toothpaste,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,True,True,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,True,True,False,False,False,False,False,True,False,False,False,False,True,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,True,False,False,False,False,True,False,False,False,True,False,False,False,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,True,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False,False


Below are cells of Apriori and FPGrowth. We are using two algorithms to show the difference in time to execute. However even though there might be slight time changes, it is not of much concern atleast for this dataset as its very small (7501, 20)

In [93]:
%%time
# Frequent Item Sets - Apriori
frequent_itemsets_apr = apriori(transformed_df, min_support=0.02, use_colnames=True)
print(frequent_itemsets_apr.tail(12))

      support                           itemsets
91   0.035462                  (spaghetti, milk)
92   0.027596         (mineral water, olive oil)
93   0.033729          (pancakes, mineral water)
94   0.023597            (shrimp, mineral water)
95   0.023064              (soup, mineral water)
96   0.059725         (spaghetti, mineral water)
97   0.024397          (mineral water, tomatoes)
98   0.020131  (whole wheat rice, mineral water)
99   0.022930             (spaghetti, olive oil)
100  0.025197              (pancakes, spaghetti)
101  0.021197                (shrimp, spaghetti)
102  0.020931              (spaghetti, tomatoes)
Wall time: 118 ms


In [84]:
%%time
# Frequent Item Sets - FPGrowth
frequent_itemsets_fp = fpgrowth(transformed_df, min_support=0.02, use_colnames=True)
print(frequent_itemsets_fp.tail(12))

      support                      itemsets
91   0.022797      (chicken, mineral water)
92   0.024397     (mineral water, tomatoes)
93   0.020931         (spaghetti, tomatoes)
94   0.033729     (pancakes, mineral water)
95   0.025197         (pancakes, spaghetti)
96   0.020131      (pancakes, french fries)
97   0.021730              (pancakes, eggs)
98   0.040928  (ground beef, mineral water)
99   0.039195      (ground beef, spaghetti)
100  0.021997           (ground beef, milk)
101  0.023064      (ground beef, chocolate)
102  0.027463         (cake, mineral water)
Wall time: 107 ms


As seen above, the two algorithms are giving the same frequent itemsets. Also when it comes to execution time, **FPGrowth** averages out to **107 ms** where as **Apriori** averages out to **118 ms** after multiple executions of the same cell. Therefore we will be using **FPGrowth** for rule generation. This slight difference is not a big deal for this dataset, but when we have a huge dataset, this time difference will make a huge difference.

In [94]:
# Mine Association Rules for different Lift Values
print(f"Association Rules with Minimum Lift of 1.4")
print("--------------------------------------------")
assoc_rules = association_rules(frequent_itemsets_fp, metric="lift", min_threshold=1.4)
assoc_rules_processed = assoc_rules[["antecedents", "consequents", "antecedent support", "consequent support", "lift", "confidence"]]
assoc_rules_processed.head(12)

Association Rules with Minimum Lift of 1.4
--------------------------------------------


Unnamed: 0,antecedents,consequents,antecedent support,consequent support,lift,confidence
0,(shrimp),(spaghetti),0.071457,0.17411,1.70376,0.296642
1,(spaghetti),(shrimp),0.17411,0.071457,1.70376,0.121746
2,(mineral water),(olive oil),0.238368,0.065858,1.757904,0.115772
3,(olive oil),(mineral water),0.065858,0.238368,1.757904,0.419028
4,(spaghetti),(olive oil),0.17411,0.065858,1.999758,0.1317
5,(olive oil),(spaghetti),0.065858,0.17411,1.999758,0.348178
6,(burgers),(eggs),0.087188,0.179709,1.83783,0.330275
7,(eggs),(burgers),0.179709,0.087188,1.83783,0.160237
8,(burgers),(french fries),0.087188,0.170911,1.476173,0.252294
9,(french fries),(burgers),0.170911,0.087188,1.476173,0.128705


In [95]:
assoc_rules_processed.tail(12)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,lift,confidence
42,(pancakes),(spaghetti),0.095054,0.17411,1.522468,0.265077
43,(spaghetti),(pancakes),0.17411,0.095054,1.522468,0.144717
44,(ground beef),(mineral water),0.098254,0.238368,1.747522,0.416554
45,(mineral water),(ground beef),0.238368,0.098254,1.747522,0.1717
46,(ground beef),(spaghetti),0.098254,0.17411,2.291162,0.398915
47,(spaghetti),(ground beef),0.17411,0.098254,2.291162,0.225115
48,(ground beef),(milk),0.098254,0.129583,1.727704,0.223881
49,(milk),(ground beef),0.129583,0.098254,1.727704,0.169753
50,(ground beef),(chocolate),0.098254,0.163845,1.432669,0.234735
51,(chocolate),(ground beef),0.163845,0.098254,1.432669,0.140765


# V. Insights

Based on the above outputs, we can make out some of the promising Association Rules. In the above outputs, we have used the **Lift** as **1.4**. More on the Insights are provided in the Documentation for the dataset.
