# Market Basket Analysis using different Association Algorithms  

Market Basket Analysis is a benchmark for trying out different Association Algorithms. Most frequently used algorithms in this are **Apriori**, **FP Tree**.  
This notebook is divided into 5 parts. Each part delve deep into its function. The parts are given below.

- Loading the Data
- Basic Preprocessing
- EDA
- Model Building
- Insights


In [12]:
# Import required packages
import numpy as np
import pandas as pd
import mlxtend as mlx
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

import seaborn as sns
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

## Loading the data
### Read the Data which is in the XLSX format using `read_excel` function in Pandas and pass `header` argument as None.

In [13]:
# Read a portion of data after loading the dataset
data = pd.read_excel("D:\MindLab\AI\Machine-Learning-and-Data-Science-Projects\Datasets\MBA.xlsx", header=None)
data.head(6)

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,
5,low fat yogurt,,,,,,,,,,,,,,,,,,,


## Basic Preprocessing

In this section the dataset is first converted from *NaN* to empty string. The reason being it will be easy to remove those at a later point. After the conversion to empty string is done, `filter` method is used to remove the empty strings from the list. Making the final data to a list of lists format making it ready for the MLXtend's `TransactionEncoder` function to consume.


In [14]:
data = data.fillna("")

In [15]:
processed_data = [ list(filter(None, x)) for x in data.values.tolist() ]

#processed_data = [[y for y in x if pd.notna(x)] for x in data.values.tolist()]]

## EDA

In this section some basic EDA is done.

In [16]:
# Description of the Dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7501 entries, 0 to 7500
Data columns (total 20 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   0       7501 non-null   object
 1   1       7501 non-null   object
 2   2       7501 non-null   object
 3   3       7501 non-null   object
 4   4       7501 non-null   object
 5   5       7501 non-null   object
 6   6       7501 non-null   object
 7   7       7501 non-null   object
 8   8       7501 non-null   object
 9   9       7501 non-null   object
 10  10      7501 non-null   object
 11  11      7501 non-null   object
 12  12      7501 non-null   object
 13  13      7501 non-null   object
 14  14      7501 non-null   object
 15  15      7501 non-null   object
 16  16      7501 non-null   object
 17  17      7501 non-null   object
 18  18      7501 non-null   object
 19  19      7501 non-null   object
dtypes: object(20)
memory usage: 1.1+ MB


In [17]:
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
count,7501,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0,7501.0
unique,115,118.0,116.0,115.0,111.0,107.0,103.0,98.0,89.0,81.0,67.0,51.0,44.0,29.0,20.0,9.0,4.0,4.0,4.0,2.0
top,mineral water,,,,,,,,,,,,,,,,,,,
freq,577,1754.0,3112.0,4156.0,4972.0,5637.0,6132.0,6520.0,6847.0,7106.0,7245.0,7347.0,7414.0,7454.0,7476.0,7493.0,7497.0,7497.0,7498.0,7500.0


In [18]:
# Graphs

## Model Creation
In this section we create an Apriori model using `MLXtend` package

In [23]:
# Transaction Encoder
transact_encoder = TransactionEncoder()
transact_array = transact_encoder.fit(processed_data).transform(processed_data)
transformed_df = pd.DataFrame(transact_array, columns=transact_encoder.columns_)

In [24]:
transformed_df.head(6)

Unnamed: 0,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,True,True,False,True,False,False,False,False,False,False,...,False,True,False,False,True,False,False,True,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
3,False,False,False,True,False,False,False,False,False,False,...,True,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,True,False,False,False
5,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


In [25]:
frequent_itemsets = apriori(transformed_df, min_support=0.03, use_colnames=True)
print(frequent_itemsets)

     support                            itemsets
0   0.033329                           (avocado)
1   0.033729                          (brownies)
2   0.087188                           (burgers)
3   0.030129                            (butter)
4   0.081056                              (cake)
5   0.046794                         (champagne)
6   0.059992                           (chicken)
7   0.163845                         (chocolate)
8   0.080389                           (cookies)
9   0.051060                       (cooking oil)
10  0.031862                    (cottage cheese)
11  0.179709                              (eggs)
12  0.079323                          (escalope)
13  0.170911                      (french fries)
14  0.043061                       (fresh bread)
15  0.063325                   (frozen smoothie)
16  0.095321                 (frozen vegetables)
17  0.052393                     (grated cheese)
18  0.132116                         (green tea)
19  0.098254        

In [27]:
# Mine Association Rules
assoc_rules = association_rules(frequent_itemsets, metric="confidence", min_threshold=0.06)
print(assoc_rules)

            antecedents          consequents  antecedent support  \
0                (eggs)          (chocolate)            0.179709   
1           (chocolate)               (eggs)            0.163845   
2        (french fries)          (chocolate)            0.170911   
3           (chocolate)       (french fries)            0.163845   
4                (milk)          (chocolate)            0.129583   
5           (chocolate)               (milk)            0.163845   
6       (mineral water)          (chocolate)            0.238368   
7           (chocolate)      (mineral water)            0.163845   
8           (chocolate)          (spaghetti)            0.163845   
9           (spaghetti)          (chocolate)            0.174110   
10       (french fries)               (eggs)            0.170911   
11               (eggs)       (french fries)            0.179709   
12               (eggs)               (milk)            0.179709   
13               (milk)               (eggs)    