<a href="https://colab.research.google.com/github/SonakshiA/Market-Basket-Analysis-for-a-Supermarket/blob/main/Market_Basket_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Market Basket Analysis**

Market Basket Analysis (MBA) is a data mining technique used in retail to uncover the associations and patterns between items purchased together. It helps in understanding customer purchasing behavior by identifying itemsets that frequently co-occur in transactions.


**Key Terms for Masket Basket Analysis:**

1. **Support**: Support is the frequency of an item/group of items divided by the total transactions.
    Support (X) = Total number of transactions containing X/Total number of transactions


2. **Confidence**: It is the probability of people buying item B given they have bought item A.
      P(B|A) = P(A and B)/P(A)

3. **Lift**: Ratio of Confidence to Support gives the lift.
* A lift value<1 means there is negative association between the antecendent and the consequent.
* A lift value=1 means there is no association between the antecendent and the consequent.
* A lift>1 means there is strong association between the antecendent and the consequent.


The heat-maps/layouts of retail shops are designed keeping in mind the lift value.

In [1]:
pip install opendatasets

Collecting opendatasets
  Downloading opendatasets-0.1.22-py3-none-any.whl.metadata (9.2 kB)
Downloading opendatasets-0.1.22-py3-none-any.whl (15 kB)
Installing collected packages: opendatasets
Successfully installed opendatasets-0.1.22


**Importing all required libraries and Downloading the data**

In [16]:
import numpy as np
import pandas as pd
import opendatasets as od
import mlxtend
from mlxtend.frequent_patterns import apriori, association_rules
from mlxtend.preprocessing import TransactionEncoder

In [7]:
data = od.download('https://www.kaggle.com/datasets/shazadudwadia/supermarket')

Skipping, found downloaded files in "./supermarket" (use force=True to force download)


In [11]:
df = pd.read_csv("supermarket/GroceryStoreDataSet.csv", names = ['transaction'], sep = ',')
df.head()

Unnamed: 0,transaction
0,"MILK,BREAD,BISCUIT"
1,"BREAD,MILK,BISCUIT,CORNFLAKES"
2,"BREAD,TEA,BOURNVITA"
3,"JAM,MAGGI,BREAD,MILK"
4,"MAGGI,TEA,BISCUIT"


In [12]:
df = list(df["transaction"].apply(lambda x:x.split(",")))

In [13]:
df

[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

**One-Hot Encoding the Transactions**

In [17]:
one_hot_transformer = TransactionEncoder()
df_transform = one_hot_transformer.fit_transform(df)


[['MILK', 'BREAD', 'BISCUIT'],
 ['BREAD', 'MILK', 'BISCUIT', 'CORNFLAKES'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['JAM', 'MAGGI', 'BREAD', 'MILK'],
 ['MAGGI', 'TEA', 'BISCUIT'],
 ['BREAD', 'TEA', 'BOURNVITA'],
 ['MAGGI', 'TEA', 'CORNFLAKES'],
 ['MAGGI', 'BREAD', 'TEA', 'BISCUIT'],
 ['JAM', 'MAGGI', 'BREAD', 'TEA'],
 ['BREAD', 'MILK'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'COCK', 'BISCUIT', 'CORNFLAKES'],
 ['COFFEE', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'COCK'],
 ['BREAD', 'SUGER', 'BISCUIT'],
 ['COFFEE', 'SUGER', 'CORNFLAKES'],
 ['BREAD', 'SUGER', 'BOURNVITA'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['BREAD', 'COFFEE', 'SUGER'],
 ['TEA', 'MILK', 'COFFEE', 'CORNFLAKES']]

In [18]:
df_transform

array([[ True, False,  True, False, False, False, False, False,  True,
        False, False],
       [ True, False,  True, False, False,  True, False, False,  True,
        False, False],
       [False,  True,  True, False, False, False, False, False, False,
        False,  True],
       [False, False,  True, False, False, False,  True,  True,  True,
        False, False],
       [ True, False, False, False, False, False, False,  True, False,
        False,  True],
       [False,  True,  True, False, False, False, False, False, False,
        False,  True],
       [False, False, False, False, False,  True, False,  True, False,
        False,  True],
       [ True, False,  True, False, False, False, False,  True, False,
        False,  True],
       [False, False,  True, False, False, False,  True,  True, False,
        False,  True],
       [False, False,  True, False, False, False, False, False,  True,
        False, False],
       [ True, False, False,  True,  True,  True, False, Fal

In [19]:
df = pd.DataFrame(df_transform,columns=one_hot_transformer.columns_)

In [20]:
df

Unnamed: 0,BISCUIT,BOURNVITA,BREAD,COCK,COFFEE,CORNFLAKES,JAM,MAGGI,MILK,SUGER,TEA
0,True,False,True,False,False,False,False,False,True,False,False
1,True,False,True,False,False,True,False,False,True,False,False
2,False,True,True,False,False,False,False,False,False,False,True
3,False,False,True,False,False,False,True,True,True,False,False
4,True,False,False,False,False,False,False,True,False,False,True
5,False,True,True,False,False,False,False,False,False,False,True
6,False,False,False,False,False,True,False,True,False,False,True
7,True,False,True,False,False,False,False,True,False,False,True
8,False,False,True,False,False,False,True,True,False,False,True
9,False,False,True,False,False,False,False,False,True,False,False


**Interpretation**: A support value of 0.35 for biscuits means that 35% of all transactions in the dataset include biscuits.

In [21]:
#find rules having at least 5% support
frequent_items = apriori(df,min_support=0.05,use_colnames=True)
frequent_items

Unnamed: 0,support,itemsets
0,0.35,(BISCUIT)
1,0.20,(BOURNVITA)
2,0.65,(BREAD)
3,0.15,(COCK)
4,0.40,(COFFEE)
...,...,...
78,0.05,"(BREAD, MAGGI, BISCUIT, TEA)"
79,0.10,"(COCK, COFFEE, BISCUIT, CORNFLAKES)"
80,0.05,"(JAM, BREAD, MILK, MAGGI)"
81,0.05,"(JAM, BREAD, MAGGI, TEA)"


In [22]:
#Build rules with minimum lift of 1
rules=association_rules(frequent_items,metric="lift",min_threshold=1)
rules.shape

(262, 10)

**Displaying the Rules Formulated**

In [23]:
#Display rules
rules

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
0,(COCK),(BISCUIT),0.15,0.35,0.10,0.666667,1.904762,0.0475,1.950000,0.558824
1,(BISCUIT),(COCK),0.35,0.15,0.10,0.285714,1.904762,0.0475,1.190000,0.730769
2,(CORNFLAKES),(BISCUIT),0.30,0.35,0.15,0.500000,1.428571,0.0450,1.300000,0.428571
3,(BISCUIT),(CORNFLAKES),0.35,0.30,0.15,0.428571,1.428571,0.0450,1.225000,0.461538
4,(MAGGI),(BISCUIT),0.25,0.35,0.10,0.400000,1.142857,0.0125,1.083333,0.166667
...,...,...,...,...,...,...,...,...,...,...
257,"(COFFEE, TEA)","(CORNFLAKES, MILK)",0.05,0.10,0.05,1.000000,10.000000,0.0450,inf,0.947368
258,(CORNFLAKES),"(MILK, COFFEE, TEA)",0.30,0.05,0.05,0.166667,3.333333,0.0350,1.140000,1.000000
259,(MILK),"(CORNFLAKES, COFFEE, TEA)",0.25,0.05,0.05,0.200000,4.000000,0.0375,1.187500,1.000000
260,(COFFEE),"(CORNFLAKES, MILK, TEA)",0.40,0.05,0.05,0.125000,2.500000,0.0300,1.085714,1.000000


In [25]:
#Select rules have lift>=2 showing strong association
rules[(rules['lift']>=2)]

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
20,(COCK),(COFFEE),0.15,0.40,0.15,1.000000,2.500000,0.0900,inf,0.705882
21,(COFFEE),(COCK),0.40,0.15,0.15,0.375000,2.500000,0.0900,1.360000,1.000000
22,(COCK),(CORNFLAKES),0.15,0.30,0.10,0.666667,2.222222,0.0550,2.100000,0.647059
23,(CORNFLAKES),(COCK),0.30,0.15,0.10,0.333333,2.222222,0.0550,1.275000,0.785714
30,(JAM),(MAGGI),0.10,0.25,0.10,1.000000,4.000000,0.0750,inf,0.833333
...,...,...,...,...,...,...,...,...,...,...
257,"(COFFEE, TEA)","(CORNFLAKES, MILK)",0.05,0.10,0.05,1.000000,10.000000,0.0450,inf,0.947368
258,(CORNFLAKES),"(MILK, COFFEE, TEA)",0.30,0.05,0.05,0.166667,3.333333,0.0350,1.140000,1.000000
259,(MILK),"(CORNFLAKES, COFFEE, TEA)",0.25,0.05,0.05,0.200000,4.000000,0.0375,1.187500,1.000000
260,(COFFEE),"(CORNFLAKES, MILK, TEA)",0.40,0.05,0.05,0.125000,2.500000,0.0300,1.085714,1.000000


**Interpretation for row 30:**
* Jam has an antecedent support of 0.10 which means 10% of all transactions in the dataset include Jam.

* Maggi has a consequent supprot of 0.25 which means 25% of all transactions in the dataset include Maggi.

* The overall support is 0.10 which means 10% of all transactions in the dataset include both Jam and Maggi.

* A 1.0 confidence means there is full probability of customers buying maggi given they have bought jam.

* A lift of 4.0 shows strong association between Jam and Maggi, implying that retail shops keep them close by to increase sales and boost profits.