# Market Basket Analysis

### Project Objective: - Apply association rule mining technique focusing on 'Market Basket Analysis' to discover insightful relationships between                                  products purchased together

## Importing libraries

In [80]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

## Load dataset

In [82]:
df = pd.read_excel('Online retail.xlsx')

## EDA

In [84]:
df.head(10)

Unnamed: 0,"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil"
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt
5,"whole wheat pasta,french fries"
6,"soup,light cream,shallot"
7,"frozen vegetables,spaghetti,green tea"
8,french fries
9,"eggs,pet food"


In [85]:
df.shape

(7500, 1)

In [86]:
# Renaming the column header to 'Description'

In [87]:
df.rename(columns={"shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil":"Description"},inplace=True)
df.head()

Unnamed: 0,Description
0,"burgers,meatballs,eggs"
1,chutney
2,"turkey,avocado"
3,"mineral water,milk,energy bar,whole wheat rice..."
4,low fat yogurt


In [88]:
df.isna().sum()

Description    0
dtype: int64

In [89]:
df.dtypes

Description    object
dtype: object

In [90]:
df.duplicated().sum()

2325

In [91]:
# From above observation, we can see that there are 2325 duplicate records in the dataset, but we will keep all those duplicate records.

# Reasoning:

# 1. Support Calculation: The Apriori algorithm relies on the concept of support, which is the proportion of transactions containing a particular 
#    itemset. Duplicate transactions contribute to the support count of an itemset. Removing duplicates would artificially lower the support of 
#    frequently purchased itemsets, potentially leading to the omission of important association rules.

# 2. Real-World Reflection: Duplicate transactions often reflect real customer behavior. Removing these duplicates would misrepresent the 
#    actual frequency of item combinations and potentially lead to inaccurate insights.

# 3. Association Rule Mining: Association rules are derived from frequent itemsets. Keeping duplicate records ensures that the frequent itemsets 
#    are accurately identified, which in turn leads to more reliable and meaningful association rules.

## Prepare data to feed into Apriori algorithm

In [361]:
# We have dataset with item combinations in the records, we need to split the items for further analysis.
# Therefore, we create a new column ('items') where the item combinations in the 'Description' column are put into a list of items. 

In [104]:
df['items'] = df['Description'].str.split(',')

In [106]:
df

Unnamed: 0,Description,items
0,"burgers,meatballs,eggs","[burgers, meatballs, eggs]"
1,chutney,[chutney]
2,"turkey,avocado","[turkey, avocado]"
3,"mineral water,milk,energy bar,whole wheat rice...","[mineral water, milk, energy bar, whole wheat ..."
4,low fat yogurt,[low fat yogurt]
...,...,...
7495,"butter,light mayo,fresh bread","[butter, light mayo, fresh bread]"
7496,"burgers,frozen vegetables,eggs,french fries,ma...","[burgers, frozen vegetables, eggs, french frie..."
7497,chicken,[chicken]
7498,"escalope,green tea","[escalope, green tea]"


In [144]:
# We have to apply MultiLabelBinarizer to create unique labels out of the lists created in the 'items' column for further analysis. 

In [146]:
# MultiLabelBinarizer is a preprocessing tool in scikit-learn used to transform multi-label data into a binary matrix format 

# MultiLabelBinarizer takes a list of lists or a list of sets, where each inner list or set represents the labels assigned to a data point. 
# It then creates a binary matrix where:

# Rows: Represent data points.
# Columns: Represent unique labels.
# Values: Indicate the presence (1) or absence (0) of each label for each data point.

In [112]:
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
onehot_encoded = mlb.fit_transform(df['items'])
onehot_df = pd.DataFrame(onehot_encoded, columns=mlb.classes_)
onehot_df

Unnamed: 0,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,1,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7497,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7498,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [130]:
mlb.classes_

array([' asparagus', 'almonds', 'antioxydant juice', 'asparagus',
       'avocado', 'babies food', 'bacon', 'barbecue sauce', 'black tea',
       'blueberries', 'body spray', 'bramble', 'brownies', 'bug spray',
       'burger sauce', 'burgers', 'butter', 'cake', 'candy bars',
       'carrots', 'cauliflower', 'cereals', 'champagne', 'chicken',
       'chili', 'chocolate', 'chocolate bread', 'chutney', 'cider',
       'clothes accessories', 'cookies', 'cooking oil', 'corn',
       'cottage cheese', 'cream', 'dessert wine', 'eggplant', 'eggs',
       'energy bar', 'energy drink', 'escalope', 'extra dark chocolate',
       'flax seed', 'french fries', 'french wine', 'fresh bread',
       'fresh tuna', 'fromage blanc', 'frozen smoothie',
       'frozen vegetables', 'gluten free bar', 'grated cheese',
       'green beans', 'green grapes', 'green tea', 'ground beef', 'gums',
       'ham', 'hand protein bar', 'herb & pepper', 'honey', 'hot dogs',
       'ketchup', 'light cream', 'light mayo', 

In [None]:
# Concatenate both Dataframes

In [114]:
df = pd.concat([df, onehot_df], axis=1)
df

Unnamed: 0,Description,items,asparagus,almonds,antioxydant juice,asparagus.1,avocado,babies food,bacon,barbecue sauce,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
0,"burgers,meatballs,eggs","[burgers, meatballs, eggs]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,chutney,[chutney],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"turkey,avocado","[turkey, avocado]",0,0,0,0,1,0,0,0,...,1,0,0,0,0,0,0,0,0,0
3,"mineral water,milk,energy bar,whole wheat rice...","[mineral water, milk, energy bar, whole wheat ...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,low fat yogurt,[low fat yogurt],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7495,"butter,light mayo,fresh bread","[butter, light mayo, fresh bread]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7496,"burgers,frozen vegetables,eggs,french fries,ma...","[burgers, frozen vegetables, eggs, french frie...",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7497,chicken,[chicken],0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7498,"escalope,green tea","[escalope, green tea]",0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [138]:
pivot_data = pd.pivot_table(data=df, index='Description',values=mlb.classes_)
pivot_data

Unnamed: 0_level_0,asparagus,almonds,antioxydant juice,asparagus,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
almonds,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,cake,low fat yogurt",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,cookies",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,eggs",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,eggs,cookies",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"yogurt cake,candy bars",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,energy drink",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,honey",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,low fat yogurt",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


In [140]:
# Explanation of the above step: 

# index='Description': This sets the 'Description' column as the index of the pivot table. Each unique combination of items in the 'Description' column
# will become a row in the pivot table.

# values=mlb.classes_: mlb.classes_ contains the unique items extracted by the MultiLabelBinarizer. By setting this as the values, we're telling 
# the pivot table to use the one-hot encoded columns representing these items.

In [153]:
# 'asparagus' column was duplicated in the process of Multi Label Binarizer and need to be dropped. 

In [151]:
pivot_data.drop(columns=['asparagus'],inplace=True)
pivot_data

Unnamed: 0_level_0,asparagus,almonds,antioxydant juice,avocado,babies food,bacon,barbecue sauce,black tea,blueberries,body spray,...,turkey,vegetables mix,water spray,white wine,whole weat flour,whole wheat pasta,whole wheat rice,yams,yogurt cake,zucchini
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
almonds,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,cake,low fat yogurt",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,cookies",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,eggs",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"almonds,eggs,cookies",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
"yogurt cake,candy bars",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,energy drink",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,honey",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
"yogurt cake,low fat yogurt",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0


## Model Building with Apriori

In [211]:
from mlxtend.frequent_patterns import apriori, association_rules

In [213]:
# Focusing on commonly purchased items, a higher minimum support might be appropriate that gives meaningful insights for our business objectives.

In [243]:
freq_item_sets = apriori(df=pivot_data,min_support=0.05,use_colnames=True) #Minimum association of 5% (0.05)
freq_item_sets

Unnamed: 0,support,itemsets
0,0.113816,(burgers)
1,0.103575,(cake)
2,0.054879,(champagne)
3,0.083865,(chicken)
4,0.205217,(chocolate)
5,0.060676,(cookies)
6,0.071884,(cooking oil)
7,0.208116,(eggs)
8,0.083865,(escalope)
9,0.192657,(french fries)


In [None]:
# Setting association rules based on 'Confidence' metric

In [348]:
best_associations = association_rules(df=freq_item_sets,metric='confidence',min_threshold=0.20,num_itemsets=1) # Setting confidence at atleast 20%
best_associations.sort_values('confidence',ascending=False,inplace=True)

In [350]:
best_associations

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
9,(ground beef),(mineral water),0.135845,0.29971,0.058744,0.432432,1.442835,1.0,0.01803,1.233844,0.355168,0.155897,0.189525,0.314218
11,(ground beef),(spaghetti),0.135845,0.229565,0.055845,0.411095,1.790756,1.0,0.02466,1.30825,0.510993,0.1804,0.23562,0.327181
13,(milk),(mineral water),0.170048,0.29971,0.067826,0.398864,1.330831,1.0,0.016861,1.164943,0.299523,0.16875,0.141589,0.312585
8,(frozen vegetables),(mineral water),0.129855,0.29971,0.050435,0.388393,1.295895,1.0,0.011516,1.144999,0.262407,0.133028,0.126637,0.278336
16,(spaghetti),(mineral water),0.229565,0.29971,0.085024,0.37037,1.235762,1.0,0.016221,1.112225,0.24763,0.191388,0.100901,0.327029
0,(chocolate),(mineral water),0.205217,0.29971,0.073237,0.356874,1.19073,1.0,0.011731,1.088884,0.201538,0.169651,0.081629,0.300616
5,(eggs),(mineral water),0.208116,0.29971,0.070145,0.337047,1.124578,1.0,0.00777,1.05632,0.139891,0.160265,0.053317,0.285545
15,(milk),(spaghetti),0.170048,0.229565,0.050048,0.294318,1.282068,1.0,0.011011,1.091759,0.265088,0.143173,0.084047,0.256166
17,(mineral water),(spaghetti),0.29971,0.229565,0.085024,0.283688,1.235762,1.0,0.016221,1.075557,0.272434,0.191388,0.07025,0.327029
2,(chocolate),(spaghetti),0.205217,0.229565,0.055845,0.272128,1.185406,1.0,0.008735,1.058476,0.196793,0.147374,0.055245,0.257697


In [367]:
# Ranking the associations based on rules -> lift>1.2, support>0.05, confidence>0.3

In [365]:
best_associations[(best_associations.lift > 1.2) & (best_associations.support > 0.05) & (best_associations.confidence > 0.3)].sort_values(by=['lift'], ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,representativity,leverage,conviction,zhangs_metric,jaccard,certainty,kulczynski
11,(ground beef),(spaghetti),0.135845,0.229565,0.055845,0.411095,1.790756,1.0,0.02466,1.30825,0.510993,0.1804,0.23562,0.327181
9,(ground beef),(mineral water),0.135845,0.29971,0.058744,0.432432,1.442835,1.0,0.01803,1.233844,0.355168,0.155897,0.189525,0.314218
13,(milk),(mineral water),0.170048,0.29971,0.067826,0.398864,1.330831,1.0,0.016861,1.164943,0.299523,0.16875,0.141589,0.312585
8,(frozen vegetables),(mineral water),0.129855,0.29971,0.050435,0.388393,1.295895,1.0,0.011516,1.144999,0.262407,0.133028,0.126637,0.278336
16,(spaghetti),(mineral water),0.229565,0.29971,0.085024,0.37037,1.235762,1.0,0.016221,1.112225,0.24763,0.191388,0.100901,0.327029


## Analysis and Interpretations

### Associations have been formulated based on rules: - lift>1.2, support>0.05, confidence>0.3 to achieve insightful patterns in the customer's purchasing behavior. 

### The association rules are ranked below in descending order of 'Lift' values as the project objective requires to find the best associations between the products based on the strength of the associations. 

### Rules are ranked below as: - 

##### Rule 1: If customer purchases ground beef, then they also purchase spaghetti with metrics -> (support:0.0558, confidence:0.4110, lift:1.7907)
##### Rule 2: If customer purchases ground beef, then they also purchase mineral water with metrics -> (support:0.0587, confidence:0.4324, lift:1.4428)
##### Rule 3: If customer purchases milk, then they also purchase mineral water with metrics -> (support:0.0678, confidence:0.3988, lift:1.3308)
##### Rule 4: If customer purchases frozen vegetables, then they also purchase mineral water with metrics -> (support:0.0504, confidence:0.3883, lift:1.2958)
##### Rule 5: If customer purchases spaghetti, then they also purchase mineral water with metrics -> (support:0.0850, confidence:0.3703, lift:1.2357)

### Based on the above association rules we can infer that: -

##### 1. Ground beef and spaghetti are frequently purchased together, indicating a potential meal combination preference among customers.
##### 2. Ground beef is also often purchased with mineral water, suggesting a preference for a beverage alongside this protein source.
##### 3. Milk, frozen vegetables, and spaghetti have slightly weak associations with mineral water, suggesting some customers might prefer mineral water as a general beverage choice.

### These insights can be used for various business applications, such as: -

##### 1. Product Placement: Consider placing ground beef, spaghetti, and mineral water near each other to encourage joint purchases.
##### 2. Promotions: Bundling ground beef, spaghetti and mineral water and offering discounts on mineral water when bundle is purchased could be effective in increased sales of these products.
##### 3. Recommendations: Recommending frozen vegetables to customers with milk and mineral water in their cart might give a boost to the sales of frozen vegetables.