# Market Basket Optimisation

## 1. Overview

Frequent Itemsets via Apriori Algorithm Apriori function to extract frequent itemsets for association rule mining We have a dataset of a mall with 7500 transactions of different customers buying different items from the store. We have to find correlations between the different items in the store. so that we can know if a customer is buying apple, banana and mango. what is the next item, The customer would be interested in buying from the store.

Apriori is a popular algorithm for extracting frequent itemsets with applications in association rule learning. The apriori algorithm has been designed to operate on databases containing transactions, such as purchases by customers of a store. An itemset is considered as "frequent" if it meets a user-specified support threshold. For instance, if the support threshold is set to 0.5 (50%), a frequent itemset is defined as a set of items that occur together in at least 50% of all transactions in the database.

## 2. Essential imports

In [17]:
import numpy as np
import pandas as pd

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
plt.style.use('fivethirtyeight')
import plotly.express as px


# for market basket analysis
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

## 3. Importing the Data

In [7]:
data = pd.read_csv('Market_Basket_Optimisation.csv', header = None)

In [8]:
data.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
0,shrimp,almonds,avocado,vegetables mix,green grapes,whole weat flour,yams,cottage cheese,energy drink,tomato juice,low fat yogurt,green tea,honey,salad,mineral water,salmon,antioxydant juice,frozen smoothie,spinach,olive oil
1,burgers,meatballs,eggs,,,,,,,,,,,,,,,,,
2,chutney,,,,,,,,,,,,,,,,,,,
3,turkey,avocado,,,,,,,,,,,,,,,,,,
4,mineral water,milk,energy bar,whole wheat rice,green tea,,,,,,,,,,,,,,,


## 4. Data Description

In [9]:
data.shape

(7501, 20)

In [23]:
data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19
count,7501,5747,4389,3345,2529,1864,1369,981,654,395,256,154,87,47,25,8,4,4,3,1
unique,115,117,115,114,110,106,102,98,88,80,66,50,43,28,19,8,3,3,3,1
top,mineral water,mineral water,mineral water,mineral water,green tea,french fries,green tea,green tea,green tea,green tea,low fat yogurt,green tea,green tea,green tea,magazines,cake,frozen smoothie,protein bar,mayonnaise,olive oil
freq,577,484,375,201,153,107,96,67,57,31,22,15,8,4,3,1,2,2,1,1


## 5. Data Visualization

In [37]:
# 1. Gather All Items of Each Transactions into Numpy Array
transaction = []
for i in range(0, data.shape[0]):
    for j in range(0, data.shape[1]):
        transaction.append(data.values[i,j])

transaction = np.array(transaction)

In [38]:
# 2. Transform Them a Pandas DataFrame
df = pd.DataFrame(transaction, columns=["items"]) 
df["incident_count"] = 1 # Put 1 to Each Item For Making Countable Table, to be able to perform Group By

In [39]:
# 3. Delete NaN Items from Dataset
indexNames = df[df['items'] == "nan" ].index
df.drop(indexNames , inplace=True)

In [40]:
# 4. Final Step: Make a New Appropriate Pandas DataFrame for Visualizations  
df_table = df.groupby("items").sum().sort_values("incident_count", ascending=False).reset_index()

In [41]:
# 5. Initial Visualizations
df_table.head(10).style.background_gradient(cmap='Blues')

Unnamed: 0,items,incident_count
0,mineral water,1788
1,eggs,1348
2,spaghetti,1306
3,french fries,1282
4,chocolate,1230
5,green tea,991
6,milk,972
7,ground beef,737
8,frozen vegetables,715
9,pancakes,713


In [42]:
df_table["all"] = "all" # to have a same origin

fig = px.treemap(df_table.head(30), path=['all', "items"], values='incident_count',
                  color=df_table["incident_count"].head(30), hover_data=['items'],
                  color_continuous_scale='Blues',
                  )
fig.show()

 Lets check whether the items have multiple records in a transaction or not
- If the answer is "Yes", we need to handle them since they might mislead the apriori algorithm in further steps

In [43]:
# Transform Every Transaction to Seperate List & Gather Them into Numpy Array
# By Doing So, We Will Be Able To Iterate Through Array of Transactions

transaction = []
for i in range(data.shape[0]):
    transaction.append([str(data.values[i,j]) for j in range(data.shape[1])])
    
transaction = np.array(transaction)

# Create a DataFrame In Order To Check Status of Top20 Items

top30 = df_table["items"].head(30).values
array = []
df_top30_multiple_record_check = pd.DataFrame(columns=top30)

for i in range(0, len(top30)):
    array = []
    for j in range(0,transaction.shape[0]):
        array.append(np.count_nonzero(transaction[j]==top30[i]))
        if len(array) == len(data):
            df_top30_multiple_record_check[top30[i]] = array
        else:
            continue
            

df_top30_multiple_record_check.head(10)

Unnamed: 0,mineral water,eggs,spaghetti,french fries,chocolate,green tea,milk,ground beef,frozen vegetables,pancakes,...,chicken,whole wheat rice,grated cheese,cooking oil,soup,herb & pepper,honey,champagne,fresh bread,salmon
0,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,0,0,1,0,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [44]:
df_top30_multiple_record_check

Unnamed: 0,mineral water,eggs,spaghetti,french fries,chocolate,green tea,milk,ground beef,frozen vegetables,pancakes,...,chicken,whole wheat rice,grated cheese,cooking oil,soup,herb & pepper,honey,champagne,fresh bread,salmon
0,1,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,1,0,0,1
1,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,1,0,0,0,0,1,1,0,0,0,...,0,1,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7496,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
7497,0,1,0,1,0,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
7498,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
7499,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## 7. Frequent Itemset

In [46]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

basket_sets = df_top30_multiple_record_check.applymap(encode_units)

In [50]:
frequent_itemsets = apriori(basket_sets, min_support=0.01, use_colnames=True)

In [51]:
frequent_itemsets

Unnamed: 0,support,itemsets
0,0.238368,(mineral water)
1,0.179709,(eggs)
2,0.174110,(spaghetti)
3,0.170911,(french fries)
4,0.163845,(chocolate)
...,...,...
203,0.010932,"(mineral water, ground beef, chocolate)"
204,0.011065,"(mineral water, ground beef, milk)"
205,0.011065,"(frozen vegetables, mineral water, milk)"
206,0.010532,"(eggs, chocolate, spaghetti)"


In [53]:
frequent_itemsets['length'] = frequent_itemsets['itemsets'].apply(lambda x: len(x))
frequent_itemsets

Unnamed: 0,support,itemsets,length
0,0.238368,(mineral water),1
1,0.179709,(eggs),1
2,0.174110,(spaghetti),1
3,0.170911,(french fries),1
4,0.163845,(chocolate),1
...,...,...,...
203,0.010932,"(mineral water, ground beef, chocolate)",3
204,0.011065,"(mineral water, ground beef, milk)",3
205,0.011065,"(frozen vegetables, mineral water, milk)",3
206,0.010532,"(eggs, chocolate, spaghetti)",3


In [55]:
frequent_itemsets[ (frequent_itemsets['length'] == 2) &
                   (frequent_itemsets['support'] >= 0.05) ]

Unnamed: 0,support,itemsets,length
30,0.050927,"(mineral water, eggs)",2
31,0.059725,"(mineral water, spaghetti)",2
33,0.05266,"(mineral water, chocolate)",2


In [56]:
frequent_itemsets[ (frequent_itemsets['length'] == 3) ].head()

Unnamed: 0,support,itemsets,length
191,0.014265,"(mineral water, eggs, spaghetti)",3
192,0.013465,"(mineral water, eggs, chocolate)",3
193,0.013065,"(mineral water, eggs, milk)",3
194,0.010132,"(mineral water, ground beef, eggs)",3
195,0.010132,"(french fries, mineral water, spaghetti)",3


## 8. Association Rules Generation

In [57]:
rules = association_rules(frequent_itemsets, metric="lift", min_threshold=1.2)
rules["antecedents_length"] = rules["antecedents"].apply(lambda x: len(x))
rules["consequents_length"] = rules["consequents"].apply(lambda x: len(x))
rules.sort_values("lift",ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedents_length,consequents_length
211,(herb & pepper),(ground beef),0.049460,0.098254,0.015998,0.323450,3.291994,0.011138,1.332860,1,1
210,(ground beef),(herb & pepper),0.098254,0.049460,0.015998,0.162822,3.291994,0.011138,1.135410,1,1
284,(ground beef),"(mineral water, spaghetti)",0.098254,0.059725,0.017064,0.173677,2.907928,0.011196,1.137902,1,2
281,"(mineral water, spaghetti)",(ground beef),0.059725,0.098254,0.017064,0.285714,2.907928,0.011196,1.262445,2,1
300,"(mineral water, spaghetti)",(olive oil),0.059725,0.065858,0.010265,0.171875,2.609786,0.006332,1.128021,2,1
...,...,...,...,...,...,...,...,...,...,...,...
55,(low fat yogurt),(eggs),0.076523,0.179709,0.016798,0.219512,1.221484,0.003046,1.050997,1,1
156,(green tea),(shrimp),0.132116,0.071457,0.011465,0.086781,1.214449,0.002025,1.016780,1,1
157,(shrimp),(green tea),0.071457,0.132116,0.011465,0.160448,1.214449,0.002025,1.033747,1,1
114,(french fries),(escalope),0.170911,0.079323,0.016398,0.095944,1.209537,0.002841,1.018385,1,1


In [58]:
rules.sort_values("confidence",ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedents_length,consequents_length
260,"(ground beef, eggs)",(mineral water),0.019997,0.238368,0.010132,0.506667,2.125563,0.005365,1.543848,2,1
318,"(ground beef, milk)",(mineral water),0.021997,0.238368,0.011065,0.503030,2.110308,0.005822,1.532552,2,1
312,"(ground beef, chocolate)",(mineral water),0.023064,0.238368,0.010932,0.473988,1.988472,0.005434,1.447937,2,1
323,"(frozen vegetables, milk)",(mineral water),0.023597,0.238368,0.011065,0.468927,1.967236,0.005440,1.434136,2,1
35,(soup),(mineral water),0.050527,0.238368,0.023064,0.456464,1.914955,0.011020,1.401255,1,1
...,...,...,...,...,...,...,...,...,...,...,...
326,(mineral water),"(frozen vegetables, milk)",0.238368,0.023597,0.011065,0.046421,1.967236,0.005440,1.023935,1,2
313,(mineral water),"(ground beef, chocolate)",0.238368,0.023064,0.010932,0.045861,1.988472,0.005434,1.023893,1,2
302,(mineral water),"(olive oil, spaghetti)",0.238368,0.022930,0.010265,0.043065,1.878079,0.004799,1.021041,1,2
261,(mineral water),"(ground beef, eggs)",0.238368,0.019997,0.010132,0.042506,2.125563,0.005365,1.023507,1,2


In [59]:
rules[~rules["consequents"].str.contains("mineral water", regex=False) & 
      ~rules["antecedents"].str.contains("mineral water", regex=False)].sort_values("confidence", ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedents_length,consequents_length
68,(ground beef),(spaghetti),0.098254,0.17411,0.039195,0.398915,2.291162,0.022088,1.373997,1,1
82,(olive oil),(spaghetti),0.065858,0.17411,0.02293,0.348178,1.999758,0.011464,1.267048,1,1
334,"(chocolate, milk)",(spaghetti),0.032129,0.17411,0.010932,0.340249,1.954217,0.005338,1.251821,2,1
51,(burgers),(eggs),0.087188,0.179709,0.028796,0.330275,1.83783,0.013128,1.224818,1,1
98,(herb & pepper),(spaghetti),0.04946,0.17411,0.016264,0.328841,1.888695,0.007653,1.230543,1,1
211,(herb & pepper),(ground beef),0.04946,0.098254,0.015998,0.32345,3.291994,0.011138,1.33286,1,1
328,"(eggs, chocolate)",(spaghetti),0.033196,0.17411,0.010532,0.317269,1.822232,0.004752,1.209686,2,1
102,(salmon),(spaghetti),0.042528,0.17411,0.013465,0.316614,1.818472,0.00606,1.208527,1,1
92,(grated cheese),(spaghetti),0.052393,0.17411,0.016531,0.315522,1.812196,0.007409,1.206597,1,1
56,(turkey),(eggs),0.062525,0.179709,0.019464,0.311301,1.732245,0.008228,1.191072,1,1


In [60]:
rules[rules["antecedents"].str.contains("ground beef", regex=False) & rules["antecedents_length"] == 1].sort_values("confidence", ascending=False).head(10)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,antecedents_length,consequents_length
7,(ground beef),(mineral water),0.098254,0.238368,0.040928,0.416554,1.747522,0.017507,1.305401,1,1
68,(ground beef),(spaghetti),0.098254,0.17411,0.039195,0.398915,2.291162,0.022088,1.373997,1,1
120,(ground beef),(chocolate),0.098254,0.163845,0.023064,0.234735,1.432669,0.006965,1.092635,1,1
166,(ground beef),(milk),0.098254,0.129583,0.021997,0.223881,1.727704,0.009265,1.121499,1,1
284,(ground beef),"(mineral water, spaghetti)",0.098254,0.059725,0.017064,0.173677,2.907928,0.011196,1.137902,1,2
197,(ground beef),(frozen vegetables),0.098254,0.095321,0.016931,0.17232,1.807796,0.007565,1.093031,1,1
210,(ground beef),(herb & pepper),0.098254,0.04946,0.015998,0.162822,3.291994,0.011138,1.13541,1,1
198,(ground beef),(pancakes),0.098254,0.095054,0.014531,0.147897,1.555925,0.005192,1.062015,1,1
207,(ground beef),(olive oil),0.098254,0.065858,0.014131,0.143826,2.183889,0.007661,1.091066,1,1
200,(ground beef),(burgers),0.098254,0.087188,0.011998,0.122117,1.400607,0.003432,1.039787,1,1


There are many associations with high confidence and lift score. We are on the right way!

## Results

As you seen on above investigations, the flexibility of the algorithm and the mlxtend library is high therefore we can easily investigate different aspects and get new associations from the data. From that reason, the investigations could be further detailed by taking other products (rest of the Top50) into calculation or changing the criteria threshold. Nevertheless, since the association rule learning has an iterative schema, data understanding and interpretation skills and activities are really important. In that case, we should give enough importance to data visualization and/or data cleansing (if required) steps to be sure we are on the right way.