# **Market Basket Analysis**

Market basket analysis or MBA is a data mining technique that is used to determine the purchase habits of the consumers. The aim of the technique is to uncover the set of products that are frequently brought together. 

Association rules are 'if-then' statments, that reflects the probability of relationships among dataset. It is used in Market basket analysis to determine the relationship between products. This asssociation between products is used by retailers in various marketing decisions. Association rule helps in developing following strategies:
- Product placement
- Personalised push notifications
- Catalog design
- Cross-selling & Up-selling

Strategic and mindful placement of products in the aisle not only save consumer's time but it also encourages consumers to buy related products. For example, consumer buying cereal is more likely to buy milk if milk aisle is closer to the cereal section. 

The 'if' component of the association rule is known as 'antecedent' and the 'then' component is 'consequent'. These components are disjoint, they just reflect co-occurence and not casuality. 

The strength of association rule is measured by support, confidence and lift ratio.

* Support: Support measures the fraction of product's occurence. In simple terms, it reflects the popularity of the product. The number of times a product occurs in transactions. For example, `milk` will occur in more number of transactions then  `shaving cream`. Hence `milk` generally have higher support than `shaving cream`.

* Confidence: Confidence is the ratio of the number of transactions that contain both antecedent & consequent to the number of transactions that contain all antecedent itemsets. As the name suggest, confidence reflects the reliability of the rule and conditional probability that the consequent will occur given the occurrence of the antecedent.

* Lift: Lift is the ratio of confidence of the rule to the expected confidence. In simple terms, it measures the importance and quality of the rule.

In [1]:
import pandas as pd

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

pd.set_option('display.max_colwidth', None)

In [2]:
order_prior = pd.read_csv('order_products__prior.csv')
products = pd.read_csv('products.csv')

In [None]:
order_prior.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [3]:
order_prior.product_id.nunique()

49677

We have 49,677 products in our dataset. Apriori algorithm looks at the database multiple times to determine the frequency of the itemsets. So, it can be slow and inefficent on large datsets. Hence, we look at only 200 most frequently bought products. 

In [4]:
#calculating top 200 most bought products

top_200_products = order_prior.groupby('product_id')['order_id'].count().sort_values(ascending = False)[:200].reset_index()

In [5]:
#creating a dataframe of only top 200 products

df = order_prior[order_prior.product_id.isin(top_200_products.product_id.to_list())]

In [6]:
#merging df with products data on product_id to get product name
df = df.merge(products, on = 'product_id', how = 'left')

#keeping only the relevant columns
df= df[['order_id','product_name','reordered']].set_index('order_id') 

In [None]:
df.head()

Unnamed: 0_level_0,product_name,reordered
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1
2,Organic Egg Whites,1
2,Michigan Organic Kale,1
2,Carrots,1
3,Total 2% with Strawberry Lowfat Greek Strained Yogurt,1
3,Unsweetened Almondmilk,1


In [7]:
x = df.pivot_table(columns='product_name', values='reordered', index='order_id').reset_index().fillna(0).set_index('order_id')

#converting data type from float to int
x = x.astype('int32')

In [None]:
x

product_name,100% Raw Coconut Water,100% Recycled Paper Towels,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Baby Spinach,Bag of Organic Bananas,Banana,Bartlett Pears,...,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Unsweetened Vanilla Almond Milk,Vanilla Almond Breeze Almond Milk,Watermelon Chunks,Whipped Cream Cheese,Whole Milk,Yellow Bell Pepper,Yellow Onions,"YoKids Squeezers Organic Low-Fat Yogurt, Strawberry"
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3421078,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3421080,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3421081,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3421082,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
frequent_sets = apriori(x, min_support=0.003, use_colnames=True )

In [None]:
frequent_sets

Unnamed: 0,support,itemsets
0,0.011726,(100% Raw Coconut Water)
1,0.006451,(100% Recycled Paper Towels)
2,0.017559,(100% Whole Wheat Bread)
3,0.011275,(2% Reduced Fat Milk)
4,0.024482,(Apple Honeycrisp Organic)
5,0.016287,(Asparagus)
6,0.005458,(Baby Spinach)
7,0.123727,(Bag of Organic Bananas)
8,0.156115,(Banana)
9,0.009014,(Bartlett Pears)


We will derive the association rule on the basis of lift. Lift is more important than confidence in the association rule because it reflects the relative strength of association between 2 products. It is the ratio of the change in the probability of presence of item 1 with the knowledge that item 2 is present over the probability of presence of item 1 without the knowledge of item 2's presence (Nandakumar,2020).

For example: 
Probability of milk in the cart **with the knowledge** of presence of toothbrush is = 10/(10+4) = .7

If the probability of milk in the cart **without the knowledge** of toothbrush is 80/100 = .8

We can see above that the knowledge of toothbrush reduces the probability of milk from 0.8 to 07. Hence the lift here will be 0.7/.8 = .87 and lift less than 1 shows the less association between products.

We will set min_threshold to be 1 because lift more than 1 means that products are likly to be bought together where lift of 1 means that products have no association between tham and lift of less than 1 means that products are not likly to be bought together

In [None]:
rules = association_rules(frequent_sets, metric='lift', min_threshold=1)
rules.sort_values(by=['lift'], ascending=False)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
197,(Sparkling Water Grapefruit),(Lime Sparkling Water),0.022917,0.013527,0.003886,0.169569,12.535775,0.003576,1.187905
196,(Lime Sparkling Water),(Sparkling Water Grapefruit),0.013527,0.022917,0.003886,0.287278,12.535775,0.003576,1.370918
275,(Organic Yellow Onion),(Organic Garlic),0.030969,0.029242,0.005464,0.176421,6.033198,0.004558,1.178708
274,(Organic Garlic),(Organic Yellow Onion),0.029242,0.030969,0.005464,0.18684,6.033198,0.004558,1.191685
202,(Limes),(Organic Cilantro),0.037508,0.016881,0.003669,0.09783,5.795179,0.003036,1.089727
203,(Organic Cilantro),(Limes),0.016881,0.037508,0.003669,0.217363,5.795179,0.003036,1.229807
319,(Strawberries),(Raspberries),0.039087,0.015473,0.003199,0.081842,5.28937,0.002594,1.072285
318,(Raspberries),(Strawberries),0.015473,0.039087,0.003199,0.206748,5.28937,0.002594,1.211359
348,(Organic Raspberries),"(Bag of Organic Bananas, Organic Strawberries)",0.041283,0.01852,0.003436,0.083219,4.493553,0.002671,1.070572
345,"(Bag of Organic Bananas, Organic Strawberries)",(Organic Raspberries),0.01852,0.041283,0.003436,0.185509,4.493553,0.002671,1.177075


In [None]:
#adding length column to determine the number of items in the set
rules['length'] = rules['antecedents'].apply(lambda x: len(x))

In [None]:
rules.sort_values(by=['lift', 'confidence'], ascending=[False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
196,(Lime Sparkling Water),(Sparkling Water Grapefruit),0.013527,0.022917,0.003886,0.287278,12.535775,0.003576,1.370918,1
197,(Sparkling Water Grapefruit),(Lime Sparkling Water),0.022917,0.013527,0.003886,0.169569,12.535775,0.003576,1.187905,1
274,(Organic Garlic),(Organic Yellow Onion),0.029242,0.030969,0.005464,0.18684,6.033198,0.004558,1.191685,1
275,(Organic Yellow Onion),(Organic Garlic),0.030969,0.029242,0.005464,0.176421,6.033198,0.004558,1.178708,1
202,(Limes),(Organic Cilantro),0.037508,0.016881,0.003669,0.09783,5.795179,0.003036,1.089727,1
203,(Organic Cilantro),(Limes),0.016881,0.037508,0.003669,0.217363,5.795179,0.003036,1.229807,1
319,(Strawberries),(Raspberries),0.039087,0.015473,0.003199,0.081842,5.28937,0.002594,1.072285,1
318,(Raspberries),(Strawberries),0.015473,0.039087,0.003199,0.206748,5.28937,0.002594,1.211359,1
345,"(Bag of Organic Bananas, Organic Strawberries)",(Organic Raspberries),0.01852,0.041283,0.003436,0.185509,4.493553,0.002671,1.177075,2
348,(Organic Raspberries),"(Bag of Organic Bananas, Organic Strawberries)",0.041283,0.01852,0.003436,0.083219,4.493553,0.002671,1.070572,1


As we can see above there are several rules from same item sets. In these rules, support will remain same however the denominator calculation of confidence will differ. For example:

Rule 1 : ratio of transactions containing 'milk' also having 'bread, cereal, butter'

Rule 2 : ratio of transactions containing 'bread, cereal, butter' also having 'milk'

Rule 1 < Rule 2 because the number of transactions containing 'milk' with 'bread, cereal, butter' will be far high then transactions that contain all three 'bread, cereal, butter' and also 'milk'. 

We will look for this method to prune the rules.

In [None]:
def threshold(support = .001,confidence=.005,lift=1.1, length = 1):

  ''' This function will take the given threshold or the default minimum threshold and generate the desired result'''

  return rules[ (rules['support'] >= support) &
       (rules['confidence'] >= confidence) &
       (rules['lift'] >= lift) &
       (rules['length'] == length)].sort_values(by = 'lift', ascending = False) 

In [None]:
threshold(length = 2, confidence =.2).sort_values(by=['confidence','lift'], ascending=[False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
334,"(Organic Raspberries, Organic Hass Avocado)",(Bag of Organic Bananas),0.007615,0.123727,0.003447,0.452633,3.658309,0.002505,1.600887,2
340,"(Organic Strawberries, Organic Hass Avocado)",(Bag of Organic Bananas),0.011934,0.123727,0.00447,0.374553,3.027242,0.002993,1.401034,2
321,"(Organic Baby Spinach, Organic Hass Avocado)",(Bag of Organic Bananas),0.009896,0.123727,0.003565,0.360258,2.911707,0.002341,1.369728,2
346,"(Organic Raspberries, Organic Strawberries)",(Bag of Organic Bananas),0.009827,0.123727,0.003436,0.349607,2.825626,0.00222,1.347298,2
327,"(Organic Baby Spinach, Organic Strawberries)",(Bag of Organic Bananas),0.010427,0.123727,0.003056,0.29313,2.369159,0.001766,1.239652,2
332,"(Bag of Organic Bananas, Organic Raspberries)",(Organic Hass Avocado),0.012306,0.066632,0.003447,0.280099,4.203684,0.002627,1.296523,2
344,"(Bag of Organic Bananas, Organic Raspberries)",(Organic Strawberries),0.012306,0.080619,0.003436,0.279176,3.462899,0.002443,1.275459,2
320,"(Organic Baby Spinach, Bag of Organic Bananas)",(Organic Hass Avocado),0.014769,0.066632,0.003565,0.241395,3.622814,0.002581,1.230374,2
338,"(Bag of Organic Bananas, Organic Strawberries)",(Organic Hass Avocado),0.01852,0.066632,0.00447,0.241361,3.622307,0.003236,1.230319,2
339,"(Bag of Organic Bananas, Organic Hass Avocado)",(Organic Strawberries),0.019559,0.080619,0.00447,0.22853,2.834678,0.002893,1.191725,2


Rules with 2 itemsets and minimum confidence of 20% are:
* {Organic Raspberries, Organic Hass Avocado}  --▶ {Bag of Organic Bananas}
* {Organic Strawberries, Organic Hass Avocado}  --▶ {Bag of Organic Bananas}
* {Organic Baby Spinach, Organic Hass Avocado}  --▶ {Bag of Organic Bananas}
* {Organic Raspberries, Organic Strawberries}  --▶ {Bag of Organic Bananas}
* {Organic Baby Spinach, Organic Strawberries}  --▶ {Bag of Organic Bananas}



In [None]:
threshold(length = 1,  confidence = .30).sort_values(by=['confidence','lift'], ascending=[False, False])

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
95,(Bartlett Pears),(Banana),0.009014,0.156115,0.00363,0.402737,2.579742,0.002223,1.412921,1
133,(Organic Fuji Apple),(Banana),0.024992,0.156115,0.009735,0.389525,2.495112,0.005833,1.382342,1
113,(Honeycrisp Apple),(Banana),0.022644,0.156115,0.008058,0.355858,2.279456,0.004523,1.310091,1
63,(Organic Navel Orange),(Bag of Organic Bananas),0.010705,0.123727,0.003774,0.35259,2.849735,0.00245,1.353506,1
109,(Granny Smith Apples),(Banana),0.009218,0.156115,0.003221,0.349365,2.237863,0.001781,1.297017,1
101,(Broccoli Crown),(Banana),0.010786,0.156115,0.003671,0.340378,2.180296,0.001987,1.279345,1
105,(Cucumber Kirby),(Banana),0.026363,0.156115,0.008748,0.331808,2.125404,0.004632,1.262938,1
41,(Organic D'Anjou Pears),(Bag of Organic Bananas),0.01323,0.123727,0.004193,0.316903,2.561303,0.002556,1.282795,1
59,(Organic Large Extra Fancy Fuji Apple),(Bag of Organic Bananas),0.022403,0.123727,0.007055,0.314907,2.545168,0.004283,1.279057,1
167,(Seedless Red Grapes),(Banana),0.02161,0.156115,0.006603,0.305562,1.957284,0.00323,1.215205,1


Bananas and Bag of Organic Bananas is natural consequent for fruits and vegetables. Let's see the association between products excluding these two products from the set.

In [8]:
y = x.drop(['Banana', 'Bag of Organic Bananas'], axis=1)

In [9]:
y.head()

product_name,100% Raw Coconut Water,100% Recycled Paper Towels,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Baby Spinach,Bartlett Pears,Blackberries,Blueberries,...,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Unsweetened Vanilla Almond Milk,Vanilla Almond Breeze Almond Milk,Watermelon Chunks,Whipped Cream Cheese,Whole Milk,Yellow Bell Pepper,Yellow Onions,"YoKids Squeezers Organic Low-Fat Yogurt, Strawberry"
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5,0,0,0,1,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [None]:
frequent_set_2 = apriori(y, min_support=0.003, use_colnames=True )

In [13]:
rules_2 = association_rules(frequent_set_2, metric='lift', min_threshold=1)

In [24]:
rules_2.sort_values(by=['confidence','lift'], ascending=[False, False]).head(20)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,length
27,(Lime Sparkling Water),(Sparkling Water Grapefruit),0.013527,0.022917,0.003886,0.287278,12.535775,0.003576,1.370918,1
112,(Organic Lemon),(Organic Hass Avocado),0.023709,0.066632,0.00586,0.247175,3.709565,0.00428,1.239821,1
133,(Organic Raspberries),(Organic Strawberries),0.041283,0.080619,0.009827,0.238035,2.952579,0.006499,1.206591,1
126,(Organic Kiwi),(Organic Strawberries),0.014007,0.080619,0.003253,0.23221,2.880332,0.002123,1.197438,1
90,(Organic Blueberries),(Organic Strawberries),0.024643,0.080619,0.005464,0.221719,2.750198,0.003477,1.181297,1
32,(Organic Cilantro),(Limes),0.016881,0.037508,0.003669,0.217363,5.795179,0.003036,1.229807,1
92,(Organic Cucumber),(Organic Hass Avocado),0.021952,0.066632,0.004737,0.215803,3.238743,0.003275,1.190222,1
138,(Organic Whole String Cheese),(Organic Strawberries),0.017875,0.080619,0.003786,0.211836,2.627615,0.002345,1.166485,1
148,(Raspberries),(Strawberries),0.015473,0.039087,0.003199,0.206748,5.28937,0.002594,1.211359,1
118,(Organic Tomato Cluster),(Organic Hass Avocado),0.016552,0.066632,0.003364,0.203237,3.050148,0.002261,1.17145,1


**Observation:**

* Without bananas, length of itemsets does not exceed 1.

* Interestingly, organic and non-organic products seldomly occur together.