# Market Basket Analysis

Market basket analysis scrutinizes the products customers tend to buy together, and uses the information to decide which products should be cross-sold or promoted together. The term arises from the shopping carts supermarket shoppers fill up during a shopping trip.

Association Rule Mining is used when we want to find an association between different objects in a set, find frequent patterns in a transaction database, relational databases or any other information repository.

The most common approach to find these patterns is Market Basket Analysis, which is a key technique used by large retailers like Amazon, Flipkart, etc to analyze customer buying habits by finding associations between the different items that customers place in their “shopping baskets”. The discovery of these associations can help retailers develop marketing strategies by gaining insight into which items are frequently purchased together by customers. The strategies may include:

- Changing the store layout according to trends
- Customers behavior analysis
- Catalog Design
- Cross marketing on online stores
- Customized emails with add-on sales, etc.

### Matrices

- **Support** : Its the default popularity of an item. In mathematical terms, the support of item A is the ratio of transactions involving A to the total number of transactions.


- **Confidence** : Likelihood that customer who bought both A and B. It is the ratio of the number of transactions involving both A and B and the number of transactions involving B.
     - Confidence(A => B) = Support(A, B)/Support(A)


- **Lift** : Increase in the sale of A when you sell B.
    
    - Lift(A => B) = Confidence(A, B)/Support(B)
        
    - Lift (A => B) = 1 means that there is no correlation within the itemset.
    - Lift (A => B) > 1 means that there is a positive correlation within the itemset, i.e., products in the itemset, A, and B, are more likely to be bought together.
    - Lift (A => B) < 1 means that there is a negative correlation within the itemset, i.e., products in itemset, A, and B, are unlikely to be bought together.

**Apriori Algorithm:** Apriori algorithm assumes that any subset of a frequent itemset must be frequent. Its the algorithm behind Market Basket Analysis. Say, a transaction containing {Grapes, Apple, Mango} also contains {Grapes, Mango}. So, according to the principle of Apriori, if {Grapes, Apple, Mango} is frequent, then {Grapes, Mango} must also be frequent.

In [1]:
import numpy as np
import pandas as pd
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

root = "C:/Users/HP/Downloads/market"

### Data

In [2]:
orders = pd.read_csv(root + '/orders.csv')
order_products_prior = pd.read_csv(root + '/order_products__prior.csv')
order_products_train = pd.read_csv(root + '/order_products__train.csv')
products = pd.read_csv(root + '/products.csv')

In [3]:
order_products = order_products_prior.append(order_products_train)
order_products.shape

  order_products = order_products_prior.append(order_products_train)


(33819106, 4)

In [4]:
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered
0,2,33120,1,1
1,2,28985,2,1
2,2,9327,3,0
3,2,45918,4,1
4,2,30035,5,0


In [5]:
order_products.product_id.nunique()

49685

Out of 49685 keeping top 100 most frequent products.

In [6]:
product_counts = order_products.groupby('product_id')['order_id'].count().reset_index().rename(columns = {'order_id':'frequency'})
product_counts = product_counts.sort_values('frequency', ascending=False)[0:100].reset_index(drop = True)
product_counts = product_counts.merge(products, on = 'product_id', how = 'left')
product_counts.head(10)

Unnamed: 0,product_id,frequency,product_name,aisle_id,department_id
0,24852,491291,Banana,24,4
1,13176,394930,Bag of Organic Bananas,24,4
2,21137,275577,Organic Strawberries,24,4
3,21903,251705,Organic Baby Spinach,123,4
4,47209,220877,Organic Hass Avocado,24,4
5,47766,184224,Organic Avocado,24,4
6,47626,160792,Large Lemon,24,4
7,16797,149445,Strawberries,24,4
8,26209,146660,Limes,24,4
9,27845,142813,Organic Whole Milk,84,16


Keeping 100 most frequent items in order_products dataframe

In [7]:
freq_products = list(product_counts.product_id)
freq_products[1:10]

[13176, 21137, 21903, 47209, 47766, 47626, 16797, 26209, 27845]

In [8]:
len(freq_products)

100

In [9]:
order_products = order_products[order_products.product_id.isin(freq_products)]
order_products.shape

(7795471, 4)

In [10]:
order_products.order_id.nunique()

2444982

In [11]:
order_products = order_products.merge(products, on = 'product_id', how='left')
order_products.head()

Unnamed: 0,order_id,product_id,add_to_cart_order,reordered,product_name,aisle_id,department_id
0,2,28985,2,1,Michigan Organic Kale,83,4
1,2,17794,6,1,Carrots,83,4
2,3,24838,2,1,Unsweetened Almondmilk,91,16
3,3,21903,4,1,Organic Baby Spinach,123,4
4,3,46667,6,1,Organic Ginger Root,83,4


Structuring the data for feeding in the algorithm

In [12]:
basket = order_products.groupby(['order_id', 'product_name'])['reordered'].count().unstack().reset_index().fillna(0).set_index('order_id')
basket

product_name,100% Raw Coconut Water,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Bartlett Pears,Blueberries,Boneless Skinless Chicken Breasts,...,Sparkling Natural Mineral Water,Sparkling Water Grapefruit,Spring Water,Strawberries,Uncured Genoa Salami,Unsalted Butter,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Whole Milk,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
5,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
9,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3421078,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3421080,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3421081,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3421082,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0


In [13]:
del product_counts, products, order_products, order_products_prior, order_products_train

encoding the units

In [14]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1 
    
basket = basket.applymap(encode_units)
basket.head()

product_name,100% Raw Coconut Water,100% Whole Wheat Bread,2% Reduced Fat Milk,Apple Honeycrisp Organic,Asparagus,Bag of Organic Bananas,Banana,Bartlett Pears,Blueberries,Boneless Skinless Chicken Breasts,...,Sparkling Natural Mineral Water,Sparkling Water Grapefruit,Spring Water,Strawberries,Uncured Genoa Salami,Unsalted Butter,Unsweetened Almondmilk,Unsweetened Original Almond Breeze Almond Milk,Whole Milk,Yellow Onions
order_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,0,0,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
5,0,0,1,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [15]:
basket.size

244498200

In [16]:
basket.shape

(2444982, 100)

Creating frequent sets and rules

In [29]:
frequent_items = apriori(basket, min_support=0.001, use_colnames=True, low_memory=True)

# The length column has been added to increase ease of filtering.
frequent_items['length'] = frequent_items['itemsets'].apply(lambda x: len(x))

frequent_items



Unnamed: 0,support,itemsets,length
0,0.016062,(100% Raw Coconut Water),1
1,0.025814,(100% Whole Wheat Bread),1
2,0.015800,(2% Reduced Fat Milk),1
3,0.035694,(Apple Honeycrisp Organic),1
4,0.029101,(Asparagus),1
...,...,...,...
2525,0.001049,"(Organic Yellow Onion, Organic Strawberries, O...",3
2526,0.001157,"(Organic Zucchini, Organic Strawberries, Organ...",3
2527,0.001018,"(Organic Whole Milk, Organic Strawberries, Org...",3
2528,0.001436,"(Organic Hass Avocado, Organic Baby Spinach, O...",4


In [30]:
frequent_items.tail()

Unnamed: 0,support,itemsets,length
2525,0.001049,"(Organic Yellow Onion, Organic Strawberries, O...",3
2526,0.001157,"(Organic Zucchini, Organic Strawberries, Organ...",3
2527,0.001018,"(Organic Whole Milk, Organic Strawberries, Org...",3
2528,0.001436,"(Organic Hass Avocado, Organic Baby Spinach, O...",4
2529,0.001659,"(Organic Hass Avocado, Organic Strawberries, B...",4


In [31]:
frequent_items.shape

(2530, 3)

In [39]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules["antecedent_len"] = rules["antecedents"].apply(lambda x: len(x))


rules=rules.sort_values('lift', ascending=False)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len,consequents_len
5565,(Lime Sparkling Water),"(Sparkling Water Grapefruit, Sparkling Lemon W...",0.019841,0.003907,0.001874,0.094451,24.173626,0.001797,1.099988,0.978038,1,2
5564,"(Sparkling Water Grapefruit, Sparkling Lemon W...",(Lime Sparkling Water),0.003907,0.019841,0.001874,0.47964,24.173626,0.001797,1.883616,0.962393,2,1
5562,"(Lime Sparkling Water, Sparkling Water Grapefr...",(Sparkling Lemon Water),0.00564,0.013992,0.001874,0.332294,23.748283,0.001795,1.476709,0.963325,2,1
5567,(Sparkling Lemon Water),"(Lime Sparkling Water, Sparkling Water Grapefr...",0.013992,0.00564,0.001874,0.133934,23.748283,0.001795,1.148134,0.971485,1,2
5563,"(Lime Sparkling Water, Sparkling Lemon Water)",(Sparkling Water Grapefruit),0.003629,0.032411,0.001874,0.51634,15.930869,0.001756,2.000555,0.940643,2,1


In [40]:
rules.tail()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len,consequents_len
426,(Organic Grade A Free Range Large Brown Eggs),(Banana),0.017101,0.200938,0.003443,0.201306,1.001828,6e-06,1.00046,0.001857,1,1
4694,"(Large Lemon, Banana)",(Organic Hass Avocado),0.017603,0.090339,0.001593,0.090501,1.001799,3e-06,1.000179,0.001828,2,1
4695,(Organic Hass Avocado),"(Large Lemon, Banana)",0.090339,0.017603,0.001593,0.017634,1.001799,3e-06,1.000032,0.001974,1,2
1377,(Large Lemon),(Organic Whole String Cheese),0.065764,0.025223,0.00166,0.025244,1.000837,1e-06,1.000022,0.000895,1,1
1376,(Organic Whole String Cheese),(Large Lemon),0.025223,0.065764,0.00166,0.065819,1.000837,1e-06,1.000059,0.000858,1,1


In [47]:
def select_rules_with_antecedents_length(len):
    return rules[ rules['antecedent_len'] == len]

In [48]:
select_rules_with_antecedents_length(3)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len,consequents_len
6408,"(Organic Hass Avocado, Organic Strawberries, B...",(Organic Raspberries),0.006452,0.058325,0.001659,0.257099,4.408066,0.001283,1.267566,0.778164,3,1
6411,"(Organic Strawberries, Bag of Organic Bananas,...",(Organic Hass Avocado),0.005003,0.090339,0.001659,0.331562,3.670203,0.001207,1.360876,0.731194,3,1
6397,"(Organic Baby Spinach, Organic Strawberries, B...",(Organic Hass Avocado),0.004726,0.090339,0.001436,0.303964,3.364707,0.00101,1.306917,0.706134,3,1
6409,"(Organic Hass Avocado, Organic Strawberries, O...",(Bag of Organic Bananas),0.003384,0.161527,0.001659,0.49027,3.035222,0.001112,1.644935,0.672811,3,1
6410,"(Organic Hass Avocado, Bag of Organic Bananas,...",(Organic Strawberries),0.004883,0.112711,0.001659,0.339698,3.013883,0.001108,1.343763,0.671481,3,1
6394,"(Organic Hass Avocado, Organic Baby Spinach, O...",(Bag of Organic Bananas),0.003463,0.161527,0.001436,0.414738,2.567611,0.000877,1.432646,0.612655,3,1
6395,"(Organic Hass Avocado, Organic Baby Spinach, B...",(Organic Strawberries),0.005191,0.112711,0.001436,0.276688,2.454838,0.000851,1.226703,0.595734,3,1
6396,"(Organic Hass Avocado, Organic Strawberries, B...",(Organic Baby Spinach),0.006452,0.102948,0.001436,0.222617,2.162427,0.000772,1.153938,0.541048,3,1


In [34]:
pop_df=rules[:100000]
pop_df.to_pickle('market_basket')
pop_df=pd.read_pickle('market_basket')
pop_df

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric,antecedent_len
5565,(Lime Sparkling Water),"(Sparkling Water Grapefruit, Sparkling Lemon W...",0.019841,0.003907,0.001874,0.094451,24.173626,0.001797,1.099988,0.978038,1
5564,"(Sparkling Water Grapefruit, Sparkling Lemon W...",(Lime Sparkling Water),0.003907,0.019841,0.001874,0.479640,24.173626,0.001797,1.883616,0.962393,2
5562,"(Lime Sparkling Water, Sparkling Water Grapefr...",(Sparkling Lemon Water),0.005640,0.013992,0.001874,0.332294,23.748283,0.001795,1.476709,0.963325,2
5567,(Sparkling Lemon Water),"(Lime Sparkling Water, Sparkling Water Grapefr...",0.013992,0.005640,0.001874,0.133934,23.748283,0.001795,1.148134,0.971485,1
5563,"(Lime Sparkling Water, Sparkling Lemon Water)",(Sparkling Water Grapefruit),0.003629,0.032411,0.001874,0.516340,15.930869,0.001756,2.000555,0.940643,2
...,...,...,...,...,...,...,...,...,...,...,...
426,(Organic Grade A Free Range Large Brown Eggs),(Banana),0.017101,0.200938,0.003443,0.201306,1.001828,0.000006,1.000460,0.001857,1
4694,"(Large Lemon, Banana)",(Organic Hass Avocado),0.017603,0.090339,0.001593,0.090501,1.001799,0.000003,1.000179,0.001828,2
4695,(Organic Hass Avocado),"(Large Lemon, Banana)",0.090339,0.017603,0.001593,0.017634,1.001799,0.000003,1.000032,0.001974,1
1377,(Large Lemon),(Organic Whole String Cheese),0.065764,0.025223,0.001660,0.025244,1.000837,0.000001,1.000022,0.000895,1


In [22]:
frequent_items = apriori(basket, min_support=0.01, use_colnames=True, low_memory=True)



In [24]:
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules=rules.sort_values('lift', ascending=False)
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction,zhangs_metric
35,(Large Lemon),(Limes),0.065764,0.059984,0.01186,0.180345,3.006544,0.007915,1.146843,0.714372
34,(Limes),(Large Lemon),0.059984,0.065764,0.01186,0.197723,3.006544,0.007915,1.16448,0.70998
52,(Organic Strawberries),(Organic Raspberries),0.112711,0.058325,0.014533,0.12894,2.210731,0.007959,1.081069,0.61723
53,(Organic Raspberries),(Organic Strawberries),0.058325,0.112711,0.014533,0.249174,2.210731,0.007959,1.181751,0.581582
36,(Organic Avocado),(Large Lemon),0.075348,0.065764,0.010538,0.139862,2.126728,0.005583,1.086147,0.572966
