## Market basket analysis

> - identifies the strength of association between pairs of products purchased together

> - identify patterns of co-occurrence, two or more things take place together

>> In essence it generates if -> then scenario's

<h4 align="center">Components of algorithm</h4>
<hr>
<div style="text-align: center"> Support </div>
<hr>
<div style="text-align: center"> Confidence</div>
<hr>
<div style="text-align: center"> Lift</div>
<hr>

<h5 align="center">A note on the metrics</h5>

<div>
<center><img src="itemset.png" width="200"/></center>
</div>
<hr>
<div style="text-align: center"> Contains antecedents and consequents, both of which are part of an itemset</div>
<hr>
<div style="text-align: center"> Implication here is co-occurence not causality</div>

<h4 align="center">Support</h4>
<hr>
<div style="text-align: center"> How frequent is an itemset in all transactions</div>
<hr>
<div style="text-align: center"> Used to identify rules worth analysing further </div>


> $$ Support(X  \cap  Y) = \frac{Frequency(X \cap Y)}{N (totaltransactions)} $$

<h5 align="center">Example</h5>

<div>
<center><img src="venn.png" width="200" align="center"/></center>
</div>

<hr>
<div style="text-align: center"> Support(Toothbrush&Milk) = 10/84 = 0.12 </div>
<hr>
<div style="text-align: center"> 12% of transactions contain both toothbrush & milk </div>

<h4 align="center">Confidence</h4>
<hr>
<div style="text-align: center"> the likeliness of consequent occuring when antecedent is present </div>


> $$ Confidence(X \cap Y) = \frac{Frequency(X \cap Y)}{Frequency(X)} $$

<h5 align="center">Example</h5>

<div>
<center><img src="venn.png" width="200" align="center"/></center>
</div>

<hr>
<div style="text-align: center"> Confidence(Toothbrush&Milk) = 10/14 = 0.7 </div>
<hr>
<div style="text-align: center"> Probability of having milk on the cart with the knowledge that toothbrush is present 70% </div>
<hr>
<div style="text-align: center"> Can be misleading as we can see that there is a weak association </div>

<h4 align="center">Lift</h4>
<hr>
<div style="text-align: center"> Lift controls for the support (frequency) of consequent while calculating the conditional probability of occurrence of Y given X </div>


> $$ Lift(X \cap Y) = \frac{Support(X \cap Y)}{Support(Y)} $$
<hr>
<div style="text-align: center"> more then 1: itemset more likely to be bought together </div>
<hr>
<div style="text-align: center"> 1: no association </div>
<hr>
<div style="text-align: center"> less then 1: less likely to be bought together </div>

<h5 align="center">Example</h5>

<div>
<center><img src="venn.png" width="200" align="center"/></center>
</div>
<hr>
<div style="text-align: center"> Probability of having milk on the cart with the knowledge that toothbrush is present 70% </div>
<hr>
<div style="text-align: center"> Consider the probability of having milk on the cart without any knowledge about toothbrush: 80/100 = 80% </div>
<hr>
<div style="text-align: center"> having toothbrush on the cart actually reduces the probability of having milk on the cart to 0.7 from 0.8 </div>
<hr>
<div style="text-align: center"> Lift(Toothbrush&Milk) = 0.7/0.8 = 0.87 </div>
<hr>
<div style="text-align: center"> A value of lift less than 1 shows that having toothbrush on the cart does not increase the chances of occurrence of milk on the cart in spite of the rule showing a high confidence value </div>

In [1]:
import pandas as pd
import mlxtend
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import euclidean, cityblock
import random
import seaborn as sns

In [25]:
df = pd.read_csv('data_ecom.csv')
df = df[df['StockCode']!='POST']
# Basket for France
basket_france = (df[df['Country'] =="France"]
          .groupby(['InvoiceNo', 'StockCode'])['Quantity']
          .sum().unstack().reset_index().fillna(0)
          .set_index('InvoiceNo'))

### Dataset of orders for unidentified e-com store

In [26]:
df_fr = df[df['Country'] =="France"]
df_fr.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
26,536370,22728,ALARM CLOCK BAKELIKE PINK,24,12/1/2010 8:45,3.75,12583.0,France
27,536370,22727,ALARM CLOCK BAKELIKE RED,24,12/1/2010 8:45,3.75,12583.0,France
28,536370,22726,ALARM CLOCK BAKELIKE GREEN,12,12/1/2010 8:45,3.75,12583.0,France
29,536370,21724,PANDA AND BUNNIES STICKER SHEET,12,12/1/2010 8:45,0.85,12583.0,France
30,536370,21883,STARS GIFT TAPE,24,12/1/2010 8:45,0.65,12583.0,France


In [27]:
print(f"Total # of unique orders {len(df_fr['InvoiceNo'].unique())}")
print(f"Total # of unique references {len(df_fr['StockCode'].unique())}")

Total # of unique orders 448
Total # of unique references 1542


In [28]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

### Encoding algo

In [29]:
basket_france = basket_france.applymap(encode_units)
basket_france.head()

StockCode,10002,10120,10125,10135,11001,15036,15039,15044C,15056BL,15056N,...,90030B,90030C,90031,90099,90184B,90184C,90201B,90201C,C2,M
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536370,1,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536852,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536974,0,0,0,0,0,0,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
537065,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
537463,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [30]:
frq_items = apriori(basket_france, min_support = 0.05, use_colnames = True)
rules = association_rules(frq_items, metric ="lift", min_threshold = 1)
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False])
rules.head()

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
37,"(21080, 21086)",(21094),0.089286,0.111607,0.087054,0.975,8.736,0.077089,35.535714
36,"(21080, 21094)",(21086),0.089286,0.120536,0.087054,0.975,8.088889,0.076291,35.178571
10,(21094),(21086),0.111607,0.120536,0.107143,0.96,7.964444,0.09369,21.986607
35,(23256),(23254),0.060268,0.0625,0.055804,0.925926,14.814815,0.052037,12.65625
34,(23254),(23256),0.0625,0.060268,0.055804,0.892857,14.814815,0.052037,8.770833


## Goals KTN