## Problem Statement

#### Suppose you are a Data Scientsist at a retail store like Walmart/DMart

- The company wants to increase sales and customer satisfaction through their purchase journey
- Your task is to use data to determine the products that are often purchased together:
 - Retailers can optimize product placement
 - Offer special deals and create new product bundles to encourage further sales of these combinations

<br>


<br>

> **Q. How many products are we dealing with here?**

At huge super stores like Walmart, we will have products across many categories: Daliy essentials, Food products (like Milk, Butter, jam , bread, etc), Beauty Products, Toys, etc.

Even then, we're looking at a **few hundreds** of products, or at the max, a few thousand, as opposed to the world of e-commerce where, there may be lakhs or millions of products. Let $n:$ Total No of distinct products

Let's define $D$ as the set of all the products we have, then,

$D=\left \{1, 2, 3, ..., n  \right \}$

![picture](https://drive.google.com/uc?export=view&id=1tpCqUbIkAwOsFs5t7QIpPVY3MnN4iwiW)

<br>

> **Q. Do we have a way of representing the items bought by a customer?**

Yes. Consider a customer that is done selecting the items they want, and are proceeding towards the billing counter. They present their **basket** full of products and the cashier scans the product bought and it's quantity.

This is called as a **transaction**, denoted by $T$.

For example,
- Suppose a customer bought item no. 1, 3, 6, and 8.
- This is represented as: $T1 = \left \{1, 3, 6, 8 \right \}$
- Similarly, there are other customers that bought some items as: <br>
$T2 = \left \{1, 3, 7, 12 \right \}$, <br>
$T3 = \left \{1, 7, 3, 16 \right \}$ <br>
... and so on till $m$ transactions (let).

If you think about it, $T$ is essentially a **subset of $D$**, i.e. $T_i ⊆ D$

**NOTE:**
- We are **not keeping track of the quantity** of a product bought by the user, in this transaction representation.
- Since everyday, a store like Walmart would see a footfall of 1000s of customers. In real world, this transaction would be very large.

<br>

> **Q. Could there be any pattern in what the customer is buying?**

Yes. There are some products that are more frequently bought together. For instance,
- Bread and milk
- Pen and notebook
- Toothbrush and paste
... and so on.

These are called as **item sets**.

As you can see from our transaction data above also, customers that are buying Item 1, also tend to buy Item 3. Here, the item set becomes: $\left \{1, 3 \right \}$

![picture](https://drive.google.com/uc?export=view&id=1HlFPKLF_kmKyL2kdRBnHv8eKX0nxDC9-)

<br>

> **Q. Why are these patterns important?**

Suppose there comes a customer that is buying Item1 but not buying Item3,
- since we already know based on a lot of transactional data, that these two are popularly bought together.
- We can recommend Item3 to the customer

![picture](https://drive.google.com/uc?export=view&id=1BN8ptIuM4wZYYExAf153jNXr4KXxlFAq)


#### Q. How to solve this problem?


*   **Transactional data is extremely large** and its very difficult to find patterns to identify which products are purchased together through a manual greedy approach

* Lets have a look at how transactional data looks like


In [None]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from google.colab import files

In [None]:
#importing CSV uploaded to drive
id = "1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v"
print("https://drive.google.com/uc?export=download&id=" + id)
!wget "https://drive.google.com/uc?export=download&id=1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v" -O Online_Retail.csv

https://drive.google.com/uc?export=download&id=1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v
--2023-10-16 04:50:25--  https://drive.google.com/uc?export=download&id=1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v
Resolving drive.google.com (drive.google.com)... 142.251.2.139, 142.251.2.113, 142.251.2.100, ...
Connecting to drive.google.com (drive.google.com)|142.251.2.139|:443... connected.
HTTP request sent, awaiting response... 303 See Other
Location: https://doc-0c-64-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/8s3oq9dhs09r7a8vg0rs651vfb5er3md/1697431800000/05948409478210288909/*/1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v?e=download&uuid=09730e2f-14fc-4962-ba99-1faf715489cc [following]
--2023-10-16 04:50:29--  https://doc-0c-64-docs.googleusercontent.com/docs/securesc/ha0ro937gcuc7l7deffksulhg5h7mbp1/8s3oq9dhs09r7a8vg0rs651vfb5er3md/1697431800000/05948409478210288909/*/1M5IPR96R9efi5cb2n8LxpXIXyx_F0x3v?e=download&uuid=09730e2f-14fc-4962-ba99-1faf715489cc
Resolving doc-0c-64-docs.googleus

In [None]:
df = pd.read_csv("Online_Retail.csv")
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,01/12/10 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,01/12/10 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,01/12/10 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,01/12/10 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,01/12/10 8:26,3.39,17850.0,United Kingdom


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 541909 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    541909 non-null  object 
 1   StockCode    541909 non-null  object 
 2   Description  540455 non-null  object 
 3   Quantity     541909 non-null  int64  
 4   InvoiceDate  541909 non-null  object 
 5   UnitPrice    541909 non-null  float64
 6   CustomerID   406829 non-null  float64
 7   Country      541909 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 33.1+ MB


*   This datasets contains 8 features with 541909 rows!

#### Now let's formalise the task
**Given**
- $D$: Set of all items, and
- $T$: Set of all transactions

Note: No of transactions $m$ >> No of distinct items $n$

**To find:**
- **Item sets** that occur very frequently in transactions T.

![picture](https://drive.google.com/uc?export=view&id=1tRndSbYeha9weFX3jX7uvAavMghpggxD)


<br>

This technique of analyzing transaction data to give recommendations to the customer is called as **Market Basket Analysis**.

It is typically used in context of an offline store, where $n$ is not too large, it is typically a few hundreds.


## Data Preprocessing

Lets count the unique invoice numbers (transactions) and unqiue customer IDs

In [None]:
print('Number of Unique Invoice numbers: {cnt}'.format(cnt=df.InvoiceNo.nunique()))
print('Number of Unique Customer IDs: {cnt}'.format(cnt=df.CustomerID.nunique()))

Number of Unique Invoice numbers: 25900
Number of Unique Customer IDs: 4372


In [None]:
df.describe()

Unnamed: 0,Quantity,UnitPrice,CustomerID
count,541909.0,541909.0,406829.0
mean,9.55225,4.611114,15287.69057
std,218.081158,96.759853,1713.600303
min,-80995.0,-11062.06,12346.0
25%,1.0,1.25,13953.0
50%,3.0,2.08,15152.0
75%,10.0,4.13,16791.0
max,80995.0,38970.0,18287.0


> **Q. Why are some values negative under the `Quantity` and `UnitPrice` columns?**

These denote returned or cancelled items.

In order for us to find patterns and analyse the transactional data, we need not consider the items that were returned, we need only look at the items bought by a user in one go.

<br>

Hence, Let's get rid of all rows where quantity is `< 0`.

In [None]:
df = df[df['Quantity']>=0]
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 531285 entries, 0 to 541908
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    531285 non-null  object 
 1   StockCode    531285 non-null  object 
 2   Description  530693 non-null  object 
 3   Quantity     531285 non-null  int64  
 4   InvoiceDate  531285 non-null  object 
 5   UnitPrice    531285 non-null  float64
 6   CustomerID   397924 non-null  float64
 7   Country      531285 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 36.5+ MB


Let’s drop the rows that don’t have invoice numbers and remove the credit transactions (those with invoice numbers containing C).

In [None]:
df.dropna(axis=0, subset=['InvoiceNo'],inplace=True)
df['InvoiceNo']=df['InvoiceNo'].astype('str')
df = df[~df['InvoiceNo'].str.contains('C')]
df.shape

(531285, 8)

Let's explore the column `Country`

In [None]:
df['Country'].value_counts()

United Kingdom          486286
Germany                   9042
France                    8408
EIRE                      7894
Spain                     2485
Netherlands               2363
Belgium                   2031
Switzerland               1967
Portugal                  1501
Australia                 1185
Norway                    1072
Italy                      758
Channel Islands            748
Finland                    685
Cyprus                     614
Sweden                     451
Unspecified                446
Austria                    398
Denmark                    380
Poland                     330
Japan                      321
Israel                     295
Hong Kong                  284
Singapore                  222
Iceland                    182
USA                        179
Canada                     151
Greece                     145
Malta                      112
United Arab Emirates        68
European Community          60
RSA                         58
Lebanon 

Since most our data is from UK, let's just consider that and drop the rest.

Also, let's convert this data to the form of a **sparse matrix** such that:
* We encode the basket data into a binary data that shows whether an items is bought (1) or not (0)


In [None]:
data = (df[df['Country'] =="United Kingdom"].groupby(['InvoiceNo', 'Description'])['Quantity']
               .sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))
data.head(2)

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,...,returned,taig adjust,test,to push order througha s stock was,website fixed,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
536366,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


*   For any transaction, if the quantity of an item is >=1 then its encoded as 1 (bought) , else 0 (not bought)


In [None]:
def encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

data = data.applymap(encode_units)
data

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,...,returned,taig adjust,test,to push order througha s stock was,website fixed,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536369,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581586,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A563185,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
A563186,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


Since our goal is to find frequently occuring items, let's get rid of all transactions where only one product is bought.

We are going to uncover the association between 2 or more items that is bought according to historical data


In [None]:
data = data[(data > 0).sum(axis=1) >= 2]
data

Description,4 PURPLE FLOCK DINNER CANDLES,50'S CHRISTMAS GIFT BAG LARGE,DOLLY GIRL BEAKER,I LOVE LONDON MINI BACKPACK,NINE DRAWER OFFICE TIDY,OVAL WALL MIRROR DIAMANTE,RED SPOT GIFT BAG LARGE,SET 2 TEA TOWELS I LOVE LONDON,SPACEBOY BABY GIFT SET,TOADSTOOL BEDSIDE LIGHT,...,returned,taig adjust,test,to push order througha s stock was,website fixed,wrongly coded 20713,wrongly coded 23343,wrongly marked,wrongly marked 23343,wrongly sold (22719) barcode
InvoiceNo,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
536365,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536366,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536367,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536368,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
536372,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
581582,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581583,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581584,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
581585,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


According to the result above, we could see that there are 16,539 transaction that bought more than 1 items. It means, 91 % of the basket data is a transaction that has bought more than 1 item


## Apriori Algorithm

> **Q. How can we find the frequent item sets from Transaction data $T$?**

Let's create **key value pairs**, where:
- **key:** represents the **item-sets** / pairs of items bought together
- **value:** represents the **count** of occurence of the key item set

From this representation, we are using the transaction data to get a fair sense of items that are frequently bought together.

Item sets that have a very high frequency would have a relation, so based on that we can recommend items to the customer. For example:
- It was found that young adults between 5pm-7pm, tend to buy **beers** and **diapers** together.
- This is a very surprising relation, that **could not have been guessed!**
- Utilizing this fact, as a Data scientist at Walmart, we can recommend the stores to keep diapers section close to where beers are kept.

![picture](https://drive.google.com/uc?export=view&id=1YjTFmMKVepn18Gj2wXMlvWQjGBVAeb-e)

<br>

> **Q. How will we get the key values consisting of item sets?**

We are given the set of all products $D$. We can list out all of its subsets to get all possible item sets.

For example,
- If $D= \left \{1, 2, 3\right \}$, then all the subsets are: $\left \{1 \right \}, \left \{2\right \}, \left \{3 \right \}, \left \{1, 2 \right \}, \left \{1, 3 \right \}, \left \{2, 3\right \}, \left \{1, 2, 3 \right \}$

Using this logic, we get all the subsets of $D = \left \{1, 2, 3, ..., n\right \}$

<br>

> **Q. How many subsets would $D$ have?**

Recall what we learnt in Math in school, the number of subsets of a set with $n$ elements is: $2^n$.

This is because, for all the $n$ items, there are 2 choices: To be present in the subset (1) OR be absent from the subset (0).

Which means, $2*2*2*2*...*2..n \ times = 2^n$.

This even includes the empty subset.

<br>

> **Q. Is this an ideal approach (i.e. finding all subsets of D, and taking the count of their occurences, to find relations between items.)? What will bethe time complexity?**

- In our context, n is the number of products in Walmart, which is a high number like a few hundreds.

 - Even if we consider $n=100$, number of subsets of $D$, become $2^{100}$, which is an insanely high number.

- Besides that, we will have to scan through all the $m$ transaction sets,  
 - where recall that $m$ is the total number of transaction data and that it might be in millions.

This makes the time complexity as: $O(2^n *m)$

This is very tough to compute, it'll be very **very slow.**


![picture](https://drive.google.com/uc?export=view&id=1PR_XnhTpnmWdiNfOFD3GTzKAtH3GBYvA)

#### Q. How can we optimise on this idea?

To think of a possible optimization, first let's consider an example to get an idea across.


Using this idea, if we have an item set, we can use a **threshold value** (aka **minimum support**).

This makes sense, if we have millions of transaction data sets, we would like to set some sort of **minimum support**, beyond which, we can be confident about an item-set having a repeating pattern, as at least, that many number of customers are known to have bought those items together.

This minimum support is denoted as: $c$

![picture](https://drive.google.com/uc?export=view&id=1HhHhiQj3FJuRpWPqsOmCLpNn_q__FAO5)

<br>

> **Q. How can we use the intuition of minimum support to optimize our solution approach?**

- While listing down all the subsets of our items, $D$, say we come across a set $A$.
- Suppose the number of occurences of $A$ in the transaction data $T$ is less than the minimum support value, ie. $<c$
- Then, we need not consider $A$, or any of it's subsets.

For example:
- Let $c=100$, and we have the $D = \left \{1, 2, 3, 4\right \}$ and $T$ data.
- First, we list all the sets of size 1
 - $\left \{1\right \}, \left \{2\right \}, \left \{3\right \}, \left \{4\right \}$

- Now, we find the number of occurrences of each of these,
 - Let that be equal to $110, 200, 150, 50$ respectively.

- Since the number of occurrence of $\left \{4\right \}$ is less than $c$, no superset of it will occur more than 100 times, so we can ignore all of it's supersets.

- Now, we find subsets of size 2, for all sets which have occurred more than c
 - $\left \{1, 2\right \}, \left \{1, 3\right \}, \left \{2, 3\right \}$

- Again, we find number of occurences of each of these:
 - Let that be: $105, 110, 60$

- Since $60<c$, $\left \{2,3\right \}$ is **not** a frequent item-set

- Now, we build sets of size 3,
 - We know that $\left \{2, 3\right \}$ and $\left \{4\right \}$ cannot be a subset of it
 - So, the only element left is $1$, we cannot create set of length 3 with just that
 - So, there exist no superset of length 3.

![picture](https://drive.google.com/uc?export=view&id=1_x2euKN0ET9gABTIIdnHChKHUO61PQO4)

<br>

This way, though we'll still have to parse throught the $m$ transaction data for each item-set, we're able to significantly reduce the number of item-sets ($2^n$).

![picture](https://drive.google.com/uc?export=view&id=1i7CFNb7WPGhZtAZ_3fxTPgmnKJK7hVx_)


This approach is known as the **Apriori Algorithm**, and it was developed around 1994-95.

Though it is very simple in design, as soon as $n$ increases, it becomes too costly to be productive.

**The worst case complexity is still: $O(2^n*m)$**

Hence, it is not used in e-commerce today, where the number of products is very very large.

![picture](https://drive.google.com/uc?export=view&id=1lPWhsCefcn3aK6d-ryGuPMsxDyvswu0r)

<br>

However, modifications have been made over the concept of Apriori algorithm with time. This has given rise to a new technique:-
- **Frequent Pattern (FP) growth items**
 - Uses specialised data structure of **tries**
 - It is faster

Regardless of the modifications of FP growth, essentially we're still performing **frequency item-set mining**
- These are still very very expensive
- It is useful only when the number of items $n$ of $D$ is small
- Hence used in offline market places, where less products are present.
- Not scalable for world of e-commerce. We will learn other techniques for that soon.

![picture](https://drive.google.com/uc?export=view&id=1Or6_TJeeKtjHpyCE6sfYJ90Wve_kx_G-)


#### Q. Other than retail setups, can you think of other applications of Market Basket Analysis?

- **Bio-informatics**
 - If two chemical components $c_1$ and $c_3$ occur frequently within different proteins, then we can find out that perhaps there is some relation between these components
 - If two gene sequences $ATTC$ and $AGTC$ occur frequently, in the sequence of some mammal, then we can find out that perhaps there is some relation between them.

- **Medicine**
 - If we find that according to a doctor's presecription, medicines $m_1, m_2$ and $m_3$ are being prescribed frequently, then it means that together they form some combination drug, which can cure a certain ailment.

![picture](https://drive.google.com/uc?export=view&id=1K739RoCUtgyds9ZAEUv3ffgy73wJ93Xo)

- **Finding similar webpages / web usage mining**
 - If in a single session many users are visiting the same webpages ($w_1, w_2, w_3$), then perhaps they are related in nature

- **Finding similar words**
 -
 ![picture](https://drive.google.com/uc?export=view&id=1A9dclrQ8Xfr3mG7warRdE4bJS9Qtsuqj)


> **Q. How to implement Apriori Algorithm in code?**

We have a built-in function that implements apriori for us, under `mlxtend.frequent_patterns` library.

- As we discussed, we need to specify a minimum support threshold value, as parameter of this function.

- Since our column names represent the items, we use `use_colnames=True`

This will give us the most frequent item-sets, so let's sort them in desending order also, using `sort_values()`

Also, for easy interpretation, let's explicitly add a column stating the length of the itemset.

In [None]:
from mlxtend.frequent_patterns import apriori

In [None]:
frequent_itemsets_plus = apriori(data, min_support=0.03,
                                 use_colnames=True).sort_values('support', ascending=False).reset_index(drop=True)

frequent_itemsets_plus['length'] = frequent_itemsets_plus['itemsets'].apply(lambda x: len(x))

frequent_itemsets_plus

Unnamed: 0,support,itemsets,length
0,0.129875,(WHITE HANGING HEART T-LIGHT HOLDER),1
1,0.116331,(JUMBO BAG RED RETROSPOT),1
2,0.100671,(REGENCY CAKESTAND 3 TIER),1
3,0.095471,(PARTY BUNTING),1
4,0.084165,(LUNCH BAG RED RETROSPOT),1
...,...,...,...
172,0.030473,(CHRISTMAS CRAFT LITTLE FRIENDS),1
173,0.030473,(CREAM HEART CARD HOLDER),1
174,0.030413,"(LUNCH BAG BLACK SKULL., LUNCH BAG CARS BLUE)",2
175,0.030292,"(LUNCH BAG RED RETROSPOT, LUNCH BAG SPACEBOY D...",2


We get 177 most frequently occuring item-sets, using Apriori Algorithm!


*   Using apriori algorithm, we filter frequent itemsets by giving minimum support value of 3%
*   Length is the number of items in the itemset
* Based on the support threshold of 3%, there are 177 itemsets that are considered as frequently bought
* Eg: White hanging Heart T-Light Holder is the most frequently bought items with the support value of 0.129875. i.e the item is bought 2148 times out of the whole transaction


---

## Association Rule

Using Apriori algorithm, we get a sense of which item-sets are frequent.

We have another tool in the Market Basket Analysis, called the **Association rule**, using which we can find relations between these frequent item-sets.

<br>

> **Q. How does the Association rule work?**

Consider we have our set of items: $D = \left\{ 1, 2, 3, ..., n\right\}$

Let's define X and Y as another sets of items, as follows:-
- $X= \left \{1, 2, 3 \right\}$
- $Y = \left\{ 4, 6 \right\}$

If the item-set $\left \{1, 2, 3, 4, 6\right \}$ is a **frequent itemset**, then according to the **Association Rule** of Market Basket Analysis, we can say that
- "People who buy $X$, have a very high likelihood to buy $Y$ also.

This can be written as: $X -> Y$

**It is read as: "If X, then Y".**

![picture](https://drive.google.com/uc?export=view&id=1DzwEdj6dgmy3Vb4N7cMTHYJzaOmEaVqE)

<br>

> **Q. What are some real life examples of association rule?**

For example:-
- **If** a person buys beer, **then** there is a high tendecy of buying diapers
 - $\left\{ beer \right \} -> \left\{ diapers \right\}$

- People buying milk and bread, also tend to buy jam and eggs
 - $\left\{ milk, bread \right \} -> \left\{ jam, eggs \right\}$

![picture](https://drive.google.com/uc?export=view&id=1AOrVfsR0J022K6EnErXfYzpZv3o2lA0F)

<br>

> **Q. Can X and Y be used interchangeably?**

**No.**

$X → Y$ is not the same as $Y → X$

When we say, People buying beer, have high tendecy of buying diapers, that **does not imply** that people buying diapers, have a high tendency of buying beers.

In order to set these apart, we have the following terminologies in place:
- **Antecedent (If):** The items on the LEFT ie., the item which the customer buy
- **Consequent (Then):** The items on the RIGHT ie., the item which the customer follows to buy.

Consider that you are at Domino's,
- It is more likely for a customer to buy combination of **pizza + coke**
- than a combination of **pizza + garlic bread**

Though, both these associations rules hold true:-
- $\left\{ pizza → coke \right \}$
- $\left\{ pizza → garlic \ bread \right \}$

We know that one of them is more strongly associated than the other.

<br>

### Q. Are there any metrics to know how strongly two itemsets are associated?

Yes, there are a couple of different metrics that give us a better idea about associations. These are:-
- Support
- Confidence
- Lift
- Leverage
- Conviction

Lets take a look at them one by one.

---

#### Support
We talked about setting a minimum support threshold in Apriori Algorithm, let's formally introduce the concept of **support**.

> **Q. What is support?**

Support is a metric of How frequently does an item or item-set occur in the transaction data

<br>

> **Q. How is support calculated?**

This is calculated by dividing the number of transactions of where an itemset $X$ has occured in the transaction (let this be x), by the total number of transactions (let this be N)

$Support(x) = \frac{x}{N}$

<br>

> **Q. How can we interpret a support value?**

From a probabilistic standpoint, and Even if you see the formula,
- support is essentially the probability of an item-set $X$ occuring in the transaction data $T$.

![picture](https://drive.google.com/uc?export=view&id=1hyQZl6dureqy5t-g8O3kZhswBldsNw0p)

#### Confidence

Using Association rules, we stated that a person buying $X$ also tends to by $Y$, where $X$ and $Y$ are item sets.

> **Q. Since we need to base some business decisions on this, How confident are we in this statement?**

If we know that people buying milk and bread also tend to buy jam and eggs, then we can make a business decision and place all these items very close to each other.

But in order to make that decision, we first need to know how sure in this.

This can be answered by a term in market basket analysis called **confidence**

Confidence tells about the number of times these relationships have been found to be true

<br>

> **Q. How can we calculate confidence?**

This can be calculated by dividing the number of times both $X$ and $Y$ occur in transactions, by the number of times just $X$ occurs in transactions.

$confidence(X->Y) = \frac{Number \ of \ transactions \ with \ X \ and \ Y}{Number \ of \ transactions \ with \ X}$


<br>

> **Q. How can we interpret a confidence value?**

- This can be thought of as probability of $Y$ conditioned on $X$, $P(Y|X)$

 - i.e. Of all the times that $X$ occurs, how many times do we observe $Y$.

 - If $X U Y$ is a frequent itemset itself, then the confidence will be very high $≈90$%

- The confidence measure helps identify which product drives the sale of which other product.
 - For any two products, A drives B  {A ⇒ B} is not the same as B drives A, {B ⇒ A}


![picture](https://drive.google.com/uc?export=view&id=1scu8N-ELZR6KgcipA614kWdknpq9nG5w)

**Note:**
- A rule may show a strong correlation in a data set because it appears very often but may occur far less when applied (i.e checked against the antecedent).
 - This would be a **case of high support, but low confidence**.

- Conversely, a rule might not particularly stand out in a data set, but continued analysis shows that it occurs very frequently.
 - This would be a **case of high confidence and low support**.

#### Lift

Consider the combination: {Cornflakes} → {Milk}
- This should be a high confidence rule.

<br>

> **Q. What about {Yogurt} → {Milk}?**

High again.

<br>

> **Q. What about {Toothbrush} → {Milk}?**

Not so sure?
- Confidence for this rule will also be **high** since {Milk} is such a frequent itemset and would be present in every other transaction.

- It does not matter what you have in the antecedent for such a frequent consequent.
- The confidence for an association rule having a very frequent consequent will always be high

Analyse this:
<ul>
<li>Total transactions = 100
<li> 80 of them have milk
<li> 14 of them have toothbrush
<li> 10 of them have both milk and toothbrush
</ul>

Confidence for {Toothbrush} → {Milk} will be 10/14 = 0.7

Looks like a high confidence value. But we know intuitively that these two products have a weak association and there is something misleading about this high confidence value.

<br>

> **Q. How can we overcome this problem?**

Since, Considering just the value of confidence limits our capability to make any business inference.

**Lift** is introduced to overcome this challenge.





> **Q. What is lift?**

Let’s run another mini analytics.
Suppose an X store’s retail transactions database includes the following data:
<ul><li>
Total number of transactions: 600,000
<li>Transactions containing Bread: 7,500 (1.25 percent)
<li>Transactions containing Milk: 60,000 (10 percent)
<li>Transactions containing both Bread and Milk: 6,000 (1.0 percent)
</ul>

From the above figures, we can conclude that
- if there was no relation between Bread and Milk (that is, they were statistically independent),
 - then we would have got only 10% of Bread purchasers to buy Milk too.

- However, as surprising as it may seem, the figures tell us that 80% (=6000/7500) of the people who buy Bread also buy Milk.

This is a significant jump of 8x over what was the expected probability.

This **factor of increase is known as Lift** – which is the ratio of the observed frequency of co-occurrence of our items and the expected frequency.

Based on the low percentages we are seeing here (1.25%, 10%, 1%), we would have expected a low lift%.  

However, the fact that about 80% of Bread purchases include the purchase of Milk indicates a link between Bread and Milk.

TODO: Scribble 1



<br>

> **Q. How is lift calculated?**


It is based on the idea that, if $X$ and $Y$ are independent events, then the probability of $X and Y$ is equal to the product of their individual probabilities: $P(X ∩ Y) = P(X) . P(Y)$

Lift can be calculated by dividing the support of X and Y, by their individual support values.

$lift(X → Y) = \frac{support(X ∩ Y)}{support(X).support(Y)} = \frac{confidence(X → Y)}{support(Y)}$

This can be interpreted as being the same as: $\frac{P(X ∩ Y)}{P(X) . P(Y)}$

![picture](https://drive.google.com/uc?export=view&id=1pvRYIHM_ox9WpVEu0YCXFjLTzKqMOZNs)

<br>

> **Q. What happens if X and Y are actually independent?**

In that case, the numerator and denominator will be both equal, and give a result of 1.

Otherwise, the numerator will be greater.

- $lift(X→Y) = 1$, if X and Y are independent
- $lift(X→Y) < 1$, unlikely to be bought together: **negative correlation**
- $lift(X → Y) > 1$, likely to be bought together: **positive correlation**

**Note:**
- The value of lift ranges from **0 to infinity**.

![picture](https://drive.google.com/uc?export=view&id=1664Io4QPmIZYy6R-3vsa4eZk1ki5IaF1)

Now if we analyse the two itemset pair where both had high confidence :

$Lift\left\{Bread -> Milk\right\} = \frac{Confidence \left\{Bread -> Milk\right\}}{ Support(Milk)}$
$ = \frac{6000/7500}{60,000/600,000} = 80$

<br>

Similarly, we get

$ Lift \left\{Toothbrush → Milk \right\} = \frac{Confidence \left\{Toothpaste -> Milk\right\}}{ Support(Milk)} = \frac{0.7}{0.8} = 0.87$

<br>
A value of lift less than 1 shows that having toothbrush on the cart does not increase the chances of occurrence of milk on the cart in spite of the rule showing a high confidence value




#### Leverage

There is another metric called **leverage** that is constructed using support. This is very similar to lift.

<br>

> **Q. How is leverage calculated?**

To compute the leverage of "if X then Y" i.e. $X->Y$, we compute the support of $X and Y$, and subtract the product of support of X, and support of Y from it.

$leverage(X->Y) = support(X \cap Y) - (support(X)*support(Y))$

<br>

> **Q. What is the advantage of using leverage over lift values**

- Though it is similar to lift, but leverage is **easier to interpret**.
- Leverage value lies in the range of **-1 to +1**, whereas lift value ranges from 0 to infinity.

#### Conviction

> **Q. How can we calculate conviction value?**

It can be calculated as the ratio of the expected frequency that X occurs without Y if X and Y were independent divided by the observed frequency of incorrect predictions.

$Conv (X → Y) = \frac{1 - S(Y)}{1 - C(X → Y)}$

<br>

> **Note**

A high value means that the consequent depends strongly on the antecedent.


Now that we've looked at assocation rules, and it's metrics, let's implement it in code.

<br>

> **Q. How to implement this in code?**


- After applying the apriori algorithm and finding the frequently bought item, we apply the association rules.
- From association rules, we could extract information about which items are more effective to be sold together

We have a built-in function for this as well in the same library: `mlxtend.frequent_patterns`.

**Note:**
- We pass the frequent item sets we got from apriori algo as the parameter here.

In [None]:
from mlxtend.frequent_patterns import association_rules

In [None]:
association_rules=association_rules(frequent_itemsets_plus, metric='lift',
                  min_threshold=1).sort_values('lift', ascending=False).reset_index(drop=True)
association_rules.head(5)

Unnamed: 0,antecedents,consequents,antecedent support,consequent support,support,confidence,lift,leverage,conviction
0,(GREEN REGENCY TEACUP AND SAUCER),(PINK REGENCY TEACUP AND SAUCER),0.056473,0.042264,0.034887,0.617773,14.617093,0.0325,2.505674
1,(PINK REGENCY TEACUP AND SAUCER),(GREEN REGENCY TEACUP AND SAUCER),0.042264,0.056473,0.034887,0.825465,14.617093,0.0325,5.405948
2,(PINK REGENCY TEACUP AND SAUCER),(ROSES REGENCY TEACUP AND SAUCER ),0.042264,0.057682,0.033013,0.781116,13.541798,0.030575,4.305101
3,(ROSES REGENCY TEACUP AND SAUCER ),(PINK REGENCY TEACUP AND SAUCER),0.057682,0.042264,0.033013,0.572327,13.541798,0.030575,2.239413
4,(GARDENERS KNEELING PAD CUP OF TEA ),(GARDENERS KNEELING PAD KEEP CALM ),0.045287,0.054235,0.032711,0.722296,13.317793,0.030254,3.405662


## Q. How can we draw Data Conclusions from these associations rules?


*   We can see GREEN REGENCY TEACUP AND SAUCER and PINK REGENCY TEACUP AND SAUCER are the items that has the highest association each other since these two items has the **highest lift value**, i.e 14.6

 *   This tells us that 'PINK REGENCY TEACUP AND SAUCER' is 14.6 times more likely to be bought by the customers who buy 'GREEN REGENCY TEACUP AND SAUCER' compared to the **default likelihood sale** of 'PINK REGENCY TEACUP AND SAUCER’

* Since the antecedent support > consequent support, the rule that applies is **{GREEN REGENCY TEACUP AND SAUCER -> PINK REGENCY TEACUP AND SAUCER}**

 * This means that  a customer has a higher tendency to buy PINK REGENCY TEACUP AND SAUCER AFTER they buy GREEN REGENCY TEACUP AND SAUCER. Not  the other way around

* The confidence level for the rule is 0.6177, which shows that out of all the transactions that contain both “GREEN REGENCY TEACUP AND SAUCER”, ~62 % contain PINK REGENCY TEACUP AND SAUCER too

 * This could  very valuable information, because we are now aware which products should we put the discounts on. We could give discounts on PINK REGENCY TEACUP AND SAUCER if a customer buy GREEN REGENCY TEACUP AND SAUCER etc

Similarly, we can draw other business conclusions from these association rules, which will lead in increase in revenue!


## Q. What Business Strategy can we follow based on these findings?



*   **Item Placements:** We could place these two products together for easy accessiblity and increase sales

*   **Product Bundling:** We could bundle these two products and sell it together at a discounted price compared to each price combined

* **Customer Recommendation & Discounts:** We could place PINK REGENCY TEACUP AND SAUCER at the cashier, and every time a customer buys GREEN REGENCY TEACUP AND SAUCER, we could offer and recommend them to buy PINK REGENCY TEACUP AND SAUCER with a lower price



## Summary

- Market basket analysis is a business problem, which can be solved using **Apriori algorithm / FP growth algorithm**.
- Also, market basket analysis can be interpreted in the context of **association rule mining**
- To find these association rule mining, and frequent item sets, we will still go to appriori and FP growth algos

![picture](https://drive.google.com/uc?export=view&id=1b1lWMUFDgsID4fj07ifFmjhudf6IGszE)