# Fastpages Notebook Blog Post
> A tutorial of fastpages for Jupyter notebooks.

- toc: true 
- badges: true
- comments: true
- categories: [jupyter]
- image: images/chart-preview.png

# Understanding MBA

Market basket analysis (MBA), also known as association-rule mining, is a useful method of discovering customer purchasing patterns by extracting associations or co-occurrences from stores' transactional databases (Chen et al., 2005).  It is a modelling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. For example, if you are in a supermarket and you buy a loaf of Bread, you are more likely to buy a packet of Butter at the same time than somebody who didn't buy the Bread. Another example, if you are buying a XiaoMi Power Bank in an online store, you are more likely to also buy a carrying case to go with the power bank. [Amazon](https://www.amazon.com) knows this well from the transaction data of its millions of customers and thus recommends a case to you as seen below:


<table><tr><td><img height="400" width="800" src="images/amazon.jpg"></td></tr></table>
<p style="text-align: center">Credit: Amazon</p>


The set of items a customer buys is known as an itemset, and MBA tries to identify relationships from the purchases of itemset. The output of MBA consists of a series of product association rules. From the transaction data extracted from the shopping carts of online retailers or the point of sales system of retail stores, we can use MBA to extract interesting association rules between products. For example, if customers buy product A they also tend to buy product B.

Typically we can extract the relationship between products in the form of a rule, an example of association rule:

    IF {bread} THEN {butter}. 

In this example, if customers buy Bread they also tend to buy Butter. Some people often link products with high association to "complementary goods". In Economics 101, complementary good or service is consumed or used in conjunction with another good or service. Usually, the complementary good has little to no value when consumed alone, but when combined with another good or service, it adds to the overall value of the offering. For example a car and petrol. It would be of little value to buy petrol without owning a car. Complementary goods often have a negative cross-price elasticity of demand coefficient (Farnham, 2014). However, it is worth pointing out that, while complementary goods tend to have high association, not all products with high association rules are complementary goods. In MBA, we are more interested in product-pairs with high association rules i.e. products that are frequently purchased together. For example, in a retail store, MBA findings may show that Barbie dolls and candy are frequently purchased together, even though they are not technically complementary goods. In short, complementary goods are fairly obvious and common sense, but MBA seeks to uncover product associations that may not be so obvious and straighforward. In doing so, it is attempting to convert the abstract consumer tastes and preferences into association rules that are more insightful and actionable, from business perspective.

***
## Applications ##
There are many real-life applications of MBA:
- **Recommendation engine** – showing related products as "Customers Who Bought This Item Also Bought" or “Frequently bought together” (as shown in the Amazon example above). It can also be applied to recommend videos and news article by analyzing the videos or news articles that are often watched or read together in a user session.
<br>
<br>
- **Cross-sell / bundle products** – selling associated products as a "bundle" instead of individual items. For example, transaction data may show that customers often buy a new phone with screen protector together. Phone retailers can then package new phone with high-margin screen protector together and sell them as a bundle, thereby increasing their sales.
<br>
<br>
- **Arrangement of items in retail stores** – associated items can be placed closer to each other, thereby invoking "impulse buying". For example it may be uncovered that customers who buy Barbie dolls also buy candy at the same time. Thus retailers can place high-margin candy near Barbie doll display, thereby tempting customers to buy them together.
<br>
<br>
- **Detecting fraud** – identifying related actions whenever a fraudulent transaction is performed. For example, in a fraudulent insurance claim for stolen vehicle, it may be analyzed (from historical data) that claimant frequently report the incident a few days late (action 1) and often refuse to cooperate with insurer on investigation (action 2). Insurers can identify these red flags once certain behaviours or actions are displayed by the claimants.
<br>
<br>



***
## Case Study ##
For simplicity we are analyzing only 2 items – Bread and Butter. We want to know if there is any evidence that suggests that buying Bread leads to buying Butter.

**Problem Statament:** Is the purchase of Bread leads to the purchase of Butter?<br><br>
**Hypothesis:** There is significant evidence to show that buying Bread leads to buying Butter.


Bread => Butter

Antecedent => Consequent

Let's take the example of a supermarket which generates 1,000 transactions monthly, of which Bread was purchased in 150 transactions, Butter in 130 transactions, and both together in 50 transactions.

In set theory it can be represented as Bread only – 100, Butter only – 80, Bread and Butter – 50, as shown in the Venn diagram below:

![alt text](images/set.jpg "Example in a set")



## Analysis and Findings ##
We can use MBA to extract the association rule between Bread and Butter. There are three metrics or criteria to evaluate the strength or quality of an association rule, which are support, confidence and lift.

### 1. Support ###
Support measures the percentage of transactions containing a particular combination of items relative to the total number of transactions. In our example, this is the percentage of transactions where both Bread and Butter are bought together. We need to calculate this to know if this combination of items is significant or negligible? Generally, we want a high percentage i.e. high support in order to make sure it is a useful relationship. Typically, we will set a threshold, for example we will only look at a combination if more than 1% of transactions have this combination.


Support (antecedent (Bread) and consequent (Butter)) = Number of transactions having both items / Total transactions

![alt text](images/support.jpg "Support")

Result: The support value of 5% means 5% of all transactions have this combination of Bread and Butter bought together. Since the value is above the threshold of 1%, it shows there is indeed **_support_** for this association and thus satisfy the first criteria.

***
### 2. Confidence ###
Confidence measures the probability of finding a particular combination of items whenever antecedent is bought. In probability terms, confidence is the conditional probability of the consequent given the antecedent and is represented as P (consequent / antecedent). In our example, it is the probability of both Bread and Butter being bought together whenever Bread is bought. Typically, we may set a threshold, say we want this combination to occur at least 25% of times when Bread is bought.

Confidence (antecedent i.e. Bread and consequent i.e. Butter) = P (Consequent (Butter) is bought GIVEN antecedent (Bread) is bought)

![alt text](images/confidence.jpg "Confidence")

Result: The confidence value of 33.3% is above the threshold of 25%, indicating we can be **_confident_** that Butter will be bought whenever Bread is bought, and thus satisfy the second criteria.

***
### 3. Lift ###
Lift is a metric to determine how much the purchase of antecedent influences the purchase of consequent. In our example, we want to know whether the purchase of Butter is independent of the purchase of Bread (or) is the purchase of Butter happening due to the purchase of Bread? In probability terms, we want to know which is higher, P (Butter) or P (Butter / Bread)? If the purchase of Butter is influenced by the purchase of Bread, then P (Butter / Bread) will be higher than P (Butter), or in other words, the ratio of P (Butter / Bread) over P (Butter) will be higher than 1.

![alt text](images/lift.jpg "Confidence")

Result: The lift value of 2.56 is greater than 1, it shows that the purchase of Butter is indeed influenced by the purchase of Bread rather than Butter's purchase being independent of Bread. The lift value of 2.56 also means that Bread's purchase **_lifts_** the Butter's purchase by 2.56 times.



***
### Conclusion ###
Based on the findings above, we can justify our initial hypothesis as we

    a) Have the support of 5% transactions for Bread and Butter in the same basket
    b) Have 33.3% confidence that Butter sales happen whenever Bread is purchased.
    c) Knows the lift in Butter's sales is 2.56 times more, whenever Bread is purchased than when Butter is purchased alone.

Therefore, we can conclude that there is indeed evidence to suggest that the purchase of Bread leads to the purchase of Butter. This is a valuable insight to guide management's decision-making. For example, managers of retail stores could start placing bread and butter close to each other, knowing that customers are highly likely to "impulsively" purchase them together, thereby increasing the store's revenue.


# <a name="implementation-in-python">Implementation in Python</a> #

## Importing libraries ##

In [4]:
%matplotlib inline
!pip install mlxtend
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

Collecting mlxtend
  Downloading mlxtend-0.21.0-py2.py3-none-any.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m16.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Installing collected packages: mlxtend
Successfully installed mlxtend-0.21.0


## Load Data ##

In [5]:
# load the data inro a pandas data fram and take a look at the first 10 rows
bread = pd.read_csv("BreadBasket_DMS.csv")
bread.head(10)

Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
5,2016-10-30,10:07:57,3,Cookies
6,2016-10-30,10:08:41,4,Muffin
7,2016-10-30,10:13:03,5,Coffee
8,2016-10-30,10:13:03,5,Pastry
9,2016-10-30,10:13:03,5,Bread


In [6]:
# check the summary info of the dataframe
bread.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21293 entries, 0 to 21292
Data columns (total 4 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   Date         21293 non-null  object
 1   Time         21293 non-null  object
 2   Transaction  21293 non-null  int64 
 3   Item         21293 non-null  object
dtypes: int64(1), object(3)
memory usage: 665.5+ KB


> Note: There are 21,293 rows and 4 columns in the dataframe. `Date` and `Time` columns are encoded in 'object' instead of Datetime, but fortunately there is a `Transaction` column which helps to identify each transaction. `Item` column contains the individual items in that transaction. For example, Transaction No. 3 contains items of "Hot chocolate", "Jam", and "Cookies" which are all transacted in the same time i.e 10.07.57 on 2016-10-30.

## Checking for missing values

In [8]:
# check for missing values
bread.isnull().sum()

Date           0
Time           0
Transaction    0
Item           0
dtype: int64

In [9]:
missing_value = ["NaN", "NONE", "None", "Nan", "nan", "nil", "none"]
print("There are {0} missing values in the dataframe.".format(len(bread[bread.Item.isin(missing_value)])))
bread[bread.Item.isin(missing_value)].head(10)

There are 786 missing values in the dataframe.


Unnamed: 0,Date,Time,Transaction,Item
26,2016-10-30,10:27:21,11,NONE
38,2016-10-30,10:34:36,15,NONE
39,2016-10-30,10:34:36,15,NONE
66,2016-10-30,11:05:30,29,NONE
80,2016-10-30,11:37:10,37,NONE
85,2016-10-30,11:55:51,40,NONE
126,2016-10-30,13:02:04,59,NONE
140,2016-10-30,13:37:25,65,NONE
149,2016-10-30,13:46:48,67,NONE
167,2016-10-30,14:32:26,75,NONE


> Note: While there is no empty cell in the dataframe, a check using the popular missing value shows that there are 786 rows with "NONE" in the column `Item`. Since the items are not recorded, we will have to remove these rows.

In [11]:
bread = bread.drop(bread[bread.Item == "NONE"].index)
print("Number of rows:{0}".format(len(bread)))
bread.head(10)

Number of rows:20507


Unnamed: 0,Date,Time,Transaction,Item
0,2016-10-30,09:58:11,1,Bread
1,2016-10-30,10:05:34,2,Scandinavian
2,2016-10-30,10:05:34,2,Scandinavian
3,2016-10-30,10:07:57,3,Hot chocolate
4,2016-10-30,10:07:57,3,Jam
5,2016-10-30,10:07:57,3,Cookies
6,2016-10-30,10:08:41,4,Muffin
7,2016-10-30,10:13:03,5,Coffee
8,2016-10-30,10:13:03,5,Pastry
9,2016-10-30,10:13:03,5,Bread
