# Advanced Certification Program in Computational Data Science

##  A program by IISc and TalentSprint

### Notebook: Market Basket Analysis 
Note: This is supplementary reading material provided with the Market Basket Analysis Mini-Project

## Learning Objectives

At the end of the experiment, you will be able to :

* extract summary level insight from a given customer dataset

* handle the missing data and identify the underlying pattern or structure

* identify customer segments based on the overall buying behaviour


## Dataset

The dataset chosen for this mini project is Online Retail dataset. It is a transnational data set which contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.

The dataset contains 541909 records, and each record is made up of 8 fields.

To know more about the dataset : [click here](https://archive.ics.uci.edu/ml/datasets/Online+Retail)

## Information

#### Market Basket Analysis

Market Basket Analysis is one of the key techniques used by the large retailers that uncovers associations between items by looking for combinations of items that occur together frequently in transactions. In other words, it allows the retailers to identify relationships between the items that people buy.

Association Rules is widely used to analyze retail basket or transaction data, is intended to identify strong rules discovered in transaction data using some measures of interestingness, based on the concept of strong rules.

##### Example of Association Rules

* Assume there are 100 customers
* 10 out of them bought milk, 8 bought butter and 6 bought both of them. 
* bought milk => bought butter
* Support = P(Milk & Butter) = 6/100 = 0.06
* confidence = support/P(Butter) = 0.06/0.08 = 0.75
* lift = confidence/P(Milk) = 0.75/0.10 = 7.5


**Note:** In practice, a rule needs a support of several hundred transactions before it can be considered statistically significant, and datasets often contain thousands or millions of transactions.


To know more about the **market basket analysis** click [here](https://www.sciencedirect.com/topics/computer-science/market-basket-analysis)




### Import required packages

In [None]:
import numpy as np # Importing Numpy Package
from scipy import stats
import pandas as pd # Importing Pandas Package undername pd
from mlxtend.frequent_patterns import apriori, association_rules # Importing apriori and association rules from mlxtend package

To know about **mlxtend.frequent_patterns** click [here](http://rasbt.github.io/mlxtend/api_subpackages/mlxtend.frequent_patterns/)

## Data Wrangling

In [None]:
#@title Download the data
!wget -qq https://cdn.iiith.talentsprint.com/CDS/Datasets/Online_Retail.xlsx
print("Data downloaded successfully")

#### Loading the data

In [None]:
data = pd.read_excel('Online_Retail.xlsx') # Loading the data

To know more about the **read_excel** function click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html)

In [None]:
data.head() # Checking for the first five rows from the dataset

In [None]:
data.tail() # Checking for the last five rows from the dataset

In [None]:
data.shape # Checking for the number of rows and columns in the dataset

In [None]:
data.columns # Checking for the columns in the dataset

In [None]:
data.dtypes # Checking for the types of variables in the dataframe

## Exploratory Data Analysis

### Data Pre-processing

Checking for the duplicate data using **duplicated** funtion. 

To know about duplicate function click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html)

In [None]:
data.duplicated()

We can see that there is a lot of redundant data. So, let's handle the redundant data by dropping them using the **drop_duplicates** function.

To know more about the **drop_duplicates** click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html)

In [None]:
data.drop_duplicates(inplace = True) # Handling the redundant data by dropping them
data.shape # Checking for the shape of the data after dropping the redundant data

In the dataset, we can see that most Invoices appear as normal transactions with positive quantity and prices, but there are some prefixed with "C" or "A" which denote different transaction types. Invoice starting with C represents cancelled order and A represents the Adjusted. 

Now let's identify such data and handle them by checking the negative values in Quantity column for all cancelled orders

In [None]:
len(data[data.InvoiceNo.str[0] == 'C']), len(data[data.Quantity < 1 ]) # Checking for the data

Dropping the records containing the Cancelled orders.

To know about how to subset a pandas dataframe click [here](https://cmdlinetips.com/2018/02/how-to-subset-pandas-dataframe-based-on-values-of-a-column/)

In [None]:
data = data[~ (data.InvoiceNo.str[0] == 'C')]
data.shape # Checking for the shape of the dataset after dropping the records which contain cancelled orders

### Descriptive statistics

Let's describe the statistics of the data using **describe().**

To know about the **describe** function click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html)

In [None]:
data.describe()

Checking for the empty records(null values) using **isna** function. 

To know more about the isna function click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html)

In [None]:
data.isna().sum()

let's drop the empty records using **dropna** function.

To know more about the dropna function click [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

In [None]:
data.dropna(inplace=True)
data.shape # Checking for the shape of the data after dropping the nan values

From the dataset, we can see that some of the transactions based on the `StockCode` variable are not actually products, but representing the costs or fees regarding to the post or bank or other tansactions.

Let's assume that the transaction with `'POST' 'PADS' 'M' 'DOT' 'C2' 'BANK CHARGES'` as their `StockCodes` are irrelevant transactions.

In [None]:
Irrelevant = data['StockCode'].astype('str').unique() # Finding the irrelevant data
Irrelevant.sort() # Sorting the data
print('Irrelevant Transactions: \n',Irrelevant[::-1][:100])
print(data.shape) # Checking for the shape of the data
data = data[~(data['StockCode'].isin(['POST', 'PADS', 'M', 'DOT', 'C2', 'BANK CHARGES']))] # Dropping irrelevant data
print(data.shape) # Checking for the shape of the data after removing irrelevant data

We can see that there are outliers in the UnitPrice and Quantity Variables. Let's handle them by calculating the z-score.

To know about how to handle outliers click [here](https://kanoki.org/2020/04/23/how-to-remove-outliers-in-python/)

In [None]:
data = data[(np.abs(stats.zscore(data['UnitPrice']))<3) & (np.abs(stats.zscore(data['Quantity']))<5)]
data.shape # Checking for the shape of the data

We need to consolidate the items into 1 transaction per row with each product 1 hot encoded. For the sake of keeping the data set small, We will be looking at sales for France to apply association rule by grouping the Invoice and Quantity variables.

In [None]:
basket_France = (data[data['Country'] =="France"].groupby(['InvoiceNo', 'Description'])['Quantity'].sum().unstack().reset_index().fillna(0).set_index('InvoiceNo'))
basket_France

We can see that, there are a lot of zeros in the data but we also need to make sure any positive values are converted to a 1 and anything less the 0 is set to 0.

To know more about one hot encoding click [here](https://towardsdatascience.com/categorical-encoding-using-label-encoding-and-one-hot-encoder-911ef77fb5bd)

In [None]:
def hot_encode(x): 
    if(x<= 0): 
        return 0
    if(x>= 1): 
        return 1

basket_France = basket_France.applymap(hot_encode)

Now let's generate frequent item sets that have a support of at least 5%

In [None]:
frq_items = apriori(basket_France, min_support = 0.05, use_colnames = True) 

Now let's generate the rules with their corresponding support, confidence and lift

In [None]:
rules = association_rules(frq_items, metric ="lift", min_threshold = 1) 
rules = rules.sort_values(['confidence', 'lift'], ascending =[False, False]) 
print(rules.head()) 