**Market Basket Analysis**

Here will try to do Market Basket Analysis on online retail dataset it is technique used by retailers to uncover strength of association between pairs of products purchased together and identify patterns of co-occurrence.

It creates If-Then scenario rules, for example, if item A is purchased then item B is likely to be purchased. The rules are probabilistic in nature

And using pandas profiling for creating automated Exploratory data analysis reports for initial analysis

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load in 

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# Any results you write to the current directory are saved as output.

In [None]:
# import required libraries 

import pandas as pd
import numpy as np
import seaborn as sns # used for plot interactive graph. 
import matplotlib.pyplot as plt

#Pandas-profiling is an open source library that can generate beautiful interactive reports for initial EDA.

import pandas_profiling
from pandas_profiling import ProfileReport

****Let’s get into the fun part of analyzing some data set.

In [None]:
# Importing dataset and  checking the first 5 rows of dataset
data= pd.read_csv('../input/online-retail-ii-uci/online_retail_II.csv')
data.head()

* EDA for any data project require below steps 
* The data type of each variable 
* The distribution of the target variable, number of distinct values for each predictor variable, 
* If there is any duplicate or missing values in the data set etc.this all will be getting through pandas profiling package 

In [None]:
# To install pandas profiling in notebook
# pip install pandas-profiling 
# So from below reports can have look for itnital analaysis of missing value, distribution of variables etc..

In [None]:
# Basic analysis reports
ProfileReport(data)

Data preprocessing and exploring

In [None]:
# can check for unique description 
print('Unique Items: ', data['Description'].nunique())
print( '\n', data['Description'].unique())

In [None]:
# CustId and description have null values 
print(data.isnull().sum().sort_values(ascending=False)) 

In [None]:
#drop all null values
data.dropna(inplace = True) 

In [None]:
#No null values are present now
data.info() 

Which items do customers purchase most?

In [None]:
# explore and visualize the most sales items within this time period. 
most_sold = data['Description'].value_counts().head(15)

print('Most Sold Items: \n')
print(most_sold)

In [None]:
#A bar plot of the support of most frequent items bought.
plt.figure(figsize=(7,6))
most_sold.plot(kind='bar')
plt.title('Items Most Sold')

* Association Rules are widely used to analyze retail basket or transaction data, and are intended to     identify strong rules discovered in transaction data using measures of interestingness, based on the concept   of strong rules.
* Create some rules
* We use the Apriori algorithm to mine frequent itemsets and association rules. 
  The algorithm employs level-wise search for frequent itemsets.

In [None]:
# UK has maximum records of sales followed by Germany
data['Country'].value_counts() 

For analysis purposes taking records of Germany only.

In [None]:
data = data.loc[data['Country'] == 'Germany']   
data.head()

Data Cleaning 

In [None]:
#remove spaces from begining of the description 
data['Description'] = data['Description'].str.strip()

In [None]:
# drop duplicates of invoices 
data.dropna(axis=0,subset=['Invoice'],inplace = True)

In [None]:
# converting invoice in to string 
data['Invoice'] = data['Invoice'].astype('str')

* Before using any rule mining algorithm, we need to transform the data from the data frame format,into  transactions such that we have all the items bought together in one row.
* Can reomove date and CustID col (any column which is not required ),keeping here only Invoice no.

In [None]:
# Seprating transaction for Germany
my_basket = (data[data['Country']=='Germany']
            .groupby(['Invoice','Description'])['Quantity']
            .sum().unstack().reset_index().fillna(0) 
            .set_index('Invoice'))

* Viewing basket of country Germany and invoice no as key ,if value is 0 means product is not present in that invoice no.

* If some number like(1,2 ..) are present means that product is part of that invoice .
 
* Transaction dataset,and it shows the matrix of items being bought together.

In [None]:
my_basket.head() 

Association of analysis required dataframe to only to be zero and one.

In [None]:
# converting all positive value to 1 and rest to zero
def my_encode_units(x):
    if x <= 0:
        return 0
    if x >= 1:
        return 1

In [None]:
my_basket_sets = my_basket.applymap(my_encode_units)

* Training model now.
* Use conda install mlxtend (to import apriori and association_rule)

In [None]:
# Import required libraries
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import association_rules, apriori

* We pass min supp=0.07,return all the rules that have a support of at least 7%
* We sort the rules by decreasing confidence.
* Have a look at the summary of the rules.

In [None]:
#Support of at least support 0.7% 

#We sort the rules by decreasing confidence.

frequent_itemsets = apriori(my_basket_sets, min_support=0.07, use_colnames=True)

# generating rules from frequent trasactions 

rules = association_rules(frequent_itemsets, metric='lift', min_threshold=1.0) # different rules having support, confidence,lift
rules.sort_values('confidence', ascending=False)

How to use rules to make recommendation?

Using trasactions can get the rules can do recommendation based on lift,confidence and support

Association rules are normally written like this: {ROUND SNACK BOXES SET OF 4 FRUITS'} -> {ROUND SNACK BOXES SET OF4 WOODLAND} which means that there is a strong relationship between customers that purchased fruits and also purchased Wooland in the same transaction i.e 15th line items in above data.

Both antecedents and consequents can have multiple items.

In [None]:
#How many times the item ROUND SNACK BOXES SET OF 4 FRUITS is occuring:15 line item (ignoring Postage )
my_basket_sets['ROUND SNACK BOXES SET OF 4 FRUITS'].sum()

In [None]:
# 15th line item in consequents col ROUND SNACK BOXES SET OF4 WOODLAND
my_basket_sets['ROUND SNACK BOXES SET OF4 WOODLAND'].sum()

There is lift of 4.26 and confidence of 0.80 so customer those who buying set of 4 woodland are buying 134 times of ROUND SNACK BOXES SET OF 4 FRUITS.

So can recommend to ppl to buy ROUND SNACK BOXES SET OF 4 FRUITS those are buying ROUND SNACK BOXES SET OF4 WOODLAND.



Filtering the rules based on different conditions limiting to lift of greater than 3 and confidence greater than 30%.

In [None]:
rules[(rules['lift'] >=3) &
    (rules['confidence'] >= 0.3) ]

**conclusion** 

* Where the higher the lift value, the stronger the correlation between the items.
* We now know the correlation between items and the common interest of the customers,so business can make decisions based on these findings and product placement.