# Project 1 - Market Basket Analysis for E-Commerce
### by Azubuogu Peace Udoka


## Introduction

### Background
As a data analyst at a retail company, I have access to a dataset containing customer transactions. The task is to perform market basket analysis to uncover patterns in customer purchasing behavior. By identifying which products  tend  to  be  bought  together,  the  company  can  make  informed  decisions  to  improve  sales  and  customer satisfaction.

In [1]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline
#set general style of plots
sns.set(rc = {'figure.figsize':(20,8)}, style="white", font_scale=1.2)

#import transaction encoder function and apriori from mlxtend. Ensure mlxtend is installed before running
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import apriori, association_rules

### Understanding the Dataset

In [2]:
data = pd.read_csv("Market Basket Analysis - Groceries_dataset.csv")
print(f'There are {data.shape[0]} rows and {data.shape[1]} columns')

There are 38765 rows and 3 columns


In [3]:
data.head()

Unnamed: 0,Member_number,Date,itemDescription
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


### Dataset Description
The dataset contains information about customer transactions. There are 38765 rows of customer transactions and 3 columns:
- Member_number: This is the unique identifier for  customers. Note, a customer may perform multiple transactions. 
- Date: This is the date at which the transaction was done.
- itemDescription: This is the item bought.


## Data Wrangling
To improve efficiency, the column names will be replaced with more intuitive names.

- Member_number will be changed to cust_id.

- itemDescription will be changed to <b>item</b>.


In [4]:
data.rename(columns={"Member_number":"cust_id", "itemDescription":"item"},inplace= True)

In [5]:
data.head()

Unnamed: 0,cust_id,Date,item
0,1808,21-07-2015,tropical fruit
1,2552,05-01-2015,whole milk
2,2300,19-09-2015,pip fruit
3,1187,12-12-2015,other vegetables
4,3037,01-02-2015,whole milk


In [6]:
data.item.nunique()

167

There are 167 different items being bought.

To know the items bought together, group the data by date and then by cust_id. This will display the items bought by a customer on the same day. 

The number of each item bought is not of use to us. We only need to know the items bought together. To show this information in a more concise manner, rearrange the dataframe such that items bought together by each customer appears on the same row, separated by commas


In [7]:
new_data = data.groupby(["Date","cust_id"], as_index=False)['item'].agg(','.join)
new_data

Unnamed: 0,Date,cust_id,item
0,01-01-2014,1249,"citrus fruit,coffee"
1,01-01-2014,1381,"curd,soda"
2,01-01-2014,1440,"other vegetables,yogurt"
3,01-01-2014,1659,"specialty chocolate,frozen vegetables"
4,01-01-2014,1789,"hamburger meat,candles"
...,...,...,...
14958,31-10-2015,4322,"brown bread,chocolate"
14959,31-10-2015,4675,"pip fruit,pastry"
14960,31-10-2015,4773,"salty snack,other vegetables,yogurt,other vege..."
14961,31-10-2015,4882,"tropical fruit,pickled vegetables"


In [8]:
# convert items column to a list of lists to enable onehot encoding
transactions = new_data["item"].apply(lambda x: x.split(','))
transactions = list(transactions)

In [9]:
# instantiate the transaction encoder
encoder = TransactionEncoder().fit(transactions)
onehot = encoder.transform(transactions)
onehot = pd.DataFrame(onehot, columns = encoder.columns_)
onehot

Unnamed: 0,Instant food products,UHT-milk,abrasive cleaner,artif. sweetener,baby cosmetics,bags,baking powder,bathroom cleaner,beef,berries,...,turkey,vinegar,waffles,whipped/sour cream,whisky,white bread,white wine,whole milk,yogurt,zwieback
0,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
1,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
2,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
3,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
4,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
14958,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14959,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False
14960,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,True,False
14961,False,False,False,False,False,False,False,False,False,False,...,False,False,False,False,False,False,False,False,False,False


## Calculating Metrics
### 1. Support: 
This shows how often an item is bought.

In [10]:
support = onehot.mean()
support.sort_values(ascending = False)

whole milk               0.157923
other vegetables         0.122101
rolls/buns               0.110005
soda                     0.097106
yogurt                   0.085879
                           ...   
frozen chicken           0.000334
bags                     0.000267
baby cosmetics           0.000200
kitchen utensil          0.000067
preservation products    0.000067
Length: 167, dtype: float64

The results above show that the whole milk is the most frequently bought item while preservation products are the least bought.

In [11]:
support.describe()

count    167.000000
mean       0.015210
std        0.023381
min        0.000067
25%        0.002038
50%        0.005681
75%        0.017644
max        0.157923
dtype: float64

<b> Note</b>: Because there are too many items being sold, there will be a very large possible combination of sets of items. To perform proper analysis, use the Apriori principle to reduce the number of sets of items which works by pruning all subsets of infrequent sets and retaining all frequent sets and subsets. From the summary statistics of the support of the items, it is seen that the median support value is about 0.05. This will be used as the support threshold.  That is, all items with support values above 0.05 will be considered as frequent.

In [12]:
# applying Apriori algorithm
freq_items = apriori(onehot, min_support = 0.005, use_colnames = True)
len(freq_items)

In [13]:
len(freq_items)

126