# Market Basket Analysis
GitHub location: https://github.com/LarsTinnefeld/olist_ecom_analysis.git

**A project of the Olist ecommerce business analysis.**

<img src="https://i2.wp.com/dataneophyte.com/wp-content/uploads/2019/12/Logo-01.png" width="400" height="300">


## Questions to answer
...

## Table of Contents

I. [Data Import and Wrangling](#data)<br>
II. [Exploratory Date Analysis](#eda)<br>
III. [Market Basket Analysis](#affinity)<br>

---
## <a class="anchor" id="data">I. Data Import and Wrangling</a>

### 1. Libraries

In [1]:
import pandas as pd
import numpy as np

from apyori import apriori

import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
import seaborn as sns
from datetime import datetime as dt
%matplotlib inline
sns.set_style("whitegrid")

### 2. Importing order data
Part of the data was inherited from the initial Olist data analysis.

General data structure:

<img src="https://i.imgur.com/HRhd2Y0.png" width="700" height="450">

We will import the already cleanded dataset from the shared dataset of the Olist business analysis.

<img src="https://github.com/LarsTinnefeld/olist_ecom_analysis/blob/main/Olist-Analysis_1_New_tables.PNG?raw=true" width="700" height="400">

Importing order data

In [2]:
df_orders = pd.read_csv('../0 - data/df_orders_consolidated.csv')

In [3]:
df_orders.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,order_id,customer_id,order_status,order_purchase_timestamp,order_approved_at,order_delivered_carrier_date,order_delivered_customer_date,order_estimated_delivery_date,...,product_name_lenght,product_description_lenght,product_photos_qty,product_weight_g,product_length_cm,product_height_cm,product_width_cm,order_line_cube_in_ltr,price_round,customer_unique_id
0,0,0,2e7a8482f6fb09756ca50c10d7bfc047,08c5351a6aca1c1589a38f244edeee9d,shipped,2016-09-04,2016-10-07,2016-10-18,2016-11-09,2016-10-20,...,59.0,426.0,2.0,1400.0,32.0,6.0,28.0,5.376,40.0,b7d76e111c89f7ebf14761390f0f7d17
1,1,1,35d3a51724a47ef1d0b89911e39cc4ff,27ab53f26192510ff85872aeb3759dcc,delivered,2016-10-04,2016-10-05,2016-10-14,2016-10-26,2016-12-20,...,59.0,426.0,2.0,1400.0,32.0,6.0,28.0,5.376,40.0,f922896769e9517ea3c630f3c8de86d0
2,2,2,c4f710df20f7d1500da1aef81a993f65,4b671f05b6eb9dc1d2c1bae9c8c78536,delivered,2016-10-10,2016-10-10,2016-10-18,2016-10-26,2016-12-14,...,59.0,426.0,2.0,1400.0,32.0,6.0,28.0,5.376,40.0,0ecf7f65b5ff3b9e61b637e59f495e0a
3,3,3,81e5043198a44ddeb226002ff55d8ad4,ddd15ef77c83eea8c534d2896173a927,delivered,2017-01-09,2017-01-09,2017-01-09,2017-02-24,2017-02-24,...,59.0,426.0,2.0,1400.0,32.0,6.0,28.0,10.752,40.0,853ba75a0b423722ccf270eea3b4cfe4
4,4,4,03b218d39c422c250f389120c531b61f,db857a86c685a6a3a02a705961ec1ff1,delivered,2017-01-14,2017-01-14,2017-01-16,2017-01-18,2017-03-01,...,59.0,426.0,2.0,1400.0,32.0,6.0,28.0,5.376,40.0,c83d504c46170342ddbc93c762e0e4ec


This is a cleaned data table from previous analysis. All missing values are added, all duplicated entries are removed and order lines were consolidated. We only need to format the date fields into datetime format.

In [10]:
def convert_to_dt(dat, cols):
    '''
    Function takes in a dataframe name and date
    columns for conversion into datetime format
    Input:
    - Dataframe
    Output:
    - None (Converts the format of the column into datetime)
    '''
    for col in cols:
        dat[col] = pd.to_datetime(dat[col]).dt.date

In [11]:
convert_to_dt(df_orders, [
    'order_purchase_timestamp',
    'order_approved_at',
    'order_delivered_carrier_date',
    'order_delivered_customer_date',
    'order_estimated_delivery_date'
    ])

In [12]:
df_orders.drop(['Unnamed: 0', 'Unnamed: 0.1'], axis=1, inplace=True)

In [14]:
df_orders.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 102425 entries, 0 to 102424
Data columns (total 31 columns):
 #   Column                         Non-Null Count   Dtype  
---  ------                         --------------   -----  
 0   order_id                       102425 non-null  object 
 1   customer_id                    102425 non-null  object 
 2   order_status                   102425 non-null  object 
 3   order_purchase_timestamp       102425 non-null  object 
 4   order_approved_at              102425 non-null  object 
 5   order_delivered_carrier_date   102425 non-null  object 
 6   order_delivered_customer_date  102425 non-null  object 
 7   order_estimated_delivery_date  102425 non-null  object 
 8   order_time                     102425 non-null  object 
 9   delivery_time                  102425 non-null  object 
 10  date_ordinal                   102425 non-null  int64  
 11  shipping_time_delta            102425 non-null  int64  
 12  shipping_duration             

---
## <a class="anchor" id="affinity">III. Market Basket Analysis</a>

Questions to answer:

** Can we predict buying behaviour between articles in one order (association)? **

I will try to answer these questions with an Affinity Analysis (Market Basket Analysis) by using the Apyori library. Basic formulas:

### 1. Basics

How likely SKU B is purchased if SKU A is purchased: $$ Confidence(A\Rightarrow B) = \frac{frq(A, B)}{frq(A)}) $$

How popular is SKU A: $$ Support(A\Rightarrow B) = \frac{frq(A, B)}{N}) $$

In other words: Number of orderlines of the specific products over the total number of order lines of the dataset

SKU B's likelyhood to be bought when SKU A is bought:  $$ Lift(A\Rightarrow B) = \frac{Support}{Supp(A) * Supp(B)}) $$

The orders are so small (93% SLO) that I suspect the assiciation will be very small. As a reminder, the order profile:

In [None]:
# sns.countplot(df_upo_lpo['lines']);

Only for orders with 2 or more lines association can be extracted.

I will extract these MLOs and perform with these the Affinity Analysis.

### 2. Preparing data

Only multi-line orders are interesting, because here we can observe which categories are ordered together.

In [None]:
# List with all multi-line orders
#lst_mlo = df_upo_lpo[df_upo_lpo['lines']>1]['order_id'].tolist()

In [None]:
# Extract dataframe with only multi-line orders
#df_mlo = df_orders_consolidated[df_orders_consolidated['order_id'].isin(lst_mlo)]

In [3]:
# Generate sparse matrix by using my matric function
#mat_mlo_cat = create_category_order_matrix(df_mlo, 'order_id', 'product_category_name_english', 'qty')

In [4]:
#mat_mlo_cat = mat_mlo_cat['qty']

### Support Score
The support score for the prodcuts in this table is simply the mean

In [None]:
#mat_mlo_cat.mean().sort_values(ascending=False).head(20)

### 3. Performing Analysis

In [5]:
# Aryori needs a list with all mlos, where each lmo contains a list with all categories
#records = []
#for i in range(3229):
#    records.append([mat_mlo_cat.values[i, j] for j in range(0, 66)])

In [6]:
# Applying list to apyori algorithm
#affinity_model = apriori(mat_mlo_cat, min_support=0.01, min_confidence=0.2 , use_colnames = True)
#affinity_results = list(affinity_model)

# In Process ...