# Olist Dataset

This notebook is used to prepare the olist dataset for ingestion into the streamlit app, to apply customer segmentation and customer lifetime value (CLV) prediction.

The dataset can be found here: *https://www.kaggle.com/datasets/olistbr/brazilian-ecommerce?resource=download*

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

## Data Ingestion

In [2]:
orders_dataset = pd.read_csv('../data/olist_orders_dataset.csv')
orders_dataset = orders_dataset[["order_id", "customer_id", "order_purchase_timestamp"]]

In [3]:
orders_dataset.head(2)

Unnamed: 0,order_id,customer_id,order_purchase_timestamp
0,e481f51cbdc54678b7cc49136f2d6af7,9ef432eb6251297304e76186b10a928d,2017-10-02 10:56:33
1,53cdb2fc8bc7dce0b6741e2150273451,b0830fb4747a6c6d20dea0b8c802d7ef,2018-07-24 20:41:37


In [4]:
payments_dataset = pd.read_csv('../data/olist_order_payments_dataset.csv')
payments_dataset = payments_dataset[["order_id", "payment_value"]]

In [5]:
payments_dataset.head(2)

Unnamed: 0,order_id,payment_value
0,b81ef226f3fe1789b1e8b2acac839d17,99.33
1,a9810da82917af2d9aefd1278f1dcfa0,24.39


## Data Cleaning

In [6]:
## merge datasets
dataframe = pd.merge(orders_dataset, payments_dataset, on= "order_id", how = "inner")

In [7]:
dataframe.drop("order_id", axis= 1, inplace= True)

In [8]:
dataframe.rename({"customer_id": "CustomerID",
                 "order_purchase_timestamp":"InvoiceDate",
                 "payment_value": "Revenue"}, 
                 axis=1,
                 inplace=True)

In [9]:
dataframe["InvoiceDate"] = pd.to_datetime(dataframe["InvoiceDate"]).dt.date

In [10]:
dataframe.head()

Unnamed: 0,CustomerID,InvoiceDate,Revenue
0,9ef432eb6251297304e76186b10a928d,2017-10-02,18.12
1,9ef432eb6251297304e76186b10a928d,2017-10-02,2.0
2,9ef432eb6251297304e76186b10a928d,2017-10-02,18.59
3,b0830fb4747a6c6d20dea0b8c802d7ef,2018-07-24,141.46
4,41ce2a54c0b03bf3443c3d931a367089,2018-08-08,179.12


In [11]:
dataframe.shape

(103886, 3)

In [12]:
dataframe.CustomerID.nunique()

99440

In [13]:
dataframe.dtypes

CustomerID      object
InvoiceDate     object
Revenue        float64
dtype: object

In [14]:
## assessing how many customers are repeat customers
_ = pd.DataFrame(dataframe.groupby("CustomerID")["InvoiceDate"].count())

repeat_cust = _.loc[_["InvoiceDate"] > 1].shape[0]

print(f"{round(repeat_cust / _.shape[0],2)}% repeat customers")

0.03% repeat customers


As there are only <1% repeat customers, the analysis will proceed with only those customers who have shopped at the Olist online store repeatedly.

In [15]:
df_tmp = pd.DataFrame(dataframe.groupby("CustomerID")["InvoiceDate"].count()).reset_index()

df_new = pd.merge(df_tmp, dataframe, on= "CustomerID")

In [16]:
df_new.rename({"InvoiceDate_x": "BuyCount",
              "InvoiceDate_y": "InvoiceDate"}, axis= 1, inplace= True)

In [17]:
final_df = df_new.loc[df_new["BuyCount"] > 1]

In [18]:
final_df.columns

Index(['CustomerID', 'BuyCount', 'InvoiceDate', 'Revenue'], dtype='object')

In [19]:
final_df.CustomerID.nunique()

2961

In [20]:
final_df.head()

Unnamed: 0,CustomerID,BuyCount,InvoiceDate,Revenue
16,000e943451fc2788ca6ac98a682f2f49,4,2017-04-20,26.8
17,000e943451fc2788ca6ac98a682f2f49,4,2017-04-20,26.8
18,000e943451fc2788ca6ac98a682f2f49,4,2017-04-20,26.8
19,000e943451fc2788ca6ac98a682f2f49,4,2017-04-20,25.83
25,001051abfcfdbed9f87b4266213a5df1,3,2018-05-30,13.35


In [21]:
## assessing total revenue by repeat customers
revenue = round(final_df.Revenue.sum(),2)

total_revenue = round(dataframe.Revenue.sum(),2)

perc = round(revenue/total_revenue,2)

print(f"Repeat customers make up {revenue}.\n\nThat is {perc}% of total revenue.")

Repeat customers make up 492107.95.

That is 0.03% of total revenue.


## Export Final Dataset

In [22]:
final_df.to_csv("../data/final_olist_dataset.csv", index= False)

# CLV Prediction

## Creating RFM Metrics

In [23]:
import sys
sys.path.append('../')

In [24]:
from modules.modules import clean_dataframe

In [28]:
final_df = clean_dataframe(final_df);

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])


In [33]:
final_df.InvoiceDate.describe(datetime_is_numeric=True)

count                             7407
mean     2017-12-01 10:03:38.712028928
min                2016-10-04 00:00:00
25%                2017-07-27 00:00:00
50%                2017-12-08 00:00:00
75%                2018-04-13 00:00:00
max                2018-08-28 00:00:00
Name: InvoiceDate, dtype: object