# Customer Lifetime Value Predictions

In this notebook a machine learning model will be built to predict Customer Lifetime Value (CLV) of an online retail store.

Data can be found here:
*https://www.kaggle.com/datasets/vijayuv/onlineretail*


Step-by-step process:

- Define an appropriate time frame for Customer Lifetime Value calculation
- Identify the features we are going to use to predict future value and create them
- Calculate lifetime value (LTV) for training the machine learning model
- Build and run the machine learning model
- Check if the model is useful

In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import math
import plotly.express as px

from sklearn.preprocessing import RobustScaler
from datetime import datetime, timedelta, date

## Data Import

In [2]:
data = pd.read_csv("../data/OnlineRetail.csv",encoding= 'unicode_escape')

In [3]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,12/1/2010 8:26,2.55,17850.0,United Kingdom
1,536365,71053,WHITE METAL LANTERN,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,12/1/2010 8:26,2.75,17850.0,United Kingdom
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,12/1/2010 8:26,3.39,17850.0,United Kingdom
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,12/1/2010 8:26,3.39,17850.0,United Kingdom


In [4]:
data.shape

(541909, 8)

In [5]:
data.dtypes

InvoiceNo       object
StockCode       object
Description     object
Quantity         int64
InvoiceDate     object
UnitPrice      float64
CustomerID     float64
Country         object
dtype: object

## Importing Functions

In [6]:
def groupby_mean(x):
    return x.mean()

def groupby_count(x):
    return x.count()

def purchase_duration(x):
    return (x.max() - x.min()).days

def avg_frequency(x):
    return (x.max() - x.min()).days / x.count()

groupby_mean.__name__ = 'avg'
groupby_count.__name__ = 'count'
purchase_duration.__name__ = 'purchase_duration'
avg_frequency.__name__ = 'purchase_frequency'

In [7]:
def group_by_3M(df, clv_freq):
    '''This function slices the dataframe into chunks of a select timeframe.
    This is done so that previous timeframes can be used to predict CLV for a later timeframe.
    Ex.: Slicing into 3 months timeframes, to predict the last chunk based on all previous ones.'''
    
    df_orders = df.groupby(['CustomerID', 'InvoiceNo']).agg({'Revenue': sum, 'InvoiceDate': max})

    df_data = df_orders.reset_index().groupby([
                'CustomerID',
                pd.Grouper(key='InvoiceDate', freq=clv_freq)
                ]).agg({'Revenue': [sum, groupby_mean, groupby_count],})

    df_data.columns = ['_'.join(col).lower() for col in df_data.columns]
    
    df_data.reset_index(inplace= True)
    
    map_date_month = {str(x)[:10]: 'M_%s' % (i+1) for i, x in enumerate(
                    sorted(df_data.reset_index()['InvoiceDate'].unique(), reverse=True))}
    
    df_data["M"] = df_data["InvoiceDate"].apply(lambda x: map_date_month[str(x)[:10]])
    
    return df_data
    
    

In [24]:
def create_features_and_target(df):
    '''This function takes the monthly aggregated data, and creates features
    as inputs for our regression from it.'''
    
    ## create features
    df_features = pd.pivot_table(
                    df.loc[df["M"] != "M_1"],
                    values= ["revenue_sum", "revenue_avg", "revenue_count"],
                    columns = "M",
                    index= "CustomerID")
    
    df_features.columns = ['_'.join(col) for col in df_features.columns]
    
    df_features.reset_index(level=0, inplace= True)
    
    df_features.fillna(0, inplace=True)
    
    ## create target
    df_target = df.loc[df["M"] == "M_1"][["CustomerID", "revenue_sum"]]

    df_target.columns = ["CustomerID", "CLV_M_1"]
    
    ## creating final dataframe by merging
    df_final = pd.merge(df_features, df_target, on= "CustomerID", how= "left")
    
    df_final.fillna(0, inplace=True)
    
    return df_final

In [8]:
def clean_dataframe(df):
    '''Function to clean dataframe from missing data, outliers, duplicates,
    and to change data types as necessary.'''
    
    if "Revenue" in df.columns:
        
        df["InvoiceDate"] = pd.to_datetime(df["InvoiceDate"])
    
    else:
            
        ## dropping outliers and negative values in Quantity
        idx_neg = df.loc[df.Quantity < 0].index
        
        df.drop(idx_neg, inplace= True)
        
        ## removing outliers
        num_cols = df.select_dtypes(include=["int64", "float64"]).columns
        
        for col in num_cols:
            
            mask = df[col] > df[col].quantile(0.99)

            df.drop(df[mask].index, inplace=True)
            
        ## changing datatype for InvoiceDate
        df["InvoiceDate"] = pd.to_datetime(df.InvoiceDate)
        
        ## creating revenue column
        df["Revenue"] = df["Quantity"] * df["UnitPrice"]

    return df

def clustering(data=None, k=None, column=None):
    '''This function clusters data of a given column,
    and returns the dataframe with the cluster predictions.'''
    
    kmeans = KMeans(n_clusters = k,
                    max_iter= 1000)
    
    kmeans.fit(data[[column]])
    
    new_column = column + "Cluster"
    
    data[new_column] = kmeans.predict(data[[column]])
    
    return data

def order_clusters(data=None, column=None, target=None, ascending=None):
    '''This function orders the clusters of a given dataframe,
    so that cluster names are not a nominal variable but ordinal.'''
    
    new_column = "new_" + column 
    
    df = data.groupby(column)[target].mean().reset_index()
    
    df = df.sort_values(by=target, ascending=ascending)
    
    df["index"] = df.index
    
    df_final = pd.merge(data, df[[column, "index"]], on=column)
    
    df_final.drop([column], axis=1, inplace=True)
    
    df_final = df_final.rename(columns={"index": column})
    
    return df_final


def data_prep(df):
    '''This function applies scaling and encoding to features, 
    for the step of modeling and predicting.'''
    
    for col in ["Recency", "Frequency", "Revenue"]:
    
        scaler = RobustScaler()
        
        scaler.fit(df[[col]])
        
        df[col] = scaler.transform(df[[col]])
    
    return df

## Making RFM Metrics

In [9]:
from sklearn.cluster import KMeans

In [10]:
data = pd.read_csv("../data/OnlineRetail.csv",encoding= 'unicode_escape')

In [11]:
df = clean_dataframe(data)

In [12]:
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [13]:
df.shape

(517353, 9)

In [14]:
tmp = group_by_3M(df, clv_freq='3M')

In [15]:
date = pd.to_datetime(sorted(tmp['InvoiceDate'].unique(), reverse=True)[1]).date()

In [16]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [17]:
dataframe = data.loc[data["InvoiceDate"].dt.date <= date]

In [18]:
dataframe.shape

(353088, 9)

In [19]:
## segmenting data into 3 and 6 month dataframes.
## 3 Months of data will be used to forecast CLV over the following 6 months.

user_df = pd.DataFrame(dataframe.CustomerID.unique(), columns= ["CustomerID"])

## creating Recency Metric
recency_df = pd.DataFrame(dataframe.groupby("CustomerID")["InvoiceDate"].max().reset_index())
recency_df.columns = ["CustomerID", "LatestPurchase"]

recency_df["Recency"] = (dataframe["InvoiceDate"].max() - recency_df["LatestPurchase"]).dt.days
recency_df.drop("LatestPurchase", axis= 1, inplace= True)

recency_df = clustering(data= recency_df,
                        k= 3,
                        column="Recency")

recency_df = order_clusters(data= recency_df,
                            column= "RecencyCluster",
                            target= "Recency",
                            ascending= False)

user_df = pd.merge(recency_df, user_df, on= "CustomerID") 

## creating Frequency Metric
frequency_df = pd.DataFrame(dataframe.groupby("CustomerID")["InvoiceDate"].count().reset_index())
frequency_df.columns = ["CustomerID", "Frequency"]

frequency_df = clustering(data= frequency_df,
                          k= 5,
                          column= "Frequency")

frequency_df = order_clusters(data= frequency_df,
                              column= "FrequencyCluster",
                              target= "Frequency",
                              ascending= True)

user_df = pd.merge(frequency_df, user_df, on= "CustomerID")

## creating Revenue Metric
revenue_df = pd.DataFrame(dataframe.groupby("CustomerID")["Revenue"].sum().reset_index())
revenue_df.columns = ["CustomerID", "Revenue"]

revenue_df = clustering(data= revenue_df,
                        k= 5,
                        column= "Revenue")

revenue_df = order_clusters(data= revenue_df,
                            column= "RevenueCluster",
                            target= "Revenue",
                            ascending= True)

user_df = pd.merge(revenue_df, user_df, on= "CustomerID")

user_df["OverallScore"] = user_df["RecencyCluster"] + user_df["FrequencyCluster"] + user_df["RevenueCluster"]

In [20]:
user_df

Unnamed: 0,CustomerID,Revenue,RevenueCluster,Frequency,FrequencyCluster,Recency,RecencyCluster,OverallScore
0,12347.0,2541.26,0,123,0,59,1,1
1,12348.0,835.08,0,16,2,5,1,3
2,12350.0,294.40,0,16,2,240,2,4
3,12352.0,1154.01,0,63,2,2,1,3
4,12353.0,89.00,0,4,2,133,0,2
...,...,...,...,...,...,...,...,...
3526,17677.0,12670.63,1,224,0,12,1,2
3527,17841.0,24521.34,1,4984,3,2,1,5
3528,18102.0,28430.85,1,119,0,2,1,2
3529,14646.0,111508.90,2,1138,1,8,1,4


## Making 3M - Segments

In [21]:
data = data.loc[data['InvoiceDate'] < '2011-12-01']

In [22]:
data.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country,Revenue
0,536365,85123A,WHITE HANGING HEART T-LIGHT HOLDER,6,2010-12-01 08:26:00,2.55,17850.0,United Kingdom,15.3
1,536365,71053,WHITE METAL LANTERN,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
2,536365,84406B,CREAM CUPID HEARTS COAT HANGER,8,2010-12-01 08:26:00,2.75,17850.0,United Kingdom,22.0
3,536365,84029G,KNITTED UNION FLAG HOT WATER BOTTLE,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34
4,536365,84029E,RED WOOLLY HOTTIE WHITE HEART.,6,2010-12-01 08:26:00,3.39,17850.0,United Kingdom,20.34


In [23]:
df_data = group_by_3M(data, clv_freq='3M')

df_data.head()

Unnamed: 0,CustomerID,InvoiceDate,revenue_sum,revenue_avg,revenue_count,M
0,12347.0,2010-12-31,711.79,711.79,1,M_5
1,12347.0,2011-03-31,475.39,475.39,1,M_4
2,12347.0,2011-06-30,769.17,384.585,2,M_3
3,12347.0,2011-09-30,584.91,584.91,1,M_2
4,12347.0,2011-12-31,1294.32,1294.32,1,M_1


In [25]:
df_final = create_features_and_target(df_data)

In [26]:
df_final.head()

Unnamed: 0,CustomerID,revenue_avg_M_2,revenue_avg_M_3,revenue_avg_M_4,revenue_avg_M_5,revenue_count_M_2,revenue_count_M_3,revenue_count_M_4,revenue_count_M_5,revenue_sum_M_2,revenue_sum_M_3,revenue_sum_M_4,revenue_sum_M_5,CLV_M_1
0,12347.0,584.91,384.585,475.39,711.79,1.0,2.0,1.0,1.0,584.91,769.17,475.39,711.79,1294.32
1,12348.0,120.0,327.0,20.4,367.68,1.0,1.0,1.0,1.0,120.0,327.0,20.4,367.68,0.0
2,12350.0,0.0,0.0,294.4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,294.4,0.0,0.0
3,12352.0,256.25,0.0,160.3775,0.0,2.0,0.0,4.0,0.0,512.5,0.0,641.51,0.0,231.73
4,12353.0,0.0,89.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,89.0,0.0,0.0,0.0


In [27]:
df_final.shape

(3531, 14)

## Merging Dataframes

In [28]:
final_dataframe = pd.merge(df_final, user_df, left_on= "CustomerID", right_on= "CustomerID", how="left")

In [29]:
final_dataframe.head()

Unnamed: 0,CustomerID,revenue_avg_M_2,revenue_avg_M_3,revenue_avg_M_4,revenue_avg_M_5,revenue_count_M_2,revenue_count_M_3,revenue_count_M_4,revenue_count_M_5,revenue_sum_M_2,...,revenue_sum_M_4,revenue_sum_M_5,CLV_M_1,Revenue,RevenueCluster,Frequency,FrequencyCluster,Recency,RecencyCluster,OverallScore
0,12347.0,584.91,384.585,475.39,711.79,1.0,2.0,1.0,1.0,584.91,...,475.39,711.79,1294.32,2541.26,0,123,0,59,1,1
1,12348.0,120.0,327.0,20.4,367.68,1.0,1.0,1.0,1.0,120.0,...,20.4,367.68,0.0,835.08,0,16,2,5,1,3
2,12350.0,0.0,0.0,294.4,0.0,0.0,0.0,1.0,0.0,0.0,...,294.4,0.0,0.0,294.4,0,16,2,240,2,4
3,12352.0,256.25,0.0,160.3775,0.0,2.0,0.0,4.0,0.0,512.5,...,641.51,0.0,231.73,1154.01,0,63,2,2,1,3
4,12353.0,0.0,89.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,89.0,0,4,2,133,0,2


## Modeling

In [30]:
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split, cross_validate
from sklearn.metrics import mean_absolute_error, r2_score

### Regression using LinearRegression

In [31]:
def regression(df, target):
    
    X= df.drop([target, "CustomerID"], axis= 1)
    y= df[target]
    
    lm = LinearRegression()
    
    cv_result= cross_validate(lm,
                             X,
                             y,
                             cv= 5,
                             scoring= ('r2', 'neg_mean_absolute_error'))
    
    r2 = cv_result['test_r2'].mean()
    mae = abs(cv_result['test_neg_mean_absolute_error'].mean())
    
    print(f"Results: \nr2 = {round(r2,2)}\nMAE = {round(mae,2)}")
    
    return lm

In [40]:
df_prep = data_prep(final_dataframe)
df_prep.head()

Unnamed: 0,CustomerID,revenue_avg_M_2,revenue_avg_M_3,revenue_avg_M_4,revenue_avg_M_5,revenue_count_M_2,revenue_count_M_3,revenue_count_M_4,revenue_count_M_5,revenue_sum_M_2,...,revenue_sum_M_4,revenue_sum_M_5,CLV_M_1,Revenue,RevenueCluster,Frequency,FrequencyCluster,Recency,RecencyCluster,OverallScore
0,12347.0,584.91,384.585,475.39,711.79,1.0,2.0,1.0,1.0,584.91,...,475.39,711.79,1294.32,1.8447,0,1.348485,0,-0.038462,1,1
1,12348.0,120.0,327.0,20.4,367.68,1.0,1.0,1.0,1.0,120.0,...,20.4,367.68,0.0,0.260613,0,-0.272727,2,-0.453846,1,3
2,12350.0,0.0,0.0,294.4,0.0,0.0,0.0,1.0,0.0,0.0,...,294.4,0.0,0.0,-0.241376,0,-0.272727,2,1.353846,2,4
3,12352.0,256.25,0.0,160.3775,0.0,2.0,0.0,4.0,0.0,512.5,...,641.51,0.0,231.73,0.556721,0,0.439394,2,-0.476923,1,3
4,12353.0,0.0,89.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,-0.432078,0,-0.454545,2,0.530769,0,2


In [32]:
regression(final_dataframe, "CLV_M_1")

Results: 
r2 = 0.51
MAE = 351.6


### Regression using XGB

In [33]:
import xgboost as xgb

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, mean_squared_error, mean_absolute_error
from sklearn.dummy import DummyClassifier, DummyRegressor

In [34]:
X = final_dataframe.drop(["CustomerID", "CLV_M_1"], axis= 1)
y = final_dataframe["CLV_M_1"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25)

In [35]:
xgb_regressor = xgb.XGBRegressor()

xgb_regressor.fit(X_train, y_train)

y_pred = xgb_regressor.predict(X_test)

r2 = xgb_regressor.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
rmse = math.sqrt(mse)
mae = mean_absolute_error(y_test, y_pred)

results = {
    "metric": ["r2", "mse", "rmse", "mae"],
    "value": [r2, mse, rmse, mae]
}

pd.DataFrame(results).set_index("metric").round(2)

Unnamed: 0_level_0,value
metric,Unnamed: 1_level_1
r2,0.34
mse,2444827.84
rmse,1563.59
mae,394.88
