<H1> CLTV Prediction </H1>

We are going to build simple machine learning models that predicts our customers lifetime value and compare their performances.

In [1]:
# Packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.metrics import confusion_matrix,classification_report,precision_recall_fscore_support,accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split

<a name=1> <h1> 1. Feature Engineering </h2></a>

In [2]:
# get the data
import pickle

with open("tx_cluster.pkl", "rb") as f:
    tx_cluster = pickle.load(f)

tx_cluster.head()

Unnamed: 0,CustomerID,Recency,RecencyCluster,Frequency,FrequencyCluster,Revenue,RevenueCluster,OverallScore,Segment,m6_Revenue,LTVCluster
0,17850.0,301,0,312,1,5288.63,1,2,Low-Value,0.0,0
1,17602.0,1,3,565,1,5050.77,1,5,High-Value,128.75,0
2,17509.0,57,2,369,1,6100.74,1,4,Mid-Value,0.0,0
3,13093.0,266,0,170,0,7741.47,1,1,Low-Value,262.55,0
4,17809.0,15,3,64,0,4627.62,1,4,Mid-Value,276.72,0


In [3]:
#convert categorical columns to numerical
tx_class = pd.get_dummies(tx_cluster) #There is only one categorical variable segment
tx_class.head()

Unnamed: 0,CustomerID,Recency,RecencyCluster,Frequency,FrequencyCluster,Revenue,RevenueCluster,OverallScore,m6_Revenue,LTVCluster,Segment_High-Value,Segment_Low-Value,Segment_Mid-Value
0,17850.0,301,0,312,1,5288.63,1,2,0.0,0,False,True,False
1,17602.0,1,3,565,1,5050.77,1,5,128.75,0,True,False,False
2,17509.0,57,2,369,1,6100.74,1,4,0.0,0,False,False,True
3,13093.0,266,0,170,0,7741.47,1,1,262.55,0,False,True,False
4,17809.0,15,3,64,0,4627.62,1,4,276.72,0,False,False,True


In [4]:
tx_class.describe()

Unnamed: 0,CustomerID,Recency,RecencyCluster,Frequency,FrequencyCluster,Revenue,RevenueCluster,OverallScore,m6_Revenue,LTVCluster
count,3910.0,3910.0,3910.0,3910.0,3910.0,3910.0,3910.0,3910.0,3910.0,3910.0
mean,15561.35601,91.613811,2.098977,84.348849,0.113555,1289.904252,0.057545,2.270077,513.67796,0.221995
std,1575.088247,100.383464,1.055228,152.686136,0.332276,2117.764962,0.235097,1.273618,873.629228,0.483366
min,12346.0,0.0,0.0,1.0,0.0,-4287.63,0.0,0.0,-4287.63,0.0
25%,14209.25,16.0,1.0,16.25,0.0,279.23,0.0,1.0,0.0,0.0
50%,15569.5,49.0,2.0,40.0,0.0,618.04,0.0,2.0,205.305,0.0
75%,16911.5,144.0,3.0,98.0,0.0,1483.4975,0.0,3.0,626.36,0.0
max,18287.0,373.0,3.0,5128.0,3.0,57120.91,2.0,8.0,7387.22,2.0


In [5]:
#calculate and show correlations
corr_matrix = tx_class.corr()
corr_matrix['LTVCluster'].sort_values(ascending=False)

LTVCluster            1.000000
m6_Revenue            0.890176
Revenue               0.673601
RevenueCluster        0.603433
Segment_High-Value    0.511627
FrequencyCluster      0.470566
Frequency             0.447167
OverallScore          0.419524
RecencyCluster        0.223734
Segment_Mid-Value     0.037707
CustomerID           -0.044819
Recency              -0.224075
Segment_Low-Value    -0.234132
Name: LTVCluster, dtype: float64

We see that Revenue, Frequency and RFM scores will be helpful for our machine learning models from the correlation with LTVCluster.

In [6]:
#create X and y, X will be feature set and y is the label - LTV
X = tx_class.drop(['LTVCluster','m6_Revenue'],axis=1)
y = tx_class['LTVCluster']

#split training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.05, random_state=56)

<a href=2> <h1> 2. Models </h1></a>

Since our LTV Clusters are 3 types, high LTV, mid LTV and low LTV; we will perform multi class classification.

<a href=2.1> <h2> 2.1 Logistic Regression </h2></a>

In [16]:
from sklearn.linear_model import LogisticRegression

basemodelname = "Logit_test"
params = {
    "penalty": None,
    "class_weight": 'balanced'}
parsuf = '_'.join([key.replace('_','')+str(val).replace('.','') for key,val in params.items()])
modelname=f"{basemodelname}_{parsuf}"

ltv_logreg = LogisticRegression(
    penalty=params['penalty'],
    class_weight=params['class_weight'],
    max_iter=1000
).fit(X_train, y_train)

acc_train = ltv_logreg.score(X_train, y_train)
acc_test = ltv_logreg.score(X_test[X_train.columns], y_test)

print(f"Modelname: {modelname}")
print('Accuracy of Logit classifier on training set: {:.2f}'.format(acc_train))
print('Accuracy of Logit classifier on test set: {:.2f}'.format(acc_test))

y_pred = ltv_logreg.predict(X_test)
clfreport = classification_report(y_test, y_pred)
print(clfreport)

Modelname: Logit_test_penaltyNone_classweightbalanced
Accuracy of Logit classifier on training set: 0.87
Accuracy of Logit classifier on test set: 0.86
              precision    recall  f1-score   support

           0       0.96      0.88      0.92       156
           1       0.59      0.74      0.66        35
           2       0.62      1.00      0.77         5

    accuracy                           0.86       196
   macro avg       0.72      0.88      0.78       196
weighted avg       0.88      0.86      0.87       196



<a href=2.2> <h2> 2.2 XGBoost </h2></a>

In [8]:
import xgboost as xgb

basemodelname = "xgboost_test"
params = {
        "max_depth": 5,
        "learning_rate":0.1}
parsuf = '_'.join([key.replace('_','')+str(val).replace('.','') for key,val in params.items()])
modelname=f"{basemodelname}_{parsuf}"

ltv_xgb = xgb.XGBClassifier(
    max_depth=params['max_depth'], 
    learning_rate=params['learning_rate'],
    n_jobs=-1
).fit(X_train, y_train)

print(f"Modelname: {modelname}")
acc_train = ltv_xgb.score(X_train, y_train)
acc_test = ltv_xgb.score(X_test[X_train.columns], y_test)

print('Accuracy of XGB classifier on training set: {:.2f}'.format(acc_train))
print('Accuracy of XGB classifier on test set: {:.2f}'.format(acc_test))

y_pred = ltv_xgb.predict(X_test)
clfreport = classification_report(y_test, y_pred)

Modelname: xgboost_test_maxdepth5_learningrate01
Accuracy of XGB classifier on training set: 0.96
Accuracy of XGB classifier on test set: 0.88


In [9]:
print(clfreport)

              precision    recall  f1-score   support

           0       0.94      0.94      0.94       156
           1       0.67      0.69      0.68        35
           2       0.67      0.40      0.50         5

    accuracy                           0.88       196
   macro avg       0.76      0.68      0.71       196
weighted avg       0.88      0.88      0.88       196

