 # Prediction of Customer Lifetime Value (CLV) 
   # Bibobra Alabrah

BUSINESS PROBLEM

A company wants to know the lifetime value of customers in terms of how much money they will likely bring to the company based on their first few purchase history.


GOAL

The goal of this project is to build a predictive model that estimates the customer lifetime value (CLV) for new customers using past purchase history of existing customers.

In [1]:
# Import dependences

import pandas as pd
from pandas import Series, DataFrame
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection  import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics

In [2]:
# Load the data set and view the summary statistics

purchase_history = pd.read_csv("history.csv")

## Exploratory Data Analysis

In [3]:
# View the data types
purchase_history.dtypes

CUST_ID    int64
MONTH_1    int64
MONTH_2    int64
MONTH_3    int64
MONTH_4    int64
MONTH_5    int64
MONTH_6    int64
CLV        int64
dtype: object

The dataset consists of the customer ID, the amount the customer spent on your website for the first months of his relationship with your business and his ultimate life time value ( say 3 years worth)

In [4]:
# View the dimension of the data set
purchase_history.shape

(100, 8)

There are 100 customers for this dataset

In [5]:
# View the first few records of the data
purchase_history.head()

Unnamed: 0,CUST_ID,MONTH_1,MONTH_2,MONTH_3,MONTH_4,MONTH_5,MONTH_6,CLV
0,1001,150,75,200,100,175,75,13125
1,1002,25,50,150,200,175,200,9375
2,1003,75,150,0,25,75,25,5156
3,1004,200,200,25,100,75,150,11756
4,1005,200,200,125,75,175,200,15525


In [6]:
# View the last few records
purchase_history.tail()

Unnamed: 0,CUST_ID,MONTH_1,MONTH_2,MONTH_3,MONTH_4,MONTH_5,MONTH_6,CLV
95,1096,150,200,25,125,50,75,9763
96,1097,100,100,125,150,100,125,9625
97,1098,100,75,200,200,100,50,9750
98,1099,25,150,150,125,100,175,8113
99,1100,125,200,25,75,75,200,8438


## Select the best features using Correlation Analysis

In [7]:
purchase_history.corr()['CLV']

CUST_ID   -0.095205
MONTH_1    0.734122
MONTH_2    0.250397
MONTH_3    0.371742
MONTH_4    0.297408
MONTH_5    0.376775
MONTH_6    0.327064
CLV        1.000000
Name: CLV, dtype: float64

We can see that the months do show strong correlation to the target variable (CLV). That should give us confidence that we can build a strong model to predict the CLV, but the customer ID has no correlation or whatsoever, so we remove it.

## Check for missing values

In [8]:
purchase_history.isnull().sum()

CUST_ID    0
MONTH_1    0
MONTH_2    0
MONTH_3    0
MONTH_4    0
MONTH_5    0
MONTH_6    0
CLV        0
dtype: int64

There are no missing values.

# Data Cleaning

In [9]:
# We now remove the customer id feature
clean = purchase_history.drop("CUST_ID",axis=1)

In [10]:
# Let us confirm that the data looks exactly as desired
clean.head()

Unnamed: 0,MONTH_1,MONTH_2,MONTH_3,MONTH_4,MONTH_5,MONTH_6,CLV
0,150,75,200,100,175,75,13125
1,25,50,150,200,175,200,9375
2,75,150,0,25,75,25,5156
3,200,200,25,100,75,150,11756
4,200,200,125,75,175,200,15525


## Split the data into a train and validation set

Let us split the data into training and testing in the ratio of 80:20

But first, we have to drop the target variable(CLV) to form the predictors

In [11]:
predictors = clean.drop("CLV",axis=1)
target = clean.CLV

pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, target, test_size=.2)
print( "Predictor - Training : ", pred_train.shape, "Predictor - Testing : ", pred_test.shape )


Predictor - Training :  (80, 6) Predictor - Testing :  (20, 6)


## Build and Test Model
We build a Linear Regression equation for predicting CLV and then check its accuracy by predicting against the test dataset

In [12]:
# Build model on training data

# instantiate the model
LR_model = LinearRegression()

# Fit the model
LR_model.fit(pred_train,tar_train)

print("Coefficients: \n", LR_model.coef_)
print("Intercept:", LR_model.intercept_)

Coefficients: 
 [34.43767452 11.00693978 15.16953266 12.48954293  7.13060804  5.64658155]
Intercept: -14.900014162143634


In [13]:
# Let us test this model on the test data set

predictions = LR_model.predict(pred_test)
predictions

# Check the accuracy of the predictions
sklearn.metrics.r2_score(tar_test, predictions)

0.9043010230486305

It shows a 88% accuracy. This is a good model for predicting CLV for new customers

## Predict for a new Customer
Let us say we have a new customer who in his first 3 months have spend 300,100,250 on purchases. Let us use the model to predict his CLV.

In [14]:
new_data = np.array([300,100,250,0,0,0]).reshape(1, -1)
new_data

array([[300, 100, 250,   0,   0,   0]])

In [15]:
new_pred = LR_model.predict(new_data) 
print("The CLV for the new customer is : $",new_pred[0])

The CLV for the new customer is : $ 15209.479483421626
