<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Predicting-Customer-Lifetime-Value" data-toc-modified-id="Predicting-Customer-Lifetime-Value-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Predicting Customer Lifetime Value</a></span><ul class="toc-item"><li><span><a href="#Loading-and-Viewing-Data" data-toc-modified-id="Loading-and-Viewing-Data-1.1"><span class="toc-item-num">1.1&nbsp;&nbsp;</span>Loading and Viewing Data</a></span></li><li><span><a href="#Correlation-Analysis" data-toc-modified-id="Correlation-Analysis-1.2"><span class="toc-item-num">1.2&nbsp;&nbsp;</span>Correlation Analysis</a></span></li><li><span><a href="#Train-Test-Split" data-toc-modified-id="Train-Test-Split-1.3"><span class="toc-item-num">1.3&nbsp;&nbsp;</span>Train Test Split</a></span></li><li><span><a href="#Build-and-Test-Model" data-toc-modified-id="Build-and-Test-Model-1.4"><span class="toc-item-num">1.4&nbsp;&nbsp;</span>Build and Test Model</a></span></li><li><span><a href="#Predicting-for-a-new-Customer" data-toc-modified-id="Predicting-for-a-new-Customer-1.5"><span class="toc-item-num">1.5&nbsp;&nbsp;</span>Predicting for a new Customer</a></span></li></ul></li></ul></div>

 # Predicting Customer Lifetime Value

In this example, we will use past purchase history of your customers to build a model that can predict the Customer Lifetime Value (CLV) for new customers

## Loading and Viewing Data
We will load the data file for this example and checkout summary statistics and columns for that file.

In [1]:
# Import relevant libraries & packages
import pandas as pd
import numpy as np
import os
import matplotlib.pylab as plt
from sklearn.model_selection  import train_test_split
from sklearn.linear_model import LinearRegression
import sklearn.metrics

# Import data
raw_data = pd.read_csv("history.csv")

# Preview info
raw_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 8 columns):
CUST_ID    100 non-null int64
MONTH_1    100 non-null int64
MONTH_2    100 non-null int64
MONTH_3    100 non-null int64
MONTH_4    100 non-null int64
MONTH_5    100 non-null int64
MONTH_6    100 non-null int64
CLV        100 non-null int64
dtypes: int64(8)
memory usage: 6.4 KB


The dataset consists of the customer ID, the amount the customer spent on the website for the first 6 months as a new customer of the business and the customer's total lifetime value (3 years).

In [2]:
# Preview data
raw_data.head()

Unnamed: 0,CUST_ID,MONTH_1,MONTH_2,MONTH_3,MONTH_4,MONTH_5,MONTH_6,CLV
0,1001,150,75,200,100,175,75,13125
1,1002,25,50,150,200,175,200,9375
2,1003,75,150,0,25,75,25,5156
3,1004,200,200,25,100,75,150,11756
4,1005,200,200,125,75,175,200,15525


## Correlation Analysis

Let's take a look at the correlation analysis between CLV and each month.

In [3]:
cleaned_data = raw_data.drop("CUST_ID",axis=1)
cleaned_data .corr()['CLV']

MONTH_1    0.734122
MONTH_2    0.250397
MONTH_3    0.371742
MONTH_4    0.297408
MONTH_5    0.376775
MONTH_6    0.327064
CLV        1.000000
Name: CLV, dtype: float64

We can see that month 1 shows the strongest correlation to the target variable (CLV). So, becoming a customer at all is highly correlated with total CLV. The rest of the months seem to level out, ranging from 25.0-37.7%. 

## Train Test Split

Let us split the data into train and test datasets with a ratio of 90:10.

In [12]:
# Predictor variables
predictors = cleaned_data.drop("CLV",axis=1)

# Target variable
target = cleaned_data.CLV

# Train test split
pred_train, pred_test, tar_train, tar_test  =   train_test_split(predictors, target, test_size=.1)

# Check shapes
print("Predictor - Train :", pred_train.shape, "\nPredictor - Test :", pred_test.shape )


Predictor - Train : (90, 6) 
Predictor - Test : (10, 6)


## Build and Test Model
We will now build a Linear Regression equation for predicting CLV and then check its accuracy by predicting against the test dataset.

In [21]:
# Build model with training data

# Instantiate & Fit Model
model = LinearRegression()
model.fit(pred_train, tar_train)

# Print coeff's and int
print("Coefficients: \n", list(model.coef_))
print("\nIntercept:", model.intercept_)

# Make predictions using test data
predictions = model.predict(pred_test)
predictions

# Check accuracy with R2 score
accuracy = sklearn.metrics.r2_score(tar_test, predictions)
print("\nR2:", accuracy)

Coefficients: 
 [34.74477894971672, 10.542551393581304, 15.751506690420607, 11.293717634024985, 7.148167528004595, 5.018784458549347]

Intercept: 51.2451565854044

R2: 0.9446437512942709


The R2 value shows a 94% test accuracy of the model. The model appears to capture the relationship/pattern well. 

## Predicting for a new Customer
Now let's say there is a new customer who's spent 100,0,50 on the website within the first 3 months of becoming a customer. Let's use the model to predict the customer's CLV.

In [32]:
# Generate data for new customer
new_data = np.array([100,0,50,0,0,0]).reshape(1, -1)

# Make predictions with/on new data
new_pred = model.predict(new_data)

# Print predicted CLV
print("The CLV for the new customer is : $", '{0:.2f}'.format(new_pred[0]))

The CLV for the new customer is : $ 4313.30
