# Introduction

Predicting Customer Lifetime Value (CLV) is crucial for e-commerce businesses seeking to estimate the total
revenue generated from a single customer account over its entire lifespan. My project aims to develop a predictive
model for CLV that captures and forecasts the net profit associated with the complete future relationship with
customers. This model is essential for optimizing marketing strategies, enhancing customer retention, and efficiently
allocating resources to the most promising customer segments.

# Dataset and Preliminary Examination

To achieve this goal, I make use of the "Online Retail II" dataset available from the UCI Machine Learning Repository. This dataset provides a detailed record of transactions made by a UK-based online retailer spanning from December 2009 to December 2011. It includes valuable information such as Customer IDs, Invoice Numbers, Stock Codes, Descriptions, and more, offering a rich source for analyzing customer buying behaviors and building CLV models.

# Problem Significance

Predicting CLV well is a big deal in e-commerce. It helps find the best customers and plan how to sell more to them. This means smarter marketing and happier customers, leading to more profit from marketing efforts.

# Modeling Approach and Hypotheses

Based on thorough data examination and understanding CLV dynamics, I propose the following initial hypotheses:
- There's likely a positive correlation between purchase frequency and CLV, implying that customers who buy more often tend to have higher lifetime values.
- The total amount spent by customers probably correlates strongly with CLV, suggesting that higher-spending customers likely have higher lifetime values.
- Seasonal trends and product categories might influence customer spending behaviors, potentially affecting their CLV.

These hypotheses guide our exploratory data analysis and modeling approach. We'll start with a basic Linear Regression model and advance to more complex algorithms like Decision Trees, Random Forests, Gradient Boosting, and Neural Networks (Multilayer Perceptrons). Each model will be thoroughly evaluated to gauge its accuracy in predicting CLV, allowing us to refine our methods continuously.

# Conclusion and Forward-Look

The core objective of my project is to provide an evidence-based CLV prediction model, empowering e-commerce 
enterprises with precise insights for decision-making. By grasping customer worth, I intend to guide marketing
tactics and customer relations toward more lucrative paths. As market dynamics shift and new data emerges, my
plan involves iteratively refining the model to uphold its accuracy and relevance in an ever-changing business
environment.

# Exploratory Data Analysis

In [11]:
import pandas as pd
import numpy as np
data = pd.read_excel('online_retail_II.xlsx', engine='openpyxl')

In [9]:
data

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
0,489434,85048,15CM CHRISTMAS GLASS BALL 20 LIGHTS,12,2009-12-01 07:45:00,6.95,13085.0,United Kingdom
1,489434,79323P,PINK CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
2,489434,79323W,WHITE CHERRY LIGHTS,12,2009-12-01 07:45:00,6.75,13085.0,United Kingdom
3,489434,22041,"RECORD FRAME 7"" SINGLE SIZE",48,2009-12-01 07:45:00,2.10,13085.0,United Kingdom
4,489434,21232,STRAWBERRY CERAMIC TRINKET BOX,24,2009-12-01 07:45:00,1.25,13085.0,United Kingdom
...,...,...,...,...,...,...,...,...
525456,538171,22271,FELTCRAFT DOLL ROSIE,2,2010-12-09 20:01:00,2.95,17530.0,United Kingdom
525457,538171,22750,FELTCRAFT PRINCESS LOLA DOLL,1,2010-12-09 20:01:00,3.75,17530.0,United Kingdom
525458,538171,22751,FELTCRAFT PRINCESS OLIVIA DOLL,1,2010-12-09 20:01:00,3.75,17530.0,United Kingdom
525459,538171,20970,PINK FLORAL FELTCRAFT SHOULDER BAG,2,2010-12-09 20:01:00,3.75,17530.0,United Kingdom


In [14]:
data.describe(include='all')  

Unnamed: 0,Invoice,StockCode,Description,Quantity,InvoiceDate,Price,Customer ID,Country
count,525461.0,525461,522533,525461.0,525461,525461.0,417534.0,525461
unique,28816.0,4632,4681,,,,,40
top,537434.0,85123A,WHITE HANGING HEART T-LIGHT HOLDER,,,,,United Kingdom
freq,675.0,3516,3549,,,,,485852
mean,,,,10.337667,2010-06-28 11:37:36.845017856,4.688834,15360.645478,
min,,,,-9600.0,2009-12-01 07:45:00,-53594.36,12346.0,
25%,,,,1.0,2010-03-21 12:20:00,1.25,13983.0,
50%,,,,3.0,2010-07-06 09:51:00,2.1,15311.0,
75%,,,,10.0,2010-10-15 12:45:00,4.21,16799.0,
max,,,,19152.0,2010-12-09 20:01:00,25111.09,18287.0,


In [15]:
dataset_structure = data.info() #printing important information about dataset

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 525461 entries, 0 to 525460
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype         
---  ------       --------------   -----         
 0   Invoice      525461 non-null  object        
 1   StockCode    525461 non-null  object        
 2   Description  522533 non-null  object        
 3   Quantity     525461 non-null  int64         
 4   InvoiceDate  525461 non-null  datetime64[ns]
 5   Price        525461 non-null  float64       
 6   Customer ID  417534 non-null  float64       
 7   Country      525461 non-null  object        
dtypes: datetime64[ns](1), float64(2), int64(1), object(4)
memory usage: 32.1+ MB


In [18]:
missing_values = data.isnull().sum() #Identifying missing values
print(missing_values)

Invoice             0
StockCode           0
Description      2928
Quantity            0
InvoiceDate         0
Price               0
Customer ID    107927
Country             0
dtype: int64
