# E-Commerce Customer Lifetime Value (CLV) Prediction and Segmentation

## Complete Analysis Pipeline

This notebook demonstrates a comprehensive CLV analysis including:
1. Data Loading and Exploration
2. Data Cleaning and Preparation
3. RFM Analysis
4. Customer Segmentation (K-Means)
5. Advanced CLV Modeling (BG/NBD and Pareto/NBD)
6. Model Evaluation and Comparison
7. Business Insights and Recommendations

## 1. Setup and Imports

In [None]:
import sys
sys.path.insert(0, '..')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
import warnings
warnings.filterwarnings('ignore')

# Import custom modules
from src import data_processing, rfm_analysis, segmentation, clv_modeling, visualization, utils

# Setup logging
logger = utils.setup_logging('INFO')
utils.set_random_seed(42)

# Setup visualization
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("✓ All imports successful!")

## 2. Load and Explore Data

In [None]:
# Load data
df = data_processing.load_raw_data('../data/raw/ecommerce_transactions.csv')

# Display basic info
utils.print_data_info(df)

# Display first rows
print("\nFirst 10 rows:")
df.head(10)

In [None]:
# Detailed statistics
print("\nData Summary:")
print(f"Date Range: {df['TransactionDate'].min()} to {df['TransactionDate'].max()}")
print(f"Total Transactions: {len(df):,}")
print(f"Unique Customers: {df['CustomerID'].nunique():,}")
print(f"Total Revenue: ${df['Amount'].sum():,.2f}")
print(f"\nAmount Statistics:")
print(df['Amount'].describe())

## 3. Data Cleaning and Preparation

In [None]:
# Check for missing values
missing = data_processing.check_missing_values(df)
print(f"\nMissing values: {missing if missing else 'None'}")

# Check for duplicates
print(f"Duplicate rows: {df.duplicated().sum()}")

# Data preparation
df_clean = data_processing.prepare_data_for_analysis(df, remove_outliers_flag=False)

print(f"\nCleaned data shape: {df_clean.shape}")
print(f"Rows removed: {len(df) - len(df_clean):,}")

## 4. RFM Analysis

In [None]:
# Prepare RFM analysis
rfm = rfm_analysis.prepare_rfm_analysis(
    df_clean,
    customer_id_col='CustomerID',
    transaction_date_col='TransactionDate',
    amount_col='Amount'
)

print(f"\nRFM Analysis Results:")
print(f"Total Customers: {len(rfm):,}")
print(f"\nRFM Summary:")
print(rfm_analysis.get_rfm_summary(rfm))

In [None]:
# Segment summary
print("\nSegment Summary:")
segment_summary = rfm_analysis.get_segment_summary(rfm)
print(segment_summary)

In [None]:
# Visualize RFM distributions
visualization.plot_rfm_distribution(rfm)
visualization.plot_rfm_scores(rfm)
visualization.plot_segment_distribution(rfm)
visualization.plot_segment_rfm(rfm)

## 5. Customer Segmentation (K-Means)

In [None]:
# Prepare segmentation
features = ['Recency', 'Frequency', 'Monetary']

df_segmented, scaler, kmeans, metrics = segmentation.prepare_segmentation(
    rfm,
    features=features,
    find_optimal=True,
    random_state=42
)

print(f"\nSegmentation Results:")
print(f"Optimal number of clusters: {df_segmented['Cluster'].nunique()}")
print(f"\nCluster Distribution:")
print(df_segmented['Cluster'].value_counts().sort_index())

In [None]:
# Analyze segments
segment_profiles = segmentation.analyze_segments(df_segmented, features)
print("\nSegment Profiles:")
print(segment_profiles)

In [None]:
# Visualize elbow curve
if metrics:
    visualization.plot_elbow_curve(metrics)

# Visualize clusters
visualization.plot_cluster_scatter(df_segmented, 'Frequency', 'Monetary', 'Cluster')
visualization.plot_cluster_scatter(df_segmented, 'Recency', 'Monetary', 'Cluster')

## 6. Advanced CLV Modeling (BG/NBD and Pareto/NBD)

In [None]:
# Prepare data for lifetimes models
rfm_lifetimes, reference_date = clv_modeling.prepare_rfm_for_lifetimes(
    df_clean,
    customer_id_col='CustomerID',
    transaction_date_col='TransactionDate',
    amount_col='Amount'
)

print(f"\nRFM Data for Lifetimes Models:")
print(f"Shape: {rfm_lifetimes.shape}")
print(f"\nFirst 10 rows:")
print(rfm_lifetimes.head(10))

In [None]:
# Fit BG/NBD model
print("\n" + "="*60)
print("FITTING BG/NBD MODEL")
print("="*60)

bgf = clv_modeling.fit_bgf_model(rfm_lifetimes)

In [None]:
# Fit Pareto/NBD model
print("\n" + "="*60)
print("FITTING PARETO/NBD MODEL")
print("="*60)

pnbd = clv_modeling.fit_pareto_model(rfm_lifetimes)

In [None]:
# Predict CLV using BG/NBD
print("\n" + "="*60)
print("CLV PREDICTIONS - BG/NBD MODEL")
print("="*60)

clv_bgf = clv_modeling.predict_clv(
    bgf, 
    rfm_lifetimes, 
    prediction_period_days=365,
    monetary_col='monetary_value'
)

print(f"\nCLV Distribution (BG/NBD):")
print(clv_bgf['CLV'].describe())

In [None]:
# Predict CLV using Pareto/NBD
print("\n" + "="*60)
print("CLV PREDICTIONS - PARETO/NBD MODEL")
print("="*60)

clv_pnbd = clv_modeling.predict_clv(
    pnbd, 
    rfm_lifetimes, 
    prediction_period_days=365,
    monetary_col='monetary_value'
)

print(f"\nCLV Distribution (Pareto/NBD):")
print(clv_pnbd['CLV'].describe())

## 7. Model Evaluation and Comparison

In [None]:
# Evaluate BG/NBD model
print("\nBG/NBD Model Evaluation:")
bgf_metrics = clv_modeling.evaluate_model(bgf, rfm_lifetimes)
print(bgf_metrics)

In [None]:
# Evaluate Pareto/NBD model
print("\nPareto/NBD Model Evaluation:")
pnbd_metrics = clv_modeling.evaluate_model(pnbd, rfm_lifetimes)
print(pnbd_metrics)

In [None]:
# Compare models
print("\n" + "="*60)
print("MODEL COMPARISON")
print("="*60)

comparison = clv_modeling.compare_models(rfm_lifetimes)
print(comparison)

In [None]:
# Visualize CLV distributions
visualization.plot_clv_distribution(clv_bgf)
visualization.plot_model_comparison(comparison)

## 8. Business Insights and Recommendations

In [None]:
# Combine CLV with RFM and segments
rfm_with_clv = rfm.copy()
rfm_with_clv['CLV_BGF'] = clv_bgf['CLV'].values
rfm_with_clv['CLV_PNBD'] = clv_pnbd['CLV'].values
rfm_with_clv['Cluster'] = df_segmented['Cluster'].values

# Use average CLV
rfm_with_clv['CLV'] = (rfm_with_clv['CLV_BGF'] + rfm_with_clv['CLV_PNBD']) / 2

print("\nTop 20 Customers by CLV:")
top_customers = rfm_with_clv.nlargest(20, 'CLV')[['Recency', 'Frequency', 'Monetary', 'Segment', 'CLV']]
print(top_customers)

In [None]:
# CLV by segment
print("\nCLV Analysis by Segment:")
clv_by_segment = rfm_with_clv.groupby('Segment').agg({
    'CLV': ['count', 'mean', 'median', 'sum'],
    'Monetary': 'mean',
    'Frequency': 'mean'
}).round(2)
print(clv_by_segment)

In [None]:
# CLV by cluster
print("\nCLV Analysis by Cluster:")
clv_by_cluster = rfm_with_clv.groupby('Cluster').agg({
    'CLV': ['count', 'mean', 'median', 'sum'],
    'Recency': 'mean',
    'Frequency': 'mean',
    'Monetary': 'mean'
}).round(2)
print(clv_by_cluster)

In [None]:
# Business recommendations
print("\n" + "="*60)
print("BUSINESS INSIGHTS AND RECOMMENDATIONS")
print("="*60)

# Champions
champions = rfm_with_clv[rfm_with_clv['Segment'] == 'Champions']
print(f"\n1. CHAMPIONS ({len(champions)} customers):")
print(f"   - Average CLV: ${champions['CLV'].mean():.2f}")
print(f"   - Total Value: ${champions['CLV'].sum():,.2f}")
print(f"   - Recommendation: VIP treatment, exclusive offers, loyalty programs")

# Loyal Customers
loyal = rfm_with_clv[rfm_with_clv['Segment'] == 'Loyal Customers']
print(f"\n2. LOYAL CUSTOMERS ({len(loyal)} customers):")
print(f"   - Average CLV: ${loyal['CLV'].mean():.2f}")
print(f"   - Total Value: ${loyal['CLV'].sum():,.2f}")
print(f"   - Recommendation: Retention programs, personalized recommendations")

# At Risk
at_risk = rfm_with_clv[rfm_with_clv['Segment'] == 'At Risk']
print(f"\n3. AT RISK ({len(at_risk)} customers):")
print(f"   - Average CLV: ${at_risk['CLV'].mean():.2f}")
print(f"   - Total Value: ${at_risk['CLV'].sum():,.2f}")
print(f"   - Recommendation: Win-back campaigns, special discounts, re-engagement")

# New Customers
new = rfm_with_clv[rfm_with_clv['Segment'] == 'New Customers']
print(f"\n4. NEW CUSTOMERS ({len(new)} customers):")
print(f"   - Average CLV: ${new['CLV'].mean():.2f}")
print(f"   - Total Value: ${new['CLV'].sum():,.2f}")
print(f"   - Recommendation: Onboarding programs, welcome offers, education")

In [None]:
# Create summary dashboard
visualization.create_summary_dashboard(rfm_with_clv, rfm_with_clv[['CLV']])

## 9. Save Results

In [None]:
# Save RFM with CLV
output_path = '../data/processed/rfm_with_clv.parquet'
utils.save_dataframe(rfm_with_clv, output_path, format='parquet')

# Save segmented data
output_path = '../data/processed/segmented_customers.parquet'
utils.save_dataframe(df_segmented, output_path, format='parquet')

print("\n✓ Results saved successfully!")

## Summary

This comprehensive analysis demonstrates:

1. **Data Processing**: Complete data cleaning and preparation pipeline
2. **RFM Analysis**: Customer segmentation using Recency, Frequency, and Monetary metrics
3. **K-Means Clustering**: Advanced customer segmentation with optimal cluster determination
4. **Advanced CLV Modeling**: 
   - BG/NBD (Beta-Geometric/Negative Binomial Distribution)
   - Pareto/NBD models for probabilistic CLV prediction
5. **Model Evaluation**: Comprehensive metrics and comparison
6. **Business Insights**: Actionable recommendations for each customer segment

The analysis provides a complete framework for understanding customer value and developing targeted marketing strategies.