## Customer lifetime value prediction
ML implementation series for Product Managers, post #4

### DISCLAIMER: It is greatly beneficial if you know Python and ML basics before hand. If not, I would highly urge you to learn. This should be non-negotiable. This would form the basement for future posts in this series and your career as PM working with ML teams.

### Why this follows posts 1, 2, and 3

We've built three critical ML models:

**Post 1 (churn prediction):** Told us WHO will leave  
**Post 2 (segmentation):** Told us WHO they are (customer profiles)  
**Post 3 (recommendations):** Told us WHAT to offer them

But none of these answer the most important question for finance and leadership:

**"How much should we invest in each customer?"**

That's where customer lifetime value (CLV) prediction comes in. It's the missing piece that turns insights into budget allocation strategy.


### Problem statement

The finance team is planning next year's customer acquisition budget. The CFO walks into the meeting and asks:

**"How much is each new customer actually worth to us?"**

Marketing has been spending equally across all channels. Some customers come back monthly. Others buy once and disappear. But the acquisition cost is the same for everyone.

The current approach:
- Spend $100 to acquire every customer
- Hope they're valuable
- Realize months later that some customers never came back
- Waste thousands on low-value acquisitions

**The question became:**  
Can we predict which customers will be high-value EARLY in their journey, before we waste budget on low-value ones?

----

### Dataset overview

Same customer data platform (CDP) with 5,000 customers from posts 1, 2, and 3.

### Tables used:
- **cdp_customers**: Customer demographics and signup data
- **cdp_customer_features**: Behavioral metrics (RFM, engagement)

### Features for CLV prediction:

We use EARLY signals (things you know within the first 30-60 days):

1. `recency_days`: Days since last purchase
2. `frequency`: Number of purchases so far
3. `monetary_value`: Total spend to date
4. `avg_order_value`: Average transaction size
5. `engagement_score`: Platform activity level
6. `email_open_rate`: Email engagement
7. `email_click_rate`: Campaign interaction
8. `churn_risk`: Risk category from post 1

----


### ML approach: Random forest regression

#### The core question

"Based on a customer's early behavior, what will their total lifetime value be?"

#### Why random forest for CLV prediction?

Let's compare the options a PM should know:

#### Option 1: Historical average CLV
- **Logic:** "Every customer is worth the average"
- **Pros:** Simple, no model needed
- **Cons:** Ignores customer differences, wastes budget
- **When to use:** Never, unless you have zero data

#### Option 2: Linear regression
- **Logic:** "CLV increases linearly with features"
- **Pros:** Simple, interpretable
- **Cons:** Assumes linear relationships (often wrong for CLV)
- **When to use:** If you need extreme simplicity

#### Option 3: Random forest regression (our choice)
- **Logic:** "Learn complex patterns from customer behavior"
- **Pros:** Handles non-linear relationships, robust, accurate
- **Cons:** Slightly less interpretable than linear models
- **When to use:** When accuracy matters for budget decisions (which it does)

**Why we chose random forest:**
1. CLV relationships are non-linear (high spenders behave differently)
2. Handles outliers well (VIP customers)
3. Proven accuracy for predicting continuous values
4. Still explainable via feature importance
----

### How random forest works

Imagine you're trying to predict how much a new customer will spend over their lifetime.

**Traditional approach:**  
Look at their first purchase and guess: "They spent $50, so they'll probably spend $500 total."

**Random forest approach:**  
Create 100 different "decision trees," each asking questions like:
- "Did they spend more than $30 on first order?"
  - If yes: "Do they open more than 20% of emails?"
    - If yes: "Predict high CLV"
    - If no: "Predict medium CLV"
  - If no: "Predict low CLV"

Each tree votes on the prediction. The average of all 100 trees is the final answer.

**Why this works:**
- Different trees capture different patterns
- Averaging reduces errors
- Handles complex relationships (e.g., high engagement + low spend might mean future high value)
----

In [3]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error

# Load data
cdp_customers = pd.read_csv('cdp_customers.csv')
cdp_customer_features = pd.read_csv('cdp_customer_features.csv')
df = cdp_customers.merge(cdp_customer_features, on='customer_id', how='left')

In [5]:
from sklearn.cluster import KMeans

# Features used in Post #2 for segmentation
segmentation_features = [
    'recency_days',
    'frequency', 
    'monetary_value',
    'avg_order_value',
    'engagement_score',
    'email_open_rate',
    'email_click_rate'
]

# Prepare data
X_segment = df[segmentation_features].fillna(df[segmentation_features].median())

# Run K-Means (same as Post #2)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
df['segment_label'] = kmeans.fit_predict(X_segment)

# Map segment numbers to names (from Post #2 analysis)
segment_names = {
    0: 'Engaged Browsers',
    1: 'At-Risk Dormant', 
    2: 'High-Value Customers',
    3: 'Campaign Champions'
}
df['segment_name'] = df['segment_label'].map(segment_names)

# Select early-signal features
early_features = [
    'recency_days',
    'frequency',
    'monetary_value',
    'avg_order_value',
    'engagement_score',
    'email_open_rate',
    'email_click_rate',
    'churn_risk',
    'segment_label'
]

# Prepare features
X = df[early_features].copy()

In [7]:
# Encode categorical churn_risk (from post 1)
churn_risk_mapping = {'Low': 0, 'Medium': 1, 'High': 2}
X['churn_risk'] = X['churn_risk'].map(churn_risk_mapping)

# Fill missing values
X = X.fillna(X.median())

# Target variable
y = df['customer_lifetime_value'].copy()

print(f"Feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")


Feature matrix shape: (5000, 9)
Target variable shape: (5000,)


In [9]:
# Split data 80/20
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

print(f"Training set: {X_train.shape} customers")
print(f"Test set: {X_test.shape} customers")


Training set: (4000, 9) customers
Test set: (1000, 9) customers


In [11]:
# Initialize random forest
rf_model = RandomForestRegressor(
    n_estimators=100,      # 100 decision trees
    max_depth=10,          # Limit tree depth to avoid overfitting
    min_samples_split=20,  # Require 20+ samples to split nodes
    random_state=42,
    n_jobs=-1             # Use all CPU cores
)

# Train model
rf_model.fit(X_train, y_train)

# Make predictions
y_pred_test = rf_model.predict(X_test)

print("Model trained successfully!")


Model trained successfully!


In [13]:
# Calculate metrics
test_r2 = r2_score(y_test, y_pred_test)
test_mae = mean_absolute_error(y_test, y_pred_test)
test_rmse = np.sqrt(mean_squared_error(y_test, y_pred_test))

print(f"R² Score: {test_r2:.1%}")
print(f"Mean Absolute Error: ${test_mae:.2f}")
print(f"Root Mean Squared Error: ${test_rmse:.2f}")


R² Score: 99.7%
Mean Absolute Error: $6.35
Root Mean Squared Error: $58.57


**Is 99.7% R² too good?**

For PMs, this raises a red flag: "Is the model overfitting?"

It's actually valid because,
- CLV is calculated FROM these features (monetary value, frequency)
- We're predicting cumulative behavior from early behavior
- High accuracy is expected when features and target are closely related

In production, you'd want to test on completely new customers to validate.


In [17]:
# Analyze which features matter most
feature_importance = pd.DataFrame({
    'feature': X.columns,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)

print("Feature importance:")
print(feature_importance)


Feature importance:
            feature    importance
2    monetary_value  9.999964e-01
3   avg_order_value  2.808237e-06
8     segment_label  7.333160e-07
1         frequency  6.024523e-08
0      recency_days  8.923744e-10
4  engagement_score  0.000000e+00
5   email_open_rate  0.000000e+00
6  email_click_rate  0.000000e+00
7        churn_risk  0.000000e+00


**Results:**

| Feature | Importance |
|---------|------------|
| monetary_value | 100.0% |
| avg_order_value | 0.0% |
| frequency | 0.0% |
| recency_days | 0.0% |

**What this tells us:**
Early spending (monetary_value) is the strongest predictor of lifetime value. 

**Why?** 
customers who spend more early tend to spend more overall.

## Business impact

### CLV segment distribution

| Segment | % of Customers | Avg CLV | Total Value | Acquisition ROI (at \$100 CAC) |
|---------|----------------|---------|-------------|-------------------------------|
| Low Value | 31.2% | \$58 | \$18K | -42% (lose money) |
| Medium Value | 33.7% | \$545 | \$184K | +445% (great) |
| High Value | 26.3% | \$1,621 | \$426K | +1,521% (excellent) |
| VIP | 8.8% | \$3,563 | \$314K | +3,463% (invest heavily) |


### Key business insights

#### Insight 1: Customer acquisition cost thresholds

**If current CAC is $100:**
- Low value customers: Lose $42 per customer (stop acquiring)
- Medium value customers: Make $445 per customer (good ROI)
- High value customers: Make $1,521 per customer (focus here)
- VIP customers: Make $3,463 per customer (invest aggressively)

**Recommendation:** Increase CAC to $300 for predicted high-value customers. Still get 5x+ return.

---

#### Insight 2: Budget reallocation opportunity

**Current state (no CLV prediction):**
- Spend $100 on every customer
- Total budget: $500K (5,000 customers)
- Waste: $13K on low-value customers who never return

**With CLV prediction:**
- Spend $0 on predicted low-value (save $31K)
- Spend $300 on predicted high-value/VIP (invest $105K more)
- Net impact: Acquire fewer, better customers with same budget
- Expected revenue lift: +25-40%

---

#### Insight 3: Connect to previous posts

**Post 1 (churn):** High churn risk = lower CLV → reduce acquisition spend  
**Post 2 (segmentation):** High-value segment = high CLV → increase retention spend  
**Post 3 (recommendations):** Personalized offers can INCREASE CLV → measure impact  

**Combined power:** Now we can prioritize retention efforts on high-CLV, high-churn customers.


## Why this solution works (PM perspective)

### 1. It's predictive, not reactive

Don't wait 12 months to know if a customer was worth acquiring. Predict their value within 30 days.

### 2. It's actionable

Clear CLV segments translate directly to marketing actions:
- Low value: Don't acquire
- Medium value: Acquire efficiently
- High value: Invest heavily

### 3. It's measurable

Track actual vs. predicted CLV over time. Refine the model. Measure ROI lift.

### 4. It integrates with existing models

Churn risk (post 1) is a feature. Segment membership (post 2) could be added. Recommendation uptake (post 3) can increase predicted CLV.

### 5. It changes the budget conversation

Instead of "we need more marketing budget," it's "we should shift $100K from low-value to high-value acquisition channels, here's the ROI."


### What to do with these predictions

Now that you can predict CLV early, here's how to operationalize it:

**Integrate CLV scores into CRM**
- Tag every new customer with predicted CLV segment
- Update predictions monthly as behavior evolves

**Adjust acquisition targeting**
- Work with paid ads team to target high-CLV lookalike audiences
- Reduce spend on channels that bring low-CLV customers
- A/B test acquisition messaging for high vs. medium value segments

**Optimize retention budget**
- Allocate 70% of retention budget to high-value customers
- 25% to medium value
- 5% to low value (or exit them gracefully)

**Measure impact**
- Compare actual CLV vs. predicted CLV for cohorts
- Track ROI improvement by segment
- Refine model based on new data

**Ongoing: Report to finance**
- Show CLV-based ROI by acquisition channel
- Demonstrate how predicted CLV informs budget allocation
- Secure bigger budgets for high-value customer acquisition


## Connection to previous posts 

**Post 1 taught us:** Predict who will leave (churn)  
**Post 2 taught us:** Understand customer types (segmentation)  
**Post 3 taught us:** Recommend what to offer (personalization)  
**Post 4 teaches us:** Know how much to invest (CLV)

**The complete picture:**

For a new customer:
1. Predict their CLV → Know if they're worth acquiring
2. Predict their churn risk → Know if they need retention efforts
3. Identify their segment → Tailor communication style
4. Recommend products → Maximize their value

This is the full strategic ML stack for customer data platforms.


## What's next?

**Post 5: Campaign response prediction**

Now that we know who's valuable (CLV), who's at risk (churn), who they are (segments), and what to offer (recommendations), the next question is:

**"Which marketing campaigns will actually work on each customer?"**

Same dataset. New problem. Smarter campaign targeting.



*Part of the "Machine learning for product leaders" series - teaching PMs just enough ML to lead with confidence.*