# CLV Prediction: Ridge Regression Experiments

## Project Overview
This notebook explores different strategies to improve Ridge regression performance for Customer Lifetime Value (CLV) prediction:

1. **Feature Selection**: Testing manual vs automated features
2. **Target Transformation**: Log transformation to handle skewed distribution
3. **Outlier Treatment**: Clipping extreme values to reduce impact on RMSE

All experiments are logged to **MLflow** for tracking and comparison.

---

**Key Challenge**: CLV distribution is highly right-skewed (median: $167, mean: $726, max: $114,887), causing standard regression models to struggle with extreme outliers.

In [None]:
# Import libraries
import pandas as pd
import numpy as np
import mlflow
import mlflow.sklearn
import matplotlib.pyplot as plt
import json
from sklearn.pipeline import Pipeline
from sklearn.linear_model import RidgeCV
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

---

# Experiment 1: Manual RFM Features

## Objective
Test if manually engineered RFM (Recency, Frequency, Monetary) features perform better than automated features from Featuretools.

## Features Used
- **Recency**: `days_since_last_purchase`
- **Frequency**: `purchase_frequency`, `total_orders`
- **Monetary**: `avg_order_value`, `avg_item_value`
- **Additional**: `customer_age_days`, `active_days`, `avg_days_between_orders`, `avg_items_per_order`

## Hypothesis
Manual features focus on core customer behavior patterns and may generalize better than complex automated features.

In [None]:
# Define manual features
features_manual = [
    'days_since_last_purchase', 
    'customer_age_days',
    'active_days', 
    'total_orders', 
    'purchase_frequency',
    'avg_items_per_order', 
    'avg_days_between_orders', 
    'avg_order_value',
    'avg_item_value'
]

# Load data
train = pd.read_csv('../data/features/train_final.csv')
val = pd.read_csv('../data/features/val_final.csv')
test = pd.read_csv('../data/features/test_final.csv')

X_train = train[features_manual]
y_train = train['CLV_Target']

X_val = val[features_manual]
y_val = val['CLV_Target']

X_test = test[features_manual]
y_test = test['CLV_Target']

print(f"Data loaded: {len(X_train)} train, {len(X_val)} val, {len(X_test)} test")
print(f"Number of features: {len(features_manual)}")

In [None]:
# Build and train Ridge pipeline
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   
])

preprocessor = ColumnTransformer(
    transformers=[('num', num_transformer, features_manual)]
)

ridge_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RidgeCV(alphas=[0.1, 1, 10], cv=5))
])

# Train model
ridge_pipeline.fit(X_train, y_train)

# Make predictions
y_train_pred = ridge_pipeline.predict(X_train)
y_val_pred = ridge_pipeline.predict(X_val)

# Calculate metrics
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_mae = mean_absolute_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

print(f"\n{'='*70}")
print("EXPERIMENT 1: MANUAL FEATURES - RESULTS")
print(f"{'='*70}")
print(f"\nTraining:")
print(f"  RMSE: ${train_rmse:.2f}")
print(f"  MAE:  ${train_mae:.2f}")
print(f"  RÂ²:   {train_r2:.4f}")
print(f"\nValidation:")
print(f"  RMSE: ${val_rmse:.2f}")
print(f"  MAE:  ${val_mae:.2f}")
print(f"  RÂ²:   {val_r2:.4f}")

In [None]:
# Log to MLflow
mlflow.set_experiment("Ridge_Regression_Experiments")

with mlflow.start_run(run_name="Exp1_Manual_Features"):
    
    # Log parameters
    mlflow.log_params({
        "experiment": "manual_features",
        "model_type": "ridge",
        "n_features": len(features_manual),
        "feature_type": "manual_rfm",
        "target_transform": "none",
        "best_alpha": ridge_pipeline.named_steps['model'].alpha_,
        "cv_folds": 5
    })
    
    # Log metrics
    mlflow.log_metrics({
        "train_rmse": train_rmse,
        "train_mae": train_mae,
        "train_r2": train_r2,
        "val_rmse": val_rmse,
        "val_mae": val_mae,
        "val_r2": val_r2,
        "rmse_gap": val_rmse - train_rmse,
        "val_rmse_pct_of_mean": (val_rmse / y_val.mean()) * 100
    })
    
    # Log model
    mlflow.sklearn.log_model(ridge_pipeline, "model")
    
    # Log feature list
    mlflow.log_dict({"features": features_manual}, "features_used.json")
    
    # Tags
    mlflow.set_tags({
        "stage": "exploration",
        "strategy": "feature_selection",
        "author": "Bread"
    })
    
    print("\nâœ… Experiment 1 logged to MLflow")

---

# Experiment 2: Log Transformation

## Objective
Apply log transformation to the target variable (CLV) to handle the highly skewed distribution and reduce the influence of extreme outliers.

## Method
1. Transform target: `y_log = log(1 + y)` (log1p to handle zeros)
2. Train Ridge regression on log-transformed target
3. Transform predictions back: `y_pred = exp(y_log) - 1` (expm1)
4. Evaluate on original scale

## Hypothesis
Log transformation will:
- Compress large values
- Make distribution more normal
- Reduce RMSE by minimizing outlier impact

## Features Used
Using the best automated features from feature selection (14 features including aggregations and engineered combinations).

In [None]:
# Selected features from automated feature engineering
selected_features = [
    'SUM(transactions.Revenue)', 
    'SUM(transactions.Quantity)', 
    'COUNT(transactions) + MAX(transactions.Quantity)', 
    'total_orders', 
    'COUNT(transactions) + MAX(transactions.Revenue)', 
    'COUNT(transactions)', 
    'purchase_frequency', 
    'active_days', 
    'avg_days_between_orders', 
    'avg_items_per_order', 
    'avg_order_value', 
    'MAX(transactions.Revenue)', 
    'MEAN(transactions.Revenue)', 
    'MAX(transactions.Quantity) + MEAN(transactions.Revenue)'
]

# Load data
train = pd.read_csv('../data/features/train_final.csv')
val = pd.read_csv('../data/features/val_final.csv')
test = pd.read_csv('../data/features/test_final.csv')

X_train = train[selected_features]
y_train = train['CLV_Target']

X_val = val[selected_features]
y_val = val['CLV_Target']

X_test = test[selected_features]
y_test = test['CLV_Target']

# Apply log transformation to ALL targets (train, val, test)
y_train_log = np.log1p(y_train)
y_val_log = np.log1p(y_val)
y_test_log = np.log1p(y_test)

print(f"\nData Statistics:")
print(f"  Mean CLV:   ${y_train.mean():.2f}")
print(f"  Median CLV: ${y_train.median():.2f}")
print(f"  Max CLV:    ${y_train.max():.2f}")
print(f"\nLog-transformed:")
print(f"  Mean: {y_train_log.mean():.4f}")
print(f"  Std:  {y_train_log.std():.4f}")

In [None]:
# Build pipeline
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   
])

preprocessor = ColumnTransformer(
    transformers=[('num', num_transformer, selected_features)]
)

ridge_pipeline_log = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5))
])

# Train on LOG scale
ridge_pipeline_log.fit(X_train, y_train_log)

# Predict on LOG scale
y_train_pred_log = ridge_pipeline_log.predict(X_train)
y_val_pred_log = ridge_pipeline_log.predict(X_val)

# Transform back to ORIGINAL scale
y_train_pred = np.expm1(y_train_pred_log)
y_val_pred = np.expm1(y_val_pred_log)

# Calculate metrics on ORIGINAL scale
train_rmse = np.sqrt(mean_squared_error(y_train, y_train_pred))
train_mae = mean_absolute_error(y_train, y_train_pred)
train_r2 = r2_score(y_train, y_train_pred)

val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_mae = mean_absolute_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

print(f"\n{'='*70}")
print("EXPERIMENT 2: LOG TRANSFORMATION - RESULTS")
print(f"{'='*70}")
print(f"\nTraining:")
print(f"  RMSE: ${train_rmse:.2f}")
print(f"  MAE:  ${train_mae:.2f}")
print(f"  RÂ²:   {train_r2:.4f}")
print(f"\nValidation:")
print(f"  RMSE: ${val_rmse:.2f}")
print(f"  MAE:  ${val_mae:.2f}")
print(f"  RÂ²:   {val_r2:.4f}")

In [None]:
# Log to MLflow
mlflow.set_experiment("Ridge_Regression_Experiments")

with mlflow.start_run(run_name="Exp2_Log_Transform"):
    
    # Log parameters
    mlflow.log_params({
        "experiment": "log_transform",
        "model_type": "ridge",
        "n_features": len(selected_features),
        "feature_type": "automated_selected",
        "target_transform": "log1p",
        "best_alpha": ridge_pipeline_log.named_steps['model'].alpha_,
        "cv_folds": 5
    })
    
    # Log metrics
    mlflow.log_metrics({
        "train_rmse": train_rmse,
        "train_mae": train_mae,
        "train_r2": train_r2,
        "val_rmse": val_rmse,
        "val_mae": val_mae,
        "val_r2": val_r2,
        "rmse_gap": val_rmse - train_rmse,
        "val_rmse_pct_of_mean": (val_rmse / y_val.mean()) * 100
    })
    
    # Log model
    mlflow.sklearn.log_model(ridge_pipeline_log, "model")
    
    # Log feature list
    mlflow.log_dict({"features": selected_features}, "features_used.json")
    
    # Tags
    mlflow.set_tags({
        "stage": "exploration",
        "strategy": "target_transformation",
        "author": "Bread"
    })
    
    print("\nâœ… Experiment 2 logged to MLflow")

---

# Experiment 3: Clipping Outliers

## Objective
Clip extreme CLV values during training to reduce their impact on the model, while still evaluating on the full unclipped validation set.

## Method
1. Calculate clip threshold at 95th percentile of training CLV (~$2,331)
2. Cap training target values above threshold
3. Train Ridge regression on clipped data
4. Evaluate predictions on ORIGINAL unclipped validation data

## Hypothesis
Clipping will:
- Allow model to focus on majority of customers (bottom 95%)
- Reduce influence of extreme outliers during training
- Result in better generalization

## Rationale
Error analysis shows that ~10-20 extreme customers (>$20k CLV) cause massive residuals ($20k-$60k errors) that dominate RMSE but represent <1% of customer base.

In [None]:
# Reload data (using same selected features)
train = pd.read_csv('../data/features/train_final.csv')
val = pd.read_csv('../data/features/val_final.csv')
test = pd.read_csv('../data/features/test_final.csv')

X_train = train[selected_features]
y_train = train['CLV_Target']

X_val = val[selected_features]
y_val = val['CLV_Target']

X_test = test[selected_features]
y_test = test['CLV_Target']

# Analyze distribution and determine clip threshold
clip_percentile = 0.95
clip_threshold = y_train.quantile(clip_percentile)

print(f"\nDistribution Analysis:")
for p in [0.50, 0.75, 0.90, 0.95, 0.99]:
    val = y_train.quantile(p)
    count = (y_train > val).sum()
    print(f"  {int(p*100)}th percentile: ${val:,.2f} ({count} customers above)")

print(f"\nðŸ“Š Clipping Strategy:")
print(f"  Clip threshold: ${clip_threshold:,.2f} ({int(clip_percentile*100)}th percentile)")
print(f"  Values to clip: {(y_train > clip_threshold).sum()} ({(y_train > clip_threshold).sum()/len(y_train)*100:.1f}%)")

# Apply clipping to TRAINING target only
y_train_clipped = y_train.clip(upper=clip_threshold)

print(f"\nAfter Clipping:")
print(f"  Mean CLV:   ${y_train_clipped.mean():.2f} (was ${y_train.mean():.2f})")
print(f"  Max CLV:    ${y_train_clipped.max():.2f} (was ${y_train.max():.2f})")

In [None]:
# Build pipeline
num_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())   
])

preprocessor = ColumnTransformer(
    transformers=[('num', num_transformer, selected_features)]
)

ridge_pipeline_clip = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('model', RidgeCV(alphas=[0.01, 0.1, 1, 10, 100], cv=5))
])

# Train on CLIPPED data
ridge_pipeline_clip.fit(X_train, y_train_clipped)

# Predict
y_train_pred = ridge_pipeline_clip.predict(X_train)
y_val_pred = ridge_pipeline_clip.predict(X_val)

# Metrics on CLIPPED training target (for reference)
train_rmse_clipped = np.sqrt(mean_squared_error(y_train_clipped, y_train_pred))
train_mae_clipped = mean_absolute_error(y_train_clipped, y_train_pred)
train_r2_clipped = r2_score(y_train_clipped, y_train_pred)

# Metrics on ORIGINAL unclipped validation target (what matters!)
val_rmse = np.sqrt(mean_squared_error(y_val, y_val_pred))
val_mae = mean_absolute_error(y_val, y_val_pred)
val_r2 = r2_score(y_val, y_val_pred)

print(f"\n{'='*70}")
print("EXPERIMENT 3: CLIPPING OUTLIERS - RESULTS")
print(f"{'='*70}")
print(f"\nTraining (vs clipped target):")
print(f"  RMSE: ${train_rmse_clipped:.2f}")
print(f"  MAE:  ${train_mae_clipped:.2f}")
print(f"  RÂ²:   {train_r2_clipped:.4f}")
print(f"\nValidation (vs ORIGINAL unclipped target):")
print(f"  RMSE: ${val_rmse:.2f}")
print(f"  MAE:  ${val_mae:.2f}")
print(f"  RÂ²:   {val_r2:.4f}")

# Error by customer segment
low_value = y_val <= y_val.quantile(0.33)
mid_value = (y_val > y_val.quantile(0.33)) & (y_val <= y_val.quantile(0.67))
high_value = y_val > y_val.quantile(0.67)

print(f"\nRMSE by Customer Segment:")
print(f"  Low value (bottom 33%):  ${np.sqrt(mean_squared_error(y_val[low_value], y_val_pred[low_value])):.2f}")
print(f"  Mid value (middle 33%):  ${np.sqrt(mean_squared_error(y_val[mid_value], y_val_pred[mid_value])):.2f}")
print(f"  High value (top 33%):    ${np.sqrt(mean_squared_error(y_val[high_value], y_val_pred[high_value])):.2f}")

In [None]:
# Log to MLflow
mlflow.set_experiment("Ridge_Regression_Experiments")

with mlflow.start_run(run_name="Exp3_Clipped_Outliers"):
    
    # Log parameters
    mlflow.log_params({
        "experiment": "clip_outliers",
        "model_type": "ridge",
        "n_features": len(selected_features),
        "feature_type": "automated_selected",
        "target_transform": "clip",
        "clip_percentile": clip_percentile,
        "clip_threshold": clip_threshold,
        "values_clipped": int((y_train > clip_threshold).sum()),
        "best_alpha": ridge_pipeline_clip.named_steps['model'].alpha_,
        "cv_folds": 5
    })
    
    # Log metrics
    mlflow.log_metrics({
        "train_rmse_clipped": train_rmse_clipped,
        "train_mae_clipped": train_mae_clipped,
        "train_r2_clipped": train_r2_clipped,
        "val_rmse": val_rmse,
        "val_mae": val_mae,
        "val_r2": val_r2,
        "rmse_gap": val_rmse - train_rmse_clipped,
        "val_rmse_pct_of_mean": (val_rmse / y_val.mean()) * 100,
        "val_rmse_low_segment": np.sqrt(mean_squared_error(y_val[low_value], y_val_pred[low_value])),
        "val_rmse_mid_segment": np.sqrt(mean_squared_error(y_val[mid_value], y_val_pred[mid_value])),
        "val_rmse_high_segment": np.sqrt(mean_squared_error(y_val[high_value], y_val_pred[high_value]))
    })
    
    # Log model
    mlflow.sklearn.log_model(ridge_pipeline_clip, "model")
    
    # Log feature list and clip threshold
    mlflow.log_dict({
        "features": selected_features,
        "clip_threshold": float(clip_threshold),
        "clip_percentile": clip_percentile
    }, "model_config.json")
    
    # Tags
    mlflow.set_tags({
        "stage": "exploration",
        "strategy": "outlier_treatment",
        "author": "Bread"
    })
    
    print("\nâœ… Experiment 3 logged to MLflow")

---

# Experiment Summary & Conclusions

## Key Findings

All three experiments have been logged to MLflow under the **"Ridge_Regression_Experiments"** experiment.

### Expected Results:
1. **Manual Features**: Lower performance than automated features due to missing complex interactions
2. **Log Transform**: FAILED - Made RMSE worse due to transformation artifacts with extreme skew
3. **Clipping**: Modest improvement for majority of customers, but still struggles with high-value segment

## Why These Approaches Had Limited Success

The fundamental challenge is **customer heterogeneity**:
- 90% of customers have CLV < $2,000 (predictable)
- 10% of customers have CLV > $2,000 (highly variable)
- Single model struggles to handle both segments effectively

## Next Steps

Based on residual analysis showing concentrated errors in high-value customers, the recommended approach is:

**Two-Tier Modeling System:**
1. **Classifier**: Predict if customer will be high/low value
2. **Regression Models**: Separate models for each segment
3. **Combined Prediction**: Route customers to appropriate model

Expected improvement: **15-25% RMSE reduction** (primarily on high-value segment)

---

## View Results in MLflow

Run MLflow UI to compare experiments:
```bash
mlflow ui
```

Navigate to `http://localhost:5000` and select the **"Ridge_Regression_Experiments"** experiment to compare:
- Validation RMSE
- RÂ² scores  
- Impact of different strategies

Filter and sort by metrics to identify the best approach for your use case.