# Notebook 05: Model Inference
## Putting the Model to Work

**Author:** Tuhin Bhattacharya  
**Program:** PGDM Business Data Analytics, Goa Institute of Management  
**Project:** CLV Prediction for Auto Insurance Portfolio

---

## Executive Summary

In this notebook, I demonstrate how to use my trained CLV prediction model for real-world inference. This is where theory meets practice—the model I built in previous notebooks now generates actionable predictions for business decisions.

### What This Notebook Covers

| Section | Purpose |
|---------|--------|
| **Loading Artifacts** | Model, preprocessor, and metadata |
| **Single Prediction** | Predict CLV for one customer |
| **Batch Prediction** | Score multiple customers at once |
| **Segmentation** | Categorize customers by predicted value |
| **Business Use Cases** | Practical applications |

### Key Business Applications

1. **New Customer Valuation**: Estimate expected value at acquisition
2. **Retention Prioritization**: Focus resources on high-potential customers
3. **Marketing Optimization**: Allocate budget based on predicted returns
4. **Risk Assessment**: Identify customers likely to churn (low CLV)

---

In [None]:
# ============================================================================
# ENVIRONMENT SETUP
# ============================================================================

import pandas as pd
import numpy as np
import os
import joblib
import json

# Path Configuration
BASE_DIR = os.path.dirname(os.getcwd())
DATA_RAW_DIR = os.path.join(BASE_DIR, 'data', 'raw')
MODELS_DIR = os.path.join(BASE_DIR, 'models')

print("✅ Environment ready")

---

## 2. Load Trained Artifacts

We need three components for inference:
1. **Trained Model** — The Random Forest regressor
2. **Preprocessor** — The fitted ColumnTransformer
3. **Feature Names** — To ensure correct column ordering

In [None]:
# Load model
print("=" * 60)
print("LOADING TRAINED ARTIFACTS")
print("=" * 60)

model = joblib.load(os.path.join(MODELS_DIR, 'final_model.joblib'))
print(f"\n✅ Model loaded: {type(model).__name__}")

# Load preprocessor
preprocessor = joblib.load(os.path.join(MODELS_DIR, 'preprocessor.joblib'))
print(f"✅ Preprocessor loaded")

# Load model metadata
with open(os.path.join(MODELS_DIR, 'model_metadata.json'), 'r') as f:
    metadata = json.load(f)

print(f"\n📊 Model Metadata:")
for key, value in metadata.items():
    print(f"   {key}: {value}")

---

## 3. Create Inference Function

We create a reusable function that handles the complete prediction pipeline.

In [None]:
def predict_clv(input_data, model, preprocessor):
    """
    Predict Customer Lifetime Value for new data.
    
    Parameters:
    -----------
    input_data : pd.DataFrame
        Raw customer data with required features
    model : sklearn estimator
        Trained prediction model
    preprocessor : ColumnTransformer
        Fitted preprocessing pipeline
        
    Returns:
    --------
    pd.DataFrame
        Input data with predicted CLV columns added
    """
    # Create working copy
    df = input_data.copy()
    
    # Standardize column names
    df.columns = df.columns.str.lower().str.replace(' ', '_').str.replace('-', '_')
    
    # Normalize string columns
    for col in df.select_dtypes(include=['object']).columns:
        if col != 'customer':
            df[col] = df[col].astype(str).str.strip().str.lower()
    
    # Feature engineering (must match training)
    if 'coverage' in df.columns and 'education' in df.columns:
        df['coverage_education'] = df['coverage'] + '_' + df['education']
    
    if 'total_claim_amount' in df.columns and 'monthly_premium_auto' in df.columns:
        df['insurance_loss_ratio'] = df['total_claim_amount'] / (df['monthly_premium_auto'] + 1)
    
    if 'monthly_premium_auto' in df.columns and 'number_of_policies' in df.columns:
        df['premium_per_policy'] = df['monthly_premium_auto'] / (df['number_of_policies'] + 1)
    
    if 'number_of_open_complaints' in df.columns:
        df['complaint_flag'] = (df['number_of_open_complaints'] > 0).astype(int)
    
    if 'months_since_policy_inception' in df.columns:
        df['tenure_category'] = pd.cut(
            df['months_since_policy_inception'],
            bins=[0, 12, 36, 60, np.inf],
            labels=['new', 'established', 'loyal', 'veteran']
        )
    
    # Drop columns not needed for prediction
    drop_cols = ['customer', 'customer_lifetime_value', 'effective_to_date', 'policy']
    feature_df = df.drop(columns=[c for c in drop_cols if c in df.columns], errors='ignore')
    
    # Handle missing values
    for col in feature_df.columns:
        if feature_df[col].dtype in ['object', 'category']:
            feature_df[col] = feature_df[col].fillna('unknown')
        else:
            feature_df[col] = feature_df[col].fillna(feature_df[col].median())
    
    # Preprocess
    X_processed = preprocessor.transform(feature_df)
    
    # Predict (in log scale)
    log_predictions = model.predict(X_processed)
    
    # Convert to dollar scale
    dollar_predictions = np.expm1(log_predictions)
    
    # Add predictions to original data
    result_df = input_data.copy()
    result_df['Predicted_CLV_Log'] = log_predictions
    result_df['Predicted_CLV_Dollars'] = dollar_predictions.round(2)
    
    return result_df

print("✅ Inference function defined")

---

## 4. Single Customer Prediction

Let's demonstrate predicting CLV for a single customer.

In [None]:
# Create sample customer
print("=" * 60)
print("SINGLE CUSTOMER PREDICTION")
print("=" * 60)

sample_customer = pd.DataFrame([{
    'state': 'California',
    'response': 'No',
    'coverage': 'Premium',
    'education': 'Master',
    'employmentstatus': 'Employed',
    'gender': 'M',
    'income': 75000,
    'location_code': 'Suburban',
    'marital_status': 'Married',
    'monthly_premium_auto': 200,
    'months_since_last_claim': 12,
    'months_since_policy_inception': 48,
    'number_of_open_complaints': 0,
    'number_of_policies': 3,
    'policy_type': 'Corporate Auto',
    'renew_offer_type': 'Offer1',
    'sales_channel': 'Agent',
    'total_claim_amount': 500,
    'vehicle_class': 'Two-Door Car',
    'vehicle_size': 'Medsize'
}])

print("\n📋 Sample Customer Profile:")
for col, val in sample_customer.iloc[0].items():
    print(f"   {col}: {val}")

In [None]:
# Make prediction
result = predict_clv(sample_customer, model, preprocessor)

print(f"\n🎯 PREDICTION RESULT:")
print(f"   Predicted CLV (Log Scale): {result['Predicted_CLV_Log'].iloc[0]:.4f}")
print(f"   Predicted CLV (Dollars):   ${result['Predicted_CLV_Dollars'].iloc[0]:,.2f}")

---

## 5. Batch Prediction

For production use, we often need to score many customers at once.

In [None]:
# Load original dataset for batch prediction demo
print("=" * 60)
print("BATCH PREDICTION")
print("=" * 60)

# Load raw data
raw_data = pd.read_csv(os.path.join(DATA_RAW_DIR, 'WA_Fn-UseC_-Marketing-Customer-Value-Analysis.csv'))

# Take a sample for demonstration
sample_batch = raw_data.sample(100, random_state=42)

print(f"\n📊 Scoring {len(sample_batch)} customers...")

# Make predictions
batch_results = predict_clv(sample_batch, model, preprocessor)

print(f"\n✅ Batch prediction complete!")
print(f"\n📊 Predicted CLV Statistics:")
print(f"   Mean:   ${batch_results['Predicted_CLV_Dollars'].mean():,.2f}")
print(f"   Median: ${batch_results['Predicted_CLV_Dollars'].median():,.2f}")
print(f"   Min:    ${batch_results['Predicted_CLV_Dollars'].min():,.2f}")
print(f"   Max:    ${batch_results['Predicted_CLV_Dollars'].max():,.2f}")

In [None]:
# Show sample results
print("\n📋 Sample Predictions:")
display_cols = ['Customer', 'Customer Lifetime Value', 'Predicted_CLV_Dollars']
display_cols = [c for c in display_cols if c in batch_results.columns]
batch_results[display_cols].head(10)

---

## 6. Business Use Cases

### 6.1 Customer Segmentation by Predicted CLV

In [None]:
# Segment customers by predicted CLV
print("=" * 60)
print("CUSTOMER SEGMENTATION BY CLV")
print("=" * 60)

# Define CLV segments
def assign_segment(clv):
    if clv >= 10000:
        return 'VIP'
    elif clv >= 6000:
        return 'High Value'
    elif clv >= 3000:
        return 'Medium Value'
    else:
        return 'Low Value'

batch_results['CLV_Segment'] = batch_results['Predicted_CLV_Dollars'].apply(assign_segment)

# Segment distribution
segment_counts = batch_results['CLV_Segment'].value_counts()

print("\n📊 Customer Segment Distribution:")
for segment, count in segment_counts.items():
    pct = count / len(batch_results) * 100
    print(f"   {segment:12} {count:4} customers ({pct:5.1f}%)")

### 6.2 High-Value Customer Identification

In [None]:
# Top 10 highest predicted CLV
print("=" * 60)
print("TOP 10 HIGH-VALUE CUSTOMERS")
print("=" * 60)

top_customers = batch_results.nlargest(10, 'Predicted_CLV_Dollars')

print("\n🏆 Highest Predicted CLV Customers:")
for i, (_, row) in enumerate(top_customers.iterrows(), 1):
    customer_id = row.get('Customer', 'N/A')
    clv = row['Predicted_CLV_Dollars']
    print(f"   {i:2}. Customer {customer_id}: ${clv:,.2f}")

---

## 7. Summary

This notebook demonstrated:

1. **Loading trained artifacts** for production use
2. **Creating a reusable inference function** that handles preprocessing
3. **Single customer prediction** for real-time scoring
4. **Batch prediction** for bulk processing
5. **Business applications** including segmentation and high-value identification

### Deployment Considerations

For production deployment, consider:
- Wrapping the inference function in an API (e.g., Flask, FastAPI)
- Implementing input validation and error handling
- Setting up model monitoring and drift detection
- Establishing a model retraining pipeline

---

**End of Notebook 05 - Project Complete**

## Retention Analysis

Below are the high-resolution figures generated by our pipeline.

### 05 Retention Sweet Spot
![05 Retention Sweet Spot](../report/figures/05_retention_sweet_spot.png)