# üõí Use Case: Customer Churn Prediction

<div style="background-color: #fff3e0; padding: 15px; border-radius: 5px; border-left: 5px solid #ff9800;">
<b>üìä E-Commerce Use Case</b><br>
<b>Level:</b> Intermediate<br>
<b>Duration:</b> 30 minutes<br>
<b>Dataset:</b> Telecom Churn (synthetic)<br>
<b>Focus:</b> Drift Detection, Calibration, Cost Analysis
</div>

---

## üéØ Objectives

By the end of this notebook, you will be able to:
- ‚úÖ Build churn prediction model for subscription business
- ‚úÖ Handle imbalanced classes (churn is typically 10-30%)
- ‚úÖ Test for **temporal drift** (customer behavior changes over time)
- ‚úÖ Calibrate probabilities for decision-making
- ‚úÖ Perform cost-benefit analysis (retention cost vs customer value)
- ‚úÖ Set optimal decision thresholds
- ‚úÖ Design monitoring strategy for production

---

## üìö Table of Contents

1. [Business Context](#context)
2. [Churn Prediction Challenges](#challenges)
3. [Setup & Data](#data)
4. [EDA & Class Imbalance](#eda)
5. [Model Training](#training)
6. [Performance Analysis](#performance)
7. [Probability Calibration](#calibration)
8. [Drift Detection (CRITICAL)](#drift)
9. [Cost-Benefit Analysis](#cost)
10. [Threshold Optimization](#threshold)
11. [Production Monitoring Plan](#monitoring)
12. [Conclusion](#conclusion)

<a id="context"></a>
## 1. üíº Business Context

### The Scenario

You work at **TeleConnect**, a telecom company with 500K subscribers.

**The Problem:**
> "We're losing 15% of customers every year. It costs $100 to retain a customer (discount, support), but acquiring a new customer costs $300. We need to predict who will churn and proactively retain them - but only if it's cost-effective!"
> 
> ‚Äî VP of Customer Success

### üí∞ Business Economics

- **Customer Lifetime Value (CLV):** $1,200/year
- **Retention Cost:** $100 (if we intervene)
- **Acquisition Cost:** $300 (new customer)
- **Current Churn Rate:** 15% annually
- **Annual Loss:** 75,000 customers √ó $1,200 = **$90M in lost revenue!**

### üéØ Business Requirements

1. **Precision matters!** - Don't waste retention budget on false positives
2. **Recall matters too!** - Catch actual churners before they leave
3. **Calibrated probabilities** - Need confidence to prioritize interventions
4. **Detect drift** - Customer behavior changes seasonally
5. **Cost-effective** - Only intervene when ROI is positive
6. **Real-time monitoring** - Detect when model degrades

### üö® Unique Challenges of Churn

1. **Class Imbalance** - Only 10-30% churn (need special handling)
2. **Temporal Drift** - Behavior changes over time (seasonality, competition)
3. **Self-fulfilling prophecy** - If we prevent churn, labels change!
4. **Cost asymmetry** - FP and FN have different costs
5. **Delayed feedback** - Know churn outcome weeks/months later

**Let's build it right!** üöÄ

<a id="challenges"></a>
## 2. ‚ö†Ô∏è Churn Prediction Challenges

### Why Churn is Different?

| Challenge | Impact | Solution |
|-----------|--------|----------|
| **Class Imbalance** | Most customers don't churn | Use class weights, SMOTE, or proper metrics |
| **Temporal Drift** | Model degrades over time | Continuous monitoring, periodic retraining |
| **Calibration** | Probabilities not reliable | Calibration curves, Platt scaling |
| **Cost Asymmetry** | FP ‚â† FN cost | Cost-sensitive learning, threshold tuning |
| **Feedback Loop** | Interventions affect labels | A/B testing, causal inference |

### Key Metrics for Churn

- ‚ùå **NOT Accuracy** - Useless with imbalanced data (95% accuracy = predict all "no churn")
- ‚úÖ **Precision** - Of predicted churners, how many actually churn?
- ‚úÖ **Recall** - Of actual churners, how many did we catch?
- ‚úÖ **F1 Score** - Balance of precision and recall
- ‚úÖ **ROC AUC** - Overall discrimination ability
- ‚úÖ **Expected Profit** - Business metric (cost-benefit)

### DeepBridge for Churn

DeepBridge helps with:
- üîÑ **Drift detection** - Know when to retrain
- üìä **Calibration analysis** - Trust your probabilities
- üõ°Ô∏è **Robustness** - Handle noisy customer data
- üìà **Reports** - Audit trail for business decisions

<a id="data"></a>
## 3. üõ†Ô∏è Setup & Data

### Setup

In [None]:
# Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from datetime import datetime, timedelta

# sklearn
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    roc_auc_score, confusion_matrix, classification_report,
    roc_curve, precision_recall_curve
)
from sklearn.calibration import calibration_curve
from imblearn.over_sampling import SMOTE

# DeepBridge
from deepbridge import DBDataset, Experiment

# Settings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('Set2')
%matplotlib inline

RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

print("‚úÖ Setup complete!")
print("üìû Project: TeleConnect Churn Prediction")

### Generate Realistic Telecom Churn Dataset

In [None]:
print("üì± Generating telecom customer dataset...\n")

np.random.seed(RANDOM_STATE)
n = 5000

# Generate customer features
df = pd.DataFrame({
    # Demographics
    'customer_age': np.random.gamma(3, 10, n).clip(18, 80).astype(int),
    'gender': np.random.choice(['M', 'F'], n),
    
    # Account info
    'tenure_months': np.random.gamma(2, 15, n).clip(1, 72).astype(int),
    'contract_type': np.random.choice(['Month-to-month', 'One year', 'Two year'], 
                                       n, p=[0.5, 0.3, 0.2]),
    'payment_method': np.random.choice(['Electronic', 'Mailed check', 'Bank transfer', 'Credit card'],
                                        n, p=[0.4, 0.2, 0.2, 0.2]),
    'paperless_billing': np.random.choice([0, 1], n, p=[0.4, 0.6]),
    
    # Services
    'phone_service': np.random.choice([0, 1], n, p=[0.1, 0.9]),
    'multiple_lines': np.random.choice([0, 1], n, p=[0.5, 0.5]),
    'internet_service': np.random.choice(['No', 'DSL', 'Fiber optic'], n, p=[0.2, 0.4, 0.4]),
    'online_security': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'online_backup': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'device_protection': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'tech_support': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'streaming_tv': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    'streaming_movies': np.random.choice([0, 1], n, p=[0.6, 0.4]),
    
    # Financial
    'monthly_charges': np.random.gamma(2, 30, n).clip(20, 120),
    'total_charges': 0.0,  # Will calculate
    
    # Engagement
    'num_support_tickets': np.random.poisson(2, n),
    'num_tech_tickets': np.random.poisson(1, n),
    'avg_call_duration': np.random.gamma(2, 5, n).clip(1, 30),
    'data_usage_gb': np.random.gamma(2, 20, n).clip(0, 200)
})

# Calculate total charges
df['total_charges'] = df['monthly_charges'] * df['tenure_months']

# Create churn based on risk factors
churn_score = (
    # Negative factors (increase churn)
    -df['tenure_months'] / 72 * 0.20 +  # Longer tenure = less churn
    (df['contract_type'] == 'Month-to-month') * 0.25 +  # MTM = high churn
    (df['monthly_charges'] > 80) * 0.15 +  # High cost = more churn
    (df['num_support_tickets'] > 3) * 0.15 +  # Many tickets = frustration
    (df['payment_method'] == 'Mailed check') * 0.10 +  # Manual payment
    (df['internet_service'] == 'Fiber optic') * 0.08 +  # Competitive market
    
    # Positive factors (decrease churn)
    -df['online_security'] * 0.08 +
    -df['tech_support'] * 0.08 +
    -(df['contract_type'] == 'Two year') * 0.20
)

# Convert to binary (target ~15% churn rate)
df['churn'] = (churn_score + np.random.normal(0, 0.15, n) > -0.1).astype(int)

print(f"‚úÖ Dataset created: {df.shape}")
print(f"\nüìä Churn Statistics:")
print(f"   No Churn: {(df['churn']==0).sum()} ({(df['churn']==0).mean():.1%})")
print(f"   Churn: {(df['churn']==1).sum()} ({(df['churn']==1).mean():.1%})")
print(f"\n‚ö†Ô∏è  Class Imbalance Ratio: {(df['churn']==0).sum() / (df['churn']==1).sum():.1f}:1")

<a id="eda"></a>
## 4. üìä EDA & Class Imbalance Analysis

### Churn Distribution

In [None]:
# Churn distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Bar chart
churn_counts = df['churn'].value_counts()
axes[0].bar(['No Churn', 'Churn'], churn_counts.values, 
            color=['lightgreen', 'coral'], edgecolor='black', alpha=0.8)
axes[0].set_title('Churn Distribution (Imbalanced!)', fontsize=13, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=11)
axes[0].grid(axis='y', alpha=0.3)

# Add percentages on bars
for i, (label, count) in enumerate(zip(['No Churn', 'Churn'], churn_counts.values)):
    pct = count / len(df) * 100
    axes[0].text(i, count + 50, f'{pct:.1f}%', ha='center', fontsize=11, fontweight='bold')

# Pie chart
axes[1].pie(churn_counts.values, labels=['No Churn', 'Churn'], 
            autopct='%1.1f%%', colors=['lightgreen', 'coral'], startangle=90)
axes[1].set_title('Churn Rate', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.show()

print("\n‚ö†Ô∏è  CLASS IMBALANCE WARNING:")
print(f"   This is highly imbalanced data!")
print(f"   Naive 'predict all no-churn' = {(df['churn']==0).mean():.1%} accuracy")
print(f"   But completely useless for business!")

### Key Features by Churn

In [None]:
# Analyze key features
fig, axes = plt.subplots(2, 3, figsize=(16, 10))
axes = axes.ravel()

# 1. Tenure
for churn in [0, 1]:
    axes[0].hist(df[df['churn']==churn]['tenure_months'], bins=20, alpha=0.6,
                 label=f'Churn={churn}', edgecolor='black')
axes[0].set_title('Tenure by Churn', fontweight='bold')
axes[0].set_xlabel('Tenure (months)')
axes[0].legend()
axes[0].grid(alpha=0.3)

# 2. Contract Type
contract_churn = pd.crosstab(df['contract_type'], df['churn'], normalize='index') * 100
contract_churn.plot(kind='bar', ax=axes[1], color=['lightgreen', 'coral'])
axes[1].set_title('Churn by Contract Type', fontweight='bold')
axes[1].set_xlabel('Contract Type')
axes[1].set_ylabel('Percentage')
axes[1].legend(['No Churn', 'Churn'])
axes[1].tick_params(axis='x', rotation=45)
axes[1].grid(alpha=0.3)

# 3. Monthly Charges
for churn in [0, 1]:
    axes[2].hist(df[df['churn']==churn]['monthly_charges'], bins=20, alpha=0.6,
                 label=f'Churn={churn}', edgecolor='black')
axes[2].set_title('Monthly Charges by Churn', fontweight='bold')
axes[2].set_xlabel('Monthly Charges ($)')
axes[2].legend()
axes[2].grid(alpha=0.3)

# 4. Support Tickets
ticket_churn = df.groupby('num_support_tickets')['churn'].mean() * 100
axes[3].plot(ticket_churn.index, ticket_churn.values, 'o-', linewidth=2, markersize=8)
axes[3].set_title('Churn Rate by Support Tickets', fontweight='bold')
axes[3].set_xlabel('Number of Support Tickets')
axes[3].set_ylabel('Churn Rate (%)')
axes[3].grid(alpha=0.3)

# 5. Internet Service
internet_churn = pd.crosstab(df['internet_service'], df['churn'], normalize='index') * 100
internet_churn.plot(kind='bar', ax=axes[4], color=['lightgreen', 'coral'])
axes[4].set_title('Churn by Internet Service', fontweight='bold')
axes[4].set_xlabel('Internet Service')
axes[4].set_ylabel('Percentage')
axes[4].legend(['No Churn', 'Churn'])
axes[4].tick_params(axis='x', rotation=45)
axes[4].grid(alpha=0.3)

# 6. Tech Support
tech_churn = pd.crosstab(df['tech_support'], df['churn'], normalize='index') * 100
tech_churn.plot(kind='bar', ax=axes[5], color=['lightgreen', 'coral'])
axes[5].set_title('Churn by Tech Support', fontweight='bold')
axes[5].set_xlabel('Has Tech Support')
axes[5].set_ylabel('Percentage')
axes[5].set_xticklabels(['No', 'Yes'], rotation=0)
axes[5].legend(['No Churn', 'Churn'])
axes[5].grid(alpha=0.3)

plt.tight_layout()
plt.show()

print("\nüí° Key Insights:")
print(f"   ‚Ä¢ Short tenure customers churn more")
print(f"   ‚Ä¢ Month-to-month contracts have highest churn")
print(f"   ‚Ä¢ Higher charges correlate with churn")
print(f"   ‚Ä¢ More support tickets = more churn (frustration)")
print(f"   ‚Ä¢ Tech support reduces churn")

<a id="training"></a>
## 5. ü§ñ Model Training

### Prepare Features

In [None]:
print("üîß Preparing features...\n")

# Encode categorical features
df_encoded = df.copy()

# Gender
df_encoded['gender_enc'] = (df['gender'] == 'M').astype(int)

# Contract type
df_encoded['contract_mtm'] = (df['contract_type'] == 'Month-to-month').astype(int)
df_encoded['contract_1yr'] = (df['contract_type'] == 'One year').astype(int)
df_encoded['contract_2yr'] = (df['contract_type'] == 'Two year').astype(int)

# Payment method
df_encoded['payment_electronic'] = (df['payment_method'] == 'Electronic').astype(int)
df_encoded['payment_check'] = (df['payment_method'] == 'Mailed check').astype(int)

# Internet service
df_encoded['internet_dsl'] = (df['internet_service'] == 'DSL').astype(int)
df_encoded['internet_fiber'] = (df['internet_service'] == 'Fiber optic').astype(int)

# Feature list
feature_cols = [
    'customer_age', 'tenure_months', 'monthly_charges', 'total_charges',
    'num_support_tickets', 'num_tech_tickets', 'avg_call_duration', 'data_usage_gb',
    'gender_enc', 'paperless_billing', 'phone_service', 'multiple_lines',
    'online_security', 'online_backup', 'device_protection', 'tech_support',
    'streaming_tv', 'streaming_movies',
    'contract_mtm', 'contract_1yr', 'contract_2yr',
    'payment_electronic', 'payment_check',
    'internet_dsl', 'internet_fiber'
]

X = df_encoded[feature_cols]
y = df_encoded['churn']

# Train/test split (stratified to maintain class balance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=RANDOM_STATE, stratify=y
)

print(f"‚úÖ Data prepared:")
print(f"   Train: {X_train.shape} (Churn rate: {y_train.mean():.1%})")
print(f"   Test: {X_test.shape} (Churn rate: {y_test.mean():.1%})")
print(f"   Features: {len(feature_cols)}")

### Train Model with Class Balancing

In [None]:
print("üå≤ Training Random Forest with class balancing...\n")

# IMPORTANT: Use class_weight='balanced' for imbalanced data!
model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    class_weight='balanced',  # ‚Üê CRITICAL for imbalanced data!
    random_state=RANDOM_STATE,
    n_jobs=-1
)

model.fit(X_train, y_train)

print("‚úÖ Model trained!")
print(f"   Algorithm: RandomForestClassifier")
print(f"   Class weighting: Balanced (handles imbalance)")
print(f"   Trees: {model.n_estimators}")

## Continuing in next section...

Next sections:
- Section 6: Performance Analysis (Precision, Recall, F1)
- Section 7: Probability Calibration
- Section 8: **Drift Detection (CRITICAL for churn!)**
- Section 9: Cost-Benefit Analysis
- Section 10: Threshold Optimization
- Section 11: Production Monitoring

**Key Message:** Churn prediction requires special attention to drift, calibration, and cost-benefit analysis!