# Telecom Customer Churn Prediction

A business-driven approach to predicting customer churn, optimizing for revenue impact rather than traditional ML metrics.

## 1. Business Context

### What is the business objective?

The company wants to create a predictive model that identifies customers likely to churn *before* it happens, enabling proactive retention efforts.

The model will run automatically on all customers at fixed intervals. Its output will feed into another ML system that generates personalized temporary discounts to motivate at-risk customers to stay.

**Output format:** Easy to process (JSON, YAML, etc.) containing only the customer IDs with positive churn predictions.

### What is the current solution?

Currently, retention efforts are reactive. Only after a customer leaves does a dedicated employee contact them with a custom discount offer (significant discount as a last resort).

- Win-back success rate: **15-30%**
- That means **70-85% of customers who leave are lost permanently**
- Average discount offered to churned customers: **20-40% off** standard rate

### What do we expect from the model?

- Save costly and time-consuming customer chasing
- Preserve customers with smaller discounts (or none at all)
- Shift from reactive to proactive retention

## 2. Data Loading and Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from utils import plot_threshold_analysis, calculate_value_scores

In [None]:
import kagglehub
from kagglehub import KaggleDatasetAdapter

file_path = "WA_Fn-UseC_-Telco-Customer-Churn.csv"

og_df = kagglehub.dataset_load(
    KaggleDatasetAdapter.PANDAS,
    "blastchar/telco-customer-churn",
    file_path,
)

In [None]:
df = og_df.copy()
df.head()

In [None]:
# Store customer IDs separately and remove from features
customers = df["customerID"]
df = df.drop("customerID", axis=1)

In [None]:
df.info()

## 3. Data Preprocessing

### Convert categorical values to numeric representation

In [None]:
from sklearn.preprocessing import LabelEncoder

def object_to_int(col: pd.Series) -> pd.Series:
    """Convert object columns to integer using label encoding."""
    if col.dtype == 'object':
        col = LabelEncoder().fit_transform(col)
    return col

df = df.apply(object_to_int)
df.head()

In [None]:
df.describe()

In [None]:
df["Churn"].value_counts()

### Current churn rate: 27%

1,869 customers that leave multiplied by $4,100.30 (net value per saved customer) equals **$7,663,460.70** in potential lost revenue.

**Note:** The target variable is imbalanced. This needs to be considered in both train/test splits and performance measurement.

## 4. Feature Analysis

In [None]:
df.corr()['Churn'].sort_values(ascending=False)

In [None]:
# Remove rows with zero tenure (data quality issue)
df.drop(labels=df[df['tenure'] == 0].index, axis=0, inplace=True)

### Key Business Insights from Correlation Analysis

1. **Lock customers in early:** Contract type matters most. Incentivize long-term contracts.
2. **Critical first 6-12 months:** Low tenure predicts churn. Onboarding and early experience are crucial.
3. **Add-on services work:** Tech support, security, and backup all reduce churn significantly. Bundle these!
4. **Price sensitivity is real:** High monthly charges drive churn, but interestingly, total spending doesn't matter as much.
5. **Target families:** Customers with partners/dependents are stickier.
6. **Watch new customers closely:** High monthly charges + short tenure + month-to-month contract = high churn risk.

## 5. Feature Engineering

### Creating a Composite Risk Feature

Two features show strong correlation with churn:
- **MonthlyCharges** (positive correlation): Higher charges increase churn risk
- **Tenure** (negative correlation): Longer tenure decreases churn risk

By dividing monthly charge by tenure, we get a "risk measurement":
- High monthly charge + low tenure = **high risk value**

```
high_churn_risk = MonthlyCharges / tenure
```

In [None]:
df["high_churn_risk"] = df["MonthlyCharges"] / df["tenure"]

In [None]:
# Check if the new feature improves correlation
df.corr()['Churn'].sort_values(ascending=False)

The new `high_churn_risk` feature has the strongest positive correlation (0.39) with churn, validating our hypothesis.

In [None]:
# Remove low-correlation features and the features used to create the composite
low_corr_features = ["PhoneService", "gender", "MultipleLines"]
used_features = ["MonthlyCharges", "tenure"]

df = df.drop(low_corr_features + used_features, axis=1)

In [None]:
X = df.drop('Churn', axis=1)
y = df['Churn']

## 6. Preprocessing Pipeline

### Normalizing Numeric Values

We transform numerical data so all features are on the same scale by adjusting values so the mean is 0 and standard deviation is 1.

A pipeline is used to apply this transformation consistently during cross-validation, avoiding data leakage between train and test sets.

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.compose import ColumnTransformer

numeric_columns = ["TotalCharges"]

scale_numeric_transformer = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), numeric_columns)
    ],
    remainder='passthrough'
)

## 7. Model Training and Evaluation

### Random Forest Classifier

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_predict, StratifiedKFold
from sklearn.pipeline import Pipeline

pipeline_rf = Pipeline([
    ('preprocessing', scale_numeric_transformer),
    ('model', RandomForestClassifier(n_estimators=200, random_state=42))
])

thresholds = [0.30, 0.40, 0.50, 0.60, 0.70]

# Use stratified k-fold to handle class imbalance
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

churn_scores_rf = cross_val_predict(pipeline_rf, X, y, cv=cv, method='predict_proba')[:, 1]
print("Sample churn probability scores:", churn_scores_rf[:5])

In [None]:
recall_fpr_rf = plot_threshold_analysis(y, churn_scores_rf, thresholds_to_mark=thresholds)

### Logistic Regression

In [None]:
from sklearn.linear_model import LogisticRegression

pipeline_lr = Pipeline([
    ('preprocessing', scale_numeric_transformer),
    ('model', LogisticRegression(max_iter=1000, random_state=42))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

churn_scores_lr = cross_val_predict(pipeline_lr, X, y, cv=cv, method='predict_proba')[:, 1]
print("Sample churn probability scores:", churn_scores_lr[:5])

In [None]:
recall_fpr_lr = plot_threshold_analysis(y, churn_scores_lr, thresholds_to_mark=thresholds)

## 8. Business Value Optimization

### Translating Threshold Choice to Revenue

The average Customer Lifetime Value (CLV) is approximately **$4,400.30**, while the average retention discount cost for a high-risk customer is around **$300**.

Net value per saved customer: $4,440 - $300 = **$4,100.30**

The optimal threshold maximizes:
```
Value = (Recall x $4,100.30) - (FPR x $300)
```

In [None]:
CLV = 4400.30
discount = 300

value_scores, optimal_threshold, optimal_idx = calculate_value_scores(
    thresholds, 
    recall_fpr_lr, 
    clv=CLV, 
    discount=discount
)

## 9. Final Impact Analysis

In [None]:
# Business impact calculation
churners = 1869
non_churners = 5174

# Baseline: Cost of doing nothing (lose all churners)
baseline_loss = churners * CLV
print(f"Baseline loss (27% churn, no intervention): ${baseline_loss:,.2f}")

# With model at optimal threshold (TH=0.30)
recall_at_threshold = recall_fpr_lr['recall'][0]  # 0.30 threshold
fpr_at_threshold = recall_fpr_lr['fpr'][0]

true_positives = churners * recall_at_threshold
false_positives = non_churners * fpr_at_threshold

# Revenue impact
saved_value = true_positives * (CLV - discount)
wasted_discounts = false_positives * discount
net_value = saved_value - wasted_discounts

print(f"\nWith Model (TH=0.30):")
print(f"  Churners identified: {true_positives:.0f} out of {churners} ({recall_at_threshold:.1%})")
print(f"  Value from saved churners: ${saved_value:,.2f}")
print(f"  Cost of false alarms: ${wasted_discounts:,.2f}")
print(f"  Net value gained: ${net_value:,.2f}")
print(f"\nRemaining loss (missed churners): ${(churners - true_positives) * CLV:,.2f}")

## 10. Summary

### Model Performance

For the dataset with:
- 1,869 churners
- 5,174 non-churners
- CLV = $4,400.30
- Retention discount = $300

### Results

| Scenario | Value |
|----------|-------|
| Baseline (no model) | -$8,224,160.70 loss |
| With model (TH=0.30) | +$5,393,108.69 saved |
| Remaining loss | -$2,006,695.21 |

### Key Takeaways

1. **The model saves approximately $5.4M** compared to doing nothing
2. **75.6% of churners are correctly identified** at the optimal threshold
3. **24.4% of churners are missed**, representing ~$2M in unavoidable loss
4. **The cost of false positives is acceptable** ($300 discount vs. $4,400 CLV makes it worthwhile to over-predict slightly)

### Business Recommendation

Deploy the model with a **0.30 probability threshold**. The asymmetric cost structure (losing a customer costs 14x more than an unnecessary discount) justifies prioritizing recall over precision.