# 03 Model Improvement and Time-Based Validation (Telecom Churn / Cease)

This notebook will trains **leakage-safe churn models** for the business objective:

> **Prioritise retention resources by calling customers most likely to place a cease in the next 30 days.**

## Objectives
- Load feature dataset from Notebook 02
- Use a **time-based train / validation / test split**
- Train baseline models (**Logistic Regression** + **Random Forest**)
- Evaluate with classification and business metrics
- Optimise thresholds for **Top K% retention capacity**
- Create **gains / lift** charts
- Explain drivers (feature importance + SHAP if available)
- Score external/future snapshot datasets


In [1]:
from pathlib import Path
import warnings
warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.metrics import (
    roc_auc_score, average_precision_score, classification_report,
    precision_recall_curve
)
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

pd.set_option("display.max_columns", 200)
pd.set_option("display.width", 200)


## 1) Time-based train / validation / test split

We split by **snapshot_date** (not random split) to match real deployment.

- **Train**: oldest period
- **Validation**: middle period
- **Test**: most recent period


## 2) Train baseline models

We start with:
- **Logistic Regression** (interpretable baseline)
- **Random Forest** (non-linear baseline)


## 3) Threshold optimisation for retention capacity (Top K%)

The business can only call a **limited proportion** of customers.  
So I will optimise on **Top K% prioritisation** rather than a fixed 0.5 threshold.


## 4) Final evaluation on the test set (most recent period)


## 5) Business-ready gains and lift charts




## 6) Explainability: feature importance (and SHAP if available)


## 7) Score external / future snapshot datasets


