# Telco Churn – Model Tuning and Cost-Sensitive Decision Making

This notebook builds on the **Telco churn project** and focuses on:

1. **Hyperparameter tuning** for better predictive performance.
2. **Cost-sensitive evaluation** – picking a decision threshold that aligns with
   business costs (losing a customer vs. contacting a non-churner).

We use the same IBM Telco Customer Churn dataset and a similar preprocessing
pipeline, but now we:

- Tune Logistic Regression and Random Forest with cross-validation.
- Optimise the classification threshold with a simple cost model.

> You can run this notebook independently of the previous one; it repeats the
> essential data loading and preprocessing steps.


## 1. Imports and Configuration

We import the usual stack:

- `pandas`, `numpy` for data handling.
- `matplotlib`, `seaborn` for plotting.
- `scikit-learn` for preprocessing, model tuning, and evaluation.

We also set a random seed for reproducibility.


In [None]:
from __future__ import annotations

from pathlib import Path
from typing import Dict, List

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix,
    precision_recall_curve,
    roc_auc_score,
    RocCurveDisplay,
)
from sklearn.model_selection import train_test_split, RandomizedSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.base import BaseEstimator

sns.set(style="whitegrid")
plt.rcParams["figure.figsize"] = (8, 5)

RANDOM_STATE: int = 42
np.random.seed(RANDOM_STATE)

DATA_PATH: Path = Path("data") / "WA_Fn-UseC_-Telco-Customer-Churn.csv"

if not DATA_PATH.exists():
    raise FileNotFoundError(
        f"Data file not found at {DATA_PATH.resolve()}. "
        "Please download the Telco churn CSV from Kaggle and place it under the 'data/' directory."
    )


## 2. Data Loading and Cleaning

We reuse the same logic as in the first notebook:

1. Load the CSV.
2. Convert `TotalCharges` to numeric.
3. Drop rows with missing `TotalCharges`.
4. Drop duplicate `customerID` entries.

This gives us a clean customer-level dataset for modelling.


In [None]:
def load_telco_data(path: Path) -> pd.DataFrame:
    """Load the Telco churn data from CSV.

    Args:
        path: Path to the CSV file.

    Returns:
        DataFrame with raw Telco data.

    Raises:
        FileNotFoundError: If the file does not exist.
        ValueError: If the loaded DataFrame is empty.
    """
    if not path.exists():
        raise FileNotFoundError(f"File not found: {path!s}")

    df: pd.DataFrame = pd.read_csv(path)
    if df.empty:
        raise ValueError(f"Loaded DataFrame is empty: {path!s}")
    return df


def clean_telco_data(raw_df: pd.DataFrame) -> pd.DataFrame:
    """Clean Telco data (types, missing values, duplicates).

    Args:
        raw_df: Raw Telco churn DataFrame.

    Returns:
        Cleaned DataFrame.
    """
    df = raw_df.copy()

    # Convert TotalCharges to numeric, coercing errors to NaN
    if "TotalCharges" not in df.columns:
        raise ValueError("Expected 'TotalCharges' column not found.")
    df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

    # Show missing values
    missing = df.isna().sum()
    print("Missing values per column (non-zero only):")
    display(missing[missing > 0])

    # Drop rows with missing TotalCharges
    before = df.shape[0]
    df = df.dropna(subset=["TotalCharges"])
    after = df.shape[0]
    print(f"Dropped {before - after} rows with missing TotalCharges.")

    # Drop duplicate customers
    before = df.shape[0]
    df = df.drop_duplicates(subset=["customerID"])
    after = df.shape[0]
    print(f"Dropped {before - after} duplicate customerID rows.")

    df = df.reset_index(drop=True)
    return df


raw_df = load_telco_data(DATA_PATH)
telco_df = clean_telco_data(raw_df)
display(telco_df.head())


### Section summary

We loaded and cleaned the Telco churn dataset. The data is now ready for:

- Train–test splitting.
- Preprocessing (encoding + scaling).
- Model training and tuning.


## 3. Train–Test Split and Preprocessing

We now:

1. Map the target `Churn` to 0/1.
2. Drop `customerID` from the features.
3. Split into train and test sets with stratification.
4. Define a `ColumnTransformer` to:

   - Scale numeric features.
   - One-hot encode categorical features.


In [None]:
TARGET_COL: str = "Churn"

if TARGET_COL not in telco_df.columns:
    raise KeyError(f"Target column {TARGET_COL!r} not found.")

X: pd.DataFrame = telco_df.drop(columns=[TARGET_COL, "customerID"])
y: pd.Series = telco_df[TARGET_COL].map({"No": 0, "Yes": 1})

categorical_cols: List[str] = [c for c in X.columns if X[c].dtype == "O"]
numeric_cols: List[str] = [c for c in X.columns if c not in categorical_cols]

print("Categorical columns:", categorical_cols)
print("Numeric columns:", numeric_cols)

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.2,
    stratify=y,
    random_state=RANDOM_STATE,
)

print("Train shape:", X_train.shape, "Test shape:", X_test.shape)

numeric_transformer = Pipeline(
    steps=[("scaler", StandardScaler())]
)
categorical_transformer = Pipeline(
    steps=[("encoder", OneHotEncoder(handle_unknown="ignore"))]
)

preprocessor = ColumnTransformer(
    transformers=[
        ("num", numeric_transformer, numeric_cols),
        ("cat", categorical_transformer, categorical_cols),
    ]
)


### Section summary

We prepared:

- Feature matrix `X` and target `y` (0 = no churn, 1 = churn).
- A stratified train–test split.
- A reusable preprocessing pipeline with scaling and one-hot encoding.

Next we will define a general evaluation helper and establish a simple baseline.


## 4. Evaluation Helper and Baseline Model

We define a helper function `evaluate_classifier` that:

- Fits the model.
- Computes accuracy and ROC-AUC.
- Prints a classification report.
- Shows a confusion matrix and ROC curve.

Then we fit a **dummy baseline** that always predicts the majority class.


In [None]:
def evaluate_classifier(
    name: str,
    model: BaseEstimator,
    X_train: pd.DataFrame,
    X_test: pd.DataFrame,
    y_train: pd.Series,
    y_test: pd.Series,
) -> Dict[str, float]:
    """Fit a classifier and evaluate it on train and test data.

    Args:
        name: Model name (for printing).
        model: Unfitted sklearn estimator or pipeline.
        X_train: Training features.
        X_test: Test features.
        y_train: Training labels (0/1).
        y_test: Test labels (0/1).

    Returns:
        Dictionary with key metrics on the test set.
    """
    print(f"\n===== {name} =====")
    model.fit(X_train, y_train)

    y_pred_train = model.predict(X_train)
    y_pred_test = model.predict(X_test)

    if hasattr(model, "predict_proba"):
        y_proba_test = model.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_proba_test)
    else:
        y_proba_test = None
        roc_auc = np.nan

    acc_train = accuracy_score(y_train, y_pred_train)
    acc_test = accuracy_score(y_test, y_pred_test)

    print(f"Train accuracy: {acc_train:.3f}")
    print(f"Test accuracy:  {acc_test:.3f}")
    if not np.isnan(roc_auc):
        print(f"Test ROC-AUC:  {roc_auc:.3f}")

    print("\nClassification report (test):")
    print(classification_report(y_test, y_pred_test, target_names=["No churn", "Churn"]))

    cm = confusion_matrix(y_test, y_pred_test)
    sns.heatmap(
        cm,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=["Pred No", "Pred Yes"],
        yticklabels=["True No", "True Yes"],
    )
    plt.title(f"Confusion matrix - {name}")
    plt.ylabel("True label")
    plt.xlabel("Predicted label")
    plt.show()

    if y_proba_test is not None:
        RocCurveDisplay.from_predictions(y_test, y_proba_test)
        plt.title(f"ROC curve - {name}")
        plt.show()

    return {
        "model": name,
        "train_accuracy": acc_train,
        "test_accuracy": acc_test,
        "roc_auc": float(roc_auc) if not np.isnan(roc_auc) else np.nan,
    }


baseline_clf = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        ("clf", DummyClassifier(strategy="most_frequent", random_state=RANDOM_STATE)),
    ]
)

baseline_metrics = evaluate_classifier(
    "Baseline (Most Frequent)", baseline_clf, X_train, X_test, y_train, y_test
)
baseline_metrics


### Section summary

The dummy classifier gives us a **reference level** of performance. Any
real model should:

- Have significantly better ROC-AUC than ~0.5.
- Improve recall and precision for the churn class.

Now we move to **hyperparameter tuning** for Logistic Regression and Random Forest.


## 5. Hyperparameter Tuning with RandomizedSearchCV

We tune two models:

1. **Logistic Regression** – mainly the regularisation strength `C`.
2. **Random Forest** – depth, number of trees, and split criteria.

We use:

- **RandomizedSearchCV** with 5-fold cross-validation.
- `roc_auc` as the scoring metric.

The output is the set of best hyperparameters and their cross-validated score.


In [None]:
# 5.1 Logistic Regression tuning

log_reg_base = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "clf",
            LogisticRegression(
                max_iter=1000,
                random_state=RANDOM_STATE,
                n_jobs=-1,
            ),
        ),
    ]
)

log_reg_param_distributions = {
    "clf__C": np.logspace(-2, 2, 20),
    "clf__penalty": ["l2"],
    "clf__solver": ["lbfgs"],
}

log_reg_search = RandomizedSearchCV(
    estimator=log_reg_base,
    param_distributions=log_reg_param_distributions,
    n_iter=20,
    scoring="roc_auc",
    cv=5,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1,
)

log_reg_search.fit(X_train, y_train)

print("Best Logistic Regression params:", log_reg_search.best_params_)
print("Best CV ROC-AUC:", log_reg_search.best_score_)

best_log_reg = log_reg_search.best_estimator_
log_reg_tuned_metrics = evaluate_classifier(
    "Logistic Regression (tuned)", best_log_reg, X_train, X_test, y_train, y_test
)
log_reg_tuned_metrics


In [None]:
# 5.2 Random Forest tuning

rf_base = Pipeline(
    steps=[
        ("preprocess", preprocessor),
        (
            "clf",
            RandomForestClassifier(
                n_estimators=200,
                random_state=RANDOM_STATE,
                n_jobs=-1,
            ),
        ),
    ]
)

rf_param_distributions = {
    "clf__n_estimators": [100, 200, 300, 400],
    "clf__max_depth": [None, 5, 10, 15],
    "clf__min_samples_split": [2, 4, 6, 8],
    "clf__min_samples_leaf": [1, 2, 3, 4],
    "clf__max_features": ["sqrt", "log2", 0.5, 0.8],
}

rf_search = RandomizedSearchCV(
    estimator=rf_base,
    param_distributions=rf_param_distributions,
    n_iter=25,
    scoring="roc_auc",
    cv=5,
    random_state=RANDOM_STATE,
    n_jobs=-1,
    verbose=1,
)

rf_search.fit(X_train, y_train)

print("Best Random Forest params:", rf_search.best_params_)
print("Best CV ROC-AUC:", rf_search.best_score_)

best_rf = rf_search.best_estimator_
rf_tuned_metrics = evaluate_classifier(
    "Random Forest (tuned)", best_rf, X_train, X_test, y_train, y_test
)
rf_tuned_metrics


### Section summary

We tuned Logistic Regression and Random Forest using **RandomizedSearchCV**.
We obtained:

- Best hyperparameters for each model.
- Improved ROC-AUC compared to default settings (typically).
- Updated evaluation metrics on the held-out test set.

Next we move beyond ROC-AUC and look at **decision thresholds** under a cost model.


## 6. Cost-Sensitive Evaluation and Threshold Tuning

In many churn problems, the cost of errors is **asymmetric**:

- **False negative (FN)** – we fail to identify a churner → we lose a customer.
- **False positive (FP)** – we flag a non-churner and maybe contact them
  unnecessarily (discount, call, email).

Usually:

> Cost(FN) >> Cost(FP)

We introduce a simple cost model:

- Cost per FN = `C_FN`
- Cost per FP = `C_FP`

For a chosen threshold `t` on the churn probability:

- If `p(churn) >= t` → predict churn (1).
- Else → predict non-churn (0).

We then compute the **expected cost** per customer for each threshold and
choose the threshold that **minimises cost**.


In [None]:
def compute_cost_for_thresholds(
    y_true: np.ndarray,
    y_proba: np.ndarray,
    thresholds: np.ndarray,
    cost_fp: float,
    cost_fn: float,
) -> pd.DataFrame:
    """Compute cost for a range of thresholds.

    Args:
        y_true: True labels (0/1).
        y_proba: Predicted probabilities for class 1.
        thresholds: Array of thresholds to evaluate.
        cost_fp: Cost of a false positive.
        cost_fn: Cost of a false negative.

    Returns:
        DataFrame with threshold, confusion matrix components, and total cost.
    """
    records: List[Dict[str, float]] = []

    for t in thresholds:
        y_pred = (y_proba >= t).astype(int)
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()

        total_cost = cost_fp * fp + cost_fn * fn
        cost_per_customer = total_cost / len(y_true)

        records.append(
            {
                "threshold": float(t),
                "tn": float(tn),
                "fp": float(fp),
                "fn": float(fn),
                "tp": float(tp),
                "total_cost": float(total_cost),
                "cost_per_customer": float(cost_per_customer),
            }
        )

    return pd.DataFrame.from_records(records)


# We'll use the tuned Random Forest as our main model for threshold analysis
best_rf.fit(X_train, y_train)
y_proba_test = best_rf.predict_proba(X_test)[:, 1]

# Define costs (you can adjust these values to match a real business context)
C_FP: float = 1.0   # e.g. cost of contacting a non-churner
C_FN: float = 5.0   # e.g. cost of losing a churner

thresholds = np.linspace(0.1, 0.9, 41)
cost_df = compute_cost_for_thresholds(
    y_true=y_test.to_numpy(),
    y_proba=y_proba_test,
    thresholds=thresholds,
    cost_fp=C_FP,
    cost_fn=C_FN,
)

display(cost_df.head())

best_row = cost_df.loc[cost_df["cost_per_customer"].idxmin()]
print("Best threshold by cost:")
display(best_row)


In [None]:
plt.figure(figsize=(8, 5))
sns.lineplot(data=cost_df, x="threshold", y="cost_per_customer")
plt.axvline(best_row["threshold"], linestyle="--")
plt.title("Cost per customer vs threshold (Random Forest tuned)")
plt.xlabel("Threshold")
plt.ylabel("Cost per customer")
plt.show()

# Precision–recall curve for additional context
prec, rec, pr_thresholds = precision_recall_curve(y_test, y_proba_test)

plt.figure(figsize=(8, 5))
plt.plot(rec, prec)
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision–Recall curve (Random Forest tuned)")
plt.show()


### Section summary

We:

- Defined a simple cost model with `C_FN` and `C_FP`.
- Computed the cost per customer for many thresholds.
- Identified the **threshold that minimises expected cost**.
- Plotted cost vs threshold, and inspected the precision–recall trade-off.

The optimal threshold is **rarely 0.5**, especially when the costs of errors
are asymmetric. This is a key takeaway for real churn deployments.


## 7. Final Model, Threshold, and Business Interpretation

We now summarise the final configuration:

1. **Model** – tuned Random Forest (or tuned Logistic Regression, depending on
   the results).
2. **Threshold** – the value that minimises cost under our assumptions.
3. **Operational rule** – flag customers with `p(churn) >= threshold` for a
   retention action.

This is easy to implement in production:

- Score customers daily/weekly with the churn model.
- Apply the chosen threshold.
- Generate a list of high-risk customers for campaigns.

You can also:

- Re-tune the cost parameters (`C_FP`, `C_FN`) to reflect actual business
  estimates.
- Recompute the optimal threshold for each scenario.


## 8. Next Steps

Possible extensions to this notebook:

1. **More sophisticated cost models**  
   - Include *value of customer* (CLV) instead of a constant FN cost.  
   - Different thresholds for different segments (e.g. high-value vs low-value).

2. **Calibration and uplift**  
   - Calibrate probabilities (Platt scaling / isotonic regression).  
   - Build an uplift model to predict the *incremental* effect of an action.

3. **Monitoring in production**  
   - Track data drift in features.  
   - Monitor churn model performance over time.  
   - Retrain when performance degrades.

Taken together with the first notebook, you now have:

- A complete churn pipeline (EDA → models → interpretation).  
- A tuned model with a **cost-aware decision threshold**, ready for a realistic
  deployment scenario.
