# üìä Customer Churn Prediction ‚Äî Google Colab Notebook
**Covers**: Data prep, EDA, CHAID-like tree (DecisionTree with entropy), Logistic Regression, ROC-AUC, Lift/Gain, Rule extraction, Deployment (joblib).  
**Dataset**: Telco Customer Churn (Kaggle).

In [None]:
# === Setup: Install & Imports ===
# Note: Install packages using: pip install scikit-learn pandas numpy matplotlib joblib
# Or uncomment the line below if running in Jupyter with pip available
# !pip install scikit-learn pandas numpy matplotlib joblib

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, roc_auc_score, roc_curve, confusion_matrix, classification_report
)
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier, export_text
import joblib

import os
# Create necessary directories
os.makedirs("charts", exist_ok=True)
os.makedirs("models", exist_ok=True)
os.makedirs("reports", exist_ok=True)

ERROR: Could not install packages due to an OSError: [WinError 32] The process cannot access the file because it is being used by another process: 'C:\\Users\\harip\\AppData\\Local\\Programs\\Python\\Python311\\Lib\\site-packages\\numpy\\_core\\tests\\test_scalarbuffer.py'
Consider using the `--user` option or check the permissions.


[notice] A new release of pip available: 22.3 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


ModuleNotFoundError: No module named 'pandas'

## üì• Load the dataset
Choose **one** of the options below:
- **A)** Upload `telco_customer_churn.csv` from your computer
- **B)** Load from Google Drive (if you've placed it there)

In [None]:
# OPTION A: Load from local file system
# Update the path below to point to your telco_customer_churn.csv file
import os

# Option 1: If CSV is in the same directory as this notebook
csv_path = "telco_customer_churn.csv"

# Option 2: If CSV is in a specific directory, uncomment and update:
# csv_path = r"C:\path\to\your\telco_customer_churn.csv"

# Option 3: Auto-detect if in Downloads folder
if not os.path.exists(csv_path):
    downloads_path = os.path.join(os.path.expanduser("~"), "Downloads", "telco_customer_churn.csv")
    if os.path.exists(downloads_path):
        csv_path = downloads_path
        print(f"Found CSV in Downloads: {csv_path}")
    else:
        print(f"CSV not found. Please update csv_path variable with the correct path to your dataset.")
        csv_path = None
else:
    print(f"Using CSV: {csv_path}")

In [None]:
# OPTION B: Alternative - Load from a different path
# If Option A didn't work, you can set csv_path directly here:
# csv_path = r"C:\Users\harip\Downloads\telco_customer_churn.csv"

# Check if csv_path is set
if csv_path is None or not os.path.exists(csv_path):
    print("‚ö†Ô∏è CSV file not found!")
    print("‚û°Ô∏è Please update `csv_path` in the previous cell (Option A) with the correct path to your dataset.")
    print("   The file should be named 'telco_customer_churn.csv'")
else:
    print(f"‚úÖ Using CSV: {csv_path}")

In [None]:
# === Helper functions ===
def load_telco(csv_path: str) -> pd.DataFrame:
    df = pd.read_csv(csv_path)
    # Drop ID if present
    if "customerID" in df.columns:
        df = df.drop(columns=["customerID"])
    # Numeric conversion quirks
    if "TotalCharges" in df.columns:
        df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")
    # Drop rows with null target
    df = df.dropna(subset=["Churn"])
    # Strip column names
    df.columns = [c.strip() for c in df.columns]
    return df

def train_test_prepare(df: pd.DataFrame, target_col="Churn", positive_label="Yes"):
    y = (df[target_col].astype(str).str.strip() == positive_label).astype(int)
    X = df.drop(columns=[target_col])

    # Identify categorical vs numeric
    cat_cols = X.select_dtypes(include=["object", "category"]).columns.tolist()
    num_cols = X.select_dtypes(include=[np.number]).columns.tolist()

    # Preprocessor
    pre = ColumnTransformer(
        transformers=[
            ("cat", OneHotEncoder(handle_unknown="ignore"), cat_cols),
            ("num", "passthrough", num_cols)
        ]
    )
    return X, y, pre, cat_cols, num_cols

## üîç EDA (Exploratory Data Analysis)
Basic insights and churn distribution.

In [None]:
# Load data
assert csv_path is not None, "Set csv_path to your dataset path first."
df = load_telco(csv_path)
print("Shape:", df.shape)
display(df.head())

# Churn distribution
churn_counts = df['Churn'].value_counts()
print("\nChurn distribution:\n", churn_counts)
churn_counts.plot(kind="bar", title="Churn Distribution")
plt.xlabel("Churn"); plt.ylabel("Count"); plt.tight_layout(); plt.savefig("charts/churn_distribution.png"); plt.show()

# Example relationship: tenure vs churn
if "tenure" in df.columns:
    df.boxplot(column="tenure", by="Churn")
    plt.title("Tenure by Churn"); plt.suptitle(""); plt.tight_layout(); plt.savefig("charts/tenure_by_churn.png"); plt.show()

## ü§ñ Modeling: Logistic Regression & CHAID-like Tree
We'll train and compare two models:
- **Logistic Regression**
- **Decision Tree (entropy)** as a **CHAID-like** proxy and extract readable rules.

In [None]:
# Prepare splits and preprocessing
X, y, pre, cat_cols, num_cols = train_test_prepare(df, target_col="Churn", positive_label="Yes")
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

# Logistic Regression pipeline
logit = Pipeline(steps=[
    ("pre", pre),
    ("clf", LogisticRegression(max_iter=1000, n_jobs=None))
])

# Decision Tree (CHAID-like via entropy)
tree = Pipeline(steps=[
    ("pre", pre),
    ("clf", DecisionTreeClassifier(criterion="entropy", max_depth=5, random_state=42))
])

# Fit models
logit.fit(X_train, y_train)
tree.fit(X_train, y_train)

# Predict
y_pred_logit = logit.predict(X_test)
y_proba_logit = logit.predict_proba(X_test)[:, 1]

y_pred_tree = tree.predict(X_test)
y_proba_tree = tree.predict_proba(X_test)[:, 1]

# Metrics
def summarize(y_true, y_pred, y_proba):
    acc = accuracy_score(y_true, y_pred)
    auc = roc_auc_score(y_true, y_proba)
    cm = confusion_matrix(y_true, y_pred)
    return acc, auc, cm

acc_l, auc_l, cm_l = summarize(y_test, y_pred_logit, y_proba_logit)
acc_t, auc_t, cm_t = summarize(y_test, y_pred_tree, y_proba_tree)

print("Logistic Regression ‚Äî Accuracy:", round(acc_l,4), "AUC:", round(auc_l,4), "\nConfusion Matrix:\n", cm_l)
print("\nDecision Tree ‚Äî Accuracy:", round(acc_t,4), "AUC:", round(auc_t,4), "\nConfusion Matrix:\n", cm_t)

In [None]:
# ROC Curve comparison
fpr_l, tpr_l, _ = roc_curve(y_test, y_proba_logit)
fpr_t, tpr_t, _ = roc_curve(y_test, y_proba_tree)

plt.figure()
plt.plot(fpr_l, tpr_l, label=f"Logit AUC={roc_auc_score(y_test, y_proba_logit):.3f}")
plt.plot(fpr_t, tpr_t, label=f"Tree AUC={roc_auc_score(y_test, y_proba_tree):.3f}")
plt.plot([0,1],[0,1], linestyle="--")
plt.xlabel("False Positive Rate"); plt.ylabel("True Positive Rate"); plt.title("ROC Curves")
plt.legend(); plt.tight_layout(); plt.savefig("charts/roc_comparison.png"); plt.show()

In [None]:
# Lift & Gains table
def lift_gain_table(y_true, y_proba, bins=10):
    df2 = pd.DataFrame({"y": y_true, "p": y_proba}).sort_values("p", ascending=False).reset_index(drop=True)
    df2["decile"] = pd.qcut(df2.index + 1, bins, labels=False, duplicates="drop") + 1
    agg = df2.groupby("decile").apply(lambda g: pd.Series({
        "n": len(g),
        "positives": g["y"].sum(),
        "avg_p": g["p"].mean()
    })).reset_index()
    total_pos = df2["y"].sum()
    agg["cum_pos"] = agg["positives"].cumsum()
    agg["cum_rate"] = agg["cum_pos"] / total_pos if total_pos > 0 else 0
    agg["perc_pop"] = agg["n"].cumsum() / len(df2)
    agg["gain"] = agg["cum_rate"]
    agg["lift"] = agg["gain"] / agg["perc_pop"]
    return agg

lg_logit = lift_gain_table(y_test, y_proba_logit)
lg_tree  = lift_gain_table(y_test, y_proba_tree)

print("=== Lift/Gain (Logit) ===")
display(lg_logit)
print("=== Lift/Gain (Tree) ===")
display(lg_tree)

# Simple Lift plot (Top deciles)
plt.figure()
plt.plot(lg_logit["decile"], lg_logit["lift"], marker="o", label="Logit")
plt.plot(lg_tree["decile"], lg_tree["lift"], marker="o", label="Tree")
plt.xlabel("Decile (1=Top scored)"); plt.ylabel("Lift"); plt.title("Lift Chart")
plt.legend(); plt.tight_layout(); plt.savefig("charts/lift_chart.png"); plt.show()

## üìú Rule Extraction (from the Decision Tree)
This gives you readable rules similar to CHAID splits.

In [None]:
# Export readable tree (on the transformed feature space)
dt = tree.named_steps["clf"]
ohe = tree.named_steps["pre"].named_transformers_["cat"]
num_cols = tree.named_steps["pre"].transformers_[1][2]  # numeric passthrough
ohe_features = ohe.get_feature_names_out(ohe.feature_names_in_)
all_features = list(ohe_features) + list(num_cols)

rules_text = export_text(dt, feature_names=all_features, decimals=2)
print(rules_text)

# Save rules to reports
with open("reports/decision_tree_rules.txt", "w") as f:
    f.write(rules_text)
print("Rules saved to reports/decision_tree_rules.txt")

## üíæ Save Best Model & üîÆ Inference Demo
This saves the better AUC model to `models/best_model.pkl` and shows how to predict for one new customer.

In [None]:
best_model, best_name, best_auc = (logit, "logistic_regression", auc_l) if auc_l >= auc_t else (tree, "decision_tree_entropy", auc_t)
joblib.dump(best_model, "models/best_model.pkl")
print(f"Saved best model: {best_name} with AUC={best_auc:.4f}")

# Inference example (edit values to a real row schema)
sample = X_test.iloc[[0]].copy()
proba = best_model.predict_proba(sample)[:, 1][0]
pred = int(proba >= 0.5)
print("Sample predicted churn probability:", round(proba, 4), " Pred label:", pred)

## üìù Report-ready Summaries
Copy these into your PDF report (Model Comparison and Evaluation).

In [None]:
summary = {
    "logistic_regression": {
        "accuracy": round(accuracy_score(y_test, y_pred_logit), 4),
        "auc": round(roc_auc_score(y_test, y_proba_logit), 4),
        "confusion_matrix": confusion_matrix(y_test, y_pred_logit).tolist()
    },
    "decision_tree": {
        "accuracy": round(accuracy_score(y_test, y_pred_tree), 4),
        "auc": round(roc_auc_score(y_test, y_proba_tree), 4),
        "confusion_matrix": confusion_matrix(y_test, y_pred_tree).tolist()
    }
}
summary