# Telco Customer Churn Prediction  
### End-to-End Machine Learning Pipeline

---

## 1. Project Overview

This notebook builds a complete churn prediction system using the Telco Customer Churn dataset.

**Goal:**  
Predict whether a customer will churn (leave the service).

**Business Value:**  
Retention is cheaper than acquisition.  
Predicting churn allows targeted intervention → reduced revenue loss.

---

## 2. Load & Inspect Dataset


In [46]:
import pandas as pd
import numpy as np

df = pd.read_csv(r"C:\Users\jatin\FUTURE_ML_02\data\telco_data.csv")

df.shape, df.columns.tolist()

((7043, 21),
 ['customerID',
  'gender',
  'SeniorCitizen',
  'Partner',
  'Dependents',
  'tenure',
  'PhoneService',
  'MultipleLines',
  'InternetService',
  'OnlineSecurity',
  'OnlineBackup',
  'DeviceProtection',
  'TechSupport',
  'StreamingTV',
  'StreamingMovies',
  'Contract',
  'PaperlessBilling',
  'PaymentMethod',
  'MonthlyCharges',
  'TotalCharges',
  'Churn'])

In [47]:
df.head()
df.isnull().sum()
df['Churn'].value_counts(normalize=True)

Churn
No     0.73463
Yes    0.26537
Name: proportion, dtype: float64

**Expectations:**
- 7043 rows
- 21 columns
- No missing values except possibly in `TotalCharges`
- Churn rate around ~26%

---

# Data cleaning

In [48]:
df["TotalCharges"] = pd.to_numeric(df["TotalCharges"], errors="coerce")

df["TotalCharges"] = df["TotalCharges"].fillna(df["TotalCharges"].median())

customer_ids = df["customerID"]

df["Churn"] = df["Churn"].map({"Yes": 1, "No": 0})


# Feature Engineering

In [49]:
service_cols = [
    "PhoneService","MultipleLines","InternetService","OnlineSecurity",
    "OnlineBackup","DeviceProtection","TechSupport","StreamingTV","StreamingMovies"
]

df["num_services"] = df[service_cols].apply(lambda x: sum(x != "No"), axis=1)

df["tenure_bucket"] = pd.cut(df["tenure"], bins=[0, 6, 12, 24, 48, 100], labels=False)

df["contract_months"] = df["Contract"].map({
    "Month-to-month": 0,
    "One year": 12,
    "Two year": 24
})

df["high_monthly"] = (df["MonthlyCharges"] > df["MonthlyCharges"].median()).astype(int)

# Processing Pipeline

In [50]:
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

df_model = df.drop(columns=["customerID"])
X = df_model.drop(columns=["Churn"])
y = df_model["Churn"]

numeric_cols = X.select_dtypes(include=["int64","float64"]).columns.tolist()
categorical_cols = X.select_dtypes(include=["object"]).columns.tolist()

numeric_transform = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler())
])

categorical_transform = Pipeline([
    ("impute", SimpleImputer(strategy="most_frequent")),
    ("encode", OneHotEncoder(handle_unknown="ignore"))
])

preprocessor = ColumnTransformer([
    ("num", numeric_transform, numeric_cols),
    ("cat", categorical_transform, categorical_cols)
])

# Train/Test Split

In [51]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

X_train.shape, X_test.shape

((5634, 23), (1409, 23))

# Model Training
Logistic Regression. Random Forest, XGBoost

In [52]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

models = {
    "logreg": LogisticRegression(max_iter=1000, class_weight="balanced"),
    "rf": RandomForestClassifier(n_estimators=300, class_weight="balanced", random_state=42, n_jobs=-1),
    "xgb": XGBClassifier(
        n_estimators=400, 
        learning_rate=0.05,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        eval_metric="logloss",
        random_state=42
    )
}

In [53]:
from sklearn.pipeline import Pipeline

trained_models = {}

for name, clf in models.items():
    pipe = Pipeline([("pre", preprocessor), ("clf", clf)])
    pipe.fit(X_train, y_train)
    trained_models[name] = pipe

# Model Evaluation

In [54]:
from sklearn.metrics import classification_report, roc_auc_score, average_precision_score
results = {}

for name, model in trained_models.items():
    y_pred = model.predict(X_test)
    y_proba = model.predict_proba(X_test)[:,1]
    
    results[name] = {
        "AUC": roc_auc_score(y_test, y_proba),
        "AP": average_precision_score(y_test, y_proba),
        "Report": classification_report(y_test, y_pred)
    }

results

{'logreg': {'AUC': 0.8420496525355861,
  'AP': 0.637018087949885,
  'Report': '              precision    recall  f1-score   support\n\n           0       0.91      0.73      0.81      1035\n           1       0.51      0.79      0.62       374\n\n    accuracy                           0.75      1409\n   macro avg       0.71      0.76      0.72      1409\nweighted avg       0.80      0.75      0.76      1409\n'},
 'rf': {'AUC': 0.8217985481412592,
  'AP': 0.6119165421152393,
  'Report': '              precision    recall  f1-score   support\n\n           0       0.83      0.89      0.86      1035\n           1       0.61      0.48      0.54       374\n\n    accuracy                           0.78      1409\n   macro avg       0.72      0.69      0.70      1409\nweighted avg       0.77      0.78      0.77      1409\n'},
 'xgb': {'AUC': 0.8282427859154201,
  'AP': 0.6309286507053471,
  'Report': '              precision    recall  f1-score   support\n\n           0       0.84      0.89  

**Interpretation Guide**
- AUC > 0.80 = strong churn classifier
- AP (Average Precision) is more important when classes are imbalanced
- Compare all three models; XGBoost usually wins

---

# Threshold Tuning

In [55]:
from sklearn.metrics import precision_recall_curve

model = trained_models["xgb"]
y_proba = model.predict_proba(X_test)[:,1]

prec, rec, thresh = precision_recall_curve(y_test, y_proba)

# find threshold where recall ≥ 0.60
target_recall = 0.60
idx = np.argmax(rec >= target_recall)
best_threshold = thresh[idx]

best_threshold

np.float32(0.00045486775)

# Feature Importance (XGBoost)

In [56]:
xgb_clf = trained_models["xgb"].named_steps["clf"]
importances = xgb_clf.feature_importances_

fi = pd.DataFrame({
    "feature": preprocessor.get_feature_names_out(),
    "importance": importances
}).sort_values("importance", ascending=False)

fi.head(15)

Unnamed: 0,feature,importance
40,cat__Contract_Month-to-month,0.280266
6,num__contract_months,0.156945
20,cat__InternetService_Fiber optic,0.123519
5,num__tenure_bucket,0.026777
19,cat__InternetService_DSL,0.026722
22,cat__OnlineSecurity_No,0.023969
31,cat__TechSupport_No,0.016422
39,cat__StreamingMovies_Yes,0.015823
47,cat__PaymentMethod_Electronic check,0.012843
15,cat__PhoneService_Yes,0.012153


# Export Predictions

In [57]:
from pathlib import Path
import os

# Your output directory
out_dir = Path(r"C:\Users\jatin\FUTURE_ML_02\output")

# Ensure directory exists
out_dir.mkdir(parents=True, exist_ok=True)

# Paths for the files
pred_file = out_dir / "churn_predictions.csv"
fi_file   = out_dir / "feature_importance.csv"

# Build predictions dataframe again to ensure correctness
best_model = trained_models["xgb"]
all_probs = best_model.predict_proba(X)[:, 1]

output = pd.DataFrame({
    "customerID": customer_ids,
    "actual_churn": y,
    "predicted_probability": all_probs
})

# Save both files
output.to_csv(pred_file, index=False)
fi.to_csv(fi_file, index=False)

print("Files saved successfully:")
print(" -", pred_file)
print(" -", fi_file)

Files saved successfully:
 - C:\Users\jatin\FUTURE_ML_02\output\churn_predictions.csv
 - C:\Users\jatin\FUTURE_ML_02\output\feature_importance.csv


### Summary
- Built complete ML pipeline for churn prediction  
- Cleaned + engineered features  
- Tested 3 models — XGBoost performed best  
- Exported predictions for business/Power BI use  
- Generated feature importances for insight  

### Recommended Extensions
- Add SHAP explainability  
- Build Streamlit prediction app  
- Automate retraining  
- Use Bayesian optimization for hyperparameters  
