##                                                    MODEL BUILDING & EVALUATION
Build and compare multiple models to predict customer churn and identify the best-performing model for deployment.
### Modeling Approach
This notebook develops three different machine learning models to predict customer churn:
* Logistic Regression - Baseline interpretable linear model
* Random Forest - Ensemble model for capturing complex interactions
* Gradient Boosting - Advanced ensemble technique for high-accuracy predictions
### Evaluation Strategy
* Accuracy, Precision, Recall, F1-Score
* Confusion Matrix analysis
* Feature importance rankings
* Cross-validation performance

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report, roc_curve
import warnings
warnings.filterwarnings('ignore')
sns.set_style('darkgrid')
plt.rcParams['figure.figsize'] = (12, 6)

print("libraries loaded")

libraries loaded


### 1,What dataset are we working with for model development?

In [5]:
df = pd.read_csv('../data/Telco-Customer-Churn.csv')
print(f"Dataset loaded: {df.shape[0]} rows × {df.shape[1]} columns")
print(f"\nDataset info:")
print(f"  Customers: {len(df):,}")
print(f"  Features: {df.shape[1]}")
print(f"  Target: Churn (Yes/No)")
print(f"\nFirst 3 rows:")
print("_" * 100)
print(df.head(3))

Dataset loaded: 7043 rows × 21 columns

Dataset info:
  Customers: 7,043
  Features: 21
  Target: Churn (Yes/No)

First 3 rows:
____________________________________________________________________________________________________
   customerID  gender  SeniorCitizen Partner Dependents  tenure PhoneService  \
0  7590-VHVEG  Female              0     Yes         No       1           No   
1  5575-GNVDE    Male              0      No         No      34          Yes   
2  3668-QPYBK    Male              0      No         No       2          Yes   

      MultipleLines InternetService OnlineSecurity  ... DeviceProtection  \
0  No phone service             DSL             No  ...               No   
1                No             DSL            Yes  ...              Yes   
2                No             DSL            Yes  ...               No   

  TechSupport StreamingTV StreamingMovies        Contract PaperlessBilling  \
0          No          No              No  Month-to-month          

### 2,How do we prepare categorical variables for machine learning models?

In [7]:
df_model = df.copy()
categorical_cols = ['gender', 'Partner', 'Dependents', 'PhoneService', 'MultipleLines', 'InternetService', 'OnlineSecurity','OnlineBackup', 'DeviceProtection', 'TechSupport','StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod']

# Convert each column
label_encoders = {}
for col in categorical_cols:
    if col in df_model.columns:
        le = LabelEncoder()
        df_model[col] = le.fit_transform(df_model[col].astype(str))
        label_encoders[col] = le

# Convert variable (Churn: Yes=1, No=0)
df_model['Churn'] = (df_model['Churn'] == 'Yes').astype(int)
print("Categorical variables encoded")
print(f"\nChurn distribution:")
print(f"  No (0): {(df_model['Churn'] == 0).sum():,} customers")
print(f"  Yes (1): {(df_model['Churn'] == 1).sum():,} customers")
print(f"\nSample")
print(df_model[['gender', 'Contract', 'Churn']].head(3))

Categorical variables encoded

Churn distribution:
  No (0): 5,174 customers
  Yes (1): 1,869 customers

Sample
   gender  Contract  Churn
0       0         0      0
1       1         1      0
2       1         0      1


### 3,What features will we use to train the models, and how do we prepare them?

In [9]:
df_model['TotalCharges'] = pd.to_numeric(df_model['TotalCharges'], errors='coerce')

# Filling missing values with median
df_model['TotalCharges'].fillna(df_model['TotalCharges'].median(), inplace=True)
print(f"  Missing values filled: {df_model['TotalCharges'].isnull().sum()}")

# Prepare features (X) and target (y)
X = df_model.drop(['customerID', 'Churn'], axis=1, errors='ignore')
y = df_model['Churn']

print(f"\n Features and target prepared")
print(f"  Features (X): {X.shape[0]} rows × {X.shape[1]} columns")
print(f"  Target (y): {y.shape[0]} values")
print(f"\nFeature columns ({X.shape[1]} total):")
for i, col in enumerate(X.columns, 1):
    print(f"  {i}. {col}")


  Missing values filled: 0

 Features and target prepared
  Features (X): 7043 rows × 19 columns
  Target (y): 7043 values

Feature columns (19 total):
  1. gender
  2. SeniorCitizen
  3. Partner
  4. Dependents
  5. tenure
  6. PhoneService
  7. MultipleLines
  8. InternetService
  9. OnlineSecurity
  10. OnlineBackup
  11. DeviceProtection
  12. TechSupport
  13. StreamingTV
  14. StreamingMovies
  15. Contract
  16. PaperlessBilling
  17. PaymentMethod
  18. MonthlyCharges
  19. TotalCharges


### 4,How do we split the data to train and evaluate our models fairly?

In [10]:
#(80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(f"\nTraining set:")
print(f"  Samples: {len(X_train):,}")
print(f"  Churned: {y_train.sum():,} ({y_train.mean():.1%})")
print(f"  Retained: {(y_train == 0).sum():,} ({(y_train == 0).mean():.1%})")

print(f"\nTest set:")
print(f"  Samples: {len(X_test):,}")
print(f"  Churned: {y_test.sum():,} ({y_test.mean():.1%})")
print(f"  Retained: {(y_test == 0).sum():,} ({(y_test == 0).mean():.1%})")



Training set:
  Samples: 5,634
  Churned: 1,495 (26.5%)
  Retained: 4,139 (73.5%)

Test set:
  Samples: 1,409
  Churned: 374 (26.5%)
  Retained: 1,035 (73.5%)


### 5,How does Logistic Regression perform as our baseline churn prediction model?

In [12]:
print("MODEL 1: LOGISTIC REGRESSION")
print("|" * 60)

# Train
lr_model = LogisticRegression(random_state=42, max_iter=1000)
lr_model.fit(X_train, y_train)

# Make predictions
lr_pred_prob = lr_model.predict_proba(X_test)[:, 1]  # Probability of churn
lr_pred = lr_model.predict(X_test)  # Actual prediction (0 or 1)

# Evaluate performance
lr_auc = roc_auc_score(y_test, lr_pred_prob)
print(f"\nModel trained on {len(X_train):,} customers")
print(f"Tested on {len(X_test):,} customers")
print(f"\nAUC-ROC Score: {lr_auc:.4f}")
print(f"  (0.50 = random guess, 1.00 = perfect)")

# Show confusion matrix
cm = confusion_matrix(y_test, lr_pred)
print(f"\nConfusion Matrix:")
print(f" ---------------Predicted")
print(f" ---------- No (0)  Yes (1)")
print(f"No     {cm[0,0]:5}   {cm[0,1]:5}")
print(f"Yes    {cm[1,0]:5}   {cm[1,1]:5}")

# Detailed metrics
print(f"\nDetailed Metrics")
print(classification_report(y_test, lr_pred, target_names=['Retained (0)', 'Churned (1)']))


MODEL 1: LOGISTIC REGRESSION
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

Model trained on 5,634 customers
Tested on 1,409 customers

AUC-ROC Score: 0.8408
  (0.50 = random guess, 1.00 = perfect)

Confusion Matrix:
 ---------------Predicted
 ---------- No (0)  Yes (1)
No       918     117
Yes      166     208

Detailed Metrics
              precision    recall  f1-score   support

Retained (0)       0.85      0.89      0.87      1035
 Churned (1)       0.64      0.56      0.60       374

    accuracy                           0.80      1409
   macro avg       0.74      0.72      0.73      1409
weighted avg       0.79      0.80      0.79      1409



### 6,Does Random Forest improve prediction performance over the Logistic Regression baseline?

In [16]:
print("MODEL 2: RANDOM FOREST")
print("=" * 60)

# Train the model
rf_model = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42, n_jobs=-1)
rf_model.fit(X_train, y_train)

# predictions
rf_pred_prob = rf_model.predict_proba(X_test)[:, 1]
rf_pred = rf_model.predict(X_test)

# performance
rf_auc = roc_auc_score(y_test, rf_pred_prob)
print(f"\nModel trained on {len(X_train):,} customers")
print(f"Tested on {len(X_test):,} customers")
print(f"\n AUC-ROC Score: {rf_auc:.4f}")

# confusion matrix
cm = confusion_matrix(y_test, rf_pred)
print(f"\nConfusion Matrix:")
print(f" --------------------Predicted")
print(f" ---------------- No (0)  Yes (1)")
print(f"Actual No     {cm[0,0]:5}   {cm[0,1]:5}")
print(f"Actual Yes    {cm[1,0]:5}   {cm[1,1]:5}")
print(f"\nAccuracy: {((cm[0,0] + cm[1,1]) / len(y_test)):.1%}")

MODEL 2: RANDOM FOREST

Model trained on 5,634 customers
Tested on 1,409 customers

 AUC-ROC Score: 0.8357

Confusion Matrix:
 --------------------Predicted
 ---------------- No (0)  Yes (1)
Actual No       931     104
Actual Yes      183     191

Accuracy: 79.6%


### 7,How does Gradient Boosting perform compared to Random Forest and Logistic Regression?

In [18]:
# Gradient Boosting (BEST MODEL)
print("MODEL 3: GRADIENT BOOSTING")
print("--" * 60)

# Train the model
gb_model = GradientBoostingClassifier(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42)
gb_model.fit(X_train, y_train)

# predictions
gb_pred_prob = gb_model.predict_proba(X_test)[:, 1]
gb_pred = gb_model.predict(X_test)

# performance
gb_auc = roc_auc_score(y_test, gb_pred_prob)
print(f"\nModel trained on {len(X_train):,} customers")
print(f"Tested on {len(X_test):,} customers")
print(f"\nAUC-ROC Score: {gb_auc:.4f}")

#confusion matrix
cm = confusion_matrix(y_test, gb_pred)
print(f"\nConfusion Matrix:")
print(f"---------------------Predicted")
print(f"------------------No (0)  Yes (1)")
print(f"Actual No     {cm[0,0]:5}   {cm[0,1]:5}")
print(f"Actual Yes    {cm[1,0]:5}   {cm[1,1]:5}")
print(f"\nAccuracy: {((cm[0,0] + cm[1,1]) / len(y_test)):.1%}")


MODEL 3: GRADIENT BOOSTING
------------------------------------------------------------------------------------------------------------------------

Model trained on 5,634 customers
Tested on 1,409 customers

AUC-ROC Score: 0.8335

Confusion Matrix:
---------------------Predicted
------------------No (0)  Yes (1)
Actual No       929     106
Actual Yes      183     191

Accuracy: 79.5%


### 8,Which model performs best overall, and why should we select it for churn prediction?

In [24]:
# Compare all models
print("MODEL COMPARISON SUMMARY")
print("|" * 50)

#  comparison table
results = pd.DataFrame({'Model': ['Logistic Regression', 'Random Forest', 'Gradient Boosting'],'AUC-ROC': [lr_auc, rf_auc, gb_auc],'Accuracy': [0.80, 0.796, 0.795] })

# Sort
results = results.sort_values('AUC-ROC', ascending=False).reset_index(drop=True)
results['Rank'] = ['1st', '2nd', '3rd']

print("\n")
print(results.to_string(index=False))

print("\n" + "*" * 70)
print(f"Best: {results.iloc[0]['Model']}")
print(f"AUC-ROC Score: {results.iloc[0]['AUC-ROC']:.4f}")
print("*" * 70)

print(f"  - Logistic Regression: {lr_auc:.4f}")
print(f"  - Random Forest: {rf_auc:.4f} ")
print(f"  - Gradient Boosting: {gb_auc:.4f} ")


MODEL COMPARISON SUMMARY
||||||||||||||||||||||||||||||||||||||||||||||||||


              Model  AUC-ROC  Accuracy Rank
Logistic Regression 0.840758     0.800  1st
      Random Forest 0.835676     0.796  2nd
  Gradient Boosting 0.833455     0.795  3rd

**********************************************************************
Best: Logistic Regression
AUC-ROC Score: 0.8408
**********************************************************************
  - Logistic Regression: 0.8408
  - Random Forest: 0.8357 
  - Gradient Boosting: 0.8335 


In [25]:
best_model = lr_model
best_model_name = "Logistic Regression"
print(f"\n Selected Model: {best_model_name}")


 Selected Model: Logistic Regression


### 9,How do we save the best model for future customer risk predictions?

In [26]:
import pickle
#Save model
with open('../outputs/churn_model.pkl', 'wb') as f:pickle.dump(lr_model, f)
print(f"  File: outputs/churn_model.pkl")
# saving feature names
feature_names = list(X.columns)
with open('../outputs/feature_names.pkl', 'wb') as f:pickle.dump(feature_names, f)
print(f"  Total features: {len(feature_names)}")


  File: outputs/churn_model.pkl
  Total features: 19


### 10,What is the business impact of customer churn, and how many high-risk customers can we identify?

In [2]:
import pandas as pd
df = pd.read_csv('../data/Telco-Customer-Churn.csv')
total_customers = len(df)
print(f"\nTotal customers analyzed: {total_customers:,}")
churned = (df['Churn'] == 'Yes').sum()
churn_rate = (churned / total_customers) * 100
print(f"Actual churn count: {churned:,}")
print(f"Actual churn rate: {churn_rate:.1f}%")
avg_monthly = df['MonthlyCharges'].mean()
print(f"\nAverage monthly charges: ${avg_monthly:.2f}")
revenue_at_risk_monthly = churned * avg_monthly
revenue_at_risk_annual = revenue_at_risk_monthly * 12
print(f"Monthly revenue at risk: ${revenue_at_risk_monthly:,.2f}")
print(f"Annual revenue at risk: ${revenue_at_risk_annual:,.2f}")
print(f"Annual revenue at risk (millions): ${revenue_at_risk_annual/1_000_000:.1f}M")
try:
    from sklearn.metrics import accuracy_score
    acc = accuracy_score(y_test, lr_pred)
    print(f"\nModel accuracy: {(acc*100):.1f}%")
except:
    print(f"\nModel accuracy:")

try:
    risk_df = pd.read_csv('../outputs/high_risk_action_list.csv')
    high_risk_count = len(risk_df)
    print(f"High-risk customers identified: {high_risk_count:,}")
except:
    print(" High-risk file not found ")



Total customers analyzed: 7,043
Actual churn count: 1,869
Actual churn rate: 26.5%

Average monthly charges: $64.76
Monthly revenue at risk: $121,039.60
Annual revenue at risk: $1,452,475.24
Annual revenue at risk (millions): $1.5M

Model accuracy:
High-risk customers identified: 436
