# Telcom Customer Churn Prediction - Model Training & MLflow Deployment

This notebook implements a complete MLOps pipeline for telcom customer churn prediction, including model training, evaluation, hyperparameter tuning, and MLflow deployment.

## Project Overview
- **Objective**: Build and deploy machine learning models to predict customer churn
- **Input Data**: Cleaned dataset from EDA notebook (cleaned_telcom_data.csv)
- **MLflow Integration**: Complete experiment tracking and model registry
- **Target Metric**: Prioritize Recall (catching churners) and AUC-ROC

## Notebook Structure
1. Setup & Data Loading
2. Data Splitting
3. Baseline Model: Logistic Regression
4. Advanced Models Training
5. Hyperparameter Tuning
6. Model Evaluation & Visualization
7. Final Model Selection
8. MLflow Integration & Logging
9. Model Registry & Deployment
10. Business Insights & Feature Analysis

## 1. Setup & Data Loading

In [1]:
# Install required packages
# %pip install mlflow xgboost lightgbm scikit-learn pandas numpy matplotlib seaborn --quiet

In [2]:
# Import essential libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
import os
import pickle
from datetime import datetime

# Machine Learning libraries
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score, 
                           roc_auc_score, classification_report, confusion_matrix, roc_curve)
from sklearn.preprocessing import StandardScaler

# Advanced ML libraries
import xgboost as xgb
import lightgbm as lgb

# MLflow for experiment tracking and model registry
import mlflow
import mlflow.sklearn
import mlflow.xgboost
import mlflow.lightgbm
from mlflow.tracking import MlflowClient

# Set random seeds for reproducibility
np.random.seed(42)

# Configure warnings and display settings
warnings.filterwarnings('ignore')
plt.style.use('default')
sns.set_palette("husl")
pd.set_option('display.max_columns', None)

print(" All libraries imported successfully!")
print(f" Pandas version: {pd.__version__}")
print(f" Scikit-learn available")
print(f" XGBoost version: {xgb.__version__}")
print(f" LightGBM version: {lgb.__version__}")
print(f" MLflow version: {mlflow.__version__}")

 All libraries imported successfully!
 Pandas version: 2.3.3
 Scikit-learn available
 XGBoost version: 3.0.5
 LightGBM version: 4.6.0
 MLflow version: 3.5.0


In [3]:
# Configure MLflow
mlflow.set_experiment("telcom_churn_prediction")

# Get experiment info
experiment = mlflow.get_experiment_by_name("telcom_churn_prediction")
print(f" MLflow Experiment ID: {experiment.experiment_id}")
print(f" MLflow Tracking URI: {mlflow.get_tracking_uri()}")

# Create directories for artifacts if they don't exist
os.makedirs("artifacts", exist_ok=True)
os.makedirs("models", exist_ok=True)



2025/10/21 12:09:43 INFO mlflow.tracking.fluent: Experiment with name 'telcom_churn_prediction' does not exist. Creating a new experiment.


 MLflow Experiment ID: 444754456109301605
 MLflow Tracking URI: file:///c:/Users/Admin/Documents/ML_Engineering/Churn_Prediction/mlruns


In [4]:
# Load the cleaned dataset from EDA notebook
df = pd.read_csv('cleaned_telcom_data.csv')
print(" Dataset loaded successfully!")
print(f" Dataset shape: {df.shape}")
print(f" Columns: {df.shape[1]} features")

# Display basic information about the dataset
print(f"\n Dataset Overview:")
print(f"   • Total records: {len(df):,}")
print(f"   • Total features: {df.shape[1]}")
print(f"   • Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

# Check for any missing values
missing_values = df.isnull().sum().sum()
print(f"   • Missing values: {missing_values}")

# Display first few rows
print(f"\n First 3 rows:")
df.head(3)

 Dataset loaded successfully!
 Dataset shape: (7043, 38)
 Columns: 38 features

 Dataset Overview:
   • Total records: 7,043
   • Total features: 38
   • Memory usage: 1.57 MB
   • Missing values: 0

 First 3 rows:


Unnamed: 0,gender,SeniorCitizen,Partner,Dependents,tenure,PhoneService,PaperlessBilling,MonthlyCharges,TotalCharges,Churn,ChargeRatio,ServiceCount,HasPhoneService,HasInternetService,MultipleLines_Yes,MultipleLines_No,OnlineSecurity_Yes,OnlineSecurity_No,OnlineBackup_Yes,OnlineBackup_No,DeviceProtection_Yes,DeviceProtection_No,TechSupport_Yes,TechSupport_No,StreamingTV_Yes,StreamingTV_No,StreamingMovies_Yes,StreamingMovies_No,InternetService_Fiber optic,InternetService_No,Contract_One year,Contract_Two year,PaymentMethod_Credit card (automatic),PaymentMethod_Electronic check,PaymentMethod_Mailed check,TenureBucket_13-24_months,TenureBucket_25-48_months,TenureBucket_49+_months
0,0,0,1,0,1,0,1,29.85,29.85,0,1.0,1,0,1,0,0,0,1,1,0,0,1,0,1,0,1,0,1,False,False,False,False,False,True,False,False,False,False
1,1,0,0,0,34,1,0,56.95,1889.5,0,1.024768,3,1,1,0,1,1,0,0,1,1,0,0,1,0,1,0,1,False,False,True,False,False,False,True,False,True,False
2,1,0,0,0,2,1,1,53.85,108.15,1,0.995839,3,1,1,0,1,1,0,1,0,0,1,0,1,0,1,0,1,False,False,False,False,False,False,True,False,False,False


In [None]:
# Separate features (X) and target (y)

X = df.drop('Churn', axis=1)
y = df['Churn']

# Display feature and target information
print(f"\n Target Variable (Churn) Distribution:")
churn_dist = y.value_counts()
churn_pct = y.value_counts(normalize=True) * 100
for label, count in churn_dist.items():
    print(f"   • {label}: {count:,} customers ({churn_pct[label]:.2f}%)")

print(f"\n Feature Set:")
print(f"   • Number of features: {X.shape[1]}")
print(f"   • Feature data types:")
for dtype in X.dtypes.value_counts().items():
    print(f"     - {dtype[0]}: {dtype[1]} features")

print(f"\n Feature names:")
feature_names = list(X.columns)
for i, feature in enumerate(feature_names, 1):
    print(f"   {i:2d}. {feature}")

# Verify data quality
print(f"\n Data Quality Check:")
print(f"   • Features missing values: {X.isnull().sum().sum()}")
print(f"   • Target missing values: {y.isnull().sum()}")
print(f"   • Duplicate rows: {X.duplicated().sum()}")


 Features and target separated successfully!

 Target Variable (Churn) Distribution:
   • 0: 5,174 customers (73.46%)
   • 1: 1,869 customers (26.54%)

 Feature Set:
   • Number of features: 37
   • Feature data types:
     - int64: 24 features
     - bool: 10 features
     - float64: 3 features

 Feature names:
    1. gender
    2. SeniorCitizen
    3. Partner
    4. Dependents
    5. tenure
    6. PhoneService
    7. PaperlessBilling
    8. MonthlyCharges
    9. TotalCharges
   10. ChargeRatio
   11. ServiceCount
   12. HasPhoneService
   13. HasInternetService
   14. MultipleLines_Yes
   15. MultipleLines_No
   16. OnlineSecurity_Yes
   17. OnlineSecurity_No
   18. OnlineBackup_Yes
   19. OnlineBackup_No
   20. DeviceProtection_Yes
   21. DeviceProtection_No
   22. TechSupport_Yes
   23. TechSupport_No
   24. StreamingTV_Yes
   25. StreamingTV_No
   26. StreamingMovies_Yes
   27. StreamingMovies_No
   28. InternetService_Fiber optic
   29. InternetService_No
   30. Contract_One year

## 2. Data Splitting

In [None]:
# Split data: 70% train, 15% validation, 15% test
# First split: 70% train, 30% temp
X_train, X_temp, y_train, y_temp = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Second split: 15% validation, 15% test (from the 30% temp)
X_val, X_test, y_val, y_test = train_test_split(
    X_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp
)

print(" Data splitting completed!")
print(f"\n Dataset Split Summary:")
print(f"   • Training set:   {X_train.shape[0]:,} samples ({X_train.shape[0]/len(X)*100:.1f}%)")
print(f"   • Validation set: {X_val.shape[0]:,} samples ({X_val.shape[0]/len(X)*100:.1f}%)")
print(f"   • Test set:       {X_test.shape[0]:,} samples ({X_test.shape[0]/len(X)*100:.1f}%)")
print(f"   • Total:          {len(X):,} samples")

# Verify stratification worked correctly
print(f"\n Churn Rate Distribution Across Splits:")
train_churn_rate = y_train.mean() * 100
val_churn_rate = y_val.mean() * 100
test_churn_rate = y_test.mean() * 100
overall_churn_rate = y.mean() * 100

print(f"   • Overall:    {overall_churn_rate:.2f}%")
print(f"   • Training:   {train_churn_rate:.2f}%")
print(f"   • Validation: {val_churn_rate:.2f}%")
print(f"   • Test:       {test_churn_rate:.2f}%")

# Check if splits are balanced (should be within 1% of overall rate)
max_deviation = max(abs(train_churn_rate - overall_churn_rate), 
                   abs(val_churn_rate - overall_churn_rate),
                   abs(test_churn_rate - overall_churn_rate))

if max_deviation < 1.0:
    print(" Stratification successful - all splits within 1% of overall churn rate")
else:
    print(f" Warning: Maximum deviation from overall rate: {max_deviation:.2f}%")

# Store split information for later use
split_info = {
    'train_size': len(X_train),
    'val_size': len(X_val),
    'test_size': len(X_test),
    'train_churn_rate': train_churn_rate,
    'val_churn_rate': val_churn_rate,
    'test_churn_rate': test_churn_rate,
    'overall_churn_rate': overall_churn_rate,
    'random_state': 42
}

print(f"\n Split Configuration Stored for MLflow Logging")