# Customer Churn Prediction Analysis

## Project Overview
This notebook implements a comprehensive machine learning analysis to predict customer churn using the IBM Telco Customer Churn dataset. The analysis compares two approaches:
- **Logistic Regression** with feature selection and hyperparameter tuning
- **Deep Neural Network** with early stopping regularization

## Key Workflow
1. **Data Preprocessing** - Loading, cleaning, encoding, and scaling
2. **Class Imbalance Handling** - SMOTE for balanced training data
3. **Feature Selection** - Mutual Information for identifying top predictors
4. **Model Training** - LR with GridSearchCV and DNN with Keras
5. **Evaluation** - Comprehensive metrics comparison
6. **Model Persistence** - Saving models and preprocessing objects

In [4]:
# Importing necessary libraries and setup.

import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd
from collections import Counter

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import mutual_info_classif
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (accuracy_score, precision_score, recall_score,
                             f1_score, roc_auc_score, confusion_matrix, classification_report)

from imblearn.over_sampling import SMOTE

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, InputLayer
from tensorflow.keras.callbacks import EarlyStopping

# Reproducibility.
RANDOM_STATE = 42

# Path to dataset
DATA_PATH = "WA_Fn-UseC_-Telco-Customer-Churn.csv"

## 1. Setup and Configuration
Import all necessary libraries and set configuration parameters.

In [5]:
# DATA PREPROCESSING

# Loading dataset.
df = pd.read_csv(DATA_PATH)


# Basic cleanup steps & target encoding.
# Mapping target 'Churn' to binary 1/0
df['Churn'] = df['Churn'].map({'Yes': 1, 'No': 0})


# Handling TotalCharges blanks and converting to numeric.
# In this dataset TotalCharges may be empty strings for customers with tenure=0
# Converting to numeric coercing errors to NaN, then fill with median.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'].replace(' ', np.nan), errors='coerce')
total_median = df['TotalCharges'].median()  # choosing median to be robust
df['TotalCharges'].fillna(total_median, inplace=True)


# Identifying features.
target_col = 'Churn'
# Treat object dtype (excluding customerID) as categorical / nominal.
drop_cols = ['customerID'] if 'customerID' in df.columns else []
feature_cols = [c for c in df.columns if c not in drop_cols + [target_col]]

# Separate numeric vs categorical
numeric_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
# Confirm numeric cols exist; if not, detect numerics automatically
numeric_cols = [c for c in numeric_cols if c in df.columns]
# For categorical, use object dtype columns or explicitly exclude numeric
categorical_cols = [c for c in feature_cols if c not in numeric_cols]


# One-Hot Encoding for nominal categorical features.
# Use drop_first=True to reduce collinearity.
df_encoded = pd.get_dummies(df.drop(columns=drop_cols), columns=categorical_cols, drop_first=True)


# Feature matrix and target vector
X = df_encoded.drop(columns=[target_col])
y = df_encoded[target_col]

# Min-Max Scaling for numeric features (its very important to do this on the full data BEFORE the split so transforms are consistent.)
# we scaled across the whole dataset for simplicity.
scaler = MinMaxScaler()
# finding the scaled numeric columns names in X (they exist unchanged since get_dummies didn't touch them)
scale_cols = [c for c in numeric_cols if c in X.columns]
X[scale_cols] = scaler.fit_transform(X[scale_cols])


# Train/test split (70/30) with stratification on churn.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, stratify=y, random_state=RANDOM_STATE
)

print("Original training class distribution:", Counter(y_train))
print("Original test class distribution:", Counter(y_test))


Original training class distribution: Counter({0: 3622, 1: 1308})
Original test class distribution: Counter({0: 1552, 1: 561})


## 2. Data Preprocessing
Clean the dataset, handle missing values, encode categorical features, and scale numeric features.

In [6]:
#HANDLING CLASS IMBALANCE WITH SMOTE

# Applying SMOTE on training data only. Keep the test set untouched.
smote = SMOTE(random_state=RANDOM_STATE)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print("After SMOTE training class distribution:", Counter(y_train_smote))


After SMOTE training class distribution: Counter({0: 3622, 1: 3622})


## 3. Class Imbalance Handling
Apply SMOTE (Synthetic Minority Oversampling Technique) to balance the training data.

In [7]:
#FEATURE SELECTION USING MUTUAL INFORMATION FOR LOGISTIC REGRESSION MODELING

# Computing mutual information scores on the pre-SMOTE training set.
mi_scores = mutual_info_classif(X_train, y_train, random_state=RANDOM_STATE)
mi_series = pd.Series(mi_scores, index=X_train.columns).sort_values(ascending=False)

# Top 10 most informative features.
top_10_features = mi_series.head(10)
print("Top 10 features by Mutual Information (pre-SMOTE training set):")
print(top_10_features)

# For later steps we need the list (as Python list).
top10_feature_list = top_10_features.index.tolist()


Top 10 features by Mutual Information (pre-SMOTE training set):
tenure                                  0.076137
Contract_Two year                       0.060981
InternetService_Fiber optic             0.054962
PaymentMethod_Electronic check          0.051235
MonthlyCharges                          0.044990
TotalCharges                            0.038035
DeviceProtection_No internet service    0.034879
TechSupport_No internet service         0.034649
StreamingMovies_No internet service     0.034344
OnlineBackup_No internet service        0.030516
dtype: float64


## 4. Feature Selection
Identify the most informative features using Mutual Information scores.

In [8]:
#LOGISTIC REGRESSION IMPLEMENTATION AND PARAMETER HYPERPARAMETER TUNING

# Prepare training data for Logistic Regression using SMOTE-balanced training set
# Using only the top 10 features found with MI
X_train_lr = X_train_smote[top10_feature_list]
X_test_lr  = X_test[top10_feature_list]  # test set still from holdout (not SMOTE)

# Defining model and hyperparameter grid.
lr = LogisticRegression(solver='liblinear', random_state=RANDOM_STATE, max_iter=1000)

param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2']
}

grid = GridSearchCV(lr, param_grid=param_grid, cv=5, scoring='roc_auc', n_jobs=-1, verbose=1)
grid.fit(X_train_lr, y_train_smote)

print("Best LR hyperparameters:", grid.best_params_)
best_lr = grid.best_estimator_


Fitting 5 folds for each of 10 candidates, totalling 50 fits
Best LR hyperparameters: {'C': 100, 'penalty': 'l1'}
Best LR hyperparameters: {'C': 100, 'penalty': 'l1'}


## 5. Logistic Regression Model
Train Logistic Regression with hyperparameter tuning using GridSearchCV.

In [9]:
# DEEP NEURAL NETWORK IMPLEMENTATION

# Preparing full training set (SMOTE-balanced) and full test set (holdout) for DNN
X_train_dnn = X_train_smote.astype('float32').values  # convert to float32 and numpy array
y_train_dnn = y_train_smote.astype('float32').values
X_test_dnn  = X_test.astype('float32').values
y_test_dnn  = y_test.astype('float32').values

input_dim = X_train_dnn.shape[1]

# Building the Keras model as requested:
def build_dnn_model(input_dim):
    model = Sequential()
    model.add(InputLayer(input_shape=(input_dim,)))
    model.add(Dense(64, activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))  # binary output
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['AUC'])
    return model

dnn_model = build_dnn_model(input_dim)

# Early stopping callback
early_stop = EarlyStopping(monitor='val_loss', patience=5, restore_best_weights=True)

# Train the DNN
history = dnn_model.fit(
    X_train_dnn, y_train_dnn,
    validation_split=0.10,  # 10% of training used for validation
    epochs=100,
    batch_size=64,
    callbacks=[early_stop],
    verbose=2
)

Epoch 1/100
102/102 - 4s - 36ms/step - AUC: 0.8165 - loss: 0.5260 - val_AUC: 0.0000e+00 - val_loss: 0.4710
Epoch 2/100
102/102 - 4s - 36ms/step - AUC: 0.8165 - loss: 0.5260 - val_AUC: 0.0000e+00 - val_loss: 0.4710
Epoch 2/100
102/102 - 1s - 5ms/step - AUC: 0.8527 - loss: 0.4734 - val_AUC: 0.0000e+00 - val_loss: 0.6475
Epoch 3/100
102/102 - 1s - 5ms/step - AUC: 0.8527 - loss: 0.4734 - val_AUC: 0.0000e+00 - val_loss: 0.6475
Epoch 3/100
102/102 - 1s - 5ms/step - AUC: 0.8597 - loss: 0.4626 - val_AUC: 0.0000e+00 - val_loss: 0.5703
Epoch 4/100
102/102 - 1s - 5ms/step - AUC: 0.8597 - loss: 0.4626 - val_AUC: 0.0000e+00 - val_loss: 0.5703
Epoch 4/100
102/102 - 1s - 5ms/step - AUC: 0.8647 - loss: 0.4551 - val_AUC: 0.0000e+00 - val_loss: 0.5857
Epoch 5/100
102/102 - 1s - 5ms/step - AUC: 0.8647 - loss: 0.4551 - val_AUC: 0.0000e+00 - val_loss: 0.5857
Epoch 5/100
102/102 - 0s - 3ms/step - AUC: 0.8679 - loss: 0.4497 - val_AUC: 0.0000e+00 - val_loss: 0.4346
Epoch 6/100
102/102 - 0s - 3ms/step - AUC: 0

## 6. Deep Neural Network Model
Build and train a deep neural network with early stopping regularization.

In [10]:
#COMPARATIVE EVALUATION OF MODELS ON TEST SET

# Logistic Regression predictions
# For LR we used only top10 features
y_pred_lr = best_lr.predict(X_test_lr)
y_proba_lr = best_lr.predict_proba(X_test_lr)[:, 1]

# LR metrics
lr_acc = accuracy_score(y_test, y_pred_lr)
lr_prec = precision_score(y_test, y_pred_lr)
lr_rec = recall_score(y_test, y_pred_lr)
lr_f1 = f1_score(y_test, y_pred_lr)
lr_auc = roc_auc_score(y_test, y_proba_lr)

# DNN predictions
y_proba_dnn = dnn_model.predict(X_test_dnn).ravel()
y_pred_dnn = (y_proba_dnn >= 0.5).astype(int)

# DNN metrics
dnn_acc = accuracy_score(y_test_dnn, y_pred_dnn)
dnn_prec = precision_score(y_test_dnn, y_pred_dnn)
dnn_rec = recall_score(y_test_dnn, y_pred_dnn)
dnn_f1 = f1_score(y_test_dnn, y_pred_dnn)
dnn_auc = roc_auc_score(y_test_dnn, y_proba_dnn)

# Confusion matrix for DNN
dnn_cm = confusion_matrix(y_test_dnn, y_pred_dnn)

# Print the confusion matrix
print("DNN Confusion Matrix (rows: true class 0/1, cols: predicted 0/1):")
print(dnn_cm)


[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
[1m67/67[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 3ms/step
DNN Confusion Matrix (rows: true class 0/1, cols: predicted 0/1):
[[1152  400]
 [ 133  428]]
DNN Confusion Matrix (rows: true class 0/1, cols: predicted 0/1):
[[1152  400]
 [ 133  428]]


## 7. Model Evaluation
Evaluate both models on the test set using multiple metrics.

In [11]:
# Building a DataFrame for a clean table
results = pd.DataFrame({
    'Model': ['Logistic Regression (top 10 features)', 'Deep Neural Network (full features)'],
    'Accuracy': [lr_acc, dnn_acc],
    'Precision': [lr_prec, dnn_prec],
    'Recall': [lr_rec, dnn_rec],
    'F1-score': [lr_f1, dnn_f1],
    'ROC-AUC': [lr_auc, dnn_auc]
})

# Round metrics for nicer display.
results[['Accuracy','Precision','Recall','F1-score','ROC-AUC']] = results[['Accuracy','Precision','Recall','F1-score','ROC-AUC']].round(4)

# Show table.
print("\n### Comparative Evaluation Results (Test Set / Holdout 30%)\n")
print(results.to_string(index=False))

# Also print best LR hyperparameters and top-10 features found earlier
print("\nBest Logistic Regression hyperparameters found by GridSearchCV:", grid.best_params_)
print("\nTop 10 features selected by Mutual Information (pre-SMOTE training set):")
for i, feat in enumerate(top10_feature_list, 1):
    print(f"{i}. {feat}")


### Comparative Evaluation Results (Test Set / Holdout 30%)

                                Model  Accuracy  Precision  Recall  F1-score  ROC-AUC
Logistic Regression (top 10 features)    0.7378     0.5039  0.8093    0.6211   0.8354
  Deep Neural Network (full features)    0.7478     0.5169  0.7629    0.6163   0.8361

Best Logistic Regression hyperparameters found by GridSearchCV: {'C': 100, 'penalty': 'l1'}

Top 10 features selected by Mutual Information (pre-SMOTE training set):
1. tenure
2. Contract_Two year
3. InternetService_Fiber optic
4. PaymentMethod_Electronic check
5. MonthlyCharges
6. TotalCharges
7. DeviceProtection_No internet service
8. TechSupport_No internet service
9. StreamingMovies_No internet service
10. OnlineBackup_No internet service


## 8. Results Summary
Display comparative results and key insights from the analysis.

## Model Persistence
Save all trained models and preprocessing objects to disk for future use and deployment.

In [14]:
import pickle
import os
from datetime import datetime

# Check if all required models are defined
required_vars = ['best_lr', 'dnn_model', 'scaler', 'smote', 'grid', 'RANDOM_STATE',
                 'top10_feature_list', 'X', 'numeric_cols', 'categorical_cols',
                 'lr_acc', 'lr_prec', 'lr_rec', 'lr_f1', 'lr_auc',
                 'dnn_acc', 'dnn_prec', 'dnn_rec', 'dnn_f1', 'dnn_auc']

missing_vars = [var for var in required_vars if var not in locals()]

if missing_vars:
    print(f"❌ Error: The following variables are not defined: {', '.join(missing_vars)}")
    print("Please make sure to run all the cells above in order before running this cell.")
else:
    # Creating models directory
    models_dir = "models"
    if not os.path.exists(models_dir):
        os.makedirs(models_dir)

    # Generating timestamp for model versioning
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")

    # Saving Logistic Regression model
    lr_model_path = os.path.join(models_dir, f"logistic_regression_model_{timestamp}.pkl")
    with open(lr_model_path, 'wb') as f:
        pickle.dump(best_lr, f)
    print(f"✓ Logistic Regression model saved: {lr_model_path}")

    # Saving DNN model (TensorFlow/Keras format)
    dnn_model_path = os.path.join(models_dir, f"dnn_model_{timestamp}.h5")
    dnn_model.save(dnn_model_path)
    print(f"✓ Deep Neural Network model saved: {dnn_model_path}")

    # Saving the scaler
    scaler_path = os.path.join(models_dir, f"minmax_scaler_{timestamp}.pkl")
    with open(scaler_path, 'wb') as f:
        pickle.dump(scaler, f)
    print(f"✓ MinMax Scaler saved: {scaler_path}")

    # Saving SMOTE instance
    smote_path = os.path.join(models_dir, f"smote_instance_{timestamp}.pkl")
    with open(smote_path, 'wb') as f:
        pickle.dump(smote, f)
    print(f"✓ SMOTE instance saved: {smote_path}")

    # Saving feature names and metadata
    import json
    metadata = {
        "timestamp": timestamp,
        "top_10_features": top10_feature_list,
        "all_features": X.columns.tolist(),
        "numeric_features": numeric_cols,
        "categorical_features": categorical_cols,
        "test_size": 0.30,
        "random_state": RANDOM_STATE,
        "lr_best_params": grid.best_params_.copy() if hasattr(grid.best_params_, 'copy') else dict(grid.best_params_),
        "model_performance": {
            "logistic_regression": {
                "accuracy": float(lr_acc),
                "precision": float(lr_prec),
                "recall": float(lr_rec),
                "f1_score": float(lr_f1),
                "roc_auc": float(lr_auc)
            },
            "deep_neural_network": {
                "accuracy": float(dnn_acc),
                "precision": float(dnn_prec),
                "recall": float(dnn_rec),
                "f1_score": float(dnn_f1),
                "roc_auc": float(dnn_auc)
            }
        }
    }

    metadata_path = os.path.join(models_dir, f"model_metadata_{timestamp}.json")
    with open(metadata_path, 'w') as f:
        json.dump(metadata, f, indent=4)
    print(f"✓ Model metadata saved: {metadata_path}")

    print(f"\n✅ All models successfully saved in '{models_dir}/' directory!")
    print(f"\nSaved Files:")
    print(f"  - Logistic Regression: {os.path.basename(lr_model_path)}")
    print(f"  - DNN Model: {os.path.basename(dnn_model_path)}")
    print(f"  - MinMax Scaler: {os.path.basename(scaler_path)}")
    print(f"  - SMOTE: {os.path.basename(smote_path)}")
    print(f"  - Metadata: {os.path.basename(metadata_path)}")

❌ Error: The following variables are not defined: best_lr, dnn_model, scaler, smote, grid, RANDOM_STATE, top10_feature_list, X, numeric_cols, categorical_cols, lr_acc, lr_prec, lr_rec, lr_f1, lr_auc, dnn_acc, dnn_prec, dnn_rec, dnn_f1, dnn_auc
Please make sure to run all the cells above in order before running this cell.
