<a href="https://colab.research.google.com/github/Krisanthi/Customer-Churn-Prediction-Model/blob/main/CM2604_Churn_Prediction_Colab.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CM2604 Machine Learning Coursework
## Telco Customer Churn Prediction

**Module:** CM2604 Machine Learning  
**RGU Student ID:** 2425596  
**IIT Student ID:** 20232384  
**Student Name:** Krisanthi Segar  

---

### Project Overview
This project aims to predict customer churn in a telecommunications company using machine learning techniques. I implement and compare a **Decision Tree Classifier** (with GridSearchCV hyperparameter tuning) and a **Neural Network** model.

**Dataset:** Telco Customer Churn (Kaggle)  
**Target Variable:** Churn (Yes/No - Binary Classification)  
**Models:** Decision Tree Classifier, Neural Network (MLP)

---
# 1. Setup and Data Loading
---

In [None]:
# Install required packages
!pip install --upgrade scikit-learn imbalanced-learn -q
print("Packages installed successfully!")

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import (accuracy_score, precision_score, recall_score, f1_score,
                             roc_auc_score, roc_curve, classification_report,
                             confusion_matrix, ConfusionMatrixDisplay)
from scipy.stats import zscore
from imblearn.over_sampling import SMOTE
from collections import Counter

import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, Input
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.callbacks import EarlyStopping

from google.colab import files

# Set seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

# Set display options
pd.set_option('display.max_columns', None)
plt.style.use('seaborn-v0_8-whitegrid')

print("All libraries imported successfully!")
print(f"TensorFlow version: {tf.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Pandas version: {pd.__version__}")

In [None]:
# Upload the dataset
print("Please upload the 'WA_Fn-UseC_-Telco-Customer-Churn.csv' file")
uploaded = files.upload()
filename = list(uploaded.keys())[0]
print(f"\nFile '{filename}' uploaded successfully!")

In [None]:
# Load the dataset
df = pd.read_csv(filename)
data = df.copy()

print("=" * 70)
print("DATASET LOADED SUCCESSFULLY")
print("=" * 70)
print(f"\nDataset shape: {data.shape}")
print(f"Number of records: {data.shape[0]:,}")
print(f"Number of features: {data.shape[1]}")

# Display first few rows
print("\n--- First 5 Rows ---")
display(data.head())

---
# TASK 1: Exploratory Data Analysis (EDA)
---

## 1.1 Dataset Overview

In [None]:
# Dataset Information
print("=" * 70)
print("DATASET INFORMATION")
print("=" * 70)
data.info()

print("\n--- Column Names and Data Types ---")
for i, (col, dtype) in enumerate(zip(data.columns, data.dtypes), 1):
    print(f"{i:2}. {col:20} - {dtype}")

In [None]:
# Missing Values Analysis
print("=" * 70)
print("MISSING VALUES ANALYSIS")
print("=" * 70)

# Standard null check
print("\n--- Standard Null Check ---")
null_counts = data.isnull().sum()
print(null_counts[null_counts > 0] if null_counts.sum() > 0 else "No standard null values found")

# TotalCharges special case - contains spaces as blanks
print(f"\nTotalCharges data type (before conversion): {data['TotalCharges'].dtype}")

# Check for empty strings/spaces in TotalCharges
empty_tc = data['TotalCharges'].replace(' ', '').eq('')
print(f"Empty strings/spaces in TotalCharges: {empty_tc.sum()}")

# Convert to numeric to reveal true missing values
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
missing_tc = data['TotalCharges'].isnull().sum()
print(f"\nAfter converting to numeric:")
print(f"Missing values in TotalCharges: {missing_tc}")
print(f"Percentage of missing: {(missing_tc / len(data)) * 100:.2f}%")

# Show rows with missing TotalCharges
if missing_tc > 0:
    print("\n--- Rows with Missing TotalCharges ---")
    display(data[data['TotalCharges'].isnull()][['customerID', 'tenure', 'MonthlyCharges', 'TotalCharges', 'Churn']])
    print("\nObservation: Missing TotalCharges are for customers with tenure=0 (new customers)")

In [None]:
# Duplicate Analysis
print("=" * 70)
print("DUPLICATE ANALYSIS")
print("=" * 70)
print(f"Duplicate rows in dataset: {data.duplicated().sum()}")
print(f"Duplicate customerIDs: {data['customerID'].duplicated().sum()}")
print(f"Unique customers: {data['customerID'].nunique():,}")

## 1.2 Numerical Features Analysis

In [None]:
# Numerical Features Statistics
print("=" * 70)
print("NUMERICAL FEATURES ANALYSIS")
print("=" * 70)

numerical_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
print("\n--- Statistical Summary ---")
display(data[numerical_cols].describe().round(2))

print("\n--- Key Observations ---")
print(f"\nTenure:")
print(f"  Range: {data['tenure'].min()} to {data['tenure'].max()} months")
print(f"  Mean: {data['tenure'].mean():.1f} months")
print(f"  Median: {data['tenure'].median():.1f} months")

print(f"\nMonthlyCharges:")
print(f"  Range: ${data['MonthlyCharges'].min():.2f} to ${data['MonthlyCharges'].max():.2f}")
print(f"  Mean: ${data['MonthlyCharges'].mean():.2f}")

print(f"\nTotalCharges:")
print(f"  Range: ${data['TotalCharges'].min():.2f} to ${data['TotalCharges'].max():.2f}")
print(f"  Mean: ${data['TotalCharges'].mean():.2f}")

## 1.3 Categorical Features Analysis

In [None]:
# Categorical Features Analysis
print("=" * 70)
print("CATEGORICAL FEATURES ANALYSIS")
print("=" * 70)

categorical_cols = ['gender', 'SeniorCitizen', 'Partner', 'Dependents', 'PhoneService',
                    'MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
                    'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
                    'Contract', 'PaperlessBilling', 'PaymentMethod']

print(f"\nNumber of Categorical Features: {len(categorical_cols)}")

# Check unique values per categorical column
print("\n--- Unique Values per Column ---")
for col in categorical_cols:
    unique_vals = data[col].unique()
    print(f"{col}: {len(unique_vals)} unique values - {list(unique_vals)}")

In [None]:
# Value counts for key categorical features
print("\n--- Value Distributions (Key Features) ---")
key_features = ['Contract', 'InternetService', 'PaymentMethod', 'gender']

for col in key_features:
    print(f"\n{col}:")
    vc = data[col].value_counts()
    for val, count in vc.items():
        pct = (count / len(data)) * 100
        print(f"  {val}: {count:,} ({pct:.1f}%)")

## 1.4 Target Variable Analysis

In [None]:
# Target Variable Analysis
print("=" * 70)
print("TARGET VARIABLE ANALYSIS (Churn)")
print("=" * 70)

target_counts = data['Churn'].value_counts()
target_pct = data['Churn'].value_counts(normalize=True) * 100

print(f"\nChurn Distribution:")
print(f"  No (Did not churn):  {target_counts['No']:,} ({target_pct['No']:.2f}%)")
print(f"  Yes (Churned):       {target_counts['Yes']:,} ({target_pct['Yes']:.2f}%)")

imbalance_ratio = target_counts['No'] / target_counts['Yes']
print(f"\nClass Imbalance Ratio: {imbalance_ratio:.2f}:1 (No:Yes)")
print("\n** Dataset exhibits significant class imbalance - SMOTE will be applied **")

---
# Data Visualizations
---

In [None]:
# Figure 1: Target Variable Distribution (Combined Bar + Annotation)
fig, ax = plt.subplots(figsize=(10, 6))

colors = ['#27ae60', '#e74c3c']  # Green for No, Red for Yes
bars = ax.bar(target_counts.index, target_counts.values, color=colors, edgecolor='black', linewidth=1.2)

# Add value labels on bars
for bar, count, pct in zip(bars, target_counts.values, target_pct.values):
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height + 100,
            f'{count:,}\n({pct:.1f}%)',
            ha='center', va='bottom', fontsize=12, fontweight='bold')

ax.set_xlabel('Churn Status', fontsize=12)
ax.set_ylabel('Number of Customers', fontsize=12)
ax.set_title('Figure 1: Customer Churn Distribution', fontsize=14, fontweight='bold')
ax.set_ylim(0, max(target_counts.values) * 1.15)

# Add imbalance ratio annotation
ax.annotate(f'Imbalance Ratio: {imbalance_ratio:.2f}:1',
            xy=(0.95, 0.95), xycoords='axes fraction',
            fontsize=11, ha='right', va='top',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

plt.tight_layout()
plt.savefig('fig1_churn_distribution.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 1 saved: fig1_churn_distribution.png")

In [None]:
# Figure 2: Correlation Heatmap
plt.figure(figsize=(10, 8))

corr_matrix = data[numerical_cols].corr()
mask = np.triu(np.ones_like(corr_matrix, dtype=bool))  # Upper triangle mask

sns.heatmap(corr_matrix, annot=True, cmap='RdYlBu_r', center=0, fmt='.3f',
            square=True, linewidths=0.5, annot_kws={'size': 14, 'weight': 'bold'},
            cbar_kws={'shrink': 0.8})

plt.title('Figure 2: Correlation Heatmap of Numerical Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('fig2_correlation_heatmap.png', dpi=300, bbox_inches='tight')
plt.show()

print("Figure 2 saved: fig2_correlation_heatmap.png")
print(f"\nKey Correlations:")
print(f"  Tenure - TotalCharges: {corr_matrix.loc['tenure', 'TotalCharges']:.3f} (Strong positive)")
print(f"  Tenure - MonthlyCharges: {corr_matrix.loc['tenure', 'MonthlyCharges']:.3f}")
print(f"  MonthlyCharges - TotalCharges: {corr_matrix.loc['MonthlyCharges', 'TotalCharges']:.3f}")

In [None]:
# Figure 3: Distribution of Numerical Features (Histograms with KDE)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = ['#3498db', '#e74c3c', '#2ecc71']

for i, (col, ax, color) in enumerate(zip(numerical_cols, axes, colors)):
    # Histogram with KDE
    data[col].dropna().hist(bins=30, ax=ax, color=color, edgecolor='black', alpha=0.7, density=True)
    data[col].dropna().plot.kde(ax=ax, color='darkblue', linewidth=2)

    # Add mean line
    mean_val = data[col].mean()
    ax.axvline(mean_val, color='red', linestyle='--', linewidth=2, label=f'Mean: {mean_val:.1f}')

    ax.set_xlabel(col, fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(f'Distribution of {col}', fontsize=12, fontweight='bold')
    ax.legend(loc='upper right')

plt.suptitle('Figure 3: Distribution of Numerical Features', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig3_numerical_distributions.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 3 saved: fig3_numerical_distributions.png")

In [None]:
# Figure 4: Box Plots for Outlier Detection
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
colors = ['#3498db', '#e74c3c', '#2ecc71']

for i, (col, ax, color) in enumerate(zip(numerical_cols, axes, colors)):
    bp = ax.boxplot(data[col].dropna(), patch_artist=True, notch=True)
    bp['boxes'][0].set_facecolor(color)
    bp['boxes'][0].set_alpha(0.7)
    bp['medians'][0].set_color('red')
    bp['medians'][0].set_linewidth(2)

    # Calculate outliers
    Q1 = data[col].quantile(0.25)
    Q3 = data[col].quantile(0.75)
    IQR = Q3 - Q1
    outliers = data[(data[col] < Q1 - 1.5*IQR) | (data[col] > Q3 + 1.5*IQR)][col]

    ax.set_ylabel(col, fontsize=11)
    ax.set_title(f'{col}\n(Outliers: {len(outliers)})', fontsize=12, fontweight='bold')
    ax.grid(True, alpha=0.3)

plt.suptitle('Figure 4: Box Plots for Outlier Detection', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig4_boxplots.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 4 saved: fig4_boxplots.png")

In [None]:
# Figure 5: Churn Rate by Contract Type, Internet Service, and Payment Method
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

analysis_cols = ['Contract', 'InternetService', 'PaymentMethod']

for ax, col in zip(axes, analysis_cols):
    # Calculate churn rate per category
    churn_rate = data.groupby(col)['Churn'].apply(lambda x: (x == 'Yes').mean() * 100)
    churn_rate = churn_rate.sort_values(ascending=True)

    colors = plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, len(churn_rate)))
    bars = ax.barh(churn_rate.index, churn_rate.values, color=colors, edgecolor='black')

    # Add value labels
    for bar, val in zip(bars, churn_rate.values):
        ax.text(val + 1, bar.get_y() + bar.get_height()/2, f'{val:.1f}%',
                va='center', fontsize=10, fontweight='bold')

    ax.set_xlabel('Churn Rate (%)', fontsize=11)
    ax.set_title(f'Churn Rate by {col}', fontsize=12, fontweight='bold')
    ax.set_xlim(0, max(churn_rate.values) * 1.2)

plt.suptitle('Figure 5: Churn Rate by Key Categorical Features', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig5_churn_by_categories.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 5 saved: fig5_churn_by_categories.png")

In [None]:
# Figure 6: Numerical Features vs Churn (Violin Plots)
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for i, (col, ax) in enumerate(zip(numerical_cols, axes)):
    sns.violinplot(x='Churn', y=col, data=data, ax=ax, palette=['#27ae60', '#e74c3c'])
    ax.set_xlabel('Churn', fontsize=11)
    ax.set_ylabel(col, fontsize=11)
    ax.set_title(f'{col} by Churn Status', fontsize=12, fontweight='bold')

plt.suptitle('Figure 6: Numerical Features Distribution by Churn Status', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig6_violin_plots.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 6 saved: fig6_violin_plots.png")

---
# TASK 2A: Corpus Preparation
---

In [None]:
# Reload fresh data for preprocessing
data = df.copy()

print("=" * 70)
print("CORPUS PREPARATION PIPELINE")
print("=" * 70)
print(f"\nInitial dataset shape: {data.shape}")

In [None]:
# Step 1: Convert TotalCharges to numeric
print("\n" + "-" * 50)
print("Step 1: Convert TotalCharges to Numeric")
print("-" * 50)
data['TotalCharges'] = pd.to_numeric(data['TotalCharges'], errors='coerce')
print(f"Missing values revealed: {data['TotalCharges'].isnull().sum()}")

# Step 2: Impute missing values with median
print("\n" + "-" * 50)
print("Step 2: Impute Missing Values (Median)")
print("-" * 50)
median_val = data['TotalCharges'].median()
data['TotalCharges'].fillna(median_val, inplace=True)
print(f"Median used for imputation: ${median_val:.2f}")
print(f"Missing values after imputation: {data['TotalCharges'].isnull().sum()}")

# Step 3: Remove customerID
print("\n" + "-" * 50)
print("Step 3: Remove customerID (Non-predictive)")
print("-" * 50)
data.drop(columns=['customerID'], inplace=True)
print(f"customerID removed. New shape: {data.shape}")

In [None]:
# Step 4: Binary Encoding
print("\n" + "-" * 50)
print("Step 4: Binary Encoding")
print("-" * 50)

binary_maps = {
    'gender': {'Male': 1, 'Female': 0},
    'Partner': {'Yes': 1, 'No': 0},
    'Dependents': {'Yes': 1, 'No': 0},
    'PhoneService': {'Yes': 1, 'No': 0},
    'PaperlessBilling': {'Yes': 1, 'No': 0},
    'Churn': {'Yes': 1, 'No': 0}
}

for col, mapping in binary_maps.items():
    data[col] = data[col].map(mapping)
    print(f"  {col}: {mapping}")

# Step 5: One-Hot Encoding
print("\n" + "-" * 50)
print("Step 5: One-Hot Encoding")
print("-" * 50)

onehot_cols = ['MultipleLines', 'InternetService', 'OnlineSecurity', 'OnlineBackup',
               'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies',
               'Contract', 'PaymentMethod']

print(f"Columns before encoding: {len(data.columns)}")
data = pd.get_dummies(data, columns=onehot_cols, drop_first=True)
print(f"Columns after encoding: {len(data.columns)}")
print(f"New columns created: {len(data.columns) - 10}")

In [None]:
# Step 6: Outlier Removal
print("\n" + "-" * 50)
print("Step 6: Outlier Removal (Z-score > 3)")
print("-" * 50)

numerical_features = ['tenure', 'MonthlyCharges', 'TotalCharges']
print(f"Rows before outlier removal: {len(data)}")

z_scores = np.abs(zscore(data[numerical_features]))
outlier_mask = (z_scores < 3).all(axis=1)
outliers_removed = len(data) - outlier_mask.sum()
data = data[outlier_mask]

print(f"Rows after outlier removal: {len(data)}")
print(f"Outliers removed: {outliers_removed} ({(outliers_removed/len(df))*100:.2f}%)")

In [None]:
# Step 7: Feature Scaling
print("\n" + "-" * 50)
print("Step 7: Feature Scaling (StandardScaler)")
print("-" * 50)

X = data.drop(columns=['Churn'])
y = data['Churn']

scaler = StandardScaler()
X[numerical_features] = scaler.fit_transform(X[numerical_features])

print("Numerical features scaled to mean=0, std=1")
print(f"\nScaled feature statistics:")
for col in numerical_features:
    print(f"  {col}: mean={X[col].mean():.4f}, std={X[col].std():.4f}")

In [None]:
# Step 8: SMOTE for Class Balancing
print("\n" + "-" * 50)
print("Step 8: SMOTE for Class Balancing")
print("-" * 50)

print(f"Class distribution BEFORE SMOTE: {Counter(y)}")

smote = SMOTE(random_state=42, k_neighbors=5)
X_balanced, y_balanced = smote.fit_resample(X, y)

print(f"Class distribution AFTER SMOTE: {Counter(y_balanced)}")
print(f"\nSamples added: {len(y_balanced) - len(y)}")

In [None]:
# Figure 7: Class Distribution Before and After SMOTE
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Before SMOTE
before_counts = pd.Series(y).value_counts().sort_index()
bars1 = axes[0].bar(['No Churn (0)', 'Churn (1)'], before_counts.values,
                    color=['#27ae60', '#e74c3c'], edgecolor='black')
axes[0].set_title('Before SMOTE', fontsize=12, fontweight='bold')
axes[0].set_ylabel('Count', fontsize=11)
for bar, val in zip(bars1, before_counts.values):
    axes[0].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                 f'{val:,}', ha='center', fontsize=11, fontweight='bold')

# After SMOTE
after_counts = pd.Series(y_balanced).value_counts().sort_index()
bars2 = axes[1].bar(['No Churn (0)', 'Churn (1)'], after_counts.values,
                    color=['#27ae60', '#e74c3c'], edgecolor='black')
axes[1].set_title('After SMOTE', fontsize=12, fontweight='bold')
axes[1].set_ylabel('Count', fontsize=11)
for bar, val in zip(bars2, after_counts.values):
    axes[1].text(bar.get_x() + bar.get_width()/2, bar.get_height() + 50,
                 f'{val:,}', ha='center', fontsize=11, fontweight='bold')

plt.suptitle('Figure 7: Class Distribution Before and After SMOTE', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig7_smote_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 7 saved: fig7_smote_comparison.png")

In [None]:
# Step 9: Train-Test Split
print("\n" + "-" * 50)
print("Step 9: Train-Test Split (80-20, Stratified)")
print("-" * 50)

X_train, X_test, y_train, y_test = train_test_split(
    X_balanced, y_balanced,
    test_size=0.20,
    random_state=42,
    stratify=y_balanced
)

print(f"Training set: X_train={X_train.shape}, y_train={y_train.shape}")
print(f"Testing set:  X_test={X_test.shape}, y_test={y_test.shape}")
print(f"\nTrain set class distribution: {Counter(y_train)}")
print(f"Test set class distribution: {Counter(y_test)}")

print("\n" + "=" * 70)
print("CORPUS PREPARATION COMPLETE!")
print("=" * 70)
print(f"\nFinal feature count: {X_train.shape[1]}")
print(f"Total samples for training: {len(y_train):,}")
print(f"Total samples for testing: {len(y_test):,}")

---
# TASK 2B: Model Implementation
---

## Model 1: Decision Tree Classifier with GridSearchCV

In [None]:
# Decision Tree with GridSearchCV
print("=" * 70)
print("DECISION TREE CLASSIFIER WITH GRIDSEARCHCV")
print("=" * 70)

# Define hyperparameter grid
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [5, 10, 15, 20, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}

total_combinations = 2 * 5 * 3 * 3 * 3
print(f"\nHyperparameter Grid:")
for param, values in param_grid.items():
    print(f"  {param}: {values}")
print(f"\nTotal parameter combinations: {total_combinations}")
print(f"With 5-fold CV: {total_combinations * 5} fits")

print("\nRunning GridSearchCV ")

In [None]:
# Execute GridSearchCV
dt = DecisionTreeClassifier(random_state=42)
grid_search = GridSearchCV(
    dt, param_grid,
    cv=5,
    scoring='f1',
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)
grid_search.fit(X_train, y_train)

print("\n" + "-" * 50)
print("Best Parameters Found:")
print("-" * 50)
for param, val in grid_search.best_params_.items():
    print(f"  {param}: {val}")
print(f"\nBest Cross-Validation F1-Score: {grid_search.best_score_:.4f}")

In [None]:
# Decision Tree Evaluation with Overfitting Check
best_dt = grid_search.best_estimator_

# Predictions
dt_train_pred = best_dt.predict(X_train)
dt_test_pred = best_dt.predict(X_test)
dt_train_proba = best_dt.predict_proba(X_train)[:, 1]
dt_test_proba = best_dt.predict_proba(X_test)[:, 1]

# Calculate metrics for both train and test
dt_train_accuracy = accuracy_score(y_train, dt_train_pred)
dt_test_accuracy = accuracy_score(y_test, dt_test_pred)
dt_train_f1 = f1_score(y_train, dt_train_pred)
dt_test_f1 = f1_score(y_test, dt_test_pred)
dt_train_auc = roc_auc_score(y_train, dt_train_proba)
dt_test_auc = roc_auc_score(y_test, dt_test_proba)

# Store test metrics
dt_accuracy = dt_test_accuracy
dt_precision = precision_score(y_test, dt_test_pred)
dt_recall = recall_score(y_test, dt_test_pred)
dt_f1 = dt_test_f1
dt_roc_auc = dt_test_auc

print("=" * 70)
print("DECISION TREE EVALUATION RESULTS")
print("=" * 70)

print("\n--- Training vs Test Performance (Overfitting Check) ---")
print(f"{'Metric':<15} {'Training':<12} {'Test':<12} {'Gap':<10}")
print("-" * 50)
print(f"{'Accuracy':<15} {dt_train_accuracy:<12.4f} {dt_test_accuracy:<12.4f} {dt_train_accuracy - dt_test_accuracy:<10.4f}")
print(f"{'F1-Score':<15} {dt_train_f1:<12.4f} {dt_test_f1:<12.4f} {dt_train_f1 - dt_test_f1:<10.4f}")
print(f"{'ROC-AUC':<15} {dt_train_auc:<12.4f} {dt_test_auc:<12.4f} {dt_train_auc - dt_test_auc:<10.4f}")

# Overfitting warning
if (dt_train_accuracy - dt_test_accuracy) > 0.1:
    print("\nWARNING: Potential overfitting detected (Accuracy gap > 10%)")
elif (dt_train_auc - dt_test_auc) > 0.1:
    print("\nWARNING: Potential overfitting detected (AUC gap > 10%)")
else:
    print("\nNo significant overfitting detected")

print("\n--- Test Set Metrics ---")
print(f"  Accuracy:  {dt_accuracy:.4f}")
print(f"  Precision: {dt_precision:.4f}")
print(f"  Recall:    {dt_recall:.4f}")
print(f"  F1-Score:  {dt_f1:.4f}")
print(f"  ROC-AUC:   {dt_roc_auc:.4f}")

print("\n--- Classification Report ---")
print(classification_report(y_test, dt_test_pred, target_names=['No Churn', 'Churn']))

In [None]:
# Figure 8: Decision Tree Confusion Matrix
fig, ax = plt.subplots(figsize=(8, 6))

cm_dt = confusion_matrix(y_test, dt_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_dt, display_labels=['No Churn', 'Churn'])
disp.plot(cmap='Blues', ax=ax, values_format='d')

plt.title('Figure 8: Decision Tree Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('fig8_dt_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

tn, fp, fn, tp = cm_dt.ravel()
print(f"Figure 8 saved: fig8_dt_confusion_matrix.png")
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (TN): {tn}")
print(f"  False Positives (FP): {fp}")
print(f"  False Negatives (FN): {fn}")
print(f"  True Positives (TP): {tp}")

In [None]:
# Figure 9: Decision Tree Feature Importance
feat_imp = pd.DataFrame({
    'Feature': X_train.columns,
    'Importance': best_dt.feature_importances_
}).sort_values('Importance', ascending=False)

print("--- Top 10 Feature Importances ---")
display(feat_imp.head(10))

# Plot
plt.figure(figsize=(12, 8))
top15 = feat_imp.head(15)
colors = plt.cm.viridis(np.linspace(0, 0.8, 15))
bars = plt.barh(top15['Feature'], top15['Importance'], color=colors, edgecolor='black')

plt.xlabel('Importance Score', fontsize=12)
plt.ylabel('Feature', fontsize=12)
plt.title('Figure 9: Decision Tree - Top 15 Feature Importances', fontsize=14, fontweight='bold')
plt.gca().invert_yaxis()

# Add value labels
for bar, val in zip(bars, top15['Importance']):
    plt.text(val + 0.005, bar.get_y() + bar.get_height()/2,
             f'{val:.3f}', va='center', fontsize=9)

plt.tight_layout()
plt.savefig('fig9_dt_feature_importance.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 9 saved: fig9_dt_feature_importance.png")

## Model 2: Neural Network

In [None]:
# Neural Network Model Architecture
print("=" * 70)
print("NEURAL NETWORK MODEL")
print("=" * 70)

print("\nArchitecture Design:")
print("  Input Layer:  {} neurons".format(X_train.shape[1]))
print("  Hidden 1:     128 neurons (ReLU) + Dropout(0.3)")
print("  Hidden 2:     64 neurons (ReLU) + Dropout(0.3)")
print("  Hidden 3:     32 neurons (ReLU) + Dropout(0.2)")
print("  Hidden 4:     16 neurons (ReLU)")
print("  Output:       1 neuron (Sigmoid)")
print("\nOptimizer: Adam (lr=0.001)")
print("Loss: Binary Crossentropy")

nn_model = Sequential([
    Input(shape=(X_train.shape[1],)),
    Dense(128, activation='relu'),
    Dropout(0.3),
    Dense(64, activation='relu'),
    Dropout(0.3),
    Dense(32, activation='relu'),
    Dropout(0.2),
    Dense(16, activation='relu'),
    Dense(1, activation='sigmoid')
])

nn_model.compile(
    optimizer=Adam(learning_rate=0.001),
    loss='binary_crossentropy',
    metrics=['accuracy']
)

print("\n--- Model Summary ---")
nn_model.summary()

In [None]:
# Train Neural Network with Early Stopping
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True,
    verbose=1
)

print("\nTraining Neural Network...")
print("(Early stopping enabled: patience=10)\n")

history = nn_model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    callbacks=[early_stop],
    verbose=1
)

print("\nTraining complete!")

In [None]:
# Neural Network Evaluation with Overfitting Check
# Predictions
nn_train_proba = nn_model.predict(X_train, verbose=0).ravel()
nn_test_proba = nn_model.predict(X_test, verbose=0).ravel()
nn_train_pred = (nn_train_proba >= 0.5).astype(int)
nn_test_pred = (nn_test_proba >= 0.5).astype(int)

# Calculate metrics for both train and test
nn_train_accuracy = accuracy_score(y_train, nn_train_pred)
nn_test_accuracy = accuracy_score(y_test, nn_test_pred)
nn_train_f1 = f1_score(y_train, nn_train_pred)
nn_test_f1 = f1_score(y_test, nn_test_pred)
nn_train_auc = roc_auc_score(y_train, nn_train_proba)
nn_test_auc = roc_auc_score(y_test, nn_test_proba)

# Store test metrics
nn_accuracy = nn_test_accuracy
nn_precision = precision_score(y_test, nn_test_pred)
nn_recall = recall_score(y_test, nn_test_pred)
nn_f1 = nn_test_f1
nn_roc_auc = nn_test_auc

print("=" * 70)
print("NEURAL NETWORK EVALUATION RESULTS")
print("=" * 70)

print("\n--- Training vs Test Performance (Overfitting Check) ---")
print(f"{'Metric':<15} {'Training':<12} {'Test':<12} {'Gap':<10}")
print("-" * 50)
print(f"{'Accuracy':<15} {nn_train_accuracy:<12.4f} {nn_test_accuracy:<12.4f} {nn_train_accuracy - nn_test_accuracy:<10.4f}")
print(f"{'F1-Score':<15} {nn_train_f1:<12.4f} {nn_test_f1:<12.4f} {nn_train_f1 - nn_test_f1:<10.4f}")
print(f"{'ROC-AUC':<15} {nn_train_auc:<12.4f} {nn_test_auc:<12.4f} {nn_train_auc - nn_test_auc:<10.4f}")

# Overfitting warning
if (nn_train_accuracy - nn_test_accuracy) > 0.1:
    print("\nWARNING: Potential overfitting detected (Accuracy gap > 10%)")
elif (nn_train_auc - nn_test_auc) > 0.1:
    print("\nWARNING: Potential overfitting detected (AUC gap > 10%)")
else:
    print("\nNo significant overfitting detected")

print("\n--- Test Set Metrics ---")
print(f"  Accuracy:  {nn_accuracy:.4f}")
print(f"  Precision: {nn_precision:.4f}")
print(f"  Recall:    {nn_recall:.4f}")
print(f"  F1-Score:  {nn_f1:.4f}")
print(f"  ROC-AUC:   {nn_roc_auc:.4f}")

print("\n--- Classification Report ---")
print(classification_report(y_test, nn_test_pred, target_names=['No Churn', 'Churn']))

In [None]:
# Figure 10: Neural Network Confusion Matrix
fig, ax = plt.subplots(figsize=(8, 6))

cm_nn = confusion_matrix(y_test, nn_test_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm_nn, display_labels=['No Churn', 'Churn'])
disp.plot(cmap='Oranges', ax=ax, values_format='d')

plt.title('Figure 10: Neural Network Confusion Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('fig10_nn_confusion_matrix.png', dpi=300, bbox_inches='tight')
plt.show()

tn, fp, fn, tp = cm_nn.ravel()
print(f"Figure 10 saved: fig10_nn_confusion_matrix.png")
print(f"\nConfusion Matrix Breakdown:")
print(f"  True Negatives (TN): {tn}")
print(f"  False Positives (FP): {fp}")
print(f"  False Negatives (FN): {fn}")
print(f"  True Positives (TP): {tp}")

In [None]:
# Figure 11: Neural Network Training History
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Loss Plot
axes[0].plot(history.history['loss'], label='Training Loss', color='blue', linewidth=2)
axes[0].plot(history.history['val_loss'], label='Validation Loss', color='red', linewidth=2)
axes[0].set_xlabel('Epoch', fontsize=11)
axes[0].set_ylabel('Loss', fontsize=11)
axes[0].set_title('Training and Validation Loss', fontsize=12, fontweight='bold')
axes[0].legend(loc='upper right')
axes[0].grid(True, alpha=0.3)

# Accuracy Plot
axes[1].plot(history.history['accuracy'], label='Training Accuracy', color='blue', linewidth=2)
axes[1].plot(history.history['val_accuracy'], label='Validation Accuracy', color='red', linewidth=2)
axes[1].set_xlabel('Epoch', fontsize=11)
axes[1].set_ylabel('Accuracy', fontsize=11)
axes[1].set_title('Training and Validation Accuracy', fontsize=12, fontweight='bold')
axes[1].legend(loc='lower right')
axes[1].grid(True, alpha=0.3)

plt.suptitle('Figure 11: Neural Network Training History', fontsize=14, fontweight='bold', y=1.02)
plt.tight_layout()
plt.savefig('fig11_nn_training_history.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 11 saved: fig11_nn_training_history.png")

---
# Model Comparison
---

In [None]:
# Figure 12: ROC Curve Comparison
plt.figure(figsize=(10, 8))

# Calculate ROC curves
dt_fpr, dt_tpr, _ = roc_curve(y_test, dt_test_proba)
nn_fpr, nn_tpr, _ = roc_curve(y_test, nn_test_proba)

# Plot ROC curves
plt.plot(dt_fpr, dt_tpr, label=f'Decision Tree (AUC = {dt_roc_auc:.4f})',
         color='blue', linewidth=2.5)
plt.plot(nn_fpr, nn_tpr, label=f'Neural Network (AUC = {nn_roc_auc:.4f})',
         color='orange', linewidth=2.5)
plt.plot([0, 1], [0, 1], 'k--', label='Random Classifier (AUC = 0.5)', linewidth=1.5, alpha=0.7)

# Fill areas
plt.fill_between(dt_fpr, dt_tpr, alpha=0.1, color='blue')
plt.fill_between(nn_fpr, nn_tpr, alpha=0.1, color='orange')

plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('Figure 12: ROC Curve Comparison', fontsize=14, fontweight='bold')
plt.legend(loc='lower right', fontsize=11)
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig('fig12_roc_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 12 saved: fig12_roc_comparison.png")

In [None]:
# Model Comparison Summary
print("=" * 70)
print("MODEL COMPARISON SUMMARY")
print("=" * 70)

comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC'],
    'Decision Tree': [dt_accuracy, dt_precision, dt_recall, dt_f1, dt_roc_auc],
    'Neural Network': [nn_accuracy, nn_precision, nn_recall, nn_f1, nn_roc_auc]
})
comparison_df['Difference'] = comparison_df['Neural Network'] - comparison_df['Decision Tree']
comparison_df['Better Model'] = comparison_df.apply(
    lambda row: 'Neural Network' if row['Difference'] > 0 else 'Decision Tree', axis=1
)

print("\n")
display(comparison_df.round(4))

In [None]:
# Figure 13: Model Performance Comparison Bar Chart
fig, ax = plt.subplots(figsize=(12, 7))

metrics = ['Accuracy', 'Precision', 'Recall', 'F1-Score', 'ROC-AUC']
dt_values = [dt_accuracy, dt_precision, dt_recall, dt_f1, dt_roc_auc]
nn_values = [nn_accuracy, nn_precision, nn_recall, nn_f1, nn_roc_auc]

x = np.arange(len(metrics))
width = 0.35

bars1 = ax.bar(x - width/2, dt_values, width, label='Decision Tree',
               color='#3498db', edgecolor='black', linewidth=1.2)
bars2 = ax.bar(x + width/2, nn_values, width, label='Neural Network',
               color='#e74c3c', edgecolor='black', linewidth=1.2)

# Add value labels on bars
def add_labels(bars):
    for bar in bars:
        height = bar.get_height()
        ax.annotate(f'{height:.3f}',
                    xy=(bar.get_x() + bar.get_width() / 2, height),
                    xytext=(0, 3),
                    textcoords="offset points",
                    ha='center', va='bottom', fontsize=10, fontweight='bold')

add_labels(bars1)
add_labels(bars2)

ax.set_xlabel('Metric', fontsize=12)
ax.set_ylabel('Score', fontsize=12)
ax.set_title('Figure 13: Model Performance Comparison', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels(metrics, fontsize=11)
ax.legend(loc='lower right', fontsize=11)
ax.set_ylim(0, 1.1)
ax.grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.savefig('fig13_model_comparison.png', dpi=300, bbox_inches='tight')
plt.show()
print("Figure 13 saved: fig13_model_comparison.png")

In [None]:
# Final Comparison Analysis
print("\n" + "=" * 70)
print("FINAL ANALYSIS")
print("=" * 70)

# Determine winner
nn_wins = sum([1 for dt, nn in zip(dt_values, nn_values) if nn > dt])
dt_wins = sum([1 for dt, nn in zip(dt_values, nn_values) if dt > nn])

if nn_wins > dt_wins:
    winner = "Neural Network"
    f1_improvement = ((nn_f1 - dt_f1) / dt_f1) * 100
else:
    winner = "Decision Tree"
    f1_improvement = ((dt_f1 - nn_f1) / nn_f1) * 100

print(f"\nMetrics won by Neural Network: {nn_wins}/5")
print(f"Metrics won by Decision Tree: {dt_wins}/5")
print(f"\nMost Suitable: {winner}")
print(f"\nF1-Score Improvement: {abs(f1_improvement):.1f}%")
print(f"ROC-AUC Difference: {abs(nn_roc_auc - dt_roc_auc):.4f}")

print("\n" + "-" * 70)
print("RECOMMENDATION:")
print("-" * 70)
print(f"The {winner} model is recommended for deployment based on overall")
print("performance across all evaluation metrics.")

---
# TASK 3: AI Ethics - 10%
---

## Development Phase Strategies

### 1. Data Privacy
- **customerID Removal:** The unique customer identifier was removed to eliminate Personally Identifiable Information (PII) from the training data
- **No Direct Identifiers:** The model does not use names, addresses, or contact information
- **Anonymized Features:** All features are behavioral or service-related, not personally identifying

### 2. Fairness Considerations
- **Gender Feature Analysis:** The gender feature was included but monitored for potential bias
- **Class Imbalance Handling:** SMOTE was applied to ensure the minority class (churners) is adequately represented
- **Feature Importance Analysis:** Regular monitoring of feature importances to identify potential biased predictors

### 3. Transparency
- **Decision Tree Interpretability:** The Decision Tree model provides clear, interpretable decision rules
- **Documentation:** All preprocessing steps and model decisions are documented
- **Evaluation Metrics:** Multiple metrics (Accuracy, Precision, Recall, F1, AUC) provide comprehensive model assessment

## Post-Deployment Strategies

### 1. Performance Monitoring
- **Monthly KPI Tracking:** Monitor accuracy, F1-score, and AUC on new data monthly
- **Drift Detection:** Implement statistical tests to detect feature distribution changes
- **Alert Systems:** Automated alerts when performance drops below thresholds

### 2. Bias Detection
- **Quarterly Fairness Audits:** Regular analysis of predictions across demographic groups
- **Disparate Impact Analysis:** Monitor for unequal prediction rates across protected classes
- **Feedback Loops:** Incorporate customer feedback on prediction fairness

### 3. Model Retraining
- **Quarterly Updates:** Retrain models with new customer data every quarter
- **A/B Testing:** Compare new model versions against current production model
- **Version Control:** Maintain full history of model versions and their performance

### 4. Human Oversight
- **Retention Team Review:** All high-risk churn predictions reviewed by retention specialists
- **Override Capability:** Human agents can override model predictions when appropriate
- **Escalation Protocols:** Clear procedures for handling edge cases and complaints

In [None]:
# Download all figures
print("=" * 70)
print("DOWNLOADING ALL FIGURES")
print("=" * 70)

figures = [
    'fig1_churn_distribution.png',
    'fig2_correlation_heatmap.png',
    'fig3_numerical_distributions.png',
    'fig4_boxplots.png',
    'fig5_churn_by_categories.png',
    'fig6_violin_plots.png',
    'fig7_smote_comparison.png',
    'fig8_dt_confusion_matrix.png',
    'fig9_dt_feature_importance.png',
    'fig10_nn_confusion_matrix.png',
    'fig11_nn_training_history.png',
    'fig12_roc_comparison.png',
    'fig13_model_comparison.png'
]

print("\nDownloading figures...")
for fig in figures:
    try:
        files.download(fig)
        print(f"  ✓ {fig}")
    except Exception as e:
        print(f"  ✗ {fig} - Error: {str(e)}")