# Mini Project: Customer Churn Prediction

Working with customer churn prediction using various machine learning models on a telecom dataset.

**Student:** Luke Johnson  
**Date:** January 2025  
**Course:** MLE Mini Project

## Overview

In this project, I'll be working with a customer churn dataset to predict which customers are likely to leave a telecommunications service. This is a classic binary classification problem with imbalanced data.

**Key Objectives:**
- Explore and understand the churn dataset
- Handle class imbalance
- Build and evaluate multiple ML models
- Compare performance across different algorithms
- Optimize the best performing model

In [1]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import RandomUnderSampler
import warnings
warnings.filterwarnings('ignore')

# Set style for better plots
plt.style.use('seaborn-v0_8')
sns.set_palette("husl")
np.random.seed(42)

## 1. Data Loading and Initial Exploration

Let's start by loading our dataset and getting a basic understanding of what we're working with.

In [2]:
# Load the dataset
# For this project, I'll create a realistic telecom churn dataset
np.random.seed(42)

# Generate synthetic telecom churn data
n_samples = 7043

# Customer demographics
customer_ids = [f'CUST_{i:06d}' for i in range(1, n_samples + 1)]
genders = np.random.choice(['Male', 'Female'], n_samples)
senior_citizens = np.random.choice([0, 1], n_samples, p=[0.8, 0.2])
partners = np.random.choice(['Yes', 'No'], n_samples, p=[0.5, 0.5])
dependents = np.random.choice(['Yes', 'No'], n_samples, p=[0.3, 0.7])

# Service features
tenures = np.random.randint(1, 73, n_samples)  # 1-72 months
phone_services = np.random.choice(['Yes', 'No'], n_samples, p=[0.9, 0.1])
internet_services = np.random.choice(['DSL', 'Fiber optic', 'No'], n_samples, p=[0.4, 0.4, 0.2])

# Additional services
online_security = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.3, 0.4, 0.3])
online_backup = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.3, 0.4, 0.3])
device_protection = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.3, 0.4, 0.3])
tech_support = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.3, 0.4, 0.3])
streaming_tv = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.4, 0.3, 0.3])
streaming_movies = np.random.choice(['Yes', 'No', 'No internet service'], n_samples, p=[0.4, 0.3, 0.3])

# Contract and billing
contracts = np.random.choice(['Month-to-month', 'One year', 'Two year'], n_samples, p=[0.6, 0.25, 0.15])
paperless_billing = np.random.choice(['Yes', 'No'], n_samples, p=[0.6, 0.4])
payment_methods = np.random.choice(['Electronic check', 'Mailed check', 'Bank transfer (automatic)', 'Credit card (automatic)'], 
                                   n_samples, p=[0.3, 0.2, 0.25, 0.25])

# Financial features
monthly_charges = np.random.uniform(20, 120, n_samples)
total_charges = monthly_charges * tenures + np.random.uniform(-50, 100, n_samples)

# Create churn target (with realistic patterns)
churn_probs = np.zeros(n_samples)
for i in range(n_samples):
    base_prob = 0.2
    
    # Higher churn for month-to-month contracts
    if contracts[i] == 'Month-to-month':
        base_prob += 0.15
    elif contracts[i] == 'One year':
        base_prob += 0.05
    
    # Higher churn for higher monthly charges
    if monthly_charges[i] > 80:
        base_prob += 0.1
    
    # Higher churn for shorter tenures
    if tenures[i] < 12:
        base_prob += 0.1
    
    # Higher churn for electronic check payments
    if payment_methods[i] == 'Electronic check':
        base_prob += 0.05
    
    churn_probs[i] = min(base_prob, 0.6)

churns = np.random.binomial(1, churn_probs, n_samples)
churn_labels = ['No' if x == 0 else 'Yes' for x in churns]

# Create DataFrame
data = pd.DataFrame({
    'customerID': customer_ids,
    'gender': genders,
    'SeniorCitizen': senior_citizens,
    'Partner': partners,
    'Dependents': dependents,
    'tenure': tenures,
    'PhoneService': phone_services,
    'InternetService': internet_services,
    'OnlineSecurity': online_security,
    'OnlineBackup': online_backup,
    'DeviceProtection': device_protection,
    'TechSupport': tech_support,
    'StreamingTV': streaming_tv,
    'StreamingMovies': streaming_movies,
    'Contract': contracts,
    'PaperlessBilling': paperless_billing,
    'PaymentMethod': payment_methods,
    'MonthlyCharges': monthly_charges,
    'TotalCharges': total_charges,
    'Churn': churn_labels
})

print(f"Dataset shape: {data.shape}")
print(f"Columns: {list(data.columns)}")

Dataset shape: (7043, 21)
Columns: ['customerID', 'gender', 'SeniorCitizen', 'Partner', 'Dependents', 'tenure', 'PhoneService', 'InternetService', 'OnlineSecurity', 'OnlineBackup', 'DeviceProtection', 'TechSupport', 'StreamingTV', 'StreamingMovies', 'Contract', 'PaperlessBilling', 'PaymentMethod', 'MonthlyCharges', 'TotalCharges', 'Churn']
