# Customer Churn Prediction Model - Complete Guide

This notebook provides a comprehensive, step-by-step guide to building a customer churn prediction model for a bank. We'll cover everything from data loading to model deployment.

## Table of Contents
1. [Data Loading and Initial Exploration](#data-loading)
2. [Data Preprocessing](#preprocessing)
3. [Feature Engineering](#feature-engineering)
4. [Data Visualization](#visualization)
5. [Model Building and Training](#model-building)
6. [Model Evaluation](#model-evaluation)
7. [Feature Importance Analysis](#feature-importance)
8. [Model Deployment Preparation](#deployment)

<a id='data-loading'></a>
## 1. Data Loading and Initial Exploration

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_auc_score, roc_curve
from sklearn.utils import resample
import warnings
warnings.filterwarnings('ignore')

# Load the dataset
df = pd.read_csv('botswana_bank_customer_churn.csv')

# Display basic information about the dataset
print("Dataset Shape:", df.shape)
print("\nDataset Info:")
print(df.info())
print("\nFirst 5 rows:")
df.head()

In [None]:
# Check for missing values
print("Missing Values:")
print(df.isnull().sum())

# Check for duplicate rows
print("\nDuplicate Rows:", df.duplicated().sum())

# Basic statistics
print("\nBasic Statistics:")
df.describe()

<a id='preprocessing'></a>
## 2. Data Preprocessing

In [None]:
# Handle missing values
# For CreditScore, Age, and Balance, we'll fill with median
df['CreditScore'].fillna(df['CreditScore'].median(), inplace=True)
df['Age'].fillna(df['Age'].median(), inplace=True)
df['Balance'].fillna(df['Balance'].median(), inplace=True)

# For NumOfProducts, we'll fill with mode
df['NumOfProducts'].fillna(df['NumOfProducts'].mode()[0], inplace=True)

# For HasCrCard and IsActiveMember, we'll fill with mode
df['HasCrCard'].fillna(df['HasCrCard'].mode()[0], inplace=True)
df['IsActiveMember'].fillna(df['IsActiveMember'].mode()[0], inplace=True)

# Check missing values again
print("Missing Values After Imputation:")
print(df.isnull().sum())

In [None]:
# Handle outliers using IQR method for numerical columns
def remove_outliers_iqr(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

# Apply outlier removal to numerical columns
numerical_columns = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary']
df_clean = df.copy()

for col in numerical_columns:
    df_clean = remove_outliers_iqr(df_clean, col)
    
print(f"Original dataset size: {len(df)}")
print(f"Dataset size after outlier removal: {len(df_clean)}")

<a id='feature-engineering'></a>
## 3. Feature Engineering

In [None]:
# Create new features

# Credit Utilization Ratio
df_clean['CreditUtilizationRatio'] = df_clean['Balance'] / (df_clean['CreditScore'] * 100)

# Customer Lifetime Value (CLV) approximation
df_clean['CLV'] = df_clean['Balance'] * df_clean['NumOfProducts'] * df_clean['EstimatedSalary'] / 1000000

# Risk Score based on multiple factors
df_clean['RiskScore'] = (
    (df_clean['Age'] / 100) * 0.2 +
    (1 - df_clean['IsActiveMember']) * 0.3 +
    (df_clean['NumOfProducts'] > 1).astype(int) * 0.2 +
    (df_clean['HasCrCard'] == 0).astype(int) * 0.3
)

# Tenure groups
df_clean['TenureGroup'] = pd.cut(df_clean['Tenure'], bins=[0, 2, 5, 10, float('inf')], 
                                labels=['New', 'Established', 'Long-term', 'Veteran'])

# Balance categories
df_clean['BalanceCategory'] = pd.cut(df_clean['Balance'], 
                                    bins=[-1, 0, 50000, 100000, float('inf')], 
                                    labels=['NoBalance', 'Low', 'Medium', 'High'])

# Display the new features
print("New Features Created:")
df_clean[['CreditUtilizationRatio', 'CLV', 'RiskScore', 'TenureGroup', 'BalanceCategory']].head()

In [None]:
# Encode categorical variables
label_encoders = {}
categorical_columns = ['Geography', 'Gender', 'TenureGroup', 'BalanceCategory']

df_encoded = df_clean.copy()

for col in categorical_columns:
    le = LabelEncoder()
    df_encoded[col] = le.fit_transform(df_encoded[col])
    label_encoders[col] = le

print("Categorical columns encoded successfully!")
print("Encoded columns:", categorical_columns)

<a id='visualization'></a>
## 4. Data Visualization

In [None]:
# Distribution of target variable
plt.figure(figsize=(8, 6))
sns.countplot(data=df_encoded, x='Exited')
plt.title('Distribution of Churn (Exited)')
plt.xlabel('Churn Status (0: Not Churned, 1: Churned)')
plt.ylabel('Count')
plt.show()

churn_rate = df_encoded['Exited'].mean() * 100
print(f"Churn Rate: {churn_rate:.2f}%")

In [None]:
# Correlation heatmap
plt.figure(figsize=(14, 10))
correlation_matrix = df_encoded.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Correlation Matrix')
plt.show()

In [None]:
# Age distribution by churn status
plt.figure(figsize=(12, 6))
sns.histplot(data=df_encoded, x='Age', hue='Exited', kde=True, bins=30)
plt.title('Age Distribution by Churn Status')
plt.xlabel('Age')
plt.ylabel('Frequency')
plt.show()

<a id='model-building'></a>
## 5. Model Building and Training

In [None]:
# Prepare features and target
features = [
    'CreditScore', 'Geography', 'Gender', 'Age', 'Tenure', 'Balance', 
    'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary',
    'CreditUtilizationRatio', 'CLV', 'RiskScore', 'TenureGroup', 'BalanceCategory'
]

X = df_encoded[features]
y = df_encoded['Exited']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

In [None]:
# Handle class imbalance using upsampling
train_data = pd.concat([X_train, y_train], axis=1)

# Separate majority and minority classes
majority_class = train_data[train_data.Exited == 0]
minority_class = train_data[train_data.Exited == 1]

# Upsample minority class
minority_upsampled = resample(minority_class, 
                              replace=True,     # sample with replacement
                              n_samples=len(majority_class),    # match majority class
                              random_state=42)  # reproducible results

# Combine majority class with upsampled minority class
train_balanced = pd.concat([majority_class, minority_upsampled])

# Separate features and target
X_train_balanced = train_balanced.drop('Exited', axis=1)
y_train_balanced = train_balanced['Exited']

print(f"Original training set class distribution:\n{y_train.value_counts()}")
print(f"Balanced training set class distribution:\n{y_train_balanced.value_counts()}")

In [None]:
# Train the Random Forest model with parameters to prevent overfitting
rf_model = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    min_samples_split=10,
    min_samples_leaf=5,
    max_features='sqrt',
    random_state=42
)

# Train the model
rf_model.fit(X_train_balanced, y_train_balanced)

print("Random Forest model trained successfully!")

<a id='model-evaluation'></a>
## 6. Model Evaluation

In [None]:
# Make predictions
y_pred = rf_model.predict(X_test)
y_pred_proba = rf_model.predict_proba(X_test)[:, 1]

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.4f}")
print(f"ROC AUC Score: {roc_auc:.4f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

In [None]:
# Confusion Matrix
plt.figure(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', cbar=False)
plt.title('Confusion Matrix')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.show()

In [None]:
# ROC Curve
plt.figure(figsize=(8, 6))
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.plot(fpr, tpr, color='darkorange', lw=2, label=f'ROC curve (area = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='navy', lw=2, linestyle='--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic (ROC) Curve')
plt.legend(loc="lower right")
plt.show()

<a id='feature-importance'></a>
## 7. Feature Importance Analysis

In [None]:
# Get feature importances
importances = rf_model.feature_importances_
feature_importance_df = pd.DataFrame({'feature': features, 'importance': importances})
feature_importance_df = feature_importance_df.sort_values('importance', ascending=False)

# Plot feature importances
plt.figure(figsize=(10, 8))
sns.barplot(data=feature_importance_df, x='importance', y='feature', palette='viridis')
plt.title('Feature Importances')
plt.xlabel('Importance')
plt.ylabel('Features')
plt.show()

print("Top 10 Most Important Features:")
print(feature_importance_df.head(10))

<a id='deployment'></a>
## 8. Model Deployment Preparation

In [None]:
import joblib

# Save the trained model
joblib.dump(rf_model, 'customer_churn_model.pkl')

# Save the label encoders
joblib.dump(label_encoders, 'label_encoders.pkl')

print("Model and encoders saved successfully!")

In [None]:
# Example of how to load and use the model for predictions
# Load the model and encoders
loaded_model = joblib.load('customer_churn_model.pkl')
loaded_encoders = joblib.load('label_encoders.pkl')

# Example prediction function
def predict_churn(customer_data):
    """
    Predict churn for a single customer
    customer_data: dict with customer information
    """
    # Convert to DataFrame
    df_customer = pd.DataFrame([customer_data])
    
    # Apply same preprocessing
    df_customer['CreditUtilizationRatio'] = df_customer['Balance'] / (df_customer['CreditScore'] * 100)
    df_customer['CLV'] = df_customer['Balance'] * df_customer['NumOfProducts'] * df_customer['EstimatedSalary'] / 1000000
    df_customer['RiskScore'] = (
        (df_customer['Age'] / 100) * 0.2 +
        (1 - df_customer['IsActiveMember']) * 0.3 +
        (df_customer['NumOfProducts'] > 1).astype(int) * 0.2 +
        (df_customer['HasCrCard'] == 0).astype(int) * 0.3
    )
    df_customer['TenureGroup'] = pd.cut(df_customer['Tenure'], bins=[0, 2, 5, 10, float('inf')], 
                                       labels=['New', 'Established', 'Long-term', 'Veteran'])
    df_customer['BalanceCategory'] = pd.cut(df_customer['Balance'], 
                                           bins=[-1, 0, 50000, 100000, float('inf')], 
                                           labels=['NoBalance', 'Low', 'Medium', 'High'])
    
    # Encode categorical variables
    for col in ['Geography', 'Gender', 'TenureGroup', 'BalanceCategory']:
        df_customer[col] = loaded_encoders[col].transform(df_customer[col])
    
    # Make prediction
    prediction = loaded_model.predict(df_customer[features])[0]
    probability = loaded_model.predict_proba(df_customer[features])[0][1]
    
    return prediction, probability

# Example usage
example_customer = {
    'CreditScore': 650,
    'Geography': 'France',
    'Gender': 'Male',
    'Age': 35,
    'Tenure': 5,
    'Balance': 50000,
    'NumOfProducts': 2,
    'HasCrCard': 1,
    'IsActiveMember': 1,
    'EstimatedSalary': 60000
}

prediction, probability = predict_churn(example_customer)
print(f"Prediction: {'Churn' if prediction == 1 else 'Not Churn'}")
print(f"Churn Probability: {probability:.2%}")

## Conclusion

This notebook has guided you through the complete process of building a customer churn prediction model:

1. **Data Loading and Exploration**: Loaded the dataset and understood its structure
2. **Data Preprocessing**: Handled missing values and outliers
3. **Feature Engineering**: Created new meaningful features
4. **Data Visualization**: Explored data patterns and relationships
5. **Model Building**: Trained a Random Forest classifier with balanced data
6. **Model Evaluation**: Assessed performance with multiple metrics
7. **Feature Importance**: Identified key drivers of churn
8. **Deployment Preparation**: Saved the model and demonstrated usage

The model achieves good performance while avoiding overfitting through:
- Balanced training data
- Regularization parameters in the Random Forest
- Feature engineering to capture meaningful patterns

You can now use this model to predict customer churn and take proactive retention measures.