# Bank Customer Churn Prediction: Model Training

This notebook demonstrates the process of building a machine learning model to predict customer churn for a bank. The workflow includes:
- Downloading and loading the dataset
- Data preprocessing and feature engineering
- Handling class imbalance with SMOTE
- Training a Random Forest Classifier
- Evaluating model performance
- Saving the trained model for future use

Let's get started!

In [1]:
# Install all required packages from requirements.txt
%pip install --upgrade -r requirements.txt


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m25.0.1[0m[39;49m -> [0m[32;49m25.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpython -m pip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


## 1. Download the Dataset

We use the `kagglehub` library to download the latest version of the Bank Customer Churn Prediction dataset from Kaggle. Make sure you have the necessary API credentials set up for Kaggle access.

In [1]:
import kagglehub

# Download latest version
path = kagglehub.dataset_download("shantanudhakadd/bank-customer-churn-prediction")

print("Path to dataset files:", path)

  from .autonotebook import tqdm as notebook_tqdm


Downloading from https://www.kaggle.com/api/v1/datasets/download/shantanudhakadd/bank-customer-churn-prediction?dataset_version_number=1...


100%|██████████| 262k/262k [00:00<00:00, 422kB/s]

Extracting files...
Path to dataset files: /home/codespace/.cache/kagglehub/datasets/shantanudhakadd/bank-customer-churn-prediction/versions/1





## 2. Import Required Libraries

We import essential libraries for data manipulation, model building, evaluation, and handling class imbalance. Notably, we use `pandas` for data handling, `scikit-learn` for machine learning, and `imblearn` for SMOTE.

In [2]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import joblib
import os
from imblearn.over_sampling import SMOTE


## 3. Load and Inspect the Dataset

We load the dataset into a pandas DataFrame and perform an initial inspection. Unnecessary columns such as identifiers and geographical information are dropped to focus on relevant features. We also apply one-hot encoding to categorical variables to prepare the data for modeling.

In [3]:
# Define the path to your CSV file
file_path = path + '/Churn_Modelling.csv'

# --- 1. Import the dataset into a pandas DataFrame ---
try:
    df = pd.read_csv(file_path)
    print("--- Dataset loaded successfully ---")
    print("Original DataFrame Info:")
    df.info()
    print("\nOriginal DataFrame Head:\n")
    print(df.head())
except FileNotFoundError:
    print(f"Error: The file was not found at {file_path}")
    # Exit or handle the error appropriately if the file isn't found
    exit()
except Exception as e:
    print(f"An error occurred while loading the CSV: {e}")
    exit()

# --- 2. Drop the unnecessary columns ---
columns_to_drop = ['RowNumber', 'CustomerId', 'Surname', 'Geography']

# Create a new dataframe to store the processed data
df_processed = df.drop(columns=columns_to_drop)
print(f"\n--- Dropped columns: {columns_to_drop} ---\n")
print("DataFrame Info after dropping columns:\n")
df_processed.info()

# --- 3. Preprocess categorical features using One-Hot Encoding ---
categorical_cols = ['Gender']

# Use pandas get_dummies for easy One-Hot Encoding
# drop_first=True helps avoid multicollinearity by removing one category per feature
df_processed = pd.get_dummies(df_processed, columns=categorical_cols, drop_first=True, dtype=int) # Specify dtype=int for 0/1

print("\n--- Applied One-Hot Encoding to Gender ---\n")
print("Processed DataFrame Info:")
df_processed.info()
print("\nProcessed DataFrame Head:")
print(df_processed.head())

--- Dataset loaded successfully ---
Original DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           10000 non-null  int64  
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(2), int64(9), object(3)
memory usage: 1.1+ MB

Original DataFrame Head:

   Row

## 4. Split Features and Target

We separate the features (`X`) from the target variable (`y`). The target variable, `Exited`, indicates whether a customer has churned (1) or not (0).

In [4]:
# --- 4. Split the data into features (X) and target (y) ---
# The target variable is 'Exited', which indicates whether a customer churned (1) or not (0)
X = df_processed.drop('Exited', axis=1) # Features are all columns except 'Exited'
y = df_processed['Exited'] # Target variable is 'Exited'

print("\n--- Split data into features (X) and target (y) ---")
print("Features (X) shape:", X.shape)
print("Target (y) shape:", y.shape)
print("\nFeatures (X) Head:")
print(X.head())
print("\nTarget (y) Head:")
print(y.head())


--- Split data into features (X) and target (y) ---
Features (X) shape: (10000, 9)
Target (y) shape: (10000,)

Features (X) Head:
   CreditScore  Age  Tenure    Balance  NumOfProducts  HasCrCard  \
0          619   42       2       0.00              1          1   
1          608   41       1   83807.86              1          0   
2          502   42       8  159660.80              3          1   
3          699   39       1       0.00              2          0   
4          850   43       2  125510.82              1          1   

   IsActiveMember  EstimatedSalary  Gender_Male  
0               1        101348.88            0  
1               1        112542.58            0  
2               0        113931.57            0  
3               0         93826.63            0  
4               1         79084.10            0  

Target (y) Head:
0    1
1    0
2    1
3    0
4    0
Name: Exited, dtype: int64


## 5. Train-Test Split and Addressing Class Imbalance

We split the data into training and testing sets, maintaining the distribution of the target variable using stratification. To address class imbalance, we apply SMOTE (Synthetic Minority Over-sampling Technique) to both training and testing sets, ensuring balanced classes for model training and evaluation.

In [5]:
# --- 5. Split the data into training and testing sets ---
# We'll use 80% of the data for training and 20% for testing
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y) # Stratify to maintain target distribution

print("\n--- Split data into training and testing sets ---")
print("X_train shape (before SMOTE):", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape (before SMOTE):", y_train.shape)
print("y_test shape:", y_test.shape)
print("Value counts of y_train before SMOTE:\n", y_train.value_counts())
print("Value counts of y_test before SMOTE:\n", y_test.value_counts())
# --- 6. Apply SMOTE to the training data ---
print("\n--- Applying SMOTE to the training and testing data ---")
smote = SMOTE(random_state=42)
X_train_res, y_train_res = smote.fit_resample(X_train, y_train)
print("X_train shape (after SMOTE):", X_train_res.shape)
print("y_train shape (after SMOTE):", y_train_res.shape)
print("Value counts of y_train after SMOTE:\n", y_train_res.value_counts())

X_test_res, y_test_res = smote.fit_resample(X_test, y_test)
print("X_train shape (after SMOTE):", X_test_res.shape)
print("y_train shape (after SMOTE):", y_test_res.shape)
print("Value counts of y_test after SMOTE:\n", y_test_res.value_counts())




--- Split data into training and testing sets ---
X_train shape (before SMOTE): (8000, 9)
X_test shape: (2000, 9)
y_train shape (before SMOTE): (8000,)
y_test shape: (2000,)
Value counts of y_train before SMOTE:
 Exited
0    6370
1    1630
Name: count, dtype: int64
Value counts of y_test before SMOTE:
 Exited
0    1593
1     407
Name: count, dtype: int64

--- Applying SMOTE to the training and testing data ---
X_train shape (after SMOTE): (12740, 9)
y_train shape (after SMOTE): (12740,)
Value counts of y_train after SMOTE:
 Exited
1    6370
0    6370
Name: count, dtype: int64
X_train shape (after SMOTE): (3186, 9)
y_train shape (after SMOTE): (3186,)
Value counts of y_test after SMOTE:
 Exited
0    1593
1    1593
Name: count, dtype: int64


## 6. (Optional) Feature Scaling

Feature scaling is often applied to numerical features to standardize their ranges. Although this step is currently commented out, it can be enabled if needed for algorithms sensitive to feature scales.

In [None]:
# # --- Scale numerical features ---
# # Original numerical columns: 'CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary'
# numerical_cols_to_scale = ['CreditScore', 'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']
# # Note: 'HasCrCard' and 'IsActiveMember' are binary but often included in scaling.

# scaler = StandardScaler()

# # Fit the scaler on the training data and transform it
# X_train[numerical_cols_to_scale] = scaler.fit_transform(X_train[numerical_cols_to_scale])

# # Transform the test data using the *same* scaler fitted on the training data
# X_test[numerical_cols_to_scale] = scaler.transform(X_test[numerical_cols_to_scale])

# print("\n--- Scaled numerical features ---")
# print("X_train Head after scaling:")
# print(X_train.head())
# print("\nX_test Head after scaling:")
# print(X_test.head())

## 7. Train a Random Forest Classifier

We initialize and train a Random Forest Classifier using the SMOTE-balanced training data. Random Forest is an ensemble method that builds multiple decision trees and aggregates their predictions for improved accuracy and robustness.

In [6]:
# --- 7. Train a Random Forest Classifier model ---
print("\n--- Training Random Forest Classifier ---")

# Initialize the Random Forest Classifier
# n_estimators: The number of trees in the forest.
# random_state: Ensures reproducibility.
model = RandomForestClassifier(n_estimators=100, random_state=42)

# Train the model using the training data
model.fit(X_train_res, y_train_res)
print("--- Random Forest Classifier trained successfully ---")


--- Training Random Forest Classifier ---
--- Random Forest Classifier trained successfully ---


## 8. Make Predictions

We use the trained Random Forest model to make predictions on the SMOTE-balanced test set.

In [7]:
# --- 8. Make predictions on the test data ---
print("\n--- Making predictions on the test set ---")
y_pred = model.predict(X_test_res)


--- Making predictions on the test set ---


## 9. Evaluate Model Performance

We evaluate the model using accuracy, classification report (precision, recall, F1-score), and confusion matrix. These metrics provide insights into the model's predictive performance and its ability to distinguish between churned and non-churned customers.

In [8]:
# --- 9. Evaluate the model ---
print("\n--- Evaluating the model ---")

# Calculate Accuracy
accuracy = accuracy_score(y_test_res, y_pred)
print(f"Accuracy: {accuracy:.4f}")

# Print Classification Report (includes precision, recall, f1-score)
print("\nClassification Report:")
print(classification_report(y_test_res, y_pred))

# Print Confusion Matrix
print("\nConfusion Matrix:")
print(confusion_matrix(y_test_res, y_pred))


--- Evaluating the model ---
Accuracy: 0.8214

Classification Report:
              precision    recall  f1-score   support

           0       0.80      0.85      0.83      1593
           1       0.84      0.79      0.82      1593

    accuracy                           0.82      3186
   macro avg       0.82      0.82      0.82      3186
weighted avg       0.82      0.82      0.82      3186


Confusion Matrix:
[[1353  240]
 [ 329 1264]]


## 10. Save the Trained Model

Finally, we save the trained model to disk using `joblib`. This allows us to reuse the model for future predictions without retraining.

In [9]:
# Save the model

print("\n--- Saving the trained model ---")

model_dir = 'model'
model_filename = 'Bank_Churn_pred_model.joblib'
model_path = os.path.join(model_dir, model_filename)

if not os.path.exists(model_dir):
    os.makedirs(model_dir)
    print(f"Created directory: {model_dir}")

try:
    joblib.dump(model, model_path)
    print(f"Model successfully saved to {model_path}")
except Exception as e:
    print(f"Error saving the model: {e}")


--- Saving the trained model ---
Created directory: model
Model successfully saved to model/Bank_Churn_pred_model.joblib
