## Introduction

The purpose of this project is to predict, based on customer behavior and account properties, whether a customer will leave the bank. The effectiveness of the predictions will be measured through the F1 score, which is a way of measuring the ability to correctly predict the exact number of customers that exited. I've chosen CatBoost as the model for its ability to work with both numeric and categorical features, as well as its built-in ability to fix class imbalance through two different methods. I'll first optimize a model with the original class balance. I'll then optimize a 2nd model where the two CatBoost class-weighting options are available as parameters. I'll then select the model with the best F1 score in order to evaluate it on the test set. 

In [1]:
import warnings

warnings.filterwarnings("ignore")

import numpy as np
import pandas as pd
import optuna, json

from IPython.display import display

from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score, roc_auc_score, confusion_matrix
from catboost import CatBoostClassifier

In [2]:
def custom_info(df, head_cnt=10):
    df.info(memory_usage=False)
    print("\n")
    display(df.head(head_cnt))

In [3]:
all_data = pd.read_csv("Churn.csv")
custom_info(all_data)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)



Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0
5,6,15574012,Chu,645,Spain,Male,44,8.0,113755.78,2,1,0,149756.71,1
6,7,15592531,Bartlett,822,France,Male,50,7.0,0.0,2,1,1,10062.8,0
7,8,15656148,Obinna,376,Germany,Female,29,4.0,115046.74,4,1,0,119346.88,1
8,9,15792365,He,501,France,Male,44,4.0,142051.07,2,0,1,74940.5,0
9,10,15592389,H?,684,France,Male,27,2.0,134603.88,1,1,1,71725.73,0


In [4]:
# Inspect range and unique value count of each column
for col in all_data.columns:
    col_min = all_data[col].min()
    col_max = all_data[col].max()
    unq_val = all_data[col].nunique()
    print(f"{col}: {col_min} to {col_max} with {unq_val} unique values")
    
# Inspect the target class-balance
print(f"\n{all_data['Exited'].value_counts()}")    

RowNumber: 1 to 10000 with 10000 unique values
CustomerId: 15565701 to 15815690 with 10000 unique values
Surname: Abazu to Zuyeva with 2932 unique values
CreditScore: 350 to 850 with 460 unique values
Geography: France to Spain with 3 unique values
Gender: Female to Male with 2 unique values
Age: 18 to 92 with 70 unique values
Tenure: 0.0 to 10.0 with 11 unique values
Balance: 0.0 to 250898.09 with 6382 unique values
NumOfProducts: 1 to 4 with 4 unique values
HasCrCard: 0 to 1 with 2 unique values
IsActiveMember: 0 to 1 with 2 unique values
EstimatedSalary: 11.58 to 199992.48 with 9999 unique values
Exited: 0 to 1 with 2 unique values

0    7963
1    2037
Name: Exited, dtype: int64


About 20% of the targets is a customer that exited. When it comes to training the model, I don't see a use for RowNumber, CustomerId, and Surname. I'll drop those when creating the DF composed of the features. I'll create categorical versions of "Tenure" and "NumOfProducts" in order to provide the model with feature variations that do not assume a relevance to the numerical order. 

Tasks:

- Convert "Geography" and "Gender" to categorical.
- Create additional feature where "Tenure" is categorical and missing values are converted to an "unknown" category.
- Create additional "NumOfProducts" feature that's categorical.

In [5]:
# Categorical version of "Tenure"
all_data['TenureCAT'] = all_data['Tenure'].apply(
    lambda x: 'unknown' if pd.isna(x) else str(int(x))).astype('category')

all_data['Geography'] = all_data['Geography'].astype('category')
all_data['Gender'] = all_data['Gender'].astype('category')
all_data['NumOfProductsCAT'] = all_data['NumOfProducts'].astype('category')

# Store list of categorical features for CatBoost
cat_feat = ['Geography', 'Gender', 'TenureCAT', 'NumOfProductsCAT']

In [6]:
# Set up training, validation, and test set
RND = 12345

X = all_data.drop(columns=['RowNumber', 'CustomerId', 'Surname', 'Exited'])
y = all_data['Exited'].copy()

# First divide into training/test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, stratify=y, random_state=RND
)

# Divide training set into training/validation
X_train, X_val, y_train, y_val = train_test_split(
    X_train, y_train, test_size=0.25, stratify=y_train, random_state=RND
) # The result is a 60/20/20 set

print("Shapes:", X_train.shape, X_val.shape, X_test.shape)
print("Class Count:", np.bincount(y_train), np.bincount(y_val), np.bincount(y_test))

Shapes: (6000, 12) (2000, 12) (2000, 12)
Class Count: [4777 1223] [1593  407] [1593  407]


## 1. Model Optimization (Untouched Class Balance)

In [7]:
N_TRIALS = 15 # Each trial trains a CatBoost model
# Find the best threshold for each candidate
thresholds = np.linspace(0.0, 0.75, 76)

def objective(trial):
    # Parameters to optimize
    depth = trial.suggest_int("depth", 3, 8)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.1, log=True)
    iterations = trial.suggest_int("iterations", 2000, 3000)
    l2_leaf_reg = trial.suggest_float("l2_leaf_reg", 0.000001, 0.01, log=True)

    # Build CatBoost model with current parameters
    model = CatBoostClassifier(
        depth=depth,
        learning_rate=learning_rate,
        iterations=iterations,
        l2_leaf_reg=l2_leaf_reg,
        random_seed=RND,
        verbose=False,
        thread_count=-1
    )

    # Fit on training set
    model.fit(
        X_train, y_train, 
        cat_features=cat_feat       
    )

    # Calculate model probabilities for validation set
    val_proba = model.predict_proba(X_val)[:, 1]
    
    # Find best threshold according to F1 score
    best_t, f1 = 0.5, -1.0
    for t in thresholds:
        # Convert probs to binary predictions based on threshold
        pred = (val_proba >= t).astype(int)
        f = f1_score(y_val, pred)
        if f > f1:
            f1 = f
            best_t = float(t)

    # Store best threshold for this candidate
    trial.set_user_attr("best_threshold", best_t)  
    
    return f1

# Run Optuna optimization
sampler = optuna.samplers.TPESampler(seed=RND)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=N_TRIALS, n_jobs=5)

# Store optimized parameters
best_trial = study.best_trial
best_params = best_trial.params.copy()
best_threshold = float(best_trial.user_attrs.get("best_threshold"))

# Rebuild best model along with its probs and predictions
best_model = CatBoostClassifier(
    depth=int(best_params["depth"]),
    learning_rate=float(best_params["learning_rate"]),
    iterations=int(best_params["iterations"]),
    l2_leaf_reg=float(best_params["l2_leaf_reg"]),
    random_seed=RND,
    verbose=False,
    thread_count=-1
)

best_model.fit(
    X_train, y_train, 
    cat_features=cat_feat       
)

val_proba = best_model.predict_proba(X_val)[:, 1]
val_pred = (val_proba >= best_threshold).astype(int)
final_f1 = f1_score(y_val, val_pred)
final_auc = roc_auc_score(y_val, val_proba)

print("\nOPTIMIZED MODEL (Untouched Class Balance)\n")

print("Parameters:")
for k, v in best_params.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.6f}")
    else:
        print(f"  {k}: {v}")
        
print(f"\nThreshold: {best_threshold:.4f}")
print(f"F1: {final_f1:.4f}")
print(f"AUC-ROC: {final_auc.round(4)}")

# Store model in order to compare with best auto-balanced model
best_model.save_model("best_unbalanced.cbm")
meta = {
    "params": best_params,
    "threshold": best_threshold,
    "val_f1": final_f1,
    "val_auc": final_auc
}
with open("best_unbalanced_meta.json", "w") as f:
    json.dump(meta, f)

[I 2025-08-29 00:33:09,696] A new study created in memory with name: no-name-b42074ba-4767-4f29-984c-8d3e66d43a4e
[I 2025-08-29 00:34:18,543] Trial 1 finished with value: 0.6268260292164675 and parameters: {'depth': 3, 'learning_rate': 0.0511288346565954, 'iterations': 2751, 'l2_leaf_reg': 1.3408918054255857e-06}. Best is trial 1 with value: 0.6268260292164675.
[I 2025-08-29 00:34:50,440] Trial 3 finished with value: 0.6605504587155964 and parameters: {'depth': 5, 'learning_rate': 0.0026745390456811772, 'iterations': 2389, 'l2_leaf_reg': 0.00021036796895770125}. Best is trial 3 with value: 0.6605504587155964.
[I 2025-08-29 00:35:05,222] Trial 0 finished with value: 0.6576354679802955 and parameters: {'depth': 5, 'learning_rate': 0.005941573406081425, 'iterations': 2833, 'l2_leaf_reg': 4.784528155668776e-06}. Best is trial 3 with value: 0.6605504587155964.
[I 2025-08-29 00:35:36,130] Trial 4 finished with value: 0.6319612590799033 and parameters: {'depth': 6, 'learning_rate': 0.01193595


OPTIMIZED MODEL (Untouched Class Balance)

Parameters:
  depth: 5
  learning_rate: 0.002675
  iterations: 2389
  l2_leaf_reg: 0.000210

Threshold: 0.2800
F1: 0.6606
AUC-ROC: 0.8847


## 2. Model Optimization (CatBoost Auto Class Weight)

In [8]:
# The only difference in this set up is that the two parameters from 
# auto_class_weights will be available to the optimizer

def objective(trial):
    # Parameters to optimize
    depth = trial.suggest_int("depth", 3, 8)
    learning_rate = trial.suggest_float("learning_rate", 0.001, 0.1, log=True)
    iterations = trial.suggest_int("iterations", 2000, 3000)
    l2_leaf_reg = trial.suggest_float("l2_leaf_reg", 0.000001, 0.01, log=True)
    auto_class_weights = trial.suggest_categorical("auto_class_weights", ["Balanced", "SqrtBalanced"])

    # Build CatBoost model with current parameters
    model = CatBoostClassifier(
        depth=depth,
        learning_rate=learning_rate,
        iterations=iterations,
        l2_leaf_reg=l2_leaf_reg,
        auto_class_weights=auto_class_weights,
        random_seed=RND,
        verbose=False,
        thread_count=-1
    )

    # Fit on training set
    model.fit(
        X_train, y_train, 
        cat_features=cat_feat       
    )

    # Calculate model probabilities for validation set
    val_proba = model.predict_proba(X_val)[:, 1]
    
    # Find best threshold according to F1 score
    best_t, f1 = 0.5, -1.0
    for t in thresholds:
        # Convert probs to binary predictions based on threshold
        pred = (val_proba >= t).astype(int)
        f = f1_score(y_val, pred)
        if f > f1:
            f1 = f
            best_t = float(t)

    # Store best threshold for this candidate
    trial.set_user_attr("best_threshold", best_t)  
    
    return f1

# Run Optuna optimization
sampler = optuna.samplers.TPESampler(seed=RND)
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=N_TRIALS, n_jobs=5)

# Retrieve optimized parameters
best_trial = study.best_trial
best_params = best_trial.params.copy()
best_threshold = float(best_trial.user_attrs.get("best_threshold"))

# Rebuild best model along with its probs and predictions
best_model = CatBoostClassifier(
    depth=int(best_params["depth"]),
    learning_rate=float(best_params["learning_rate"]),
    iterations=int(best_params["iterations"]),
    l2_leaf_reg=float(best_params["l2_leaf_reg"]),
    auto_class_weights=best_params["auto_class_weights"],
    random_seed=RND,
    verbose=False,
    thread_count=-1
)

best_model.fit(
    X_train, y_train, 
    cat_features=cat_feat       
)

val_proba = best_model.predict_proba(X_val)[:, 1]
val_pred = (val_proba >= best_threshold).astype(int)
final_f1 = f1_score(y_val, val_pred)
final_auc = roc_auc_score(y_val, val_proba)

print("\nFINAL MODEL (CatBoost Auto Class Weight)\n")

print("\nParameters:")
for k, v in best_params.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.6f}")
    else:
        print(f"  {k}: {v}")
        
print(f"\nThreshold: {best_threshold:.4f}")
print(f"F1: {final_f1:.4f}")
print(f"AUC-ROC: {final_auc.round(4)}")

# Store model in order to compare with best class-balanced model
best_model.save_model("best_auto_weighted.cbm")
meta = {
    "params": best_params,
    "threshold": best_threshold,
    "val_f1": final_f1,
    "val_auc": final_auc
}
with open("best_auto_meta.json", "w") as f:
    json.dump(meta, f)

[I 2025-08-29 00:40:38,591] A new study created in memory with name: no-name-a6f5cc7e-c8ae-4492-a03f-66c4fe8d1517
[I 2025-08-29 00:42:17,183] Trial 1 finished with value: 0.635561160151324 and parameters: {'depth': 4, 'learning_rate': 0.01529676403069316, 'iterations': 2885, 'l2_leaf_reg': 2.197490206432143e-05, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.635561160151324.
[I 2025-08-29 00:42:39,749] Trial 0 finished with value: 0.6599326599326599 and parameters: {'depth': 7, 'learning_rate': 0.0028540296513652956, 'iterations': 2108, 'l2_leaf_reg': 0.004750829383251104, 'auto_class_weights': 'Balanced'}. Best is trial 0 with value: 0.6599326599326599.
[I 2025-08-29 00:42:50,991] Trial 4 finished with value: 0.6313193588162762 and parameters: {'depth': 7, 'learning_rate': 0.01622841125866564, 'iterations': 2123, 'l2_leaf_reg': 1.859046057989532e-05, 'auto_class_weights': 'Balanced'}. Best is trial 0 with value: 0.6599326599326599.
[I 2025-08-29 00:42:58,666] Trial 2


FINAL MODEL (CatBoost Auto Class Weight)


Parameters:
  depth: 5
  learning_rate: 0.006144
  iterations: 2897
  l2_leaf_reg: 0.000003
  auto_class_weights: SqrtBalanced

Threshold: 0.4900
F1: 0.6616
AUC-ROC: 0.8773


## Model Selection and Final Test

In [9]:
# Load metadata
with open("best_unbalanced_meta.json", "r") as f:
    meta_un = json.load(f)
with open("best_auto_meta.json", "r") as f:
    meta_au = json.load(f)

# Load models
unbal_model_file = "best_unbalanced.cbm"
auto_model_file  = "best_auto_weighted.cbm"
m_un = CatBoostClassifier(); m_un.load_model(unbal_model_file)
m_au = CatBoostClassifier(); m_au.load_model(auto_model_file)

# Extract the scores for comparison
def val_metrics(meta):
    return float(meta["val_f1"]), float(meta["val_auc"]), float(meta["threshold"])

f1_un, auc_un, thr_un = val_metrics(meta_un)
f1_au, auc_au, thr_au = val_metrics(meta_au)

print("Optimized Model Comparison:")
print(f" UNBALANCED: val_f1={f1_un:.4f}, val_auc={auc_un:.4f}")
print(f"AUTO_WEIGHT: val_f1={f1_au:.4f}, val_auc={auc_au:.4f}")

# Select winner through F1 score and tie-breaker AUC
if f1_un > f1_au:
    winner_key = "unbalanced"
    winner_model_file = unbal_model_file
    winner_meta = meta_un
elif f1_au > f1_un:
    winner_key = "auto_weighted"
    winner_model_file = auto_model_file
    winner_meta = meta_au
else:
    # Tiebreaker
    if auc_un >= auc_au:
        winner_key = "unbalanced"
        winner_model_file = unbal_model_file
        winner_meta = meta_un
    else:
        winner_key = "auto_weighted"
        winner_model_file = auto_model_file
        winner_meta = meta_au

# Build full training set for final training
X_train_all = pd.concat([X_train, X_val], axis=0)
y_train_all = pd.concat([y_train, y_val], axis=0)

win_params = winner_meta["params"]

# Extract parameters of the winning model
cb_kwargs = {
    "depth": int(win_params["depth"]),
    "learning_rate": float(win_params["learning_rate"]),
    "iterations": int(win_params["iterations"]),
    "l2_leaf_reg": float(win_params["l2_leaf_reg"]),
    "random_seed": RND,
    "verbose": False,
    "thread_count": -1
}

# If the auto-weighted model wins
if "auto_class_weights" in win_params:
    cb_kwargs["auto_class_weights"] = win_params["auto_class_weights"]

print(f"\nRetraining winner ({winner_key}) on full training set with its parameters:")
for k, v in cb_kwargs.items():
    if isinstance(v, float):
        print(f"  {k}: {v:.6f}")
    else:
        print(f"  {k}: {v}")

print(f"  threshold: {round(final_thr, 2)}")

# Retrain final model on full training data
final_model = CatBoostClassifier(**cb_kwargs)
final_model.fit(X_train_all, y_train_all, cat_features=cat_feat)

# Evaluate final model on test set
final_thr = float(winner_meta["threshold"])
final_proba = final_model.predict_proba(X_test)[:, 1]
final_pred = (final_proba >= final_thr).astype(int)

final_f1 = f1_score(y_test, final_pred)
final_auc = roc_auc_score(y_test, final_proba)

print("\nFINAL TEST EVALUATION:")
print(f"Winning Model: {winner_key}")
print(f"Test F1: {final_f1:.4f}")
print(f"Test AUC: {final_auc:.4f}")

Validation metrics (used for selection):
UNBALANCED: val_f1=0.6606, val_auc=0.8847, thr=0.28
AUTO_WEIGHT: val_f1=0.6616, val_auc=0.8773, thr=0.49
Retraining winner (auto_weighted) on full training set with its parameters: {'depth': 5, 'learning_rate': 0.006143977606332298, 'iterations': 2897, 'l2_leaf_reg': 2.8247969032742353e-06, 'random_seed': 12345, 'verbose': False, 'thread_count': -1, 'auto_class_weights': 'SqrtBalanced'}

FINAL TEST EVALUATION:
Winning Model: auto_weighted
Test F1: 0.6526
Test AUC: 0.8757
Confusion Matrix:
[[1459  134]
 [ 145  262]]


In [10]:
# Output feature importance of final model
feat_importance_df = pd.DataFrame({
    "Feature": X_train.columns,
    "Importance": final_model.get_feature_importance(type='PredictionValuesChange')
}).sort_values(by="Importance", ascending=False).reset_index(drop=True)

display(feat_importance_df)

Unnamed: 0,Feature,Importance
0,Age,18.659882
1,NumOfProductsCAT,15.741442
2,Balance,14.93229
3,Geography,12.331644
4,CreditScore,8.909181
5,EstimatedSalary,8.453061
6,TenureCAT,7.47624
7,IsActiveMember,4.324035
8,Tenure,3.455779
9,NumOfProducts,2.96298


- How long with bank?