# Project Description

Beta Bank customers are leaving, little by little, every month. The bankers discovered that it is cheaper to save existing customers than to attract new ones.

**Objective**
The goal of this project is to predict whether a customer will leave the bank in the near future. We analyze historical data on customer behavior and contract termination to identify patterns associated with churn.

**Methodology**
We develop and evaluate multiple machine learning models (Decision Tree, Random Forest, Logistic Regression). 
To ensure robust evaluation and avoid overfitting, we employ a **Train/Validation/Test split (60/20/20)**. Hyperparameters are tuned manually using loops on the Validation set, and the final model is evaluated on the Test set.
To address class imbalance, we compare **Baseline** (no sampling), **Upsampling**, and **Downsampling** techniques applied strictly to the training data.

**Success Criteria**
The primary performance metric is the **F1 score**, with a target of at least **0.59** on the test set. We also evaluate the **AUC-ROC** metric.

# 1. Packages
Updated to remove `GridSearchCV` and `Pipeline` imports, as we are implementing manual loops and scaling.

In [140]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score
from sklearn.preprocessing import StandardScaler
from sklearn.utils import resample

# 2. Dataset

In [141]:
df = pd.read_csv(r'C:\Users\valen\OneDrive\Escritorio\Juano_VS\Beta-Bank\Data\Churn.csv')
df.columns = df.columns.str.lower()
df = df.drop(['rownumber', 'customerid', 'surname'], axis=1)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   creditscore      10000 non-null  int64  
 1   geography        10000 non-null  object 
 2   gender           10000 non-null  object 
 3   age              10000 non-null  int64  
 4   tenure           9091 non-null   float64
 5   balance          10000 non-null  float64
 6   numofproducts    10000 non-null  int64  
 7   hascrcard        10000 non-null  int64  
 8   isactivemember   10000 non-null  int64  
 9   estimatedsalary  10000 non-null  float64
 10  exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB
None


In [142]:
# Check for the existence of values equal to 0.
# We do this to see if missing values could be replaced by 0 (if there weren't entries already with this value).
df[df['tenure']==0].shape

(382, 11)

In [143]:
median = df['tenure'].median()
print(median)

5.0


In [144]:
# Replace missing values with the median value of the column
df['tenure'] = df['tenure'].fillna(median)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   creditscore      10000 non-null  int64  
 1   geography        10000 non-null  object 
 2   gender           10000 non-null  object 
 3   age              10000 non-null  int64  
 4   tenure           10000 non-null  float64
 5   balance          10000 non-null  float64
 6   numofproducts    10000 non-null  int64  
 7   hascrcard        10000 non-null  int64  
 8   isactivemember   10000 non-null  int64  
 9   estimatedsalary  10000 non-null  float64
 10  exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


In [145]:
# Check for duplicates
print(df.duplicated().sum())

0


In [146]:
# One-hot encoding
df_ohe = pd.get_dummies(df, columns=['geography', 'gender'], drop_first=True, dtype=int)
df_ohe.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


### 2.1 Data Splitting (Train / Validation / Test)
We split the data into three parts:
- **Training (60%)**: Used to train the models.
- **Validation (20%)**: Used to tune hyperparameters (the "loops" phase).
- **Test (20%)**: Used for the final evaluation.

In [147]:
X = df_ohe.drop('exited', axis=1)
y = df_ohe['exited']

# First split: 60% Train, 40% Temp (Val + Test)
x_train, x_temp, y_train, y_temp = train_test_split(X, y, test_size=0.4, random_state=42, stratify=y)

# Second split: Split Temp into 50% Val, 50% Test (which is 20% each of total)
x_val, x_test, y_val, y_test = train_test_split(x_temp, y_temp, test_size=0.5, random_state=42, stratify=y_temp)

print(f"Train size: {x_train.shape[0]} ({x_train.shape[0]/len(df):.0%})")
print(f"Val size:   {x_val.shape[0]} ({x_val.shape[0]/len(df):.0%})")
print(f"Test size:  {x_test.shape[0]} ({x_test.shape[0]/len(df):.0%})")

print("\nClass Balance (Train):")
print(pd.Series(y_train).value_counts(normalize=True))
print("\nClass Balance (Validation):")
print(pd.Series(y_val).value_counts(normalize=True))
print("\nClass Balance (Test):")
print(pd.Series(y_test).value_counts(normalize=True))

Train size: 6000 (60%)
Val size:   2000 (20%)
Test size:  2000 (20%)

Class Balance (Train):
exited
0    0.796333
1    0.203667
Name: proportion, dtype: float64

Class Balance (Validation):
exited
0    0.796
1    0.204
Name: proportion, dtype: float64

Class Balance (Test):
exited
0    0.7965
1    0.2035
Name: proportion, dtype: float64


### Class Balance Analysis
The class distribution is consistent across all three sets (Train, Validation, Test), with approximately **79.6%** of customers staying (Class 0) and **20.4%** exiting (Class 1).
This confirms that the `stratify=y` parameter successfully preserved the original dataset's imbalance, ensuring that our evaluation metrics will be reliable and representative of the real-world scenario.

### 2.2 Scaling
We apply `StandardScaler` **only** to the numeric columns (`creditscore`, `age`, `tenure`, `balance`, `numofproducts`, `estimatedsalary`).
Binary and One-Hot Encoded columns are left as is, as they are already in a 0-1 range.

In [148]:
numeric = ['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'estimatedsalary']

scaler = StandardScaler()
scaler.fit(x_train[numeric])

# Transform numeric columns in all sets
# We use .loc to avoid SettingWithCopy warnings and ensure we update the dataframes correctly
x_train[numeric] = scaler.transform(x_train[numeric])
x_val[numeric] = scaler.transform(x_val[numeric])
x_test[numeric] = scaler.transform(x_test[numeric])

# Note: x_train, x_val, x_test are already DataFrames, so we don't need to convert them back.
print("Scaled numeric columns:", numeric)
x_train.head()

Scaled numeric columns: ['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'estimatedsalary']


Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,geography_Germany,geography_Spain,gender_Male
6851,-1.283897,0.008566,1.449637,0.330105,0.783996,1,0,-0.084061,1,0,0
7026,0.271537,-1.139895,-0.001572,-1.220584,0.783996,0,1,0.264021,0,0,0
5705,-0.236571,0.104271,-0.001572,1.692794,0.783996,1,1,0.515344,1,0,1
9058,-1.874962,0.869911,-0.001572,1.032566,-0.919109,1,1,0.303842,0,1,0
9415,1.215167,0.391386,-1.089979,0.851257,0.783996,0,0,-1.400817,1,0,0


# 3. Manual Sampling Functions
We define functions to upsample and downsample the **Training** data only.

In [149]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)

    features_upsampled, target_upsampled = resample(
        features_upsampled, target_upsampled, replace=False, random_state=42
    )
    return features_upsampled, target_upsampled

def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=42)] + [features_ones]
    )
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=42)] + [target_ones]
    )

    features_downsampled, target_downsampled = resample(
        features_downsampled, target_downsampled, replace=False, random_state=42
    )
    return features_downsampled, target_downsampled

In [150]:
# Upsample Training Set
x_train_up, y_train_up = upsample(x_train, y_train, 4)

# Downsample Training Set
x_train_down, y_train_down = downsample(x_train, y_train, 0.25)

print("Original Train shape:", x_train.shape)
print("Upsampled Train shape:", x_train_up.shape)
print("Downsampled Train shape:", x_train_down.shape)

Original Train shape: (6000, 11)
Upsampled Train shape: (9666, 11)
Downsampled Train shape: (2416, 11)


# 4. Hyperparameter Tuning (Loops)
We iterate through hyperparameters, train on the training set (or sampled version), and evaluate on the **Validation** set to find the best configuration.

## Decision Tree

In [151]:
def tune_decision_tree(x_train, y_train, x_val, y_val):
    best_score = 0
    best_model = None
    best_params = {}
    
    for depth in [3, 5, 7, 10]:
        for leaf in [20, 50, 100]:
            for criterion in ['gini', 'entropy']:
                model = DecisionTreeClassifier(random_state=42, max_depth=depth, min_samples_leaf=leaf, criterion=criterion)
                model.fit(x_train, y_train)
                predictions = model.predict(x_val)
                score = f1_score(y_val, predictions)
                
                if score > best_score:
                    best_score = score
                    best_model = model
                    best_params = {'max_depth': depth, 'min_samples_leaf': leaf, 'criterion': criterion}
    
    # Calculate ROC AUC for the best model on validation set
    if best_model:
        probs = best_model.predict_proba(x_val)[:, 1]
        auc = roc_auc_score(y_val, probs)
        print(f"Best F1 on Validation: {best_score:.4f} | ROC AUC: {auc:.4f} | Params: {best_params}")
    else:
        print("No model achieved F1 > 0")
        
    return best_model

print("--- Decision Tree: Baseline ---")
best_tree_base = tune_decision_tree(x_train, y_train, x_val, y_val)

print("\n--- Decision Tree: Upsampling ---")
best_tree_up = tune_decision_tree(x_train_up, y_train_up, x_val, y_val)

print("\n--- Decision Tree: Downsampling ---")
best_tree_down = tune_decision_tree(x_train_down, y_train_down, x_val, y_val)

--- Decision Tree: Baseline ---
Best F1 on Validation: 0.6117 | ROC AUC: 0.8502 | Params: {'max_depth': 7, 'min_samples_leaf': 20, 'criterion': 'gini'}

--- Decision Tree: Upsampling ---
Best F1 on Validation: 0.6038 | ROC AUC: 0.8513 | Params: {'max_depth': 10, 'min_samples_leaf': 50, 'criterion': 'gini'}

--- Decision Tree: Downsampling ---
Best F1 on Validation: 0.5869 | ROC AUC: 0.8485 | Params: {'max_depth': 5, 'min_samples_leaf': 20, 'criterion': 'gini'}


### Decision Tree Analysis
- **Baseline**: F1 Score: 0.6117, ROC AUC: 0.8502
- **Upsampling**: F1 Score: 0.6038, ROC AUC: 0.8513
- **Downsampling**: F1 Score: 0.5869, ROC AUC: 0.8485

**Observation**: The Baseline model actually achieved the highest F1 score (0.6117) on the validation set, though Upsampling was very close (0.6038) and had a slightly better ROC AUC. Downsampling reduced performance, likely due to the loss of training data.

## Random Forest

In [152]:
def tune_random_forest(x_train, y_train, x_val, y_val):
    best_score = 0
    best_model = None
    best_params = {}
    
    # Reduced grid for speed, expand if needed
    for n_est in [50, 100, 200]:
        for depth in [10, 20]:
            for split in [2, 5]:
                model = RandomForestClassifier(random_state=42, n_estimators=n_est, max_depth=depth, min_samples_split=split)
                model.fit(x_train, y_train)
                predictions = model.predict(x_val)
                score = f1_score(y_val, predictions)
                
                if score > best_score:
                    best_score = score
                    best_model = model
                    best_params = {'n_estimators': n_est, 'max_depth': depth, 'min_samples_split': split}
    
    # Calculate ROC AUC for the best model on validation set
    if best_model:
        probs = best_model.predict_proba(x_val)[:, 1]
        auc = roc_auc_score(y_val, probs)
        print(f"Best F1 on Validation: {best_score:.4f} | ROC AUC: {auc:.4f} | Params: {best_params}")
    else:
        print("No model achieved F1 > 0")
        
    return best_model

print("--- Random Forest: Baseline ---")
best_rf_base = tune_random_forest(x_train, y_train, x_val, y_val)

print("\n--- Random Forest: Upsampling ---")
best_rf_up = tune_random_forest(x_train_up, y_train_up, x_val, y_val)

print("\n--- Random Forest: Downsampling ---")
best_rf_down = tune_random_forest(x_train_down, y_train_down, x_val, y_val)

--- Random Forest: Baseline ---
Best F1 on Validation: 0.6102 | ROC AUC: 0.8660 | Params: {'n_estimators': 50, 'max_depth': 20, 'min_samples_split': 5}

--- Random Forest: Upsampling ---
Best F1 on Validation: 0.6308 | ROC AUC: 0.8724 | Params: {'n_estimators': 200, 'max_depth': 10, 'min_samples_split': 5}

--- Random Forest: Downsampling ---
Best F1 on Validation: 0.6028 | ROC AUC: 0.8677 | Params: {'n_estimators': 100, 'max_depth': 10, 'min_samples_split': 2}


### Random Forest Analysis
- **Baseline**: F1 Score: 0.6102, ROC AUC: 0.8660
- **Upsampling**: F1 Score: 0.6308, ROC AUC: 0.8724
- **Downsampling**: F1 Score: 0.6028, ROC AUC: 0.8677

**Observation**: **Random Forest with Upsampling** is the clear winner. It achieved the highest F1 score of **0.6308** and the highest ROC AUC of **0.8724**. The ensemble method combined with balanced training data (via upsampling) proved to be the most robust strategy.

## Logistic Regression

In [153]:
def tune_logistic_regression(x_train, y_train, x_val, y_val):
    best_score = 0
    best_model = None
    best_params = {}
    
    # Solver 'liblinear' supports both l1 and l2
    for penalty in ['l1', 'l2']:
        for C in [0.01, 0.1, 1, 10]:
            model = LogisticRegression(random_state=42, solver='liblinear', penalty=penalty, C=C, max_iter=4000)
            model.fit(x_train, y_train)
            predictions = model.predict(x_val)
            score = f1_score(y_val, predictions)
            
            if score > best_score:
                best_score = score
                best_model = model
                best_params = {'penalty': penalty, 'C': C}
    
    # Calculate ROC AUC for the best model on validation set
    if best_model:
        probs = best_model.predict_proba(x_val)[:, 1]
        auc = roc_auc_score(y_val, probs)
        print(f"Best F1 on Validation: {best_score:.4f} | ROC AUC: {auc:.4f} | Params: {best_params}")
    else:
        print("No model achieved F1 > 0")
        
    return best_model

print("--- Logistic Regression: Baseline ---")
best_lr_base = tune_logistic_regression(x_train, y_train, x_val, y_val)

print("\n--- Logistic Regression: Upsampling ---")
best_lr_up = tune_logistic_regression(x_train_up, y_train_up, x_val, y_val)

print("\n--- Logistic Regression: Downsampling ---")
best_lr_down = tune_logistic_regression(x_train_down, y_train_down, x_val, y_val)

--- Logistic Regression: Baseline ---
Best F1 on Validation: 0.3279 | ROC AUC: 0.7907 | Params: {'penalty': 'l1', 'C': 1}

--- Logistic Regression: Upsampling ---
Best F1 on Validation: 0.5265 | ROC AUC: 0.7928 | Params: {'penalty': 'l1', 'C': 0.01}

--- Logistic Regression: Downsampling ---
Best F1 on Validation: 0.5257 | ROC AUC: 0.7940 | Params: {'penalty': 'l1', 'C': 1}


### Logistic Regression Analysis
- **Baseline**: F1 Score: 0.3279, ROC AUC: 0.7908
- **Upsampling**: F1 Score: 0.5212, ROC AUC: 0.7938
- **Downsampling**: F1 Score: 0.5230, ROC AUC: 0.7940

**Observation**: Logistic Regression struggled significantly with the imbalanced data (Baseline F1: 0.32). While sampling techniques improved the F1 score to around 0.52, it remains well below the performance of the tree-based models and the project target.

# 5. Final Test Evaluation
Now that we have selected the best models using the Validation set, we evaluate them on the **Test** set to get the final unbiased metrics.

In [154]:
def evaluate_on_test(model, x_test, y_test, name):
    if model:
        predictions = model.predict(x_test)
        probs = model.predict_proba(x_test)[:, 1]
        f1 = f1_score(y_test, predictions)
        auc = roc_auc_score(y_test, probs)
        print(f"[{name}] F1: {f1:.4f} | ROC AUC: {auc:.4f}")
    else:
        print(f"[{name}] No model found.")

print("Final Test Results:")
evaluate_on_test(best_tree_up, x_test, y_test, "Decision Tree (Upsampled)")
evaluate_on_test(best_rf_up, x_test, y_test, "Random Forest (Upsampled)")
evaluate_on_test(best_lr_up, x_test, y_test, "Logistic Regression (Upsampled)")

Final Test Results:
[Decision Tree (Upsampled)] F1: 0.5695 | ROC AUC: 0.8249
[Random Forest (Upsampled)] F1: 0.6176 | ROC AUC: 0.8599
[Logistic Regression (Upsampled)] F1: 0.5130 | ROC AUC: 0.7702


## 6. General Conclusion

**Best Model: Random Forest with Upsampling**
- **Validation F1 Score**: 0.6308
- **Validation ROC AUC**: 0.8724

**Summary of Findings**:
1.  **Methodology**: Switching to a Train/Validation/Test split with manual loops allowed us to effectively tune hyperparameters while monitoring for overfitting.
2.  **Sampling**: Upsampling proved to be the most effective technique for the Random Forest model, significantly boosting the F1 score compared to the baseline and downsampling approaches.
3.  **Model Comparison**: 
    -   **Random Forest** outperformed both Decision Trees and Logistic Regression.
    -   **Decision Trees** performed decently but were prone to overfitting or lower generalization compared to the forest ensemble.
    -   **Logistic Regression** failed to capture the complex non-linear relationships in the data, even with balanced classes.

**Recommendation**:
The **Random Forest model trained with Upsampling** is recommended for deployment. It comfortably exceeds the project's F1 target of 0.59 and demonstrates strong discriminatory power with a high ROC AUC.