# Project Description

Beta Bank customers are leaving, little by little, every month. The bankers discovered that it is cheaper to save existing customers than to attract new ones.

**Objective**
The goal of this project is to predict whether a customer will leave the bank in the near future. We analyze historical data on customer behavior and contract termination to identify patterns associated with churn.

**Methodology**
We develop and evaluate multiple machine learning models, including Decision Trees, Random Forests, and Logistic Regression. To address class imbalance, we employ techniques such as SMOTE and NearMiss within a robust Cross-Validation pipeline.

**Success Criteria**
The primary performance metric is the **F1 score**, with a target of at least **0.59** on the test set. We also evaluate the **AUC-ROC** metric to assess the model's overall discriminatory power.

# 1. Packages

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score, roc_auc_score, precision_recall_curve
from imblearn.over_sampling import SMOTE
from imblearn.under_sampling import NearMiss
from sklearn.preprocessing import StandardScaler
from imblearn.pipeline import Pipeline as ImbPipeline # Pipeline to handle resampling
from sklearn.compose import ColumnTransformer

# 2. Dataset

In [2]:
df = pd.read_csv(r'C:\Users\valen\OneDrive\Escritorio\Juano_VS\Beta-Bank\Data\Churn.csv')
df.columns = df.columns.str.lower()
df = df.drop(['rownumber', 'customerid', 'surname'], axis=1)
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   creditscore      10000 non-null  int64  
 1   geography        10000 non-null  object 
 2   gender           10000 non-null  object 
 3   age              10000 non-null  int64  
 4   tenure           9091 non-null   float64
 5   balance          10000 non-null  float64
 6   numofproducts    10000 non-null  int64  
 7   hascrcard        10000 non-null  int64  
 8   isactivemember   10000 non-null  int64  
 9   estimatedsalary  10000 non-null  float64
 10  exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB
None


In [3]:
df[df['tenure']==0].shape

(382, 11)

In [4]:
median = df['tenure'].median()
print(median)

5.0


In [5]:
df['tenure'] = df['tenure'].fillna(median)
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   creditscore      10000 non-null  int64  
 1   geography        10000 non-null  object 
 2   gender           10000 non-null  object 
 3   age              10000 non-null  int64  
 4   tenure           10000 non-null  float64
 5   balance          10000 non-null  float64
 6   numofproducts    10000 non-null  int64  
 7   hascrcard        10000 non-null  int64  
 8   isactivemember   10000 non-null  int64  
 9   estimatedsalary  10000 non-null  float64
 10  exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


In [6]:
print(df.duplicated().sum())

0


In [7]:
df_ohe = pd.get_dummies(df, columns=['geography', 'gender'], drop_first=True, dtype=int)
df_ohe.head()

Unnamed: 0,creditscore,age,tenure,balance,numofproducts,hascrcard,isactivemember,estimatedsalary,exited,geography_Germany,geography_Spain,gender_Male
0,619,42,2.0,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1.0,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8.0,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1.0,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2.0,125510.82,1,1,1,79084.1,0,0,1,0


In [8]:
X = df_ohe.drop('exited', axis=1)
y = df_ohe['exited']

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

print(pd.Series(y_train).value_counts(1))
print(pd.Series(y_test).value_counts(1))

exited
0    0.79625
1    0.20375
Name: proportion, dtype: float64
exited
0    0.7965
1    0.2035
Name: proportion, dtype: float64


# 3. Model Selection with Pipelines
We use `ImbPipeline` to ensure:
1. **Scaling** happens before resampling (critical for distance-based methods like SMOTE/NearMiss).
2. **Resampling** happens *only* on the training folds during Cross-Validation, preventing data leakage.

**Update**: We now use `ColumnTransformer` to apply `StandardScaler` **only** to numeric columns, leaving binary/OHE columns untouched.

In [9]:
# Define the preprocessor for selective scaling
numeric_features = ['creditscore', 'age', 'tenure', 'balance', 'numofproducts', 'estimatedsalary']

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features)
    ],
    remainder='passthrough' # Leave binary/OHE columns as is
)

def model_select(estimator, param, features_train, target_train, features_test, target_test):
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    grid_search = GridSearchCV(
        estimator=estimator,
        param_grid=param,
        cv=cv,
        scoring='roc_auc',
        refit=True,
        n_jobs=-1
    )
    # Note: We pass the original training data. The Pipeline handles scaling/resampling internally for each fold.
    grid_search.fit(features_train, target_train)
    
    print(f'Best Hyperparameters Cross-Validation: {grid_search.best_params_}')
    print(f'Best Score Cross-Validation (ROC AUC): {grid_search.best_score_:.4f}')
    
    best_model = grid_search.best_estimator_
    predictions = best_model.predict(features_test)
    probs = best_model.predict_proba(features_test)[:, 1]
    
    print(f'F1 Score Test: {f1_score(target_test, predictions):.4f}')
    print(f'ROC AUC Score Test: {roc_auc_score(target_test, probs):.4f}')
    return best_model

## Decision Tree

In [10]:
# Define base parameters (note the 'model__' prefix for pipeline compatibility)
param_grid_tree = {
    'model__max_depth': [3, 5, 7, 10],
    'model__min_samples_leaf': [20, 50, 100],
    'model__criterion': ['gini', 'entropy']
}

In [11]:
print("--- Decision Tree: Baseline ---")
# Even for baseline, we use a pipeline with scaler for consistency, though not strictly needed for trees.
pipeline_tree = ImbPipeline([
    ('preprocessor', preprocessor),
    ('model', DecisionTreeClassifier(random_state=42))
])

model_baseline = model_select(pipeline_tree, param_grid_tree, x_train, y_train, x_test, y_test)

--- Decision Tree: Baseline ---
Best Hyperparameters Cross-Validation: {'model__criterion': 'gini', 'model__max_depth': 7, 'model__min_samples_leaf': 20}
Best Score Cross-Validation (ROC AUC): 0.8393
F1 Score Test: 0.6020
ROC AUC Score Test: 0.8441


In [12]:
print("\n--- Decision Tree: NearMiss ---")
# Scaler -> NearMiss -> Model
pipeline_tree_nm = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', NearMiss(version=1)),
    ('model', DecisionTreeClassifier(random_state=42))
])

tree_nm = model_select(pipeline_tree_nm, param_grid_tree, x_train, y_train, x_test, y_test)


--- Decision Tree: NearMiss ---
Best Hyperparameters Cross-Validation: {'model__criterion': 'gini', 'model__max_depth': 3, 'model__min_samples_leaf': 20}
Best Score Cross-Validation (ROC AUC): 0.7268
F1 Score Test: 0.5249
ROC AUC Score Test: 0.7329


In [13]:
print("\n--- Decision Tree: SMOTE ---")
# Scaler -> SMOTE -> Model
pipeline_tree_smote = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', SMOTE(random_state=42)),
    ('model', DecisionTreeClassifier(random_state=42))
])

model_smote = model_select(pipeline_tree_smote, param_grid_tree, x_train, y_train, x_test, y_test)


--- Decision Tree: SMOTE ---
Best Hyperparameters Cross-Validation: {'model__criterion': 'entropy', 'model__max_depth': 7, 'model__min_samples_leaf': 50}
Best Score Cross-Validation (ROC AUC): 0.8317
F1 Score Test: 0.5980
ROC AUC Score Test: 0.8464


### Decision Tree Analysis
- **Baseline**: F1 Score: 0.6020, ROC AUC: 0.8441
- **SMOTE**: F1 Score: 0.5949, ROC AUC: 0.8539
- **Observation**: The Decision Tree performs relatively well. SMOTE slightly improved the ROC AUC (0.844 -> 0.854) but slightly decreased the F1 score. This suggests that while SMOTE helps in separating the classes generally (AUC), it might be introducing some false positives that affect precision (lowering F1). NearMiss performed significantly worse, likely discarding too much valuable information.

## Random Forest

In [14]:
param_grid_forest = {
    'model__n_estimators': [20, 50, 100, 200, 300, 400],
    'model__max_depth': [10, 20, 30, 40],
    'model__min_samples_split': [2, 5, 10]
}

In [15]:
print("--- Random Forest: Baseline ---")
pipeline_forest = ImbPipeline([
    ('preprocessor', preprocessor),
    ('model', RandomForestClassifier(random_state=42))
])

best_rf_base = model_select(pipeline_forest, param_grid_forest, x_train, y_train, x_test, y_test)

--- Random Forest: Baseline ---
Best Hyperparameters Cross-Validation: {'model__max_depth': 10, 'model__min_samples_split': 10, 'model__n_estimators': 200}
Best Score Cross-Validation (ROC AUC): 0.8603
F1 Score Test: 0.5796
ROC AUC Score Test: 0.8645


In [16]:
print("\n--- Random Forest: NearMiss ---")
pipeline_forest_nm = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', NearMiss(version=1)),
    ('model', RandomForestClassifier(random_state=42))
])

best_rf_nm = model_select(pipeline_forest_nm, param_grid_forest, x_train, y_train, x_test, y_test)


--- Random Forest: NearMiss ---
Best Hyperparameters Cross-Validation: {'model__max_depth': 10, 'model__min_samples_split': 10, 'model__n_estimators': 200}
Best Score Cross-Validation (ROC AUC): 0.7415
F1 Score Test: 0.4518
ROC AUC Score Test: 0.7386


In [17]:
print("\n--- Random Forest: SMOTE ---")
pipeline_forest_smote = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', SMOTE(random_state=42)),
    ('model', RandomForestClassifier(random_state=42))
])

best_rf_smote = model_select(pipeline_forest_smote, param_grid_forest, x_train, y_train, x_test, y_test)


--- Random Forest: SMOTE ---
Best Hyperparameters Cross-Validation: {'model__max_depth': 10, 'model__min_samples_split': 10, 'model__n_estimators': 400}
Best Score Cross-Validation (ROC AUC): 0.8536
F1 Score Test: 0.6132
ROC AUC Score Test: 0.8620


### Random Forest Analysis
- **Baseline**: F1 Score: 0.5877, ROC AUC: 0.8649
- **SMOTE**: F1 Score: 0.6172, ROC AUC: 0.8626
- **Observation**: Random Forest with SMOTE is the top performer. It achieved the highest F1 score (0.6172) among all models while maintaining a very high ROC AUC (0.8626). The ensemble nature of Random Forest combined with SMOTE's synthetic data generation effectively handles the class imbalance, providing a robust model.

## Logistic Regression

In [18]:
param_grid_lr = [
    {
        'model__penalty': ['l2'],
        'model__C': [0.01, 0.1, 1, 10, 100],
        'model__class_weight': ['balanced', None],
        'model__solver': ['lbfgs']
    },
    {
        'model__penalty': ['l1'],
        'model__C': [0.01, 0.1, 1, 10, 100],
        'model__class_weight': ['balanced', None],
        'model__solver': ['liblinear']
    }
]

In [19]:
print("--- Logistic Regression: Baseline ---")
pipeline_lr = ImbPipeline([
    ('preprocessor', preprocessor),
    ('model', LogisticRegression(random_state=42, max_iter=4000))
])

best_lr_base = model_select(pipeline_lr, param_grid_lr, x_train, y_train, x_test, y_test)

--- Logistic Regression: Baseline ---
Best Hyperparameters Cross-Validation: {'model__C': 0.01, 'model__class_weight': 'balanced', 'model__penalty': 'l2', 'model__solver': 'lbfgs'}
Best Score Cross-Validation (ROC AUC): 0.7673
F1 Score Test: 0.5057
ROC AUC Score Test: 0.7781


In [20]:
print("\n--- Logistic Regression: NearMiss ---")
pipeline_lr_nm = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', NearMiss(version=1)),
    ('model', LogisticRegression(random_state=42, max_iter=4000))
])

best_lr_nm = model_select(pipeline_lr_nm, param_grid_lr, x_train, y_train, x_test, y_test)


--- Logistic Regression: NearMiss ---
Best Hyperparameters Cross-Validation: {'model__C': 0.1, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best Score Cross-Validation (ROC AUC): 0.7041
F1 Score Test: 0.4620
ROC AUC Score Test: 0.7203


In [21]:
print("\n--- Logistic Regression: SMOTE ---")
pipeline_lr_smote = ImbPipeline([
    ('preprocessor', preprocessor),
    ('sampler', SMOTE(random_state=42)),
    ('model', LogisticRegression(random_state=42, max_iter=4000))
])

best_lr_smote = model_select(pipeline_lr_smote, param_grid_lr, x_train, y_train, x_test, y_test)


--- Logistic Regression: SMOTE ---
Best Hyperparameters Cross-Validation: {'model__C': 0.01, 'model__class_weight': 'balanced', 'model__penalty': 'l1', 'model__solver': 'liblinear'}
Best Score Cross-Validation (ROC AUC): 0.7674
F1 Score Test: 0.5100
ROC AUC Score Test: 0.7778


### Logistic Regression Analysis
- **Baseline**: F1 Score: 0.5042, ROC AUC: 0.7805
- **SMOTE**: F1 Score: 0.5100, ROC AUC: 0.7795
- **Observation**: Logistic Regression lags behind the tree-based models. Even with SMOTE and proper scaling (which we fixed in this notebook), the linear decision boundary is likely insufficient to capture the complex relationships in this dataset. The F1 score hovers around 0.51, which is significantly lower than the Random Forest's 0.61.

## 4. General Conclusion
Based on the comprehensive testing of Decision Trees, Random Forests, and Logistic Regression, using Baseline, NearMiss, and SMOTE strategies:

**The Best Model: Random Forest with SMOTE**
- **F1 Score**: 0.6172
- **ROC AUC**: 0.8626

**Why?**
1.  **Performance**: It achieves the best balance of Precision and Recall (F1 Score) and has excellent discriminatory power (ROC AUC).
2.  **Robustness**: Random Forests are less prone to overfitting than single Decision Trees.
3.  **Data Handling**: The combination of SMOTE (to address imbalance) and the pipeline approach (to ensure correct scaling and validation) proved most effective.

**Recommendation**:
We should proceed with the **Random Forest model trained with SMOTE**. It offers the most reliable predictions for identifying customers at risk of churning.