In this notebook, we’ll implement and compare models to see how our chosen approach performs against a baseline. For the baseline model, we’re using data from the "3_1_Cleaning_ApplicationData" step, where no new features were manually created, combined with reduced features derived through the FAMD technique. In contrast, our chosen model builds on data from "3_2_Cleaning_ApplicationData," where we’ve manually created new features, also combined with the FAMD-reduced features.

We’ll evaluate four models for this comparison: Random Forest, LightGBM with SMOTE, LightGBM with class weighting, and CatBoost with class weighting. Also, we will evaluate each model based on precision, recall, F1 score, AUC-ROC, and alert rate, giving us a comprehensive view of their effectiveness. This approach will give us a well-rounded view of each model's performance.

In [2]:
import pandas as pd
import numpy as np

# Suppress warnings from pandas
import warnings
warnings.filterwarnings('ignore')

# import the functions from the file
from notebook_setup import RootPath, config
from my_functions import pie_plot
from my_functions import preprocess_data_for_split, evaluate_models_with_resampling

# Automatically Reload Changes if there are frequent changes to my xx.py file
%load_ext autoreload
%autoreload 2

In [3]:
# load the data
y_target = pd.read_csv(f'{config.CleanDataPath}y_target.csv', index_col='SK_ID_CURR')
y_target = y_target.loc[:, ~y_target.columns.str.contains('^Unnamed')]

df3_reduced_df = pd.read_csv(f'{config.CleanDataPath}df3_reduced_df.csv', index_col='SK_ID_CURR')
df3_reduced_df = df3_reduced_df.loc[:, ~df3_reduced_df.columns.str.contains('^Unnamed')]

df3_reduced_df_new_features = pd.read_csv(f'{config.CleanDataPath}df3_reduced_df_new_features.csv', index_col='SK_ID_CURR')
df3_reduced_df_new_features = df3_reduced_df_new_features.loc[:, ~df3_reduced_df_new_features.columns.str.contains('^Unnamed')]


In [4]:
df3_reduced_df_new_features.head()

Unnamed: 0_level_0,DAYS_DETAILS_CHANGE_SUM,CREDIT_GOODS_RATIO,NAME_INCOME_TYPE_Working,REGION_RATING_MAX,REGION_RATING_CLIENT_W_CITY,REGION_RATING_MUL,REGION_RATING_MEAN,DAYS_ID_PUBLISH,REGION_RATING_CLIENT,DAYS_DETAILS_CHANGE_MUL,...,150,151,152,153,154,155,156,157,158,159
SK_ID_CURR,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
307474,-11057.0,1.2376,True,2,2,4,2.0,-2278,2,-3736622000.0,...,-8.744918,10.716136,26.733236,-24.306649,-42.859176,6.50274,6.049826,7.543491,1.432526,13.380056
412537,-7557.0,1.105608,True,2,2,4,2.0,-355,2,-1865693000.0,...,-5.222404,-0.3609,-0.178559,2.609293,8.641962,10.025218,4.397794,13.914318,-13.017894,15.589601
149084,-14155.0,1.1188,True,2,2,4,2.0,-4126,2,-45028910000.0,...,-6.451924,5.89288,1.003496,-1.991941,11.417974,3.603652,6.290238,-6.883073,-3.595558,-0.168557
364692,-6817.0,1.079196,True,2,2,4,2.0,-3253,2,-8955587000.0,...,-11.160797,-1.174208,7.739854,-0.028688,4.048036,-8.805761,8.704549,6.506291,4.020488,5.384275
155423,-14388.0,1.198,False,2,2,4,2.0,-5085,2,-95966900000.0,...,-53.920263,108.95207,60.943014,-19.897708,-12.890648,-10.116053,-8.175407,131.811324,-63.724345,20.736898


In [5]:
y_target.head()

Unnamed: 0_level_0,TARGET
SK_ID_CURR,Unnamed: 1_level_1
307474,0
412537,0
149084,0
364692,0
155423,0


### Impute Implementation
For the imputation procedure, we’ll use the mean strategy for continuous features and the most frequent value for categorical features when replacing missing values. Additionally, we’ll scale the data to a range between 0 and 1 to ensure consistency and improve model performance. This approach helps keep the data balanced and ready for effective modeling.

In [7]:
X_df3_reduced_df, y_target = preprocess_data_for_split(df3_reduced_df, y_target)
print('\n')
X_df3_reduced_df_new_features, y_target = preprocess_data_for_split(df3_reduced_df_new_features, y_target)

Impute strategy for missing values uses median.
Scale ranges of Data is between 0 and 1.
Transfoming methods are applied to X data only.
X data shape: (30751, 214)
y data shape: (30751, 1)


Impute strategy for missing values uses median.
Scale ranges of Data is between 0 and 1.
Transfoming methods are applied to X data only.
X data shape: (30751, 220)
y data shape: (30751, 1)


We’ll simplify the process by using a single pipeline through the function `evaluate_models_with_resampling`. This will help us build and implement the machine learning models and compare their performance across four models. We'll evaluate each model based on precision, recall, F1 score, AUC-ROC, and alert rate, giving us a comprehensive view of their effectiveness.

In [17]:
import time
start_time = time.time()
evaluate_models_with_resampling(X_df3_reduced_df, y_target, test_size=0.3, random_state=45)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

[LightGBM] [Info] Number of positive: 19796, number of negative: 19796
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.037746 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 52980
[LightGBM] [Info] Number of data points in the train set: 39592, number of used features: 214
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1729, number of negative: 19796
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.019330 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 43626
[LightGBM] [Info] Number of data points in the train set: 21525, number of used features: 214
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080325 -> initscore=-2.437937
[LightGBM] [Info] Start training from score -2.437937
Classification Report:
               precision    recall  f1-sc

In [18]:
start_time = time.time()
evaluate_models_with_resampling(X_df3_reduced_df_new_features, y_target, test_size=0.3, random_state=45)
end_time = time.time()

elapsed_time = end_time - start_time
print(f"Elapsed time: {elapsed_time:.2f} seconds")

[LightGBM] [Info] Number of positive: 19796, number of negative: 19796
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.039903 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 54700
[LightGBM] [Info] Number of data points in the train set: 39592, number of used features: 220
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.500000 -> initscore=0.000000
[LightGBM] [Info] Number of positive: 1729, number of negative: 19796
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.018628 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 48719
[LightGBM] [Info] Number of data points in the train set: 21525, number of used features: 220
[LightGBM] [Info] [binary:BoostFromScore]: pavg=0.080325 -> initscore=-2.437937
[LightGBM] [Info] Start training from score -2.437937
Classification Report:
               precision    recall  f1-sc

The results show that the chosen model, which is built on data with manually created features, delivers better performance in terms of metrics like AUC-ROC and alert rate. Among the four models, CatBoost with class weights stands out, providing the best results overall. Based on this, we’ll move forward with CatBoost and fine-tune its parameters to achieve even better performance.