# <center> **Home Credit Default Risk Assessment**
# <center> **Final Modeling**

# **Introduction**

In this part of the project, I compare: Logistic Regression, Random Forest, XGB and LightGBM and measure their performance using ROC-AUC socres. I use Optuna for hyperparameter tuning of the best performing model.

# **Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.utils import shuffle
from sklearn.model_selection import train_test_split

from lightgbm import LGBMClassifier
import lightgbm as lgb
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier

from feature_engine.imputation import ArbitraryNumberImputer
from feature_engine.encoding import WoEEncoder
from feature_engine.imputation import CategoricalImputer

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_predict

import functions
import importlib
importlib.reload(functions)

import optuna
import pickle
import warnings
import time

  from .autonotebook import tqdm as notebook_tqdm


# **Display**

In [2]:
%matplotlib inline

pd.options.display.max_rows = 300000
pd.options.display.max_columns = 999
pd.options.display.max_colwidth = 500

warnings.filterwarnings("ignore")

warnings.simplefilter(action="ignore", category=FutureWarning)

pd.set_option('display.max_rows', 200)

# **Load Data**

In [3]:
pd.set_option('use_inf_as_na', True)

data = pd.read_csv(
    r"C:\Users\Dell\Documents\AI\Risk\Data\Data\data 27.csv",
    index_col=False
)

data = data.drop('SK_ID_CURR', axis=1)

## **Variables**

In [4]:
random_state = 101
target = 'TARGET'

## **Imputation**

Imputation of missing values in the numeric and categorical columns 

In [5]:
ani = ArbitraryNumberImputer(arbitrary_number=-99999)
ani.fit(data)
data = ani.transform(data)

In [6]:
ci = CategoricalImputer(imputation_method='missing', fill_value='UNKNOWN')
ci.fit(data)
data = ci.transform(data)

clean_data = data.copy()

## **Train Test Split**

Dividing the dataset into train and test sets.

In [8]:
X = clean_data.drop(target, axis=1)
y = clean_data[target]

X, y = shuffle(X, y, random_state=random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=random_state)

## **Modeling**

I use Logistic Regression, Random Forest, XGB Classifier and LGBM.

In [10]:
columns_to_scale = ['ANNUITY_TO_CREDIT_RATIO', 
                    'EXT_SOURCE_3',
                    'EXT_SOURCE_2',
                    'EXT_SOURCE_1',
                    'EXT_SOURCE_MEAN',
                    'ANNUAL_PAYMENT_TO_CREDIT_RATIO',
                    'AGE',
                    'YEARS_ID_PUBLISH',
                    'AMT_ANNUITY',
                    'AMT_GOODS_PRICE',
                    'ANNUITY_TO_INCOME_RATIO',
                    'YEARS_REGISTRATION',
                    'YEARS_LAST_PHONE_CHANGE',
                    'YEARS_EMPLOYED_AGE_PRODUCT',
                    'INCOME_TO_AGE_RATIO',
                    'REGION_POPULATION_RELATIVE',
                    'AVG_MAX_DPD',
                    'TOTAL_DEBIT',
                    'TOTAL_CREDIT_AMT',
                    'DEBT_CREDIT_RATIO',
                    'AVG_ANNUITY_AMOUNT',
                    'AVG_DAYS_DECISION',
                    'RANGE_DAYS_FIRST_DUE',
                    'RANGE_DAYS_LAST_DUE',
                    'SUM_AMT_INSTALMENT',
                    'AVG_AMT_INSTALMENT',
                    'SUM_AMT_PAYMENT',
                    'AVG_AMT_PAYMENT',
                    'MAX_AMT_PAYMENT',
                    'MIN_AMT_PAYMENT',
                    'SUM_AMT_PAYMENT/SUM_AMT_INSTALMENT',
                    'MEAN_AMT_PAYMENT-MEAN_AMT_INSTALMENT'
                    ]

columns_to_encode =  ['ORGANIZATION_TYPE']

preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), columns_to_scale),
        ('encoder', WoEEncoder(fill_value=.000001), columns_to_encode)
    ],
    remainder='passthrough' 
)


lg_model = LogisticRegression(class_weight='balanced', random_state=random_state, max_iter=5000)
lg_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lg', lg_model)
])


rf_model = RandomForestClassifier(class_weight='balanced', random_state=random_state)
rf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('random_forest', rf_model)
])


xgb_model = XGBClassifier(class_weight='balanced', random_state=random_state, verbosity=0)
xgb_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('xgb', xgb_model)
])


lgbm_model = LGBMClassifier(class_weight='balanced', random_state=random_state, verbose=0)
lgbm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lgbm', lgbm_model)
])


pipelines = {
    "Logistic Regression": lg_pipeline,
    "Random Forest": rf_pipeline,
    "XGBClassifier": xgb_pipeline,
    "LightGBM": lgbm_pipeline
 
}

for name, pipeline in pipelines.items():
    start_time = time.time()

    y_pred_proba = cross_val_predict(pipeline, X, y, cv=10, method='predict_proba')[:, 1]
    
    roc_auc = roc_auc_score(y, y_pred_proba)
    
    end_time = time.time()
    elapsed_time = (end_time - start_time) / 60

    print(f"{name}: ROC AUC = {roc_auc:.2f} ({elapsed_time:.2f} minutes)")


Logistic Regression: ROC AUC = 0.63 (3.84 minutes)
Random Forest: ROC AUC = 0.73 (2.33 minutes)
XGBClassifier: ROC AUC = 0.72 (0.21 minutes)
LightGBM: ROC AUC = 0.74 (0.18 minutes)


# **Optuna**   

Here I use Optuna for hyperparameter tuning.

In [7]:
woe = WoEEncoder(fill_value=0.0001)
woe.fit(clean_data, clean_data[target])
encoded_data = woe.transform(clean_data)

In [8]:
X =encoded_data.drop(target, axis=1)
y = encoded_data[target]

X, y = shuffle(X, y, random_state=random_state)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=random_state)

## **LightGBM**

In [33]:
def objective(trial):
    param = {
        'boosting_type': trial.suggest_categorical('boosting_type', ['gbdt', 'dart', 'goss']),
        'num_leaves': trial.suggest_int('num_leaves', 20, 1000), 
        'max_depth': trial.suggest_int('max_depth', -1, 64),
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
        'min_child_samples': trial.suggest_int('min_child_samples', 1, 500),  
        'min_gain_to_split': trial.suggest_loguniform('min_gain_to_split', 0.0001, 1.0)   
    }
    
    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)
    
    train_data = lgb.Dataset(X_train_scaled, label=y_train)
    valid_data = lgb.Dataset(X_test_scaled, label=y_test, reference=train_data)

    gbm = lgb.train(
        param,
        train_data,
        valid_sets=[valid_data],  
        num_boost_round=100,
        callbacks=[lgb.early_stopping(stopping_rounds=10, verbose=False)]
    )

    y_pred = gbm.predict(X_test_scaled, num_iteration=gbm.best_iteration)

    roc_auc = roc_auc_score(y_test, y_pred)
    
    return roc_auc


study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))


[I 2024-10-21 18:52:34,728] A new study created in memory with name: no-name-6e405dd0-ebfa-4fbf-8f1a-861196090f8a
[I 2024-10-21 18:52:45,433] Trial 0 finished with value: 0.7639339114362902 and parameters: {'boosting_type': 'goss', 'num_leaves': 905, 'max_depth': 36, 'n_estimators': 176, 'learning_rate': 0.08246439374394156, 'min_child_samples': 194, 'min_gain_to_split': 0.035782813568435996}. Best is trial 0 with value: 0.7639339114362902.
[I 2024-10-21 18:52:50,197] Trial 1 finished with value: 0.7623579710658642 and parameters: {'boosting_type': 'dart', 'num_leaves': 832, 'max_depth': 24, 'n_estimators': 63, 'learning_rate': 0.1348266610981248, 'min_child_samples': 423, 'min_gain_to_split': 0.9273121311831558}. Best is trial 0 with value: 0.7639339114362902.
[I 2024-10-21 18:52:54,981] Trial 2 finished with value: 0.764506874417849 and parameters: {'boosting_type': 'goss', 'num_leaves': 158, 'max_depth': 57, 'n_estimators': 242, 'learning_rate': 0.1315159470580571, 'min_child_sample

Best trial:
  Value: 0.7750874118732859
  Params: 
    boosting_type: goss
    num_leaves: 67
    max_depth: 11
    n_estimators: 426
    learning_rate: 0.023663556125518563
    min_child_samples: 270
    min_gain_to_split: 0.026685985757266675


### **LGBM Pipeline Optuna Optimized**

I use the hyperparameters identified by Optuna to tune my LGBM model.

In [9]:
columns_to_scale = ['ANNUITY_TO_CREDIT_RATIO', 
                    'EXT_SOURCE_3',
                    'EXT_SOURCE_2',
                    'EXT_SOURCE_1',
                    'EXT_SOURCE_MEAN',
                    'ANNUAL_PAYMENT_TO_CREDIT_RATIO',
                    'AGE',
                    'YEARS_ID_PUBLISH',
                    'AMT_ANNUITY',
                    'AMT_GOODS_PRICE',
                    'ANNUITY_TO_INCOME_RATIO',
                    'YEARS_REGISTRATION',
                    'YEARS_LAST_PHONE_CHANGE',
                    'YEARS_EMPLOYED_AGE_PRODUCT',
                    'INCOME_TO_AGE_RATIO',
                    'REGION_POPULATION_RELATIVE',
                    'AVG_MAX_DPD',
                    'TOTAL_DEBIT',
                    'TOTAL_CREDIT_AMT', 
                    'DEBT_CREDIT_RATIO',
                    'AVG_ANNUITY_AMOUNT',
                    'AVG_DAYS_DECISION',
                    'RANGE_DAYS_FIRST_DUE',
                    'RANGE_DAYS_LAST_DUE',
                    'SUM_AMT_INSTALMENT',
                    'AVG_AMT_INSTALMENT',
                    'SUM_AMT_PAYMENT',
                    'AVG_AMT_PAYMENT',
                    'MAX_AMT_PAYMENT',
                    'MIN_AMT_PAYMENT',
                    'SUM_AMT_PAYMENT/SUM_AMT_INSTALMENT',
                    'MEAN_AMT_PAYMENT-MEAN_AMT_INSTALMENT'
                    ]

columns_to_encode = ['ORGANIZATION_TYPE']

preprocessor = ColumnTransformer(
    transformers=[
        ('scaler', StandardScaler(), columns_to_scale),
        ('encoder', WoEEncoder(fill_value=.000001), columns_to_encode)
    ],
    remainder='passthrough'
)

lgbm_model = lgb.LGBMClassifier(boosting_type='goss', 
                           num_leaves=67, 
                           max_depth=11, 
                           learning_rate=0.023663556125518563, 
                           n_estimators=426,
                            min_child_samples=270,
                            min_gain_to_split=0.026685985757266675, 
                           verbose=-1)

lgbm_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('lgbm', lgbm_model)
])

pipelines = {
    "lgbm": lgbm_pipeline,
}


for name, pipeline in pipelines.items():
    pipeline.fit(X_train, y_train)
    y_prob = pipeline.predict_proba(X_test)[:, 1]
    auc_score = roc_auc_score(y_test, y_prob)
    print(f"AUC Score: {auc_score:.2f}")

AUC Score: 0.77


## **Random Forest Classifier**

In [None]:
def objective(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'max_depth': trial.suggest_int('max_depth', 1, 64),
        'min_samples_split': trial.suggest_int('min_samples_split', 2, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20)
    }

    model = RandomForestClassifier(**param, random_state=0)
    
    model.fit(X_train, y_train)

    y_pred_proba = model.predict_proba(X_test)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    return roc_auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=10)

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))

## **XGB Classifier**

In [None]:
def objective(trial):
    param = {
        'n_estimators': trial.suggest_int('n_estimators', 50, 500),
        'learning_rate': trial.suggest_loguniform('learning_rate', 0.01, 1.0),
        'max_depth': trial.suggest_int('max_depth', 3, 64),
        'subsample': trial.suggest_uniform('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.5, 1.0),
    }

    scaler = StandardScaler()
    X_train_scaled = scaler.fit_transform(X_train)
    X_test_scaled = scaler.transform(X_test)

    model = XGBClassifier(**param, use_label_encoder=False, eval_metric='logloss', random_state=0)
    
    model.fit(X_train_scaled, y_train, verbose=False)

    y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    return roc_auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=5)

print("Best trial:")
trial = study.best_trial

print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}: {}".format(key, value))


[I 2024-10-29 17:22:06,064] A new study created in memory with name: no-name-def73e19-20c9-48e2-bd12-2b17cd4e7d68
[I 2024-10-29 17:25:10,842] Trial 0 finished with value: 0.768472047774617 and parameters: {'n_estimators': 430, 'learning_rate': 0.012685207556507958, 'max_depth': 14, 'subsample': 0.5920557351301199, 'colsample_bytree': 0.9202741690730589}. Best is trial 0 with value: 0.768472047774617.
[I 2024-10-29 17:26:51,722] Trial 1 finished with value: 0.7466922047168746 and parameters: {'n_estimators': 64, 'learning_rate': 0.03438816708945239, 'max_depth': 56, 'subsample': 0.5449450580496265, 'colsample_bytree': 0.8521524054833501}. Best is trial 0 with value: 0.768472047774617.
[I 2024-10-29 17:29:38,095] Trial 2 finished with value: 0.7600512114217857 and parameters: {'n_estimators': 159, 'learning_rate': 0.023047606646611422, 'max_depth': 18, 'subsample': 0.7844473036154253, 'colsample_bytree': 0.9032323448834018}. Best is trial 0 with value: 0.768472047774617.


# **Pickle File for Streamlit Deployment**

I will later use Streamlilt for deployment (See notebook 13.0). Here, I create a Pickle file for the Streamlit application.

In [None]:
with open('lgbm_pipeline.pkl', 'wb') as file:
    pickle.dump(lgbm_pipeline, file)

# **Summary**

> * **LightGBM** — The best performing model was LightGBM. 
> * **Optuna** — After hyperparameter tuning using Optuna, ROC-AUC score inceased by 3%, from 74% to 77%. 
> * **Pickle File** — I created a Pickle file for Streamlit deployment. 