# Lending Club Dataset:
# ML Model Training, Tuning, and Deployment - Notebook 2/2

## Objectives:
- Create a preprocessing Pipeline
- Create baseline models
- Build three models to:
    - classify loans into "accept/reject";
    - predict the grade of the loan;
    - predict the sub-grade and interest rate of the loan.
- Perform model selection, tune hyperparameters for the best-performing models, and test those models.
- Deploy the models to Google Cloud Platform.

## Additional Notes:
- The main performance metrics for classification tasks are ROC-AUC(for initial model selection) and F2 since we want to catch most potentially problematic loans, but also avoid flagging too many good loans and performing too much costly additional assessment. Ideally, this metric should be discussed with the client beforehand.
- The main performance metric for regression tasks is R2. 

In [608]:
import pandas as pd
import numpy as np
pd.set_option('display.max_columns', None)

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import datetime as dt
import openpyxl
import re
import textwrap
from matplotlib.gridspec import GridSpec
import scipy.stats as stats
from statsmodels.stats.proportion import proportions_ztest
import statsmodels.api as sm
from sklearn.metrics import confusion_matrix
import warnings

from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import FunctionTransformer
from category_encoders import BinaryEncoder
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder

from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import fbeta_score
from sklearn.metrics import classification_report, mean_squared_error, r2_score
from sklearn.metrics import make_scorer, precision_score, recall_score, accuracy_score
from sklearn.metrics import mean_absolute_error, explained_variance_score
from sklearn.dummy import DummyClassifier
from sklearn.utils import resample
from sklearn.decomposition import PCA
import shap

from sklearn.multioutput import MultiOutputRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor, AdaBoostRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from catboost import CatBoostRegressor
from sklearn.linear_model import LogisticRegression
from tpot import TPOTClassifier, TPOTRegressor
from sklearn.multioutput import MultiOutputRegressor
from sklearn.base import BaseEstimator
import joblib
import dill

import helper_functions as hf

# 1. Preprocessing Pipelines

In [129]:
accepted_data = pd.read_csv('accepted_cleaned_reduced.csv', index_col='id')

In [130]:
accepted_data.head()

Unnamed: 0_level_0,loan_amnt,int_rate,grade,sub_grade,home_ownership,annual_inc,verification_status,loan_status,purpose,addr_state,dti,delinq_2yrs,inq_last_6mths,open_acc,pub_rec,revol_bal,initial_list_status,collections_12_mths_ex_med,application_type,acc_now_delinq,tot_coll_amt,acc_open_past_24mths,avg_cur_bal,chargeoff_within_12_mths,delinq_amnt,mo_sin_old_il_acct,mo_sin_old_rev_tl_op,mo_sin_rcnt_rev_tl_op,mo_sin_rcnt_tl,mort_acc,mths_since_recent_bc,mths_since_recent_inq,num_accts_ever_120_pd,num_actv_bc_tl,num_il_tl,num_rev_accts,num_tl_120dpd_2m,num_tl_30dpd,num_tl_90g_dpd_24m,num_tl_op_past_12m,pct_tl_nvr_dlq,pub_rec_bankruptcies,tax_liens,total_bc_limit,total_il_high_credit_limit,disbursement_method,term_months,emp_length_years,emp_length_numeric,fico_range_avg,utilization_rate,delinquent,high_utilization,last_fico_range_avg
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1
68407277,3600,13.99,C,C4,MORTGAGE,55000,Not Verified,Fully Paid,debt_consolidation,PA,5.91,0,1,7,0,2765.0,w,0,Individual,0,722.0,4,20701.0,0,0.0,148,128,3,3,1,4,4,2,2,3,9,0,0,0,3,76.9,0,0,2400.0,13734.0,Cash,36,10+,10,677.0,1.252104,0,0,562.0
68355089,24700,11.99,C,C1,MORTGAGE,65000,Not Verified,Fully Paid,small_business,SD,16.06,1,4,22,0,21470.0,w,0,Individual,0,0.0,4,9733.0,0,0.0,113,192,2,2,4,2,0,0,5,6,27,0,0,0,2,97.4,0,0,79300.0,24667.0,Cash,36,10+,10,717.0,1.410724,0,0,697.0
68476807,10400,22.45,F,F1,MORTGAGE,104433,Source Verified,Fully Paid,major_purchase,PA,25.37,1,3,12,0,21929.0,w,0,Individual,0,0.0,10,27644.0,0,0.0,128,210,4,4,6,4,1,0,4,10,19,0,0,0,4,96.6,0,0,20300.0,88097.0,Cash,60,1-3,3,697.0,1.201364,0,1,702.0
68476668,20000,9.17,B,B2,MORTGAGE,180000,Not Verified,Fully Paid,debt_consolidation,MN,14.67,0,0,12,0,87329.0,f,0,Individual,0,0.0,6,30030.0,0,0.0,142,306,10,10,4,12,10,0,4,7,16,0,0,0,2,96.3,0,0,31500.0,46452.0,Cash,36,10+,10,682.0,1.217607,0,1,652.0
67275481,20000,8.49,B,B1,MORTGAGE,85000,Not Verified,Fully Paid,major_purchase,SC,17.61,1,0,8,0,826.0,w,0,Individual,0,0.0,4,17700.0,0,0.0,149,55,32,13,3,32,8,1,2,9,3,0,0,1,0,93.3,0,0,14500.0,36144.0,Cash,36,10+,10,707.0,0.998249,0,0,672.0


In [131]:
accepted_data_filtered = accepted_data[accepted_data['loan_status'].isin(['Fully Paid', 'Charged Off'])]

accepted_data_filtered.groupby('loan_status').size()

loan_status
Charged Off    195388
Fully Paid     790623
dtype: int64

In [132]:
fully_paid = accepted_data_filtered[accepted_data_filtered['loan_status']=='Fully Paid']
charged_off = accepted_data_filtered[accepted_data_filtered['loan_status']=='Charged Off']

fully_paid_undersampled = resample(fully_paid, replace=False,  
                                  n_samples=len(charged_off), 
                                  random_state=123)

balanced_data = pd.concat([charged_off, fully_paid_undersampled])

balanced_data = balanced_data.sample(frac=1) 

print(balanced_data.groupby('loan_status').size())

sampled_data = balanced_data.groupby('loan_status').apply(lambda x: x.sample(n=5000, random_state=42)).reset_index(drop=True)
print(sampled_data.groupby('loan_status').size())

loan_status
Charged Off    195388
Fully Paid     195388
dtype: int64
loan_status
Charged Off    5000
Fully Paid     5000
dtype: int64


In [562]:
X = sampled_data.drop(['loan_status', 'grade', 'sub_grade', 'int_rate'], axis=1)
y_loan_status = sampled_data['loan_status']
y_grade = sampled_data['grade']
y_sub_grade = sampled_data['sub_grade']
y_int_rate = sampled_data['int_rate']

In [609]:
ordinal_cols = ['emp_length_years']

numeric_cols = X.select_dtypes(include=['int64', 'float64']).columns

non_ordinal_cols = X.select_dtypes(exclude=['int64', 'float64']).columns.tolist()
non_ordinal_cols = [col for col in non_ordinal_cols if col not in ordinal_cols]

emp_length_mapping = {'<1': 1, '1-3': 2, '4-6': 3, '7-9': 4, '10+': 5}

def map_emp_length(X):
    X['emp_length_years'] = X['emp_length_years'].map(emp_length_mapping)
    return X

emp_length_transformer = FunctionTransformer(map_emp_length, validate=False)

addr_state_categories = ['PA', 'SD', 'MN', 'SC', 'RI', 'CA', 'VA', 
                         'AZ', 'MD', 'NY', 'TX', 'KS', 'NM', 'AL', 
                         'WA', 'OH', 'GA', 'IL', 'FL', 'CO', 'IN', 
                         'MI', 'MO', 'DC', 'MA', 'WI', 'NJ', 'DE', 
                         'TN', 'NH', 'NE', 'OR', 'NC', 'AR', 'NV', 
                         'WV', 'LA', 'HI', 'WY', 'KY', 'OK', 'CT', 
                         'VT', 'MS', 'UT', 'ND', 'ME', 'AK', 'MT', 
                         'ID', 'IA']


preprocessor_features = ColumnTransformer(
    transformers=[
        ('numeric', MinMaxScaler(), numeric_cols),
        ('emp_length_years', emp_length_transformer, ['emp_length_years']),
        ('non_ordinal', OneHotEncoder(handle_unknown='ignore', 
                                      drop='first'), non_ordinal_cols),
    ], 
    remainder='passthrough',
)

encoding_order = {
    'loan_status': None,  
    'grade': ['A', 'B', 'C', 'D', 'E', 'F', 'G'],
    'sub_grade': ['A1', 'A2', 'A3', 'A4', 'A5', 'B1', 'B2', 'B3', 'B4', 'B5',
                  'C1', 'C2', 'C3', 'C4', 'C5', 'D1', 'D2', 'D3', 'D4', 'D5',
                  'E1', 'E2', 'E3', 'E4', 'E5', 'F1', 'F2', 'F3', 'F4', 'F5',
                  'G1', 'G2', 'G3', 'G4', 'G5'],
    'int_rate': None  
}

preprocessor_targets = {}
for target, order in encoding_order.items():
    if target == 'loan_status':
        preprocessor_targets[target] = OneHotEncoder(drop="first", sparse_output=False)
    elif order is None:
        preprocessor_targets[target] = MinMaxScaler()
    else:
        label_encoder = LabelEncoder()
        label_encoder.classes_ = order
        preprocessor_targets[target] = label_encoder

preprocessor_feature_pipeline = Pipeline([
    ('preprocessor', preprocessor_features)
])

preprocessor_target_pipeline = ColumnTransformer(
    transformers=[(target, preprocessor, [target]) for target, 
                  preprocessor in preprocessor_targets.items()]
)

In [564]:
def perform_stratified_split(X, y, target_column, test_size=0.2, random_state=42,
                             stratify=True):
    if stratify:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state, stratify=y
        )
        return X_train, X_test, y_train, y_test
    else:
        X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=test_size, random_state=random_state,
        )
        return X_train, X_test, y_train, y_test


X_train_status, X_test_status, y_train_status, y_test_status = perform_stratified_split(
    X, y_loan_status, 'loan_status'
)

X_train_grade, X_test_grade, y_train_grade, y_test_grade = perform_stratified_split(
    X, y_grade, 'grade'
)

X_train_sub_grade, X_test_sub_grade, y_train_sub_grade, y_test_sub_grade = perform_stratified_split(
    X, y_sub_grade, 'sub_grade',
)

X_train_int_rate, X_test_int_rate, y_train_int_rate, y_test_int_rate = perform_stratified_split(
    X, y_int_rate, 'int_rate', stratify=False,
)

In [576]:
warnings.filterwarnings("ignore", category=UserWarning, module="sklearn.preprocessing._encoders")

X_train_all_p = preprocessor_feature_pipeline.fit_transform(X)

X_train_status_p = preprocessor_feature_pipeline.transform(X_train_status)
X_train_grade_p = preprocessor_feature_pipeline.transform(X_train_grade)
X_train_sub_grade_p = preprocessor_feature_pipeline.transform(X_train_sub_grade)
X_train_int_rate_p = preprocessor_feature_pipeline.transform(X_train_int_rate)

X_test_status_p = preprocessor_feature_pipeline.transform(X_test_status)
X_test_grade_p = preprocessor_feature_pipeline.transform(X_test_grade)
X_test_sub_grade_p = preprocessor_feature_pipeline.transform(X_test_sub_grade)
X_test_int_rate_p = preprocessor_feature_pipeline.transform(X_test_int_rate)

In [579]:
y_train_status_reshaped = np.array(y_train_status).reshape(-1, 1)
y_train_status_p = preprocessor_targets['loan_status'].fit_transform(y_train_status_reshaped)
y_test_status_reshaped = np.array(y_test_status).reshape(-1, 1)
y_test_status_p = preprocessor_targets['loan_status'].transform(y_test_status_reshaped)

y_train_grade_p = preprocessor_targets['grade'].fit_transform(y_train_grade)
y_test_grade_p = preprocessor_targets['grade'].transform(y_test_grade)

y_train_sub_grade_p = preprocessor_targets['sub_grade'].fit_transform(y_train_sub_grade)
y_test_sub_grade_p = preprocessor_targets['sub_grade'].transform(y_test_sub_grade)


y_train_int_rate_reshaped = y_train_int_rate.values.reshape(-1, 1)
y_train_int_rate_p = preprocessor_targets['int_rate'].fit_transform(y_train_int_rate_reshaped)
y_test_int_rate_reshaped = y_test_int_rate.values.reshape(-1, 1)
y_test_int_rate_p = preprocessor_targets['int_rate'].transform(y_test_int_rate_reshaped)

In [581]:
joblib.dump(preprocessor_feature_pipeline, 'preprocessor_feature_pipeline.joblib')

['preprocessor_feature_pipeline.joblib']

# 2. Baseline Models

## 2.1. Baseline Model for Loan Status Classification

In [582]:
from sklearn.dummy import DummyClassifier

baseline_classifier = DummyClassifier(strategy='most_frequent')

baseline_classifier.fit(X_train_status_p, y_train_status_p)

baseline_preds = baseline_classifier.predict(X_test_status_p)
print("Baseline Classification Report:\n", classification_report(y_test_status_p, 
                                                                 baseline_preds, 
                                                                 zero_division=1))

f2 = fbeta_score(y_test_status_p, baseline_preds, beta=2)
roc_auc = roc_auc_score(y_test_status_p, baseline_preds)
print(f"F2 Score: {f2}")

Baseline Classification Report:
               precision    recall  f1-score   support

         0.0       0.50      1.00      0.67      1000
         1.0       1.00      0.00      0.00      1000

    accuracy                           0.50      2000
   macro avg       0.75      0.50      0.33      2000
weighted avg       0.75      0.50      0.33      2000

F2 Score: 0.0


## 2.2. Baseline Model for Grade Prediction

In [583]:
baseline_regressor_grade = DummyRegressor(strategy='mean')

baseline_regressor_grade.fit(X_train_grade_p, y_train_grade_p)

baseline_preds_grade = baseline_regressor_grade.predict(X_test_grade_p)
print("Baseline Mean Squared Error for Grade Prediction:", mean_squared_error(y_test_grade_p, 
                                                                              baseline_preds_grade))

r2 = r2_score(y_test_grade_p, baseline_preds_grade)
print(f'R-squared (R2): {r2}')

Baseline Mean Squared Error for Grade Prediction: 1.8264790000000002
R-squared (R2): 0.0


In [584]:
baseline_classifier_grade = DummyClassifier(strategy='stratified')

baseline_classifier_grade.fit(X_train_grade_p, y_train_grade_p)

baseline_preds_grade = baseline_classifier_grade.predict(X_test_grade_p)

print("Baseline Classification Report:\n", classification_report(y_test_grade_p, 
                                                                 baseline_preds_grade, 
                                                                 zero_division=1))

f2 = fbeta_score(y_test_grade_p, baseline_preds_grade, beta=2, 
                 average='weighted')  
print(f"F2 Score: {f2}")

Baseline Classification Report:
               precision    recall  f1-score   support

           0       0.11      0.11      0.11       240
           1       0.26      0.27      0.27       506
           2       0.28      0.28      0.28       600
           3       0.18      0.17      0.18       371
           4       0.08      0.08      0.08       182
           5       0.07      0.07      0.07        75
           6       0.05      0.04      0.04        26

    accuracy                           0.21      2000
   macro avg       0.15      0.15      0.15      2000
weighted avg       0.21      0.21      0.21      2000

F2 Score: 0.20954849775199028


## 2.3. Baseline Model for Sub-Grade Prediction

In [585]:
baseline_regressor_sub_grade = DummyRegressor(strategy='mean')

baseline_regressor_sub_grade.fit(X_train_sub_grade_p, y_train_sub_grade_p)

baseline_preds_sub_grade = baseline_regressor_sub_grade.predict(X_test_sub_grade_p)
print("Baseline Mean Squared Error for Sub-Grade Prediction:", mean_squared_error(y_test_sub_grade_p, 
                                                                                  baseline_preds_sub_grade))

r2 = r2_score(y_test_sub_grade_p, baseline_preds_sub_grade)
print(f'R-squared (R2): {r2}')

Baseline Mean Squared Error for Sub-Grade Prediction: 45.411275390625
R-squared (R2): -4.088001984259293e-06


## 2.4. Baseline Model for Interest Rate Prediction

In [586]:
baseline_regressor_int_rate = DummyRegressor(strategy='mean')

baseline_regressor_int_rate.fit(X_train_int_rate_p, y_train_int_rate_p)

baseline_preds_int_rate = baseline_regressor_int_rate.predict(X_test_int_rate_p)
print("Baseline Mean Squared Error for Interest Rate Prediction:", mean_squared_error(y_test_int_rate_p, 
                                                                                      baseline_preds_int_rate))
r2 = r2_score(y_test_int_rate_p, baseline_preds_int_rate)
print(f'R-squared (R2): {r2}')

Baseline Mean Squared Error for Interest Rate Prediction: 0.039398666237955716
R-squared (R2): -9.085056114099821e-05


# 3. Building and tuning ML models

## 3.1. Grid search for each modeling task

In [301]:
def r2_scorer(model, X, y_true):
    y_pred = model.predict(X)
    return r2_score(y_true, y_pred, multioutput='uniform_average')

In [302]:
def model_selection(models, params, X_train, y_train, scoring):
    results = []
    algo_best_params = {}

    for model, param in zip(models, params):
        pipe = Pipeline(steps=[('classifier', model)])
        search = GridSearchCV(pipe, param, cv=5, scoring=scoring, refit=scoring)
        search.fit(X_train, y_train)
        print(f"Algorithm: {model}")
        print(f'Best parameter: {search.best_params_}')
        print(f"Best {scoring}: {search.best_score_:.3f}")
        best_score = search.cv_results_['mean_test_score'][search.best_index_]
        print("-" * 30)

        best_pipe = search.best_estimator_
        algo_best_params[model.__class__.__name__] = search.best_params_

        results.append({
            'model': model.__class__.__name__,
            scoring: best_score,
            'std': search.cv_results_['std_test_score'][search.best_index_],
        })

    return results, algo_best_params

In [303]:
classification_models = [
    LogisticRegression(solver='lbfgs', max_iter=1000, random_state=42), 
    RandomForestClassifier(random_state=42),
    XGBClassifier(random_state=42),
    LGBMClassifier(verbose=-1, random_state=42),
]

classification_params = [
    {'classifier__C': [0.1, 0.5, 1, 10, 100],  # LogisticRegression
     'classifier__penalty': ['l2']},
    
    {'classifier__n_estimators': [100, 200, 500],  # RandomForestClassifier
     'classifier__max_depth': [5, 8, 15],
     'classifier__criterion': ['gini', 'entropy']},
     
     {'classifier__learning_rate': [0.05, 0.1, 0.2],  # XGBClassifier   
     'classifier__max_depth': [5, 8, 12],
     'classifier__colsample_bytree': [0.5, 0.8, 1.0]},
     
     {'classifier__num_leaves': [10, 15, 31],  #LGBMClassifier
     'classifier__learning_rate': [0.05, 0.1, 0.2],
     'classifier__n_estimators': [100, 150, 200]}
]

In [304]:
regression_models = [
    RandomForestRegressor(random_state=42), 
    XGBRegressor(random_state=42),
    LGBMRegressor(verbose=-1, random_state=42),
    CatBoostClassifier(verbose=0, random_state=42),
    SVC(random_state=42),
]

regression_params = [
  {'classifier__n_estimators': [50, 100, 200, 300], # RandomForestRegressor
   'classifier__max_depth': [None, 5, 10, 20],
   'classifier__min_samples_split': [2, 5, 7, 10],
   'classifier__max_features': [None, 'sqrt', 'log2', 0.1, 0.5, 0.9]}, 
   
  {'classifier__learning_rate': [0.05, 0.1, 0.2],  # XGBRegressor   
   'classifier__max_depth': [5, 8, 12],
   'classifier__colsample_bytree': [0.5, 0.8, 1.0]},
     
   {'classifier__num_leaves': [10, 15, 31],  # LGBMRegressor
    'classifier__learning_rate': [0.05, 0.1, 0.2],
    'classifier__n_estimators': [100, 150, 200]},

   {'classifier__iterations': [50, 100, 200], # CatBoostClassifier
    'classifier__learning_rate': [0.1, 0.5],
    'classifier__depth': [3, 5, 10]}, 

    {'classifier__C': [0.1, 1, 10],  # SVC
     'classifier__kernel': ['linear', 'rbf'],
     'classifier__gamma': ['scale', 'auto']},
]

In [305]:
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

multi_output_models = [
    RandomForestRegressor(random_state=42),
    XGBRegressor(random_state=42),
    LGBMRegressor(verbose=-1, random_state=42),
    GradientBoostingRegressor(random_state=42),
]

multi_output_params = [
    {'classifier__max_features': ['sqrt', 'log2', None],  # RandomForestRegressor
     'classifier__max_depth': [5, 8, 15],
     'classifier__n_estimators': [100, 200, 500]},

    {'classifier__learning_rate': [0.05, 0.1, 0.2],  # XGBRegressor
     'classifier__max_depth': [5, 8, 12],
     'classifier__colsample_bytree': [0.5, 0.8, 1.0]},

    {'classifier__num_leaves': [10, 15, 31],  # LGBMRegressor
     'classifier__learning_rate': [0.05, 0.1, 0.2],
     'classifier__n_estimators': [100, 150, 200]},

    {'classifier__n_estimators': [50, 100, 150],  # GradientBoostingRegressor
     'classifier__learning_rate': [0.01, 0.5, 0.1, 0.2],
     'classifier__max_depth': [3, 4, 5]},
]

In [306]:
target_columns = ['loan_status', 'grade', 'sub_grade_int_rate']

for target in target_columns:

  if target == 'loan_status':
      models = classification_models  
      params = classification_params
      scoring = 'roc_auc'
      y_train = y_train_status_p
      X_train = X_train_status_p

  elif target == 'grade':
      models = regression_models
      params = regression_params 
      scoring = 'r2'
      y_train = y_train_grade_p
      X_train = X_train_grade_p

  else:
      models = multi_output_models
      params = multi_output_params
      scoring = 'r2'
      y_train = y_train_sub_grade_p
      X_train = X_train_sub_grade_p

  results, best_params = model_selection(
      models=models,
      params=params,   
      X_train=X_train,
      y_train=y_train.ravel(), 
      scoring=scoring)

  print(f"Results for {target}: ") 
  print(results)
  print("Best Params: ")
  print(best_params)
  print('-'*30)

Algorithm: LogisticRegression(max_iter=1000, random_state=42)
Best parameter: {'classifier__C': 1, 'classifier__penalty': 'l2'}
Best roc_auc: 0.949
------------------------------
Algorithm: RandomForestClassifier(random_state=42)
Best parameter: {'classifier__criterion': 'entropy', 'classifier__max_depth': 15, 'classifier__n_estimators': 500}
Best roc_auc: 0.943
------------------------------
Algorithm: XGBClassifier(base_score=None, booster=None, callbacks=None,
              colsample_bylevel=None, colsample_bynode=None,
              colsample_bytree=None, device=None, early_stopping_rounds=None,
              enable_categorical=False, eval_metric=None, feature_types=None,
              gamma=None, grow_policy=None, importance_type=None,
              interaction_constraints=None, learning_rate=None, max_bin=None,
              max_cat_threshold=None, max_cat_to_onehot=None,
              max_delta_step=None, max_depth=None, max_leaves=None,
              min_child_weight=None, miss

## 3.2. Automated ML modeling with TPOT

In [307]:
automl_classifier = TPOTClassifier(generations=10, population_size=20, 
                                    verbosity=2, scoring='roc_auc', cv=5)
                                    
automl_classifier.fit(X_train_status_p, y_train_status_p.ravel())
print(automl_classifier.score(X_test_status_p, y_test_status_p.ravel()))
automl_classifier.export('tpot_loan_status.py')

joblib.dump(automl_classifier.fitted_pipeline_, 'tpot_loan_status_model.joblib')

with open('tpot_loan_status_model.pkl', 'wb') as file:
    pickle.dump(automl_classifier.fitted_pipeline_, file)

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.95077578125

Generation 2 - Current best internal CV score: 0.9508946875

Generation 3 - Current best internal CV score: 0.9508946875

Generation 4 - Current best internal CV score: 0.9508946875

Generation 5 - Current best internal CV score: 0.9509232812499999

Generation 6 - Current best internal CV score: 0.9509232812499999

Generation 7 - Current best internal CV score: 0.9509549999999999

Generation 8 - Current best internal CV score: 0.9509549999999999

Generation 9 - Current best internal CV score: 0.9509549999999999

Generation 10 - Current best internal CV score: 0.9510265625000001

Best pipeline: XGBClassifier(input_matrix, learning_rate=0.1, max_depth=1, min_child_weight=15, n_estimators=100, n_jobs=1, subsample=0.35000000000000003, verbosity=0)
0.9556045


In [308]:
automl_regressor = TPOTRegressor(generations=10, population_size=20, 
                               verbosity=2, scoring='r2', cv=5)
                               
automl_regressor.fit(X_train_grade_p, y_train_grade_p.ravel())  
print(automl_regressor.score(X_test_grade_p, y_test_grade_p.ravel()))
automl_regressor.export('tpot_loan_grade.py')

joblib.dump(automl_regressor.fitted_pipeline_, 'tpot_loan_grade_model.joblib')

with open('tpot_loan_grade_model.pkl', 'wb') as file:
    pickle.dump(automl_regressor.fitted_pipeline_, file)

Optimization Progress:   0%|          | 0/220 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.5280626673578872

Generation 2 - Current best internal CV score: 0.533929719294215

Generation 3 - Current best internal CV score: 0.5407388132328148

Generation 4 - Current best internal CV score: 0.5407388132328148

Generation 5 - Current best internal CV score: 0.5417150798566631

Generation 6 - Current best internal CV score: 0.5417150798566631

Generation 7 - Current best internal CV score: 0.5417354204272579

Generation 8 - Current best internal CV score: 0.5417354204272579

Generation 9 - Current best internal CV score: 0.5431707864591775

Generation 10 - Current best internal CV score: 0.5431707864591775

Best pipeline: ElasticNetCV(RidgeCV(XGBRegressor(input_matrix, learning_rate=0.5, max_depth=1, min_child_weight=20, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.9000000000000001, verbosity=0)), l1_ratio=0.4, tol=0.1)
0.5079595895945581


In [338]:
from tpot import TPOTRegressor
from sklearn.multioutput import MultiOutputRegressor

automl_regressor_multi = TPOTRegressor(generations=5, population_size=40, 
                                       verbosity=2, scoring='r2', cv=5)
multioutput_regressor = MultiOutputRegressor(automl_regressor_multi)

multioutput_regressor.fit(X_train_sub_grade_p, np.column_stack((y_train_sub_grade_p.ravel(), 
                                                                y_train_int_rate_p.ravel(),
                                                               )))
print(multioutput_regressor.score(X_train_sub_grade_p, np.column_stack((y_train_sub_grade_p.ravel(), 
                                                                        y_train_int_rate_p.ravel(),
                                                                       ))))

Optimization Progress:   0%|          | 0/240 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.5612972553776866

Generation 2 - Current best internal CV score: 0.5638527802585593

Generation 3 - Current best internal CV score: 0.5690860595739439

Generation 4 - Current best internal CV score: 0.5702143948886012

Generation 5 - Current best internal CV score: 0.5702143948886012

Best pipeline: XGBRegressor(input_matrix, learning_rate=0.1, max_depth=5, min_child_weight=19, n_estimators=100, n_jobs=1, objective=reg:squarederror, subsample=0.7000000000000001, verbosity=0)


Optimization Progress:   0%|          | 0/240 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: -0.0010100194342258727

Generation 2 - Current best internal CV score: -0.0010100194342258727

Generation 3 - Current best internal CV score: -0.0008060877985545556

Generation 4 - Current best internal CV score: -0.0008060877985545556

Generation 5 - Current best internal CV score: -0.0008060877985545556

Best pipeline: ElasticNetCV(MinMaxScaler(input_matrix), l1_ratio=0.6000000000000001, tol=0.01)
0.3611088553583944


In [348]:
class MyXGBRegressor(XGBRegressor, BaseEstimator):
    def __reduce__(self):
        return super().__reduce__()

custom_xgb = MyXGBRegressor()

joblib.dump(custom_xgb, 'tpot_subgrade_int_rate_model.joblib', compress=True)

['tpot_subgrade_int_rate_model.joblib']

## 3.3. Improving the top-performing models

### 3.3.1. Loan status prediction model

In [357]:
def custom_fbeta_score(y_true, y_pred, beta):
    return fbeta_score(y_true, y_pred, beta=beta)

In [375]:
from sklearn.model_selection import StratifiedKFold

fbeta_scorer = make_scorer(custom_fbeta_score, beta=2)

lr = LogisticRegression(max_iter=1000, penalty='l2', random_state=42)
param_grid = {
    'C': [0.001, 0.01, 0.1, 0.5, 1, 1.5, 2, 5, 10],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'class_weight': [None, 'balanced'],
}

gs_lr = GridSearchCV(lr, param_grid, 
                     scoring=fbeta_scorer, 
                     cv=5, verbose=True, 
                     n_jobs=-1)
gs_lr.fit(X_train_status_p, y_train_status_p.ravel())
print(gs_lr.best_params_)
y_pred = gs_lr.predict(X_test_status_p)
print("Recall:", recall_score(y_test_status_p.ravel(), y_pred))
fbeta = fbeta_score(y_test_status_p.ravel(), y_pred, beta=2)
print(f"F-beta({2}) Score: {fbeta:.3f}")
class_report = classification_report(y_test_status_p.ravel(), y_pred)
print("Classification Report:\n", class_report)

cm = confusion_matrix(y_test_status_p.ravel(), y_pred)
cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], 
                     columns=['Predicted 0', 'Predicted 1'])

cm_df

Fitting 5 folds for each of 90 candidates, totalling 450 fits
{'C': 1.5, 'class_weight': None, 'solver': 'lbfgs'}
Recall: 0.911
F-beta(2) Score: 0.906
Classification Report:
               precision    recall  f1-score   support

         0.0       0.91      0.88      0.89      1000
         1.0       0.89      0.91      0.90      1000

    accuracy                           0.90      2000
   macro avg       0.90      0.90      0.90      2000
weighted avg       0.90      0.90      0.90      2000



Unnamed: 0,Predicted 0,Predicted 1
Actual 0,882,118
Actual 1,89,911


In [587]:
best_lr_status = LogisticRegression(**gs_lr.best_params_, 
                                    max_iter=1000, 
                                    penalty='l2',
                                    random_state=42)
best_lr_status.fit(X_train_status_p, y_train_status_p.ravel())

y_pred = best_lr_status.predict(X_test_status_p)
print("Recall:", recall_score(y_test_status_p.ravel(), y_pred))
fbeta = fbeta_score(y_test_status_p.ravel(), y_pred, beta=2)
print(f"F-beta({2}) Score: {fbeta:.3f}")
class_report = classification_report(y_test_status_p.ravel(), y_pred)
print("Classification Report:\n", class_report)

cm = confusion_matrix(y_test_status_p.ravel(), y_pred)
cm_df = pd.DataFrame(cm, index=['Actual 0', 'Actual 1'], 
                     columns=['Predicted 0', 'Predicted 1'])

cm_df

Recall: 0.9
F-beta(2) Score: 0.901
Classification Report:
               precision    recall  f1-score   support

         0.0       0.90      0.91      0.90      1000
         1.0       0.91      0.90      0.90      1000

    accuracy                           0.90      2000
   macro avg       0.90      0.90      0.90      2000
weighted avg       0.90      0.90      0.90      2000



Unnamed: 0,Predicted 0,Predicted 1
Actual 0,907,93
Actual 1,100,900


In [588]:
joblib.dump(best_lr_status, 'loan_status_classifier.joblib')

['loan_status_classifier.joblib']

### 3.3.2. Grade prediction model

In [590]:
param_grid = {
    'learning_rate': [0.1, 0.01, 0.001],
    'n_estimators': [100, 200, 300],
    'num_leaves': [5, 10, 15]
}

lgbm = LGBMRegressor(verbose=-1)

gs_lgbm = GridSearchCV(lgbm, param_grid,
                       scoring='r2', 
                       cv=5, verbose=True,
                       n_jobs=-1)
gs_lgbm.fit(X_train_grade_p, y_train_grade_p)  

print(gs_lgbm.best_params_)

y_pred = gs_lgbm.predict(X_test_grade_p)  

r2 = r2_score(y_test_grade_p, y_pred)
print(f"R2 Score: {r2:.3f}")

mse = mean_squared_error(y_test_grade_p, y_pred)
print(f'Mean Squared Error: {mse:.3f}')

mae = mean_absolute_error(y_test_grade_p, y_pred)
print(f'Mean Absolute Error: {mae:.3f}')

evs = explained_variance_score(y_test_grade_p, y_pred)
print(f'Explained Variance Score: {evs:.3f}')

Fitting 5 folds for each of 27 candidates, totalling 135 fits
{'learning_rate': 0.1, 'n_estimators': 300, 'num_leaves': 5}
R2 Score: 0.555
Mean Squared Error: 0.812
Mean Absolute Error: 0.705
Explained Variance Score: 0.555


In [591]:
best_lgbm_grade = LGBMRegressor(verbose=-1, **gs_lgbm.best_params_)
best_lgbm_grade.fit(X_train_grade_p, y_train_grade_p)

y_pred = best_lgbm_grade.predict(X_test_grade_p)  

r2 = r2_score(y_test_grade_p, y_pred)
print(f"R2 Score: {r2:.3f}")

mse = mean_squared_error(y_test_grade_p, y_pred)
print(f'Mean Squared Error: {mse:.3f}')

mae = mean_absolute_error(y_test_grade_p, y_pred)
print(f'Mean Absolute Error: {mae:.3f}')

evs = explained_variance_score(y_test_grade_p, y_pred)
print(f'Explained Variance Score: {evs:.3f}')

R2 Score: 0.555
Mean Squared Error: 0.812
Mean Absolute Error: 0.705
Explained Variance Score: 0.555


In [592]:
joblib.dump(best_lgbm_grade, 'grade_regressor.joblib')

['grade_regressor.joblib']

### 3.3.3. Sub-grade and interest rate prediction models

In [471]:
regression_models_2 = [
    RandomForestRegressor(random_state=42), 
    XGBRegressor(random_state=42),
    LGBMRegressor(verbose=-1, random_state=42),
]

regression_params_2 = [
  {'classifier__n_estimators': [50, 100, 200, 300], # RandomForestRegressor
   'classifier__max_depth': [None, 5, 10, 20],
   'classifier__min_samples_split': [2, 5, 7, 10],
   'classifier__max_features': [None, 'sqrt', 'log2', 0.1, 0.5, 0.9]}, 
   
  {'classifier__learning_rate': [0.05, 0.1, 0.2],  # XGBRegressor   
   'classifier__max_depth': [5, 8, 12],
   'classifier__colsample_bytree': [0.5, 0.8, 1.0]},
     
   {'classifier__num_leaves': [10, 15, 31],  # LGBMRegressor
    'classifier__learning_rate': [0.05, 0.1, 0.2],
    'classifier__n_estimators': [100, 150, 200]},
]

In [472]:
from sklearn.model_selection import RandomizedSearchCV

def random_model_selection(models, params, X_train, y_train, 
                           scoring, n_iter=10, random_state=42):
    results = []
    algo_best_params = {}

    for model, param in zip(models, params):
        pipe = Pipeline(steps=[('classifier', model)])
        search = RandomizedSearchCV(pipe, param_distributions=param, 
                                    n_iter=n_iter, cv=5, scoring=scoring, 
                                    refit=scoring, random_state=random_state)
        search.fit(X_train, y_train)
        print(f"Algorithm: {model}")
        print(f'Best parameter: {search.best_params_}')
        print(f"Best {scoring}: {search.best_score_:.3f}")
        best_score = search.cv_results_['mean_test_score'][search.best_index_]
        print("-" * 30)

        best_pipe = search.best_estimator_
        algo_best_params[model.__class__.__name__] = search.best_params_

        results.append({
            'model': model.__class__.__name__,
            scoring: best_score,
            'std': search.cv_results_['std_test_score'][search.best_index_],
        })

    return results, algo_best_params

Sub-grade prediction:

In [473]:
results, best_params = random_model_selection(
  models=regression_models_2,
  params=regression_params_2,   
  X_train=X_train_sub_grade_p,
  y_train=y_train_sub_grade_p.ravel(), 
  scoring='r2')

print(results)
print("Best Params: ")
print(best_params)
print('-'*30)

Algorithm: RandomForestRegressor(random_state=42)
Best parameter: {'classifier__n_estimators': 300, 'classifier__min_samples_split': 2, 'classifier__max_features': 0.5, 'classifier__max_depth': 20}
Best r2: 0.521
------------------------------
Algorithm: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)


In [593]:
best_params = {
    'classifier__num_leaves': np.arange(10, 20, 1),
    'classifier__n_estimators': [100, 150, 200, 250],
    'classifier__learning_rate': [0.05, 0.1, 0.2]
}

lgbm_model = LGBMRegressor(verbose=-1, random_state=42)

pipe = Pipeline(steps=[('classifier', lgbm_model)])

random_search = RandomizedSearchCV(pipe, param_distributions=best_params, 
                                   n_iter=10, cv=5, scoring='r2', n_jobs=-1, 
                                   random_state=42)
random_search.fit(X_train_sub_grade_p, y_train_sub_grade_p)

print(f"Best Estimator: LGBMRegressor")
print(f"R2 Score: {random_search.best_score_:.6f}")
print(f"Standard Deviation: {random_search.cv_results_['std_test_score'][random_search.best_index_]:.6f}")
print(f"Best Hyperparameters: {random_search.best_params_}")

Best Estimator: LGBMRegressor
R2 Score: 0.575322
Standard Deviation: 0.011019
Best Hyperparameters: {'classifier__num_leaves': 15, 'classifier__n_estimators': 150, 'classifier__learning_rate': 0.1}


In [594]:
best_lgbm_sub_grade = LGBMRegressor(verbose=-1, 
                                    random_state=42,
                                    **random_search.best_params_)
best_lgbm_sub_grade.fit(X_train_sub_grade_p, y_train_sub_grade_p)

y_pred = best_lgbm_sub_grade.predict(X_test_sub_grade_p)  

r2 = r2_score(y_test_sub_grade_p, y_pred)
print(f"R2 Score: {r2:.3f}")

mse = mean_squared_error(y_test_sub_grade_p, y_pred)
print(f'Mean Squared Error: {mse:.3f}')

mae = mean_absolute_error(y_test_sub_grade_p, y_pred)
print(f'Mean Absolute Error: {mae:.3f}')

evs = explained_variance_score(y_test_sub_grade_p, y_pred)
print(f'Explained Variance Score: {evs:.3f}')

R2 Score: 0.579
Mean Squared Error: 19.096
Mean Absolute Error: 3.380
Explained Variance Score: 0.580


In [514]:
joblib.dump(best_lgbm_sub_grade, 'sub_grade_regressor.joblib')

['sub_grade_regressor.joblib']

Interest rate prediction:

In [476]:
results, best_params = random_model_selection(
  models=regression_models,
  params=regression_params,   
  X_train=X_train_int_rate_p,
  y_train=y_train_int_rate_p.ravel(), 
  scoring='r2')

print(results)
print("Best Params: ")
print(best_params)
print('-'*30)

Algorithm: RandomForestRegressor(random_state=42)
Best parameter: {'classifier__n_estimators': 300, 'classifier__min_samples_split': 2, 'classifier__max_features': 0.5, 'classifier__max_depth': 20}
Best r2: 0.491
------------------------------
Algorithm: XGBRegressor(base_score=None, booster=None, callbacks=None,
             colsample_bylevel=None, colsample_bynode=None,
             colsample_bytree=None, device=None, early_stopping_rounds=None,
             enable_categorical=False, eval_metric=None, feature_types=None,
             gamma=None, grow_policy=None, importance_type=None,
             interaction_constraints=None, learning_rate=None, max_bin=None,
             max_cat_threshold=None, max_cat_to_onehot=None,
             max_delta_step=None, max_depth=None, max_leaves=None,
             min_child_weight=None, missing=nan, monotone_constraints=None,
             multi_strategy=None, n_estimators=None, n_jobs=None,
             num_parallel_tree=None, random_state=42, ...)


KeyboardInterrupt: 

In [598]:
best_params = {
    'classifier__num_leaves': np.arange(7, 15, 1),
    'classifier__n_estimators': [100, 150, 200, 250],
    'classifier__learning_rate': [0.05, 0.1, 0.2]
}

lgbm_model = LGBMRegressor(verbose=-1, random_state=42)

pipe = Pipeline(steps=[('classifier', lgbm_model)])

random_search = RandomizedSearchCV(pipe, param_distributions=best_params, 
                                   n_iter=10, cv=5, scoring='r2', n_jobs=-1, 
                                   random_state=42)
random_search.fit(X_train_int_rate_p, y_train_int_rate_p.ravel())

print(f"Best Estimator: LGBMRegressor")
print(f"R2 Score: {random_search.best_score_:.6f}")
print(f"Standard Deviation: {random_search.cv_results_['std_test_score'][random_search.best_index_]:.6f}")
print(f"Best Hyperparameters: {random_search.best_params_}")

Best Estimator: LGBMRegressor
R2 Score: 0.537355
Standard Deviation: 0.004124
Best Hyperparameters: {'classifier__num_leaves': 9, 'classifier__n_estimators': 150, 'classifier__learning_rate': 0.1}


The R2 could be better for predicting the interest rate. Let's see whether we could get some difference by applying PCA:

In [597]:
best_params = {
    'classifier__num_leaves': np.arange(7, 15, 1),
    'classifier__n_estimators': [100, 150, 200, 250],
    'classifier__learning_rate': [0.05, 0.1, 0.2]
}

lgbm_model = LGBMRegressor(verbose=-1, random_state=42)

pipe = Pipeline(steps=[
    ('pca', PCA(n_components=0.95)),  
    ('classifier', lgbm_model)
])
random_search_pca = RandomizedSearchCV(pipe, param_distributions=best_params, 
                                   n_iter=10, cv=5, scoring='r2', n_jobs=-1, 
                                   random_state=42)
random_search_pca.fit(X_train_int_rate_p, y_train_int_rate_p.ravel())

print(f"Best Estimator: LGBMRegressor")
print(f"R2 Score: {random_search_pca.best_score_:.6f}")
print(f"Standard Deviation: {random_search_pca.cv_results_['std_test_score'][random_search_pca.best_index_]:.6f}")
print(f"Best Hyperparameters: {random_search_pca.best_params_}")

Best Estimator: LGBMRegressor
R2 Score: 0.454246
Standard Deviation: 0.009875
Best Hyperparameters: {'classifier__num_leaves': 9, 'classifier__n_estimators': 150, 'classifier__learning_rate': 0.1}


The result of the randomized grid search is worse with PCA, so we won't use it for our final model.

In [603]:
best_lgbm_int_rate = LGBMRegressor(verbose=-1, 
                                   random_state=42,
                                   num_leaves=9, 
                                   n_estimators=150, 
                                   learning_rate=0.1)
best_lgbm_int_rate.fit(X_train_int_rate_p, y_train_int_rate_p.ravel())

y_pred = best_lgbm_int_rate.predict(X_test_int_rate_p)  

r2 = r2_score(y_test_int_rate_p, y_pred)
print(f"R2 Score: {r2:.3f}")

mse = mean_squared_error(y_test_int_rate_p, y_pred)
print(f'Mean Squared Error: {mse:.3f}')

mae = mean_absolute_error(y_test_int_rate_p, y_pred)
print(f'Mean Absolute Error: {mae:.3f}')

evs = explained_variance_score(y_test_int_rate_p, y_pred)
print(f'Explained Variance Score: {evs:.3f}')

R2 Score: 0.537
Mean Squared Error: 0.018
Mean Absolute Error: 0.103
Explained Variance Score: 0.537


In [604]:
joblib.dump(best_lgbm_int_rate, 'int_rate_regressor.joblib')

['int_rate_regressor.joblib']

# 5. Deploying ML models

The code for the deployment is in the 'flask_app.py' file.

# 6. Conclusions and further improvement

- We've managed to achieve a nice result for the loan status prediction (fully paid vs charged off) with the classification model.
- For regression tasks, overall we got R2 around 55-58. These results need improvement. In further modeling we would like to try different data preprocessing and dimensionality reduction techniques, and also try out other regression algorithms to check whether this would improve the prediction quality.
- In future work, we would like to use SHAP to explain the models. And we would like to incorporate SHAP force plots in the Flask App.
- The preprocessing and grid search pipelines also need some improvement. It would be also useful to transfer reusable code to a separate .py file.