# INFO

This notebook will be used for construction and testing purposes while designing model within Kedro framework. 

In [1]:
##############################################################################
# It is recommended to create new virtual environment for each Kedro project #
##############################################################################

# Uncomment and run the line below if your environment does't have
# Kedro or any other dependencies needed.

#! pip install -r requirements.txt
%load_ext kedro.ipython

# Model
As the dataset needs transformation like imputation and normalization, for avoiding data leakage, all transformations will be done within model pipeline and fitting only on training data on model fitting stage. So I'm going to split initial typed dataset to train/test sets and balance train set, ignoring all previously transformed datasets.

In [None]:
import pandas as pd
from sklearn import set_config
from sklearn.model_selection import train_test_split
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
from lending_club.pipelines.analysis.nodes import features_eng
from lending_club.pipelines.encode.nodes import _default_status
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import make_pipeline as imb_make_pipeline


def split_dataset(df: pd.DataFrame, params: dict):
    y = _default_status(df, params)
    X = df
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=params['test_size'],
        random_state=params['random_state']
    )
    return X_train, X_test, y_train, y_test


In [3]:
# Load general parameters
params = catalog.load("parameters")
params['model_features'] = catalog.load("params:model_features")

# Load intermediate dataset that has proper features types
# and split it to training/testing features and target datasets
df = catalog.load("intermediate_lc_dataset")
features_eng(df, params)
X_train, X_test, y_train, y_test = split_dataset(df, params)

In [77]:
# Define model pipeline that will transform original features datasets,
# applying isolated transformations for training and testing datasets
# to prevent data leakage.
# This pipeline will use one of optional regressor: RandomForestClassifier or CatBoostClassifier
def model_pipeline(model_options: dict, params: dict):

    set_config(transform_output='pandas')

    # split important features to assign preprocessing steps
    category_feat = [f for f in (params['category'] + [params['emp_len']]) if f in params['model_features']]
    numeric_feat_zero = [f for f in params['fill_zero'] if f in params['model_features']]
    numeric_feat_med = [f for f in params['fill_med'] if f in params['model_features']]
    remainder_feat = list(set(params['model_features']) - set(category_feat) - set(numeric_feat_zero) - set(numeric_feat_med))

    # transformer to replace missing numeric values by 0
    # and standardize all values 
    numeric_feat_zero_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value=0),
        StandardScaler()
    )
    # transformer to replace missing numeric values by median
    numeric_feat_med_transformer = make_pipeline(
        SimpleImputer(strategy='median'),
        StandardScaler()
    )

    # assemble transformers in preprocessing pipe so it will perform 
    # following transformations:
    #   - encode all categorical features to numbers
    #   - fill missing values in specific number features as "0" and standardize them
    #   - fill missing values in specific number features as median and standardize them
    #   - standardize the rest of the features
    preprocessing = make_column_transformer(
        (OrdinalEncoder(), category_feat),
        (numeric_feat_zero_transformer, numeric_feat_zero),
        (numeric_feat_med_transformer, numeric_feat_med),
        (StandardScaler(), remainder_feat)
    )

    # choose regressor depending on provided model_options
    if model_options['name'] == 'rfc':
        regressor = RandomForestClassifier(**model_options['regressor_options'])
    else: 
        if model_options['name'] == 'catboost':
            regressor = CatBoostClassifier(**model_options['regressor_options'])
        else:
            raise Exception("Pipeline accepts only RandomForestClassifier and CatBoostClassifier")
    
    # Assemble preprocessing pipeline, imbalance handling and chosen regressor as the model pipeline
    model = imb_make_pipeline(
        preprocessing,
        SMOTE(random_state=params['random_state']),
        regressor
    )
    return model

# Function that fits model but set parameters for regressor first if it is available 
def train_model(X_train, y_train, regressor, params: dict):
    try: 
        regressor.set_params(**params['fit_options']).fit(X_train, y_train)
    except:
        regressor.fit(X_train, y_train)
    return regressor


# Evaluation
## Actual profit/loss

To evaluate model performance I want to use custom loss function, so I need to calculate actual earning rate to define potential losses in case we refuse in loan, that was mistakenly predicted as default, as well as actual losses for charged off loans, that will be our loss in case if we issue a loan that was mistakenly predicted as non default. 

I assume, that earning rate for non defaulted loans, considering loans that is not fully paid at the moment, is total received amount less than total received principal divided by total received principal. 

For charged off loan, I believe, the actual losses are amount of loan less than total received payments (that includes collections after charges off) plus collection recovery fee (that I believe is our payment to collectors for collection services). Dividing that by this category loan amount we can get actual loss rate for defaulted loans

In [5]:
# Function that returns actual profit/loss rates for non-defartet/defaulted loans
def get_loss_values(df: pd.DataFrame) -> pd.DataFrame:

    # Select columns for profit/loss calculation
    df = df.loc[:, ['loan_amnt', 'loan_status', 'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee']]

    # Add default status and summarize data
    df['default_status'] = df['loan_status'].str.contains("Charged Off", regex=False, na=False)
    df = df.drop(columns=['loan_status'])
    df=df.groupby(by='default_status').sum()
    df = df.reset_index()


    df['earning/loss'] = (
        # actual earnings rate for non-defaulters
        ((df.total_pymnt - df.total_rec_prncp) / df.total_rec_prncp) * ~df.default_status
        # actual losses rate for defaulters
        + (df.loan_amnt - df.total_pymnt + df.collection_recovery_fee) / df.loan_amnt * df.default_status
        )

    # Select columns: 'default_status', 'earning/loss'
    df = df.loc[:, ['default_status', 'earning/loss']]
    return df.set_index('default_status')

df_loss = get_loss_values(catalog.load('intermediate_lc_clean'))
df_loss

Unnamed: 0_level_0,earning/loss
default_status,Unnamed: 1_level_1
False,0.228696
True,0.460871


>These figures will be used in parameters...yml to feed FP_cost and FN_cost to model evaluator

In [6]:
params['FP_cost'] = df_loss['earning/loss'].loc[False]
params['FN_cost'] = df_loss['earning/loss'].loc[True]

## Baseline model


In [78]:
#%reload_kedro

# Load base model's options
params['model_options'] = catalog.load("params:baseline_model.model_options")

# Make a model and fit it
model = model_pipeline(params['model_options'], params)
train_model(X_train, y_train, model, params['model_options'])
model

To evaluate my model I'll use a range of probability thresholds that will be used for classification of predicted default's probabilities and calculate metrics for each of them to decide which one is fitting best to minimize losses

In [79]:
def make_rng(start, stop, step):
    return range(start, stop, step)


def evaluate_metrics(model: object, X_true, y_true,
                     params: dict) -> pd.DataFrame:
    y_pred_proba = model.predict_proba(X_true)
    metrics = pd.DataFrame()
    for thresh in make_rng(**params['model_options']['prob_threshold']):
        y_pred = (y_pred_proba[:,1] > (thresh / 100))
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        cur_metrics = pd.DataFrame(
        data={
            'prob_thresh_%': thresh,
            'accuracy'     : accuracy_score(y_true, y_pred),
            'precision'    : precision_score(y_true, y_pred),
            'recall'       : recall_score(y_true, y_pred),
            'f1'           : f1_score(y_true, y_pred),
            'roc_auc'      : roc_auc_score(y_true, y_pred),
            'tn'           : tn,
            'fp'           : fp,
            'fn'           : fn,
            'tp'           : tp,
            'loss'         : params['FP_cost'] * fp + params['FN_cost'] *fn,
        },
        index = [params['model_options']['name']]
        )
        metrics = pd.concat([metrics, cur_metrics], axis=0)
    print(f"The best probability threshold for {params['model_options']['name']} model based on min loss: {metrics[metrics.loss==metrics.loss.min()]['prob_thresh_%'].iloc[0]}")
    return metrics

eval_metr = evaluate_metrics(model, X_test, y_test, params)
eval_metr


The best probability threshold for rfc model based on min loss: 49


Unnamed: 0,prob_thresh_%,accuracy,precision,recall,f1,roc_auc,tn,fp,fn,tp,loss
rfc,30,0.80155,0.265745,0.353928,0.303562,0.608896,15166,2390,1579,865,1274.297247
rfc,31,0.8096,0.273271,0.336334,0.301541,0.605909,15370,2186,1622,822,1247.460809
rfc,32,0.8163,0.279095,0.317921,0.297246,0.601801,15549,2007,1667,777,1227.263499
rfc,33,0.82265,0.284318,0.297463,0.290742,0.596613,15726,1830,1717,727,1209.827935
rfc,34,0.82885,0.293721,0.285188,0.289392,0.594861,15880,1676,1747,697,1188.434952
rfc,35,0.8342,0.299448,0.266367,0.28194,0.589808,16033,1523,1793,651,1174.644596
rfc,36,0.83825,0.30294,0.248773,0.273197,0.584542,16157,1399,1836,608,1166.103798
rfc,37,0.8432,0.311341,0.233633,0.266947,0.580846,16293,1263,1873,571,1152.053429
rfc,38,0.8466,0.314727,0.216858,0.256783,0.575563,16402,1154,1914,530,1146.021322
rfc,39,0.8501,0.319426,0.200491,0.246355,0.570512,16512,1044,1954,490,1139.299649


> The best prediction of my model based on RandomForestClassifier is at probability threshold **0.49**, when losses are the lowest ones: **1099.257**, although, precision, f1 and AUC metric are low.  Model predict correct 233 defaults and wrong 2221 defaults as non-defaults at given threshold. 

## Challenger model

I will use CatBoost as challenger model

In [80]:
%reload_kedro

# Load base model's options
params['model_options'] = catalog.load("params:candidate_model.model_options")

# Make a model and fit it
model = model_pipeline(params['model_options'], params)
train_model(X_train, y_train, model, params['model_options'])
model

0:	learn: 0.6650147	test: 0.6648508	best: 0.6648508 (0)	total: 64ms	remaining: 1m 3s
100:	learn: 0.2435473	test: 0.2449835	best: 0.2449835 (100)	total: 5.79s	remaining: 51.6s
200:	learn: 0.2167082	test: 0.2199152	best: 0.2199152 (200)	total: 11s	remaining: 43.7s
300:	learn: 0.2049044	test: 0.2104014	best: 0.2104014 (300)	total: 16.1s	remaining: 37.5s
400:	learn: 0.1992176	test: 0.2074552	best: 0.2074552 (400)	total: 22s	remaining: 32.9s
500:	learn: 0.1946075	test: 0.2053997	best: 0.2053997 (500)	total: 32.8s	remaining: 32.7s
600:	learn: 0.1912655	test: 0.2049212	best: 0.2048903 (585)	total: 39.7s	remaining: 26.3s
Stopped by overfitting detector  (20 iterations wait)

bestTest = 0.20489033
bestIteration = 585

Shrink model to first 586 iterations.


In [81]:
eval_metr = evaluate_metrics(model, X_test, y_test, params)

eval_metr 

The best probability threshold for catboost model based on min loss: 37


Unnamed: 0,prob_thresh_%,accuracy,precision,recall,f1,roc_auc,tn,fp,fn,tp,loss
catboost,30,0.8538,0.345956,0.22054,0.269365,0.581249,16537,1019,1905,539,1110.999592
catboost,31,0.8574,0.354908,0.204173,0.259221,0.576255,16649,907,1945,499,1103.820527
catboost,32,0.861,0.367299,0.190262,0.250674,0.572318,16755,801,1979,465,1095.248411
catboost,33,0.8629,0.37066,0.174714,0.237486,0.566709,16831,725,2017,427,1095.380643
catboost,34,0.8662,0.389313,0.166939,0.233677,0.565242,16916,640,2036,408,1084.69807
catboost,35,0.86765,0.391906,0.150573,0.217558,0.559024,16985,571,2076,368,1087.352913
catboost,36,0.86945,0.401649,0.139525,0.207106,0.555295,17048,508,2103,341,1085.388607
catboost,37,0.8714,0.41623,0.130115,0.198254,0.552355,17110,446,2126,318,1081.809515
catboost,38,0.87235,0.420438,0.11784,0.184084,0.547613,17159,397,2156,288,1084.429559
catboost,39,0.87345,0.429498,0.108429,0.173146,0.544189,17204,352,2179,265,1084.73829


> The best prediction of my model based on CatBoostClassifier is at probability threshold **0.37**, when losses are the lowest ones: **1081.81**, although, precision, f1 and AUC metric are slightly higher than previous model.  Model predict correct 318 defaults and wrong 2126 defaults as non-defaults at given threshold. 