# INFO

This notebook will be used for construction and testing purposes while designing model within Kedro framework. 

In [2]:
##############################################################################
# It is recommended to create new virtual environment for each Kedro project #
##############################################################################

# Uncomment and run the line below if your environment does't have
# Kedro or any other dependencies needed.

#! pip install -r requirements.txt
%load_ext kedro.ipython

# Model
As the dataset needs transformation like imputation and normalization, for avoiding data leakage, all transformations will be done within model pipeline and fitting only on training data on model fitting stage. So I'm going to split initial typed dataset to train/test sets and balance train set, ignoring all previously transformed datasets.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import FunctionTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score, confusion_matrix
import pandas as pd
import logging
from lending_club.pipelines.analysis.nodes import features_eng
from lending_club.pipelines.encode.nodes import _default_status
from imblearn.pipeline import make_pipeline as imb_make_pipeline

logger = logging.getLogger(__name__)

def split_dataset(df: pd.DataFrame, df_fe: pd.DataFrame, params: dict):
    y = _default_status(df, params)
    X = pd.concat([df, df_fe], axis=1)
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=params['test_size'],
        random_state=params['random_state']
    )
    return X_train, X_test, y_train, y_test


In [84]:
# Load general parameters
params = catalog.load("parameters")
params['model_features'] = catalog.load("params:model_features")

In [None]:
# Load intermediate dataset that has proper features types
# and split it to training/testing features and target datasets
df = catalog.load("intermediate_lc_dataset")
X_train, X_test, y_train, y_test = split_dataset(df, features_eng(df, params), params)

In [None]:
# Define model pipeline that will transform original features datasets,
# applying isolated transformations for training and testing datasets
# to prevent data leakage 
def model_pipeline(model_options: dict, params: dict):

    # split important features to assign preprocessing steps
    category_feat = [f for f in (params['category'] + [params['emp_len']]) if f in params['model_features']]
    numeric_feat_zero = [f for f in params['fill_zero'] if f in params['model_features']]
    numeric_feat_med = [f for f in params['fill_med'] if f in params['model_features']]

    # transformer to replace missing numeric values by 0
    # and standardize all values 
    numeric_feat_zero_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value=0),
        StandardScaler()
    )
    # transformer to replace missing numeric values by median
    numeric_feat_med_transformer = make_pipeline(
        SimpleImputer(strategy='median'),
        StandardScaler()
    )

    # assemble transformers in preprocessing pipe so it will perform 
    # following transformations:
    #   - encode all categorical features to numbers
    #   - fill missing values in specific number features as "0" and standardize them
    #   - fill missing values in specific number features as median and standardize them
    preprocessing = make_column_transformer(
        (OrdinalEncoder(), category_feat),
        (numeric_feat_zero_transformer, numeric_feat_zero),
        (numeric_feat_med_transformer, numeric_feat_med)
    )

    # choose regressor depending on provided model_options
    if model_options['name'] == 'rfc':
        regressor = RandomForestClassifier(**model_options['regressor_options'])
    else: 
        if model_options['name'] == 'catboost':
            regressor = CatBoostClassifier(cat_features=category_feat, **model_options['regressor_options'])
        else:
            raise Exception("Pipeline accepts only RandomForestClassifier and CatBoostClassifier")
    
    # assemble preprocessing pipeline, SMOTE (imbalance handler) and 
    # chosen regressor as the model pipeline 
    model = imb_make_pipeline(
        preprocessing,
        SMOTE(random_state=params['random_state']),
        regressor
    )
    return model

# Function that fits model but set parameters for regressor first if it is available 
def train_model(X_train, y_train, regressor, params: dict):
    try: 
        regressor.set_params(**params['fit_options']).fit(X_train, y_train)
    except:
        regressor.fit(X_train, y_train)
    return regressor

# Load base model's options
params['model_options'] = catalog.load("params:baseline_model.model_options")

# Make a model and fit it
model = model_pipeline(params['model_options'], params)
train_model(X_train, y_train, model, params['model_options'])
model

# Evaluation
## Actual profit/loss

To evaluate model performance I want to use custom loss function, so I need to calculate actual earning rate to define potential losses in case we refuse in loan, that was mistakenly predicted as default, as well as actual losses for charged off loans, that will be our loss in case if we issue a loan that was mistakenly predicted as non default. 

I assume, that earning rate for non defaulted loans, considering loans that is not fully paid at the moment, is total received amount less than total received principal divided by total received principal. 

For charged off loan, I believe, the actual losses are amount of loan less than total received payments (that includes collections after charges off) plus collection recovery fee (that I believe is our payment to collectors for collection services). Dividing that by this category loan amount we can get actual loss rate for defaulted loans

In [15]:
# Function that returns actual profit/loss rates for non-defartet/defaulted loans
def get_loss_values(df: pd.DataFrame) -> pd.DataFrame:

    # Select columns for profit/loss calculation
    df = df.loc[:, ['loan_amnt', 'loan_status', 'total_pymnt', 'total_rec_prncp', 'total_rec_int', 'total_rec_late_fee', 'recoveries', 'collection_recovery_fee']]

    # Add default status and summarize data
    df['default_status'] = df['loan_status'].str.contains("Charged Off", regex=False, na=False)
    df = df.drop(columns=['loan_status'])
    df=df.groupby(by='default_status').sum()
    df = df.reset_index()


    df['earning/loss'] = (
        # actual earnings rate for non-defaulters
        ((df.total_pymnt - df.total_rec_prncp) / df.total_rec_prncp) * ~df.default_status 
        # actual losses rate for defaulters
        + (df.loan_amnt - df.total_pymnt + df.collection_recovery_fee) / df.loan_amnt * df.default_status
        )

    # Select columns: 'default_status', 'earning/loss'
    df = df.loc[:, ['default_status', 'earning/loss']]
    return df.set_index('default_status')

df_loss = get_loss_values(catalog.load('intermediate_lc_clean'))
df_loss

Unnamed: 0_level_0,earning/loss
default_status,Unnamed: 1_level_1
False,0.228696
True,0.460871


>These figures will be used in parameters...yml to feed FP_cost and FN_cost to model evaluator

In [None]:
params['FP_cost'] = df_loss['earning/loss'].loc[False]
params['FN_cost'] = df_loss['earning/loss'].loc[True]

## Baseline model
To evaluate my model I'll use a range of probability thresholds that will be used for classification of predicted default's probabilities and calculate metrics for each of them to decide which one is fitting best to minimize losses

In [None]:
def make_rng(start, stop, step):
    return range(start, stop, step)


def evaluate_metrics(model: object, X_true, y_true,
                     params: dict) -> pd.DataFrame:
    y_pred_proba = model.predict_proba(X_true)
    metrics = pd.DataFrame()
    for thresh in make_rng(**params['model_options']['prob_threshold']):
        y_pred = (y_pred_proba[:,1] > (thresh / 100))
        tn, fp, fn, tp = confusion_matrix(y_true, y_pred).ravel()
        cur_metrics = pd.DataFrame(
        data={
            'prob_thresh_%': thresh,
            'accuracy'     : accuracy_score(y_true, y_pred),
            'precision'    : precision_score(y_true, y_pred),
            'recall'       : recall_score(y_true, y_pred),
            'f1'           : f1_score(y_true, y_pred),
            'roc_auc'      : roc_auc_score(y_true, y_pred),
            'tn'           : tn,
            'fp'           : fp,
            'fn'           : fn,
            'tp'           : tp,
            'loss'         : params['FP_cost'] * fp + params['FN_cost'] *fn,
        },
        index = [params['model_options']['name']]
        )
        metrics = pd.concat([metrics, cur_metrics], axis=0)
    print(f"The best probability threshold for {params['model_options']['name']} model based on min loss: {metrics[metrics.loss==metrics.loss.min()]['prob_thresh_%'].iloc[0]}%")
    return metrics

eval_metr = evaluate_metrics(model, X_test, y_test, params)

eval_metr 

The best probability threshold for rfc model based on min loss: 53


Unnamed: 0,prob_thresh_%,accuracy,precision,recall,f1,roc_auc,tn,fp,fn,tp,loss
rfc,30,0.80445,0.267512,0.345336,0.301482,0.60685,15245,2311,1600,844,1265.910056
rfc,31,0.81115,0.269616,0.319149,0.292299,0.599396,15443,2113,1664,780,1250.123992
rfc,32,0.81895,0.276321,0.297463,0.286502,0.594505,15652,1904,1717,727,1226.752691
rfc,33,0.8257,0.281643,0.274959,0.278261,0.588664,15842,1714,1772,672,1208.648356
rfc,34,0.83245,0.291686,0.25982,0.274832,0.585993,16014,1542,1809,635,1186.364871
rfc,35,0.83755,0.297025,0.240998,0.266094,0.580798,16162,1394,1855,589,1173.717929
rfc,36,0.8419,0.301876,0.223813,0.257049,0.575879,16291,1265,1897,547,1163.572727
rfc,37,0.84585,0.305539,0.205401,0.245657,0.570204,16415,1141,1942,502,1155.953618
rfc,38,0.8502,0.313008,0.189034,0.235714,0.565638,16542,1014,1982,462,1145.344066
rfc,39,0.8534,0.31317,0.167349,0.218133,0.558127,16659,897,2035,409,1143.012797


> The best prediction of my model based on RandomForestClassifier is at probability threshold **0.53**, when losses are the lowest ones: **1110.6**. Precision metric is the highest one at this threshold as well: **0.408**, although AUC metric is slightly higher than by chance: **0.52** and predict correct 122 defaults and mistakenly 2322 defaults as non-defaults. 

## Challenger model

I will use CatBoost as challenger model

In [None]:
def model_pipeline(model_options: dict, params: dict):

    # split important features to assign preprocessing steps
    category_feat = [f for f in (params['category'] + [params['emp_len']]) if f in params['model_features']]
    numeric_feat_zero = [f for f in params['fill_zero'] if f in params['model_features']]
    numeric_feat_med = [f for f in params['fill_med'] if f in params['model_features']]

    # transformer to replace missing numeric values by 0
    # and standardize all values 
    numeric_feat_zero_transformer = make_pipeline(
        SimpleImputer(strategy='constant', fill_value=0),
        StandardScaler()
    )
    # transformer to replace missing numeric values by median
    numeric_feat_med_transformer = make_pipeline(
        SimpleImputer(strategy='median'),
        StandardScaler()
    )

    # assemble transformers in preprocessing pipe so it will perform 
    # following transformations:
    #   - encode all categorical features to numbers
    #   - fill missing values in specific number features as "0" and standardize them
    #   - fill missing values in specific number features as median and standardize them
    preprocessing = make_column_transformer(
        # (OrdinalEncoder(), category_feat),
        (numeric_feat_zero_transformer, numeric_feat_zero),
        (numeric_feat_med_transformer, numeric_feat_med)
    )

    # choose regressor depending on provided model_options
        # if model_options['name'] == 'catboost':
    regressor = CatBoostClassifier(**model_options['regressor_options'])
        # else:
        #     raise Exception("Pipeline accepts only RandomForestClassifier and CatBoostClassifier")
    
    # assemble preprocessing pipeline, SMOTE (imbalance handler) and 
    # chosen regressor as the model pipeline 
    model = imb_make_pipeline(
        preprocessing,
        SMOTE(random_state=params['random_state']),
        regressor
    )
    return model

# Load base model options
params['model_options'] = catalog.load("params:candidate_model.model_options")

# Make a model and fit it
model = model_pipeline(params['model_options'], params)
train_model(X_train, y_train, model, params['model_options'])
model

category_feat: ['home_ownership', 'sub_grade', 'purpose', 'verification_status', 'application_type', 'verification_status_joint']
numeric_feat_zero: ['mths_since_last_record', 'mths_since_recent_bc_dlq', 'mths_since_last_major_derog', 'mths_since_recent_revol_delinq', 'mths_since_last_delinq', 'mths_since_rcnt_il']
numeric_feat_med: ['il_util', 'all_util', 'inq_fi', 'open_rv_24m', 'open_rv_12m', 'total_bal_il', 'open_il_12m', 'total_cu_tl', 'open_acc_6m', 'max_bal_bc', 'inq_last_12m', 'mths_since_recent_inq', 'mo_sin_old_il_acct', 'bc_util', 'percent_bc_gt_75', 'bc_open_to_buy', 'mths_since_recent_bc', 'pct_tl_nvr_dlq', 'avg_cur_bal', 'tot_cur_bal', 'num_actv_bc_tl', 'num_actv_rev_tl', 'num_op_rev_tl', 'mo_sin_rcnt_rev_tl_op', 'num_tl_op_past_12m', 'total_il_high_credit_limit', 'mo_sin_rcnt_tl', 'mo_sin_old_rev_tl_op', 'total_bc_limit', 'acc_open_past_24mths', 'total_bal_ex_mort', 'pub_rec_bankruptcies', 'collections_12_mths_ex_med', 'tax_liens', 'total_rev_hi_lim']
0:	learn: 0.6698316

In [94]:
scores = model.score(X_test, y_test)
scores

[1;36m0.8775[0m

In [100]:
eval_metr = evaluate_metrics(model, X_test, y_test, params)

eval_metr 

The best probability threshold for catboost model based on min loss: 60


Unnamed: 0,prob_thresh_%,accuracy,precision,recall,f1,roc_auc,tn,fp,fn,tp,loss
catboost,10,0.2945,0.139983,0.927987,0.243269,0.567149,3622,13934,176,2268,3267.76336
catboost,15,0.4985,0.164574,0.761457,0.270652,0.611675,8109,9447,583,1861,2429.178905
catboost,20,0.6493,0.187671,0.561784,0.281352,0.611634,11613,5943,1071,1373,1852.733169
catboost,25,0.73835,0.204367,0.394435,0.269236,0.590331,13803,3753,1480,964,1540.385168
catboost,30,0.8004,0.223966,0.256956,0.239329,0.566505,15380,2176,1816,628,1334.584232
catboost,35,0.83985,0.242012,0.145663,0.181865,0.541076,16441,1115,2088,356,1217.294688
catboost,40,0.85765,0.238651,0.075286,0.114463,0.520925,16969,587,2260,184,1175.813012
catboost,45,0.86355,0.203742,0.040098,0.067009,0.509141,17173,383,2346,98,1168.793934
catboost,50,0.8775,0.285714,0.001637,0.003255,0.500534,17546,10,2440,4,1126.8122
catboost,55,0.87775,0.0,0.0,0.0,0.499972,17555,1,2444,0,1126.59742
