# Loan Credit Risk Model  
Build a predictive model to understand factors that influence the credit default  


**Status**: Done  
**Dataset**: here  
**Jupyter notebook**:here



---  
---  
## <span id='all'>All about Projects</span>  
    
#### Project Context  
From the prespective of financial institution like a bank, finding a good borrower is a must to run their financing businesses. Not only make a bad impact to the business, bad borrowers could also give a negative impact to a broader economic.  
This project will elaborate any factor that affects the credit default and builds a model to predict it.


#### Project Requirement   

1. Explore and prepare the dataset  
2. Train a machine learning model to predict the credit default from borrower.
3. Evaluate model and present findings  

#### Project Planning  
1. Exploratory Data Analysis (EDA)
    - Data Structure Check (shape, datatype, head, tail)
    - Data Quality Check (missing value, outlier, distribution)
    - Data Cleaning (missing value, outlier, inappropiate values, high cardinality)
    - Content Investigation (hypothesis testing, predictive power check through descriptive/visualization and analytics)
2. Data Preporcessing & Feature Engineering
3. Feature Selection
4. Models Training
5. Models Evaluation  

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', '{:.4f}'.format)
sns.set()

In [2]:
df = pd.read_csv('..//dataset//df_dummy.csv')
ref_col = pd.read_csv('..//dataset//reference_columns.csv')

In [4]:
df.head()

Unnamed: 0,delinq_2yrs,mths_since_last_delinq,pub_rec,out_prncp,total_rec_late_fee,recoveries,next_pymnt_d,tot_coll_amt,target,int_rate | 5-10,int_rate | 14-18,int_rate | 18-22,int_rate | >22,emp_length | 2-4,emp_length | 4-6,emp_length | 6-8,emp_length | >8,annual_inc | 30000-40000,annual_inc | 40000-50000,annual_inc | 50000-60000,annual_inc | 60000-70000,annual_inc | 70000-80000,annual_inc | 80000-90000,annual_inc | 90000-100000,annual_inc | 100000-120000,annual_inc | 120000-150000,annual_inc | >150000,issue_d | 41-45,issue_d | 45-50,issue_d | 50-54,issue_d | 54-59,issue_d | 59-63,issue_d | 63-68,issue_d | 68-81,issue_d | 81-95,issue_d | >95,dti | 8-16,dti | 16-24,dti | 24-31,dti | >31,earliest_cr_line | 117-235,earliest_cr_line | 235-352,earliest_cr_line | 352-470,earliest_cr_line | >470,inq_last_6mths | 1-2,inq_last_6mths | 2-4,inq_last_6mths | 4-5,inq_last_6mths | >5,open_acc | 9-18,open_acc | 18-27,open_acc | >27,revol_bal | 23374-46747,revol_bal | >46747,revol_util | 10-20,revol_util | 20-31,revol_util | 31-41,revol_util | 41-51,revol_util | 51-61,revol_util | 61-71,revol_util | 71-82,revol_util | 82-92,revol_util | >92,total_acc | 7-14,total_acc | 14-22,total_acc | 22-29,total_acc | 29-36,total_acc | 36-43,total_acc | >43,total_rec_prncp | 4375-8750,total_rec_prncp | 8750-13125,total_rec_prncp | 13125-17500,total_rec_prncp | 17500-21875,total_rec_prncp | 21875-26250,total_rec_prncp | 26250-30625,total_rec_prncp | >30625,total_rec_int | 892-1784,total_rec_int | 1784-2676,total_rec_int | 2676-3568,total_rec_int | 3568-4460,total_rec_int | 4460-5352,total_rec_int | 5352-6244,total_rec_int | 6244-7136,total_rec_int | 7136-10704,total_rec_int | >10704,last_pymnt_d | 30-60,last_pymnt_d | 60-90,last_pymnt_d | >90,last_pymnt_amnt | 3390-6781,last_pymnt_amnt | 6781-16951,last_pymnt_amnt | 16951-23732,last_pymnt_amnt | >23732,last_credit_pull_d | 23-35,last_credit_pull_d | 35-46,last_credit_pull_d | 46-58,last_credit_pull_d | >58,tot_cur_bal | 0-12,tot_cur_bal | 12-23,tot_cur_bal | 23-35,tot_cur_bal | 35-46,tot_cur_bal | 46-58,tot_cur_bal | 58-69,tot_cur_bal | 69-81,tot_cur_bal | 81-92,tot_cur_bal | 92-104,tot_cur_bal | 104-116,tot_cur_bal | >116,tot_cur_bal | 233452-466905,tot_cur_bal | 466905-700357,tot_cur_bal | 700357-933810,tot_cur_bal | >933810,total_rev_hi_lim | 60880-121760,total_rev_hi_lim | 121760-182640,total_rev_hi_lim | 182640-243520,total_rev_hi_lim | >243520,term | 60months,home_ownership | OWN,home_ownership | RENT,verification_status | Source Verified,verification_status | Verified,purpose | major_purchase__wedding__car,purpose | moving__other__medical__debt_consolidation,purpose | renewable_energy__vacation__house__educational,purpose | small_business,initial_list_status | w
0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0
1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,1,0,1,0,0,0,0
2,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0
3,0,1,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,0,0,0


In [5]:
ref_col.head()

Unnamed: 0.1,Unnamed: 0,feature,is_ref_col,root
0,0,annual_inc | -500-30000,1,annual_inc
1,1,annual_inc | 100000-120000,0,annual_inc
2,2,annual_inc | 120000-150000,0,annual_inc
3,3,annual_inc | 30000-40000,0,annual_inc
4,4,annual_inc | 40000-50000,0,annual_inc


In [8]:
df.shape

(237500, 124)

In [9]:
ref_col.shape

(138, 4)

---
---

## Modeling

In [171]:
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.pipeline import Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as imb_Pipeline
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import precision_score, accuracy_score, recall_score, roc_curve, \
f1_score, roc_auc_score, accuracy_score, classification_report, plot_confusion_matrix

In [165]:
X = df.drop(labels='target', axis=1)
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=1, 
                                                    stratify=y)

In [None]:
# accuracy_lst = []
# precision_lst = []
# recall_lst = []
# f1_lst = []
# auc_lst = []

# lr_params = {'lr__C': [0.01, 1, 100],
#             'lr__l1_ratio': [0.01, 0.1, 0.9]}
# sss = StratifiedKFold(n_splits=5, random_state=1, shuffle=True)

# smote = SMOTE(sampling_strategy='minority', random_state=1, n_jobs=-1)
# lr = LogisticRegression(solver='saga', n_jobs=-1, verbose=1, random_state=1, penalty='l1')

# pipe = [('smote', smote), ('lr', lr)]

In [None]:
# pipeline = imb_Pipeline(pipe)

# rand_lr = RandomizedSearchCV(pipeline, lr_params, 
#                              n_iter=4, 
#                              n_jobs=-1, 
#                              verbose=10, 
#                              scoring='recall', 
#                              random_state=1)

# for train, test in sss.split(X, y):
    
#     rand_lr.fit(X.iloc[train], y.iloc[train])
#     y_hat = rand_lr.predict(X.iloc[test])
    
#     accuracy_lst.append(accuracy_score(y.iloc[test], y_hat))    
#     precision_lst.append(precision_score(y.iloc[test], y_hat))    
#     recall_lst.append(recall_score(y.iloc[test], y_hat))
#     f1_lst.append(f1_score(y.iloc[test], y_hat))
#     auc_lst.append(roc_auc_score(y.iloc[test], y_hat))

In [57]:
smote = SMOTE(sampling_strategy='minority', random_state=1, n_jobs=-1)
lr = LogisticRegression(C=100, l1_ratio=0.9, solver='saga', n_jobs=-1, verbose=1, random_state=1, penalty='l1')

resample_X_train, resample_y_train = smote.fit_resample(X_train, y_train)
lr.fit(resample_X_train, resample_y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


max_iter reached after 48 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   47.8s finished


In [58]:
## --- summarize each feature & its coefficient

df_temp = pd.DataFrame(columns=['feature', 'coef'])
df_temp['feature'] = X_train.columns
df_temp['coef'] = lr.coef_[0]

intercpt = pd.DataFrame(columns=['feature', 'coef'])
intercpt['feature'] = ['intercept']
intercpt['coef'] = lr.intercept_

df_temp = pd.concat([intercpt, df_temp], axis=0)
df_temp.head()

Unnamed: 0,feature,coef
0,intercept,-8.1176
0,delinq_2yrs,0.1329
1,mths_since_last_delinq,0.2542
2,pub_rec,0.5785
3,out_prncp,-0.115


In [59]:
## --- merge the coef-feature table with the reference column table
## --- fill in the na w/ proper values

ref_col_coef = pd.merge(ref_col, df_temp, how='outer', on='feature')
ref_col_coef.drop(labels='Unnamed: 0', axis=1, inplace=True)

ref_col_coef['is_ref_col'].fillna(0, inplace=True)
ref_col_coef['root'].fillna(ref_col_coef['feature'], inplace=True)
ref_col_coef['coef'].fillna(0, inplace=True)

In [62]:
ref_col_coef.head()

Unnamed: 0,feature,is_ref_col,root,coef
0,annual_inc | -500-30000,1.0,annual_inc,0.0
1,annual_inc | 100000-120000,0.0,annual_inc,1.2277
2,annual_inc | 120000-150000,0.0,annual_inc,1.2508
3,annual_inc | 30000-40000,0.0,annual_inc,0.7035
4,annual_inc | 40000-50000,0.0,annual_inc,0.7839


In [61]:
## --- filter unuseful features based on their coef

pd.merge(
    ref_col_coef[(ref_col_coef['coef'] < 0.01) & (ref_col_coef['coef'] > -0.01)].groupby('root')[['coef']].count(),
    ref_col_coef.groupby('root')['coef'].agg(['count', 'mean', 'median', 'min', 'max']),
    how='right',
    left_index=True,
    right_index=True)

Unnamed: 0_level_0,coef,count,mean,median,min,max
root,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
annual_inc,1.0,11,0.9211,0.9886,0.0,1.3435
delinq_2yrs,,1,0.1329,0.1329,0.1329,0.1329
dti,1.0,5,0.8554,1.1417,0.0,1.4778
earliest_cr_line,1.0,5,0.7493,0.9646,0.0,1.1315
emp_length,1.0,5,0.4843,0.6252,0.0,0.7651
home_ownership,1.0,3,0.3999,0.3856,0.0,0.814
initial_list_status,1.0,2,-0.0292,-0.0292,-0.0584,0.0
inq_last_6mths,1.0,5,0.2653,0.2677,0.0,0.5129
int_rate,1.0,5,1.3289,1.0691,0.0,3.3844
intercept,,1,-8.1176,-8.1176,-8.1176,-8.1176


In [63]:
## --- drop unuseful features

rm_feats = ['out_prncp',
            'revol_bal',
            'next_pymnt_d',
            'last_pymnt_d',
            'last_credit_pull_d',
            'issue_d']

for col in rm_feats:
    ref_col_coef.drop(
        labels= ref_col_coef[ref_col_coef['root'] == col].index,
        axis=0,
        inplace=True
    )  

In [167]:
## --- selected features to be trained again

X = df[ref_col_coef[(ref_col_coef['is_ref_col'] == 0) & (ref_col_coef['feature'] != 'intercept')]['feature'].values]
y = df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.25, 
                                                    random_state=1, 
                                                    stratify=y)

In [70]:
smote = SMOTE(sampling_strategy='minority', random_state=1, n_jobs=-1)
lr = LogisticRegression(C=100, l1_ratio=0.9, solver='saga', n_jobs=-1, verbose=1, random_state=1, penalty='l1')

resample_X_train, resample_y_train = smote.fit_resample(X_train, y_train)
lr.fit(resample_X_train, resample_y_train)

[Parallel(n_jobs=-1)]: Using backend ThreadingBackend with 12 concurrent workers.


max_iter reached after 42 seconds


[Parallel(n_jobs=-1)]: Done   1 out of   1 | elapsed:   42.5s finished


In [71]:
df_temp = pd.DataFrame(columns=['feature', 'coef'])
df_temp['feature'] = X_train.columns
df_temp['coef'] = lr.coef_[0]

intercpt = pd.DataFrame(columns=['feature', 'coef'])
intercpt['feature'] = ['intercept']
intercpt['coef'] = lr.intercept_

df_temp = pd.concat([intercpt, df_temp], axis=0)
df_temp.head()

Unnamed: 0,feature,coef
0,intercept,-7.7998
0,annual_inc | 100000-120000,1.1028
1,annual_inc | 120000-150000,1.1359
2,annual_inc | 30000-40000,0.6202
3,annual_inc | 40000-50000,0.6858


## Evaluation

In [75]:
## --- auc roc

roc = roc_auc_score(y_test, lr.predict(X_test))
roc

0.9224928425284086

In [76]:
## --- gini

gini = roc*2 - 1
gini

0.8449856850568171

In [211]:
df_test_pred = pd.DataFrame(columns=['y_test', 'y_pred_proba', 'y_pred'])
df_test_pred['y_test'] = y_test.values
df_test_pred['y_pred_proba'] = lr.predict_proba(X_test)
df_test_pred['y_pred'] = lr.predict(X_test)
df_test_pred.sort_values('y_pred_proba', inplace=True)
df_test_pred.reset_index(inplace=True)

In [212]:
df_test_pred['cum_n_population'] = df_test_pred.index+1
df_test_pred['cum_n_good'] = df_test_pred['y_test'].cumsum()
df_test_pred['cum_n_bad'] = df_test_pred['cum_n_population'] - df_test_pred['cum_n_good']

df_test_pred['cum_%_population'] = df_test_pred['cum_n_population'] / len(df_test_pred)
df_test_pred['cum_%_good'] = df_test_pred['cum_n_good'] / df_test_pred['y_test'].sum()
df_test_pred['cum_%_bad'] = df_test_pred['cum_n_bad'] / (len(df_test_pred) - df_test_pred['y_test'].sum())

In [86]:
df['target'].value_counts()

1    188148
0     49352
Name: target, dtype: int64

In [214]:
df_test_pred.tail()

Unnamed: 0,index,y_test,y_pred_proba,y_pred,cum_n_population,cum_n_good,cum_n_bad,cum_%_population,cum_%_good,cum_%_bad
59370,38752,0,1.0,0,59371,47037,12334,0.9999,1.0,0.9997
59371,38905,0,1.0,0,59372,47037,12335,0.9999,1.0,0.9998
59372,10663,0,1.0,0,59373,47037,12336,1.0,1.0,0.9998
59373,35428,0,1.0,0,59374,47037,12337,1.0,1.0,0.9999
59374,14586,0,1.0,0,59375,47037,12338,1.0,1.0,1.0


In [215]:
## --- kolmogorov-smirnov

ks = max(df_test_pred['cum_%_good'] - df_test_pred['cum_%_bad'])
ks

0.8534096397895073

## Score card

In [100]:
min_score = 300
max_score = 850

min_sum_coef = ref_col_coef.groupby('root')['coef'].min().sum()
max_sum_coef = ref_col_coef.groupby('root')['coef'].max().sum()

In [104]:
## --- create the score column
ref_col_coef['score'] = ref_col_coef['coef'] * (max_score - min_score) / (max_sum_coef - min_sum_coef)

## --- edit the score for intercept
ref_col_coef['score'][138] = ((ref_col_coef['coef'][138] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ref_col_coef['score'][138] = ((ref_col_coef['coef'][138] - min_sum_coef) / (max_sum_coef - min_sum_coef)) * (max_score - min_score) + min_score


In [106]:
ref_col_coef['score_prelim'] = ref_col_coef['score'].round()

In [126]:
## --- check the max & min scores -> rounding factor make it not exactly 850

min_sum_score_prel = ref_col_coef.groupby('root')['score_prelim'].min().sum() # 300.0
max_sum_score_prel = ref_col_coef.groupby('root')['score_prelim'].max().sum() # 849.0

In [132]:
ref_col_coef['diff'] = ref_col_coef['score_prelim'] - ref_col_coef['score']
ref_col_coef['diff'].argmin()

13

In [134]:
## --- set the proper value

ref_col_coef['score_final'] = ref_col_coef['score_prelim']
ref_col_coef['score_final'].iloc[13] = 16

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  ref_col_coef['score_final'].iloc[13] = 16


In [135]:
## --- check the max & min scores 

min_sum_score_fin = ref_col_coef.groupby('root')['score_final'].min().sum() # 300.0
max_sum_score_fin = ref_col_coef.groupby('root')['score_final'].max().sum() # 850

In [136]:
max_sum_score_fin

850.0

## Score card on X Test

In [151]:
# filter w/o reference columns

df_temp = ref_col_coef[ref_col_coef['is_ref_col'] == 0]
df_temp.reset_index(drop=True, inplace=True)
df_temp[df_temp['feature'] == 'intercept']

Unnamed: 0,feature,is_ref_col,root,coef,score,score_prelim,diff,score_final
97,intercept,0.0,intercept,-8.1176,555.9492,556.0,0.0508,556.0


In [153]:
X_test_w_intrcpt = X_test
X_test_w_intrcpt.insert(97, 'intercept', 1)

In [157]:
scorecard_scores = df_temp['score_final']
scorecard_scores = scorecard_scores.values.reshape(104, 1)

In [158]:
## --- dot product scorecard w/ the test set
## --- output the score for all test data

y_test_scores = X_test_w_intrcpt.dot(scorecard_scores)

In [162]:
## --- from scores of all test data, bring it back to probability
## --- follow the formula

sum_coef_from_score = ((y_test_scores - min_score) / (max_score - min_score)) * (max_sum_coef - min_sum_coef) + min_sum_coef
y_hat_proba_from_score = np.exp(sum_coef_from_score) / (np.exp(sum_coef_from_score) + 1)

In [163]:
y_hat_proba_from_score.head()

Unnamed: 0,0
119201,0.9961
37928,0.0
169127,1.0
170576,0.0843
157277,0.9999


## Setting Cut-off

In [172]:
## --- false-positive rate, true-positive rate, thresholds
## --- the thresholds is also meaning the probaility of default of each customer

fpr, tpr, thresholds = roc_curve(df_temp['y_test'], df_temp['y_pred_proba'])

In [173]:
cut_off = pd.DataFrame(columns=['thresholds', 'fpr', 'tpr'])
cut_off['fpr'] = fpr
cut_off['tpr'] = tpr
cut_off['thresholds'] = thresholds

In [176]:
## --- from thresholds convert to score

cut_off['Score'] = ((np.log(cut_off['thresholds'] / (1 - cut_off['thresholds'])) - min_sum_coef) * ((max_score - min_score) / (max_sum_coef - min_sum_coef)) + min_score).round()

  result = getattr(ufunc, method)(*inputs, **kwargs)


In [184]:
## --- make a correction in the highest threshold

cut_off['thresholds'].iloc[0] = 1
cut_off['Score'].iloc[0] = 850
cut_off['Score'].iloc[1] = 850

In [185]:
cut_off.head()

Unnamed: 0,thresholds,fpr,tpr,Score
0,1.0,0.0,0.0,850.0
1,1.0,0.0001,0.0,850.0
2,1.0,0.0051,0.0,844.0
3,1.0,0.0053,0.0,844.0
4,1.0,0.0183,0.0,839.0


In [216]:
## --- the number of approved & rejected & also their rate

def n_approved(p):
    return np.where(df_test_pred['y_pred_proba'] >= p, 1, 0).sum()


cut_off['n_approved'] = cut_off['thresholds'].apply(n_approved)
cut_off['n_rejected'] = len(df_test_pred) - cut_off['n_approved']
cut_off['approval_rate'] = cut_off['n_approved'] / len(df_test_pred)
cut_off['rejection_rate'] = 1 - cut_off['approval_rate']

In [217]:
cut_off.head()

Unnamed: 0,thresholds,fpr,tpr,Score,n_approved,n_rejected,approval_rate,rejection_rate
0,1.0,0.0,0.0,850.0,0,59375,0.0,1.0
1,1.0,0.0001,0.0,850.0,1,59374,0.0,1.0
2,1.0,0.0051,0.0,844.0,63,59312,0.0011,0.9989
3,1.0,0.0053,0.0,844.0,65,59310,0.0011,0.9989
4,1.0,0.0183,0.0,839.0,226,59149,0.0038,0.9962


In [5]:
## --- w/ this threshold, 
    ## --- we shall reject every customer w/ score lower than 672
    ## --- we obtain 11,36% approval rate

prob_default = .95
cut_off[cut_off['thresholds'] >= prob_default].tail(10)

Unnamed: 0.1,Unnamed: 0,thresholds,fpr,tpr,Score,n_approved,n_rejected,approval_rate,rejection_rate
461,461,0.9507,0.5284,0.0043,672.0,6723,52652,0.1132,0.8868
462,462,0.9507,0.5284,0.0043,672.0,6724,52651,0.1132,0.8868
463,463,0.9506,0.5286,0.0043,672.0,6726,52649,0.1133,0.8867
464,464,0.9506,0.5286,0.0044,672.0,6727,52648,0.1133,0.8867
465,465,0.9506,0.5288,0.0044,672.0,6729,52646,0.1133,0.8867
466,466,0.9506,0.5288,0.0044,672.0,6730,52645,0.1133,0.8867
467,467,0.9502,0.5294,0.0044,672.0,6738,52637,0.1135,0.8865
468,468,0.9502,0.5294,0.0044,672.0,6739,52636,0.1135,0.8865
469,469,0.9501,0.5297,0.0044,672.0,6743,52632,0.1136,0.8864
470,470,0.95,0.5297,0.0044,672.0,6744,52631,0.1136,0.8864


In [6]:
## --- w/ this threshold, 
    ## --- we shall reject every customer w/ score lower than 664
    ## --- we obtain 14,19% approval rate

prob_default = .90
cut_off[cut_off['thresholds'] >= prob_default].tail(10)

Unnamed: 0.1,Unnamed: 0,thresholds,fpr,tpr,Score,n_approved,n_rejected,approval_rate,rejection_rate
1180,1180,0.9009,0.6218,0.0155,664.0,8403,50972,0.1415,0.8585
1181,1181,0.9008,0.6218,0.0156,664.0,8406,50969,0.1416,0.8584
1182,1182,0.9006,0.6221,0.0156,664.0,8410,50965,0.1416,0.8584
1183,1183,0.9006,0.6221,0.0156,664.0,8411,50964,0.1417,0.8583
1184,1184,0.9006,0.6224,0.0156,664.0,8414,50961,0.1417,0.8583
1185,1185,0.9005,0.6224,0.0156,664.0,8415,50960,0.1417,0.8583
1186,1186,0.9004,0.6226,0.0156,664.0,8418,50957,0.1418,0.8582
1187,1187,0.9003,0.6226,0.0157,664.0,8419,50956,0.1418,0.8582
1188,1188,0.9003,0.6228,0.0157,664.0,8421,50954,0.1418,0.8582
1189,1189,0.9002,0.6229,0.0157,664.0,8424,50951,0.1419,0.8581


In [8]:
## --- w/ this threshold, 
    ## --- we shall reject every customer w/ score lower than 659
    ## --- we obtain 17,13% approval rate

prob_default = .85
cut_off[cut_off['thresholds'] >= prob_default].tail(10)

Unnamed: 0.1,Unnamed: 0,thresholds,fpr,tpr,Score,n_approved,n_rejected,approval_rate,rejection_rate
2015,2015,0.8502,0.7082,0.0301,659.0,10156,49219,0.171,0.829
2016,2016,0.8502,0.7082,0.0302,659.0,10158,49217,0.1711,0.8289
2017,2017,0.8502,0.7084,0.0302,659.0,10160,49215,0.1711,0.8289
2018,2018,0.8502,0.7084,0.0302,659.0,10161,49214,0.1711,0.8289
2019,2019,0.8501,0.7085,0.0302,659.0,10163,49212,0.1712,0.8288
2020,2020,0.8501,0.7085,0.0302,659.0,10164,49211,0.1712,0.8288
2021,2021,0.8501,0.7086,0.0302,659.0,10165,49210,0.1712,0.8288
2022,2022,0.8501,0.7086,0.0303,659.0,10166,49209,0.1712,0.8288
2023,2023,0.85,0.7087,0.0303,659.0,10167,49208,0.1712,0.8288
2024,2024,0.85,0.7087,0.0303,659.0,10168,49207,0.1713,0.8287


---
---

## Saving

In [223]:
# save all works

ref_col_coef.to_csv('..\\dataset\\ref_col_coef_score_table.csv')
cut_off.to_csv('..\\dataset\\cut_off.csv')

In [226]:
import joblib

lr = joblib.dump(lr, '.\\model\\lr_final.joblib')