<h1> Modeling Creditworthiness of Customers via Rare Event Phenomenon modeling using Random Under Sampling techniques and calculation of "Loss Function" to minimize Dollar Loss for banks, simultanously ensuring fair credit evaluation for customers</h1>

<h2>Problem Statement:  </h2>

<b> Credit Evaluation is a problem which requires special techniques as it poses a number of unique challenges to the modeller and to the bank. </b>
1. Law requires a bank to inform the customer the reason, if denied credit. This automatically limits the types of modelling techniques that can be deployed by the modeler. Specifically, "black-box" techniques like Neural Network should be avoided to ensure the required level of determinism and fairness.

2. Credit evaluation falls under the realm of "Rare Event Phenomenon". Vast majority of people who apply for credit will probably receive credit from a bank and only a small fraction will not. In such situations, a model tends to predict everything as "good credit", which results in "Low Error Rate" (since bad credit are a rare event.), but completely defeats the purpose of credit evaluation.

<h2>Solution: </h2>

<b>We tackle this two-fold problem by:</b>
1. Using Deterministic modeling techniques like Logistic Regression which meet the requirement of fairness
2. Devising a "Loss Function" which calculates the Dollar Value of Loss and uses this function as a benchmark for performance
3. Using Random Undersampling with 10 fold Cross Validation on the Loss function to optimize decision making

<b>About the Dataset </b>
The Dataset was provided by Prof Edward Jones, Executive Professor, Department of Statistics at Texas A&M University, College Station as a part of Academic work for course STAT 656. 
The Dataset consists of 10500 unique data points and 19 attributes and 1 target. 

<h3>Data Summary</h3>

1. age : Age in years ranging from 19 to 120 
2. amount : Credit amount any value from zero to 20,000
3. duration: Interval Loan duration: 1 to 72 months 
4. checking: Existing status of checking account: 1, 2, 3 or 4 
5. coapp: Status of other debtors/guarantors: 1, 2 or 3 
6. depends: Dependents 1 (none) or 2(1 or more)
7. employed: Employment duration status: 1, 2, 3, 4 or 5
8. existcr: Number of existing bank loans: 1, 2, 3 or 4
9. foreign: Foreign worker: 1 (yes) or 2 (no)
10. history: Credit history: 0, 1, 2, 3 or 4
11. housing: Housing status: 1, 2, or 3
12. installp: Installment rate as percent of income: 1, 2, 3 or 4
13. job: Employment status: 1, 2, 3 or 4
14. marital: Status and gender: 1, 2, 3 or 4
15. other: Other installment plans: 1, 2 or 3
16. property: Property	ownership: 1, 2, 3 or 4
17. resident: Permanent residence status: 1, 2, 3 or 4
18. savings: Savings account status: 1, 2, 3, 4 or 5
19. telephon: 1(no registered phone) or 2(registered phone)
20. good_bad: Credit rating: ‘bad’ or ‘good’

<h3> <u> Table of Contents </u></h3>
<p> </p>


1. Reading the Dataset, Data Replacement, Preprocessing and Imputation
2. Modeling without Undersampling and resultant Loss
3. 10 fold Cross Validation with different Undersampling Ratios
4. Final Model Selection, Loss, Model Metrics and conclusion
    


<h3> 1. Reading the Dataset, Data Replacement, Preprocessing and Imputation </h3>

In [13]:
from imblearn.under_sampling import RandomUnderSampler
from AdvancedAnalytics import ReplaceImputeEncode, logreg, calculate
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()
df = pd.read_excel('CreditData_RareEvent.xlsx')

In [2]:
df.head()

Unnamed: 0,good_bad,age,amount,duration,checking,coapp,depends,employed,existcr,foreign,history,housing,installp,job,marital,other,property,resident,savings,telephon
0,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
1,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
2,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
3,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2
4,good,67,1169,6,1,1,1,5,2,1,4,2,4,3,3,3,1,4,5,2


In [3]:
df.tail()

Unnamed: 0,good_bad,age,amount,duration,checking,coapp,depends,employed,existcr,foreign,history,housing,installp,job,marital,other,property,resident,savings,telephon
10495,bad,49,8386,30,2,1,1,4,1,1,4,2,2,3,3,3,2,2,1,1
10496,bad,33,4844,48,4,1,1,1,1,1,2,1,3,4,3,1,3,2,1,2
10497,bad,33,4844,48,4,1,1,1,1,1,2,1,3,4,3,1,3,2,1,2
10498,bad,26,8229,36,1,1,2,3,1,1,2,2,2,3,3,3,2,2,1,1
10499,bad,26,8229,36,1,1,2,3,1,1,2,2,2,3,3,3,2,2,1,1


We check for existence of missing values.

In [5]:
df.isnull().sum()

good_bad    0
age         0
amount      0
duration    0
checking    0
coapp       0
depends     0
employed    0
existcr     0
foreign     0
history     0
housing     0
installp    0
job         0
marital     0
other       0
property    0
resident    0
savings     0
telephon    0
dtype: int64

Since there are no missing values we can safely proceed. The next step is to create a Data Dictionary of the data frame. 

In [25]:
attribute_map = {
        'good_bad':['B',('good','bad')],
        'age':['I', (19, 120)],
        'amount':['I', (0, 20000)],
        'duration':['I',(1,72)],
        'checking':['N',(1,2,3,4)],
        'coapp':['N',(1,2,3)],
        'depends':['B',(1,2)],
        'employed':['N',(1,2,3,4,5)],
        'existcr':['N', (1,2,3,4)],
        'foreign':['B', (1,2)],
        'history':['N',(0,1,2,3,4)],
        'housing':['N',(1,2,3)],
        'installp':['N',(1,2,3,4)],
        'job':['N',(1,2,3,4)],
        'marital':['N', (1,2,3,4)],
        'other':['N',(1,2,3)],
        'property':['N',(1,2,3,4)],
        'resident':['N',(1,2,3,4)],
        'savings':['N',(1,2,3,4,5)],
        'telephon':['B',(1,2)]}

We now move ahead to scale the data and us one hot encoding to encode the nominal variables.

In [27]:
rie=ReplaceImputeEncode(data_map=attribute_map, nominal_encoding='one-hot',interval_scale='std', drop=True, display=False)
encoded_df = rie.fit_transform(df)
y = np.asarray(encoded_df['good_bad'])
X = np.asarray(encoded_df.drop('good_bad',axis=1))

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


<h3>2. Modeling without Undersampling and resultant Loss </h3>

The next step is is to model the raw data before using Undersampling techniques to really understand how the model would have performed without any undersampling operation. 

The Errors are thrown as we attempt to change the default "solver" to "lbfgs"

In [None]:
fp_cost = np.array(df['amount'])
fn_cost = np.array(0.15*df['amount'])
c_list = [1e-4, 1e-3, 1e-2, 1e-1, 1, 2, 3, 1e+64]
best_c = 0
max_f = 0
for c in c_list:
    lgr = LogisticRegression(C=c, tol=1e-16)
    lgr_10 = cross_val_score(lgr, X, y, scoring='f1', cv=10)
    mean = lgr_10.mean()
    if mean > max_f:
        max_f = mean
        best_c = c
        best_lgr = lgr

In [30]:
print("\nLogistic Regression Model using Entire Dataset and C = ",best_c)
best_lgr.fit(X,y)
logreg.display_binary_metrics(best_lgr, X, y)
loss,conf_mat = calculate.binary_loss(y,best_lgr.predict(X),fp_cost,fn_cost)
np.random.seed(12345)
max_seed = 2**16 - 1
rand_val = np.random.randint(1, high=max_seed, size=20)


Logistic Regression Model using Entire Dataset and C =  1

Model Metrics
Observations...............     10500
Coefficients...............        46
DF Error...................     10454
Mean Absolute Error........    0.0785
Avg Squared Error..........    0.0379
Accuracy...................    0.9556
Precision..................    0.9555




Recall (Sensitivity).......    1.0000
F1-Score...................    0.9772
MISC (Misclassification)...      4.4%
     class 0...............     93.2%
     class 1...............      0.0%


     Confusion
       Matrix     Class 0   Class 1  
Class 0.....        34       466
Class 1.....         0     10000
Misclassification Rate.    0.0444
False Negative Loss....         0
False Positive Loss....   1800209
Total Loss.............   1800209


<h3> 3. 10 fold Cross Validation with different Undersampling Ratios </h3>

<p> </p>
We now use different Undersamping Ratios with 10 fold CV to find the optimum result

In [7]:
ratio = [ '50:50', '60:40', '70:30', '80:20', '90:10' ]

rus_ratio = ({0:500, 1:500}, {0:500, 1:750}, {0:500, 1:1167},{0:500, 1:2000}, {0:500, 1:4500})
c_list = [1e-4, 1e-3, 1e-2, 1e-1, 1, 2, 3, 4, 1e+64]
min_loss = 1e64
best_ratio = 0
for k in range(len(rus_ratio)):
    print("\nLogistic Regression Model using " + ratio[k] + " RUS")
    best_c = 0
    min_loss_c = 1e64
    for j in range(len(c_list)):
        c = c_list[j]
        fn_loss = np.zeros(len(rand_val))
        fp_loss = np.zeros(len(rand_val))
        misc = np.zeros(len(rand_val))
        for i in range(len(rand_val)):
            rus = RandomUnderSampler(ratio=rus_ratio[k],random_state=rand_val[i],return_indices=False,replacement=False)
            X_rus, y_rus = rus.fit_sample(X, y)
            lgr = LogisticRegression(C=c, tol=1e-16, solver='lbfgs',max_iter=1000)
            lgr.fit(X_rus, y_rus)
            loss, conf_mat = calculate.binary_loss(y, lgr.predict(X),fp_cost, fn_cost, display=False)
            fn_loss[i] = loss[0]
            fp_loss[i] = loss[1]
            misc[i] = (conf_mat[1] + conf_mat[2])/y.shape[0]
        avg_misc = np.average(misc)
        t_loss = fp_loss+fn_loss
        avg_loss = np.average(t_loss)
        if avg_loss < min_loss_c:
            min_loss_c = avg_loss
            se_loss_c = np.std(t_loss)/math.sqrt(len(rand_val))
            best_c = c
            misc_c = avg_misc
            fn_avg_loss = np.average(fn_loss)
            fp_avg_loss = np.average(fp_loss)
        if min_loss_c < min_loss:
            min_loss = min_loss_c
            se_loss = se_loss_c
            best_ratio = k
            best_reg = best_c
        print("{:.<23s}{:12.2E}".format("Best C", best_c))
        print("{:.<23s}{:12.4f}".format("Misclassification Rate",misc_c))
        print("{:.<23s} ${:10,.0f}".format("False Negative Loss",fn_avg_loss))
        print("{:.<23s} ${:10,.0f}".format("False Positive Loss",fp_avg_loss))
        print("{:.<23s} ${:10,.0f}{:5s}${:<,.0f}".format("Total Loss", min_loss_c, " +/- ", se_loss_c))
    print("")
    print("{:.<23s}{:>12s}".format("Best RUS Ratio", ratio[best_ratio]))
    print("{:.<23s}{:12.2E}".format("Best C", best_reg))
    print("{:.<23s} ${:10,.0f}{:5s}${:<,.0f}".format("Lowest Loss", min_loss, " +/-", se_loss))


Logistic Regression Model using 50:50 RUS
Best C.................    1.00E-04
Misclassification Rate.      0.2812
False Negative Loss.... $ 2,321,172
False Positive Loss.... $   426,837
Total Loss............. $ 2,748,009 +/- $9,764
Best C.................    1.00E-03
Misclassification Rate.      0.2773
False Negative Loss.... $ 2,243,899
False Positive Loss.... $   394,464
Total Loss............. $ 2,638,362 +/- $11,396
Best C.................    1.00E-02
Misclassification Rate.      0.2872
False Negative Loss.... $ 2,069,123
False Positive Loss.... $   334,044
Total Loss............. $ 2,403,166 +/- $12,108
Best C.................    1.00E-01
Misclassification Rate.      0.2798
False Negative Loss.... $ 1,764,709
False Positive Loss.... $   335,186
Total Loss............. $ 2,099,894 +/- $20,415
Best C.................    1.00E+00
Misclassification Rate.      0.2688
False Negative Loss.... $ 1,566,610
False Positive Loss.... $   341,934
Total Loss............. $ 1,908,545 +/- $16,45

Best C.................    1.00E+00
Misclassification Rate.      0.0443
False Negative Loss.... $    76,843
False Positive Loss.... $ 1,478,691
Total Loss............. $ 1,555,533 +/- $5,191
Best C.................    2.00E+00
Misclassification Rate.      0.0441
False Negative Loss.... $    79,736
False Positive Loss.... $ 1,455,712
Total Loss............. $ 1,535,448 +/- $3,599
Best C.................    3.00E+00
Misclassification Rate.      0.0440
False Negative Loss.... $    80,524
False Positive Loss.... $ 1,447,921
Total Loss............. $ 1,528,444 +/- $4,144
Best C.................    4.00E+00
Misclassification Rate.      0.0439
False Negative Loss.... $    80,524
False Positive Loss.... $ 1,440,888
Total Loss............. $ 1,521,412 +/- $4,952
Best C.................    1.00E+64
Misclassification Rate.      0.0439
False Negative Loss.... $    85,380
False Positive Loss.... $ 1,420,117
Total Loss............. $ 1,505,497 +/- $5,181

Best RUS Ratio.........       80:20
Best C..

In [8]:
n_obs = len(y)
n_rand = 100
predicted_prob = np.zeros((n_obs,n_rand))
avg_prob = np.zeros(n_obs)
# Setup 100 random number seeds for use in creating random samples
np.random.seed(12345)
max_seed = 2**16 - 1
rand_value = np.random.randint(1, high=max_seed, size=n_rand)
# Model 100 random samples, each with a 70:30 ratio
for i in range(len(rand_value)):
    rus = RandomUnderSampler(ratio=rus_ratio[best_ratio],random_state=rand_value[i], return_indices=False, replacement=False)
    X_rus, y_rus = rus.fit_sample(X, y)
    lgr = LogisticRegression(C=best_c, tol=1e-16, solver='lbfgs', max_iter=1000)
    lgr.fit(X_rus, y_rus)
    predicted_prob[0:n_obs, i] = lgr.predict_proba(X)[0:n_obs, 0]
for i in range(n_obs):
    avg_prob[i] = np.mean(predicted_prob[i,0:n_rand])
# Set y_pred equal to the predicted classification
y_pred = avg_prob[0:n_obs] < 0.5
y_pred.astype(np.int)

array([1, 1, 1, ..., 1, 0, 0])

<h3> 4. Final Model Selection, Loss, Model Metrics and conclusion </h3>

In [10]:
# Calculate loss from using the ensemble predictions
print("\nEnsemble Estimates based on averaging",len(rand_value), "Models in $s")
loss, conf_mat = calculate.binary_loss(y, y_pred, fp_cost, fn_cost)


Ensemble Estimates based on averaging 100 Models in $s
Misclassification Rate.    0.0773
False Negative Loss....    375762
False Positive Loss....   1052394
Total Loss.............   1428156


<h3> Conclusion </h3>
<p> </p>

We successfully modified the model and reduced the initial Loss of USD1.8 million to USD1.438 million.
<p> </p>

Techniques employed:
1. Random Undersampling
2. 10 fold CV
3. Modified Logistic Regression