# Lab 6: Credit Scoring

This lab material is largely self-contained. We assume that every student has already taken STAT7008 or knows some basic operations of Python. Noet that you may use Anaconda to run the .ipynb file. For the installation of Anaconda, please see https://conda.io/docs/user-guide/install/index.html.

### Purpose

In Lab 6, you will learn how to:

a. build a credit scorecard. 

### Useful libraries for this Lab

a. numpy, for data array. 

b. sklearn, for modelling.

c. os, for the working directory. 

In [1]:
import pandas as pd
import numpy as np
from sklearn import linear_model
import os

wd = os.getcwd() # Set your working directory. 
print wd

/home/renjielu/PycharmProjects/DM8017/DM_Lab6


We read the data called accepts, which can be downloaded from Moodle. In this data set, variable called bad is the target variable. For simplicity, we only consider two input variables: bureau_score and age_oldest_tr. 

In [2]:
accept = pd.read_sas(wd+'/accepts.sas7bdat')

print list(accept) # for column names.
print accept.head()

Y = accept['bad']
X = accept[['age_oldest_tr', 'bureau_score']]

[u'bankruptcy', u'bad', u'app_id', u'tot_derog', u'tot_tr', u'age_oldest_tr', u'tot_open_tr', u'tot_rev_tr', u'tot_rev_debt', u'tot_rev_line', u'rev_util', u'bureau_score', u'purch_price', u'msrp', u'down_pyt', u'purpose', u'loan_term', u'loan_amt', u'ltv', u'tot_income', u'used_ind', u'weight']
   bankruptcy  bad  app_id  tot_derog  tot_tr  age_oldest_tr  tot_open_tr  \
0         0.0  0.0  1001.0        6.0     7.0           46.0          NaN   
1         0.0  0.0  1002.0        0.0    21.0          153.0          6.0   
2         0.0  0.0  1003.0        0.0    29.0          194.0          4.0   
3         0.0  1.0  1005.0        2.0    20.0          129.0          8.0   
4         1.0  0.0  1006.0        2.0    10.0          108.0          6.0   

   tot_rev_tr  tot_rev_debt  tot_rev_line   ...    purch_price     msrp  \
0         NaN           NaN           NaN   ...       19678.00  17160.0   
1         1.0          97.0        4637.0   ...       28615.00  27950.0   
2         2.0  

We group the two input varibales into 5 parts. We compute the weight of evidence (WOE) and information value (IV) for each variables. The results can be seen as follows. Usually, low IV, say $<$ 0.15, means weak  classification power, while high IV, like $>$ 0.4, means strong power.

In [3]:
cut_x1 = pd.cut(X.iloc[:,0],bins=5)
cut_x2 = pd.cut(X.iloc[:,1],bins=5)

cut_x1_u = pd.unique(list(cut_x1))
cut_x2_u = pd.unique(list(cut_x2))

# group the target variable by the binned variable. 

def group_y(cut_x, cut_x_u, Y):
    group_y_by_x = []
    for i in range(len(cut_x_u)): 
        tmp = [y for x,y in zip(cut_x,Y) if x is cut_x_u[i]]
        group_y_by_x.append(tmp) 
    return group_y_by_x

group_y_by_x1 = group_y(cut_x1, cut_x1_u, Y)
group_y_by_x2 = group_y(cut_x2, cut_x2_u, Y)


# compuate WOE and IV

def WOE_and_IV_y(group_y_by_x):
    
    Ni = []
    Ni_bad = []  
    for i in range(len(group_y_by_x)):
        Ni_bad.append(sum(group_y_by_x[i]))
        Ni.append(len(group_y_by_x[i]))
    
    Ni_good = np.array(Ni) - np.array(Ni_bad)
    
    N_bad = sum(Ni_bad)
    N_good = sum(Ni_good)
    
    pi_good = (np.array(Ni_good)+0.5)/N_good # to avoid pure case.
    pi_bad = (np.array(Ni_bad)+0.5)/N_bad
    
    WOE = np.log((pi_good)/(pi_bad))
    
    IV = np.sum((pi_good-pi_bad)*WOE)
    return WOE, IV

WOE_x1, IV_x1 = WOE_and_IV_y(group_y_by_x1)
WOE_x2, IV_x2 = WOE_and_IV_y(group_y_by_x2)

print 'WOE of X1: ',WOE_x1
print 'WOE of X2: ',WOE_x2
print 'IV of X1: ',IV_x1
print 'IV of X2: ',IV_x2

WOE of X1:  [-0.38801869  0.18179713  0.82635172  1.02855501 -0.66625426  0.80901651]
WOE of X2:  [ 0.719889   -0.47763081  2.18955664 -0.6651725  -1.59233598 -1.27823822]
IV of X1:  0.203552802313
IV of X2:  0.672197204136


Based on the WOEs, we can build a logistic model. After we obtain the estimated odds, we simply follow the same setting in the lecture note to give a score for each applicant, namely, we scale the scorecard such that odds of 50:1 at 600 points and the odds to double every 20 points. So, the offset and factor are 487.123 and 28.8539, respectively. We print the first five scores. Follow the formula in the points allcoation of lecture note, a corresponding scorecard is provided. Note that nan means missing value.

In this lab material, we do not do reject inference. You may try it by yourself. Note that based on the code in this material, it is easy to create a class to rebuild a scorecard after the reject inference. 

In [4]:
# creat the input matrix.

WOE_input1 = np.array([WOE_x1[list(cut_x1_u).index(x)] for x in cut_x1])
WOE_input2 = np.array([WOE_x2[list(cut_x2_u).index(x)] for x in cut_x2])

WOE_input = np.hstack((WOE_input1.reshape(-1 ,1), WOE_input2.reshape((-1, 1))))

logr = linear_model.LogisticRegression(C=1000.0) 
logr.fit(WOE_input, Y.values)
logr_predict_proba = logr.predict_proba(WOE_input)

# odds = np.exp(logr_predict_proba[:,0]/logr_predict_proba[:,1])

offset = 487.123
factor = 28.8539

scores = 487.123 + 28.8539 * (logr_predict_proba[:,0]/logr_predict_proba[:,1])

print scores[:5]

a = logr.intercept_
beta_age = logr.coef_[0][0]
beta_bur = logr.coef_[0][1]

n = 2. # we use two variables here

scorecard_x1 = np.hstack((cut_x1_u.reshape(-1,1), (-(WOE_x1*beta_age+a/n)+offset/n).reshape(-1,1)))
scorecard_x2 = np.hstack((cut_x2_u.reshape(-1,1), (-(WOE_x2*beta_bur+a/n)+offset/n).reshape(-1,1)))

print 'for age_oldest_tr' 
for i in range(scorecard_x1.shape[0]):
    print 'interval is ', scorecard_x1[i,0], 'score is ', scorecard_x1[i,1]

print '\n'    
    
print 'for bureau_score'

for i in range(scorecard_x2.shape[0]):
    print 'interval is ', scorecard_x2[i,0], 'score is ', scorecard_x2[i,1]

[ 662.25049309  732.03035118  567.70967377  567.70967377  544.74863715]
for age_oldest_tr
interval is  (0.413, 118.4] score is  244.014846994
interval is  (118.4, 235.8] score is  244.350212734
interval is  (235.8, 353.2] score is  244.729566056
interval is  (353.2, 470.6] score is  244.848573027
interval is  nan score is  243.851091141
interval is  (470.6, 588.0] score is  244.719363397


for bureau_score
interval is  (686.0, 767.0] score is  244.911422081
interval is  (605.0, 686.0] score is  243.799875399
interval is  (767.0, 848.0] score is  246.275578381
interval is  nan score is  243.625797825
interval is  (442.595, 524.0] score is  242.765197877
interval is  (524.0, 605.0] score is  243.056745726
