End-to-End Example of Scorecard Creation

Outline

Review - Scorecard Development Process
End-to-End Example of Scorecard Creation

In [1]:
# Load data manipulation package
import numpy as np
import pandas as pd
import itertools

# Load data visualization package
import matplotlib.pyplot as plt
import seaborn as sns

Review - Scorecard Development Process

Scorecard Development Process

1. Explore data.

Simple statistics such as distributions of values, mean/median, etc.
Checking data integrity.
2. Handle missing values and outliers.

Most financial industry data contains missing values, or values that do not make sense for a particular characteristic
3. Check correlation.

4. Initial characteristic analysis.

To assess the strength of each characteristic individually as a predictor of performance.
5. Statistical Measures.

Weight of Evidence (WoE) — measures the strength of each attribute.
Information Value (IV) — measures the total strength of the characteristic.
6. Check logical trend.

Attribute strengths must also be in a logical order, and make operational sense.
7. Check business/operational considerations.

The consideration is business or operational relevance. e.g. postal codes.
8. Design scorecards.

Preliminary scorecard
Reject inference — To make educated guesses about how rejected applicants would have performed if accepted.
Final scorecard production
9. Choose a scorecard — using a combination of statistical and business measures.

For example: misclassification, scorecard strength (KS, Chi-square, AIC, AUC), etc.

Example of Scorecard Creation
This is an end-to-end example to show how a scorecard is created.

This is a highly simplified one, designed to show how the final scores are calculated.

1. Load Data
The sample we will use in this example is a fictive dataset from here.

The sample consist of some demographic, bureau, and financial information.

Note that we are not defining the default or bad status from our dataset here. Instead, we already have the binary response variable:

loan_status
loan_status = 0 for non default loan.
loan_status = 1 for default loan.
The potential characteristics for predicting the response variable are:

person age
person_income (the annual income of the debtor)
person_home_ownership
RENT
MORTGAGE
OWN
OTHER
person_emp_length (the employment length of debtor in years)
loan_intent (the purpose of loan)
EDUCATION
MEDICAL
VENTURE
PERSONAL
DEBTCONSOLIDATION
loan_grade
loan_amnt (loan amount)
loan_int_rate (interest rate)
loan_percent_income (percent loan of income)
cb_person_default_on_file (historical default)
0 : the debtor does not have any history of defaults.
1 : the debtor has a history of defaults on their credit file.
cb_preson_cred_hist_length (credit history length)
First, load the data from credit_risk_dataset.csv file.

In [2]:
# Import dataset from csv file
data = pd.read_csv('credit_risk_dataset.csv')

# Table check
data.head().T

Unnamed: 0,0,1,2,3,4
person_age,22,21,25,23,24
person_income,59000,9600,9600,65500,54400
person_home_ownership,RENT,OWN,MORTGAGE,RENT,RENT
person_emp_length,123.0,5.0,1.0,4.0,8.0
loan_intent,PERSONAL,EDUCATION,MEDICAL,MEDICAL,MEDICAL
loan_grade,D,B,C,C,C
loan_amnt,35000,1000,5500,35000,35000
loan_int_rate,16.02,11.14,12.87,15.23,14.27
loan_status,1,0,1,1,1
loan_percent_income,0.59,0.1,0.57,0.53,0.55


Check the data shape.

How many variables are there?

In [3]:
# Check the data shape
data.shape

(32581, 12)

Our sample contains 12 variables from 32,581 credit records.

1 response variable, loan_status,
and 11 potential characteristics/predictors.
Please note that in this example, we will use only 3 predictors, for the sake of simplicity.

Predictor 1: person_age
Predictor 4: person_emp_length
Predictor 11: cb_person_default_on_file

In [4]:
data0 = data.copy()

# Use only 3 predictors and 1 response variable
data = data[['person_age',
             'person_emp_length',
             'cb_person_default_on_file',
             'loan_status']]
data

Unnamed: 0,person_age,person_emp_length,cb_person_default_on_file,loan_status
0,22,123.0,Y,1
1,21,5.0,N,0
2,25,1.0,N,1
3,23,4.0,N,1
4,24,8.0,Y,1
...,...,...,...,...
32576,57,1.0,N,0
32577,54,4.0,N,0
32578,65,3.0,N,1
32579,56,5.0,N,0


Before modeling, make sure you split the data first for model validation.

In the classification case, check the proportion of response variable first to decide the splitting strategy.

In [5]:
# Define response variable
response_variable = 'loan_status'

# Check the proportion of response variable
data[response_variable].value_counts(normalize = True)

0    0.781836
1    0.218164
Name: loan_status, dtype: float64

The proportion of the response variable, loan status, is not quite balanced (in a ratio of 78:22).

To get the same ratio in training and testing set, define a stratified splitting based on the response variable, loan_status.

2. Sample Splitting
First, define the predictors (X) and the response (y).

In [6]:
# Split response and predictors
y = data[response_variable]
X = data.drop(columns = [response_variable],
              axis = 1)

# Validate the splitting
print('y shape :', y.shape)
print('X shape :', X.shape)

y shape : (32581,)
X shape : (32581, 3)


In [7]:
from sklearn.model_selection import train_test_split

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    stratify = y,
                                                    test_size = 0.3,
                                                    random_state = 42)

# Validate splitting
print('X train shape :', X_train.shape)
print('y train shape :', y_train.shape)
print('X test shape  :', X_test.shape)
print('y test shape  :', y_test.shape)

X train shape : (22806, 3)
y train shape : (22806,)
X test shape  : (9775, 3)
y test shape  : (9775,)


Check the proportion of response y in each training and testing set.

In [8]:
y_train.value_counts(normalize = True)

0    0.781856
1    0.218144
Name: loan_status, dtype: float64

In [9]:
y_test.value_counts(normalize = True)

0    0.78179
1    0.21821
Name: loan_status, dtype: float64

3. Exploratory Data Analysis
To make a model that predicts well on unseen data, we must prevent leakage of test set information.
Thus, we only explore on training set.

In [10]:
# Concatenate X_train and y_train as data_train
data_train = pd.concat((X_train, y_train),
                       axis = 1)

# Validate data_train
print('Train data shape:', data_train.shape)
data_train.head()

Train data shape: (22806, 4)


Unnamed: 0,person_age,person_emp_length,cb_person_default_on_file,loan_status
11491,26,1.0,N,0
3890,23,3.0,N,0
17344,24,1.0,N,1
13023,24,1.0,N,0
29565,42,4.0,N,1


What do we do in EDA?

Check data integrity.
Check for any insight in the data: distribution, proportion, outliers, missing values, etc.
Make a plan for data pre-processing.

Check for Missing Values

In [11]:
# Check for missing values
data_train.isna().sum()

person_age                     0
person_emp_length            639
cb_person_default_on_file      0
loan_status                    0
dtype: int64

In [12]:
# Check for data type
data_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22806 entries, 11491 to 10456
Data columns (total 4 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   person_age                 22806 non-null  int64  
 1   person_emp_length          22167 non-null  float64
 2   cb_person_default_on_file  22806 non-null  object 
 3   loan_status                22806 non-null  int64  
dtypes: float64(1), int64(2), object(1)
memory usage: 890.9+ KB


Summary

There are missing values in person_emp_length, a numerical/float variable.
We need to find how to handle the missing values by exploring this variable.

Predictor 1: person_age

In [13]:
# Descriptive statistics of 'person_age'
data_train['person_age'].describe()

count    22806.000000
mean        27.722880
std          6.336638
min         20.000000
25%         23.000000
50%         26.000000
75%         30.000000
max        144.000000
Name: person_age, dtype: float64