Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
1337 lines (977 sloc) 38.5 KB
title author date output
Credit Risk Prediction
Fang Zhou, Data Scientist, Microsoft
`r Sys.Date()`
html_document
knitr::opts_chunk$set(echo = TRUE,
                      fig.width = 8,
                      fig.height = 5,
                      fig.align='center',
                      dev = "png")

1 Introduction

Credit Risk Scoring is a classic yet still increasingly important operation in banking as banks continue to be increasingly risk careful when lending for mortgages or commercial purposes, in an industry known for fierce competition and the global financial crisis. With an accurate credit risk scoring model a bank is able to predict the likelihood of default on a transaction. This will in turn help evaluate the potential risk posed by lending money to consumers and to mitigate losses due to bad debt, as well as determine who qualifies for a loan, at what interest rate, and what credit limits, and even determine which customers are likely to bring in the most revenue through a variety of products.

Many banks nowadays are driving innovation to enhance risk management. For example, a largest bank in one of the Asian countries by market capitalization is exploring opportunities to segment a millions of active credit card customer population to improve risk scoring to then identify opportunities to offer increased limits. Using advanced analytics for credit risk scoring involves traditional scorecard building and modelling, and extends to machine learning and ensemble, but will also pursue an innovation on customer oriented aggregation of transactions, multi-dimensional customer segmentation and conceptual clustering to identify multiple segments across which to understand bank customers.

In the data-driven credit risk predition model, normally two types of data are taken into consideration.

  1. Transaction data The transaction records covering account id, transaction date, transaction amount, merchant industry, ect. This transaction-level data could be dynamically aggregated and then provide transaction statistics and financial behavior information at account level.

  2. Demographic and bank account information This type of data show the characteristics of individual customer or account credit bureau, such as age, sex, income, and credit limit. They are static and never change or solely increment deterministically over time.

2 Data Driven Credit Risk Prediction

2.1 Setup

We load the required R packages.

## Setup

# Load the required packages into the R session.

library(rattle)       # Use normVarNames().
library(readr)        # Fast read csv file.
library(plyr)         # Wrangling.
library(dplyr)        # Wrangling: tbl_df(), group_by(), print(), glimpse().
library(tidyr)        # Wrangling: gather().
library(magrittr)     # Pipe operator %>% %<>% %T>% equals().
library(lubridate)    # Convert character to datetime: mdy_hms().
library(ggplot2)      # Visualization.
library(stringi)      # String concat operator %s+%.
library(glmnet)       # Build lasso logistic regression with glmnet().
library(e1071)        # Build support vector machine model with svm().
library(randomForest) # Build random forest model with randomForest().
library(Matrix)       # Construct a Matrix of a class that inherits from Matrix.
library(xgboost)      # Build extreme gradiant boosting with xgboost(). 
#library(Ckmeans.1d.dp)# Required for xgb.plot.importance().
library(caret)        # Delete near zero variables with nearZeroVar() and train model with train().
library(caretEnsemble)# Build model ensemble. 
library(ROCR)         # Draw ROC Curve.
library(scales)       # Include commas in numbers.

2.2 Data Ingestion

We use the transaction and demographic datasets simulated from public data and real data from a financial institute to conduct this R accelerator.

## Data Ingestion

# Identify the source location of the dataset.

#DATA <- "../../Data/"
#txn_fname <- file.path(DATA, "Raw/transactionSimu.csv")
#demog_fname <- file.path(DATA, "Raw/demographicSimu.csv")

wd <- getwd()

dpath <- "../Data"
txn_fname   <- file.path(wd, dpath, "transactionSimu_v3.csv")
demog_fname <- file.path(wd, dpath, "demographicSimu_v3.csv")

# Ingest the dataset.

txn   <- read_csv(file=txn_fname)   %T>% {dim(.) %>% comma() %>% cat("\n")}
demog <- read_csv(file=demog_fname) %T>% {dim(.) %>% comma() %>% cat("\n")}

2.3 Data Preparation

Before analyzing, we normalize variable names and variables.

# Normalize variable names.

names(txn)   %<>% normVarNames() %T>% print()
names(demog) %<>% normVarNames() %T>% print()

# Normalize categoric variables.

txn %>% 
  sapply(is.character) %>%
  which() %>%
  names() %T>%
  print() ->
txnc

demog %>% 
  sapply(is.character) %>%
  which() %>%
  names() %T>%
  print() ->
demc

txn[txnc]   %<>% sapply(normVarNames)
demog[demc] %<>% sapply(normVarNames)

2.4 Data Exploration and Preprocessing

2.4.1 Data Exploration

2.4.1.1 Transaction Data
## Explore transaction data.

# Review datasets.

dim(txn) %>% comma() %>% cat("\n")

# A glimpse into the data.

glimpse(txn)

# Review observations.

head(txn) %>% print.data.frame()

The transaction data set contains 198,576 records and 11 variables. The columns of "transaction_id" and "account_id" are identifiers at transaction level and account level, respectively. Each account has more than one transaction records occuring at different date or time. These transaction records show information about transaction amount, transaction type, location, merchant and so on.

Initial exploratory analysis can be performed to get a general understanding of the dataset. For example,

  1. The transaction frequency of each account varies. The following plot shows the top 10 accounts with the highest transaction frequency, i.e., number of purchase in this case.
# Visualize the top 10 accounts with the highest transaction frequency.

txn %>%
  filter(transaction_type == "p") %>%
  select(account_id) %>%
  group_by(account_id) %>%
  count() %>%
  arrange(desc(n)) %>%
  head(n=10) %>%
  ggplot(aes(x=account_id, y=n)) +
  geom_bar(stat="identity", position="dodge", fill="lightblue") +
  labs(x       = "Account ID",
       y       = "Transaction Frequency (Purchase)",
       title   = "Top 10 Accounts with the Higest Transaction Frequency",
       caption = "Source: " %s+% "Transaction") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
  1. The total transaction amount made by different types of credit cards vary across merchant industry, and this can be plotted as shown below. The total transaction amount in bank industry is much higher than that in any other.
# Visualize the distribution of total purchase across merchant industry. 

txn %>%
  filter(transaction_type == "p") %>%
  select(merchant_industry, card_type, transaction_amount_usd) %>%
  group_by(merchant_industry) %>% 
  mutate(amount_per_industry=log(sum(transaction_amount_usd))) %>%
  ggplot(aes(x=merchant_industry, y=amount_per_industry, fill=card_type)) +
  geom_bar(stat="identity") +
  labs(x       = "Merchant Industry",
       y       = "Log of Transaction Amount (USD)",
       fill    = "Card Type",
       title   = "Total Transaction Amount Across Merchant Industry\n by Card Type",
       caption = "Source: " %s+% "Transaction") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  scale_y_continuous(labels=comma)
2.4.1.2 Demographic Data
## Explore demographic data.

# Review datasets.

dim(demog) %>% comma() %>% cat("\n")

# A glimpse into the data.

glimpse(demog)

# Review observations.

head(demog) %>% print.data.frame() 

The demographic data set contains r nrow(demog) %>% comma() records and r ncol(demog) variables. It has a common key, "account_id", with the transaction data. The column of "bad_flag" is the label of customers or accounts (assuming one customer one account), representing their default status. The other variables show information about the characteristic of customers (e.g., age, education, income) and bank account information (e.g., credit limit).

Visual understanding of the dataset can be achieved by the following plots.

  1. The proportion of customers for status of "default" and "non-default" within different education levels (or any other possible factor) may vary. People hold degrees of "middle school" and "high school", are among the top 2 groups that exhibit highest default rate.
# Visualize the impact of education on default status.

demog %>%
 ggplot(aes(x=factor(education), fill=factor(bad_flag))) +
 geom_bar(width=0.5, position="fill") +
 coord_flip() +
 scale_fill_discrete(guide=guide_legend(title="Bad Flag")) +
 labs(x="Education Level",
      y="Proportion",
      fill="Bad Flag",
      title="Proportion of Customers for Default Status\n within Each Education Level",
      caption="Source: " %s+% "Demographic") +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))
  1. Credit limit (or income), sex and marital status may affect the default status of the customers with different education levels.
# Visualize the impact of credit limit/income on default status across education level.

demog %>%
  filter(sex == "male" & marital_status == "single") %>%
  ggplot(aes(x=factor(education), y=credit_limit, color=factor(bad_flag))) +
  geom_boxplot() +
  scale_fill_discrete(guide=guide_legend(title="Bad Flag")) +
  labs(x="Education Level",
       y="Credit Limit",
       color="Bad Flag",
       title="Boxplot of Credit Limit Across Education Level\n by Default Status",
       caption="Source: " %s+% "Demographic") +
 theme(axis.text.x = element_text(angle = 45, hjust = 1))

2.4.2 Data Aggregation

Since the analytical target is to predict credit risk at account level, then our first task is to aggregate the transaction data up to an account view with a single row for each account, aggregating each transaction, whether it is spends, purchase, cash withdraw, or fee and then joining. Here we only provide a simple example for the limited sample dataset, where all transaction type is equal to purchase. The time window for aggregation is 6 months, from 2013-04-01 UTC to 2013-09-30 UTC, for this sample data.

## Data Aggregation

# Define some functions to simply the calculation.

# Calculate maximum interval of a time series.

maxInterval <- function(ts)
{
  if (length(na.omit(ts)) <= 1)
  {
    interval <- 0
  }
  else
  {
    interval <- as.numeric(max(difftime(c(ts[-1], NA), ts, units="day"), na.rm=TRUE))
  }
  return(interval)
}

# Calculate minimum interval of a time series.

minInterval <- function(ts)
{
  if (length(na.omit(ts)) <= 1)
  {
    interval <- 0
  }
  else
  {
    interval <- as.numeric(min(difftime(c(ts[-1], NA), ts, units="day"), na.rm=TRUE))
  }
  return(interval)
}

# Calculate average interval of a time series.

avgInterval <- function(ts)
{
  if (length(na.omit(ts)) <= 1)
  {
    interval <- 0
  } else
  {
    interval <- as.numeric(mean(difftime(c(ts[-1], NA), ts, units="day"), na.rm=TRUE))
  }
  return(interval)
}

# Calculate interval of time between enddate and the most recent transaction date.

recentInterval <- function(startdate, enddate, ts)
{
  if (length(na.omit(ts)) == 0)
  {
   interval <- as.numeric(difftime(enddate, startdate, units="day"))
  }
  else
  {
    interval <- as.numeric(difftime(enddate, max(ts, na.rm=TRUE), units="day"))
  }
  return(interval)
}

# Aggregate transaction data per account level.

txn %>%
    filter(transaction_type == "p") %>%
    group_by(account_id) %>%
    arrange(account_id, transaction_date) %>%
    summarise(amount_6      = sum(transaction_amount_usd),
              pur_6         = n(),
              bank_6        = ifelse(pur_6 == 0, 0, sum(merchant_industry == "bank")/pur_6),
              entmnt_6      = ifelse(pur_6 == 0, 0, sum(merchant_industry == "entmnt")/pur_6),
              jewellery_6   = ifelse(pur_6 == 0, 0, sum(merchant_industry == "jewellery")/pur_6),
              medical_6     = ifelse(pur_6 == 0, 0, sum(merchant_industry == "medical")/pur_6),
              others_6      = ifelse(pur_6 == 0, 0, sum(merchant_industry == "others")/pur_6),
              petrol_6      = ifelse(pur_6 == 0, 0, sum(merchant_industry == "petrol")/pur_6),
              restaurant_6  = ifelse(pur_6 == 0, 0, sum(merchant_industry == "restaurant")/pur_6),
              supermarket_6 = ifelse(pur_6 == 0, 0, sum(merchant_industry == "supermkt")/pur_6),
              telecom_6     = ifelse(pur_6 == 0, 0, sum(merchant_industry == "telecom")/pur_6),
              travel_6      = ifelse(pur_6 == 0, 0, sum(merchant_industry == "travel")/pur_6),
              utility_6     = ifelse(pur_6 == 0, 0, sum(merchant_industry == "utility")/pur_6),
              avg_pur_amt_6 = ifelse(pur_6 == 0, 0, sum(transaction_amount_usd)/pur_6),
              max_pur_amt_6 = max(transaction_amount_usd),
              min_pur_amt_6 = min(transaction_amount_usd),
              avg_interval_pur_6 = avgInterval(transaction_date),
              max_interval_pur_6 = maxInterval(transaction_date),
              min_interval_pur_6 = minInterval(transaction_date),
              last_pur_time_6    = recentInterval(min(transaction_date),
                                                  max(transaction_date),
                                                  transaction_date)
              ) ->
rollup

# Review aggregated data.

glimpse(rollup)
  
# Check the number of infinite values for each variable.

unlist(lapply(rollup, function(x) sum(is.infinite(x))))

# Check the number of missing values for each variable.

unlist(lapply(rollup, function(x) sum(is.na(x))))

2.4.3 Data Merging

Let's merge the aggregated transaction data with demographic data by their common key, account id.

## Merge aggregated transaction data with demographic data.

merged <- merge(rollup, demog, by=c("account_id"))

glimpse(merged)

2.4.4 Data Cleansing

Before getting into feature engineering, data cleansing work, including variable cleansing and missing value cleansing, are essential and will finally help improve the prediction performance.

## Variable roles.

# Target variable

target <- "bad_flag"

# Note any identifier.

id <- c("account_id") %T>% print() 

# Note the available variables.

vars <- names(merged) %T>% print()
## Variable cleansing 

# Initialise ignored variables: identifiers

ignore <- id %T>% print()

# Identify variables with only missing values.

merged[vars] %>%
  sapply(function(x) x %>% is.na %>% sum) %>%
  equals(nrow(merged)) %>%
  which() %>%
  names() %T>%
  print() ->
missing

# Add them if any to the variables to be ignored for modelling.

ignore <- union(ignore, missing) %T>% print()

# Identify a threshold above which proportion missing is fatal.

missing.threshold <- 0.5

# Identify variables that are mostly missing.

merged[vars] %>%
  sapply(function(x) x %>% is.na() %>% sum()) %>%
  '>'(missing.threshold*nrow(merged)) %>%
  which() %>%
  names() %T>%
  print() ->
mostly

# Add it into variable to be ignore.

ignore <- union(ignore, mostly) %T>% print()
## Missing value cleansing.

# Count the number of missing values.

merged[vars] %>%  is.na() %>% sum()

# Identify variables with some missing values but not too much.

merged[vars] %>%
  sapply(function(x) x %>% is.na %>% sum) %>%
  '>' (0) %>%
  which() %>%
  names() %T>%
  print() ->
some

merged[some] %>% lapply(class) %>% print()

# Identify categorical variables with some missing values.

merged[some] %>%
  sapply(is.character) %>%
  which(useNames=TRUE) %>%
  names() %T>% 
  print() ->
some.catc

# Identify numeric variables with some missing values.

merged[some] %>%
  sapply(is.numeric) %>%
  which(useNames=TRUE) %>%
  names() %T>% 
  print() ->
some.numc

# Compute the mode of each categorical variables.

merged[some.catc] %>% 
  lapply(table) %>% 
  print() ->
counts.table

some.catc.mode <- unlist(lapply(counts.table,
                                FUN=function(x){as.character(names(x[which.max(x)]))}),
                         use.names=FALSE)

# Compute the mean of each numeric variables.

merged[some.numc] %>%
  lapply(mean, na.rm=TRUE) %>%
  unlist(use.names=F) %>%
  print() ->
some.numc.mean

# Define a function to replace missing values with mean or mode. 

meanModeReplace <- function(data, numc, numc.mean, catc, catc.mode)
{
  
  data <- data.frame(data)
    
 # Replace numeric variables with the mean. 
 
 if(length(numc) > 0) 
   for(i in 1:length(numc))
  {
    row.na <- which(is.na(data[, numc[i]]) == TRUE) 
    data[row.na, numc[i]] <- numc.mean[i]
  }
 
 # Replace character variables with the mode. 
 
 if(length(catc) > 0)
   for(i in 1:length(catc))
  {
    row.na <- which(is.na(data[, catc[i]]) == TRUE) 
    data[row.na, catc[i]] <- catc.mode[i]
  }
 
 return(data)
 
}  
  
# Impute missing values.

merged[vars] %<>% 
  meanModeReplace(numc=some.numc,
                  numc.mean=some.numc.mean, 
                  catc=some.catc,
                  catc.mode=some.catc.mode) 

# Confirm that no missing values remain.

merged[vars] %>%  is.na() %>% sum()

# Identify variables that have a single value.

merged[vars] %>%
  sapply(function(x) all(x == x[1L])) %>%
  which() %>%
  names() %T>%
  print() ->
constants

# Add them to the variable to be ignored.

ignore <- union(ignore, constants) %>% print()
# Clean correlated variables.

# Note which variables are numeric.

vars %>%
  setdiff(ignore) %>%
  '['(merged, .) %>%
  sapply(is.numeric) %>% 
  which() %>%
  names() %T>%
  print() ->
numc

# For the numeric variables generate a table of correlations

merged[numc] %>%
  cor(use="complete.obs") %>%
  ifelse(upper.tri(., diag=TRUE), NA, .) %>% 
  abs %>% 
  data.frame %>%
  tbl_df %>%
  set_colnames(numc) %>%
  mutate(var1=numc) %>% 
  gather(var2, cor, -var1) %>% 
  na.omit %>%
  arrange(-abs(cor)) %T>%
  print() ->
mc

# Add variables with collinearity to the variable to be ignored.

ignore <- union(ignore, NULL) %>% print()

# Check the number of variables currently.

length(vars)

# Remove the variables to ignore.

vars <- setdiff(vars, ignore) %T>% print()

# Confirm they are now ignored.

length(vars)

# Convert all categorical variables into factor.

merged %>% 
  sapply(is.character) %>%
  which(useNames=TRUE) %>%
  names() ->
catc
  
merged[catc] %<>% lapply(factor)

# Review the processed dataset.

processed <- merged[c(id, vars)]

glimpse(processed)

# Export the merged and cleansed data.

write.csv(processed, file=file.path(wd, dpath, "processedSimu.csv"), row.names=FALSE)

2.5 Scorecard Method

The classic and still widely used (and useful) approach for evaluating credit worthiness and risk is based on the building of "scorecards". There are several aspects of the particular modeling workflow for producing a scorecard, and for using it effectively.

  1. Discretizing Predictors A scorecard needs to make it easy for it's user to determine the individual components contributing to the overall score and credit decision, and to achieve that, it is useful to divide the values of each continuous or categorical predictor variable into a relatively small number of categories so that an applicant can be quickly scored. There are a number of methods to re-code variable values into a small number of classes, for example, binning analytics by using quantiles or smbinning.

  2. Logistic Regression Modeling Given a binary variable indicating good or bad account, Logistic Regression models are well suited for subsequent predictive modeling. It provides a logit-transformed linear relationship between the prediction probability and the predictors. When the number of predictors are relatively large or there exists multi-collinearity, Lasso Logistic Regression, provided by the function glmnet(), outperforms all the other. This algorithm is extremely fast, and can exploit sparsity in the model input matrix.

  3. Machine Learning Algorithms If the accuracy of the prediction of risk is the most important consideration of a scorecard building project, then machine learning models, such as Gradient Boosting and Random Forest, or their ensembles, provide better performance than simple logistic regression models. Since most algorithms are general approximators capable of representing any relationship between predictors and outcomes, and are also relatively robust to outliers, it is not necessary to perform many of the feature engineering steps such as binning, etc.

2.5.1 Feature Engineering

Firstly, we do feature engineering by discretizing predictors, i.e., binning.

# Note the variable names in the processed dataset.

vars <- names(processed)

# Identify numeric variables used for binning.

processed %>%
  sapply(is.numeric) %>%
  which(useNames=TRUE) %>%
  names() %T>% 
  print() ->
vars_fe

# Compute quantiles with an even split of 10% for numeric variables. 

quantiles_names <- setdiff(vars_fe, c("income"))

q <- lapply(1:length(quantiles_names), 
            function(i, data)
            {
              quantile(x=processed[[quantiles_names[i]]], 
                       data=processed, 
                       probs=seq(0, 1, 1/10), 
                       na.rm=TRUE)
            })

names(q) <- c(quantiles_names)

# Define uneven bins for the rest of the numeric variables. 
  
uneven_bins_names <- c("income")
  
b <- list(income = c(0, 100000, 2000000, 400000))

# Define function to bucketize numeric variables.

bucketize <- function(data, q, quantiles_names, b, uneven_bins_names)
{
    
  data <- data.frame(data)
    
  # Bucketize based on quantiles.
  
  for(name in  quantiles_names)
  {
    name2 <- paste(name, "_bucket", sep = "")
    data[, name2] <- as.character(length(q[[name]]))
    for(i in seq(1, (length(q[[name]]) - 1)))
    {
      rows <- which(data[, name] < q[[name]][[i + 1]] & data[, name] >= q[[name]][[i]])
      data[rows, name2] <- as.character(i)
    }
  }
  
  # Bucketize based on manually defined bins. 
  
  for(name in  uneven_bins_names)
  {
    name2 <- paste(name, "_bucket", sep = "")
    data[, name2] <- as.character(length(b[[name]]))
    for(i in seq(1, (length(b[[name]]) - 1)))
    {
      rows <- which(data[, name] < b[[name]][i + 1] & data[, name] >= b[[name]][i])
      data[rows, name2] <- as.character(i)
    }
  }
  
  return(data) 
  
}  

# Create bucketized variables.

binned <- bucketize(data=processed, 
                    q=q, 
                    quantiles_names=quantiles_names, 
                    b=b, 
                    uneven_bins_names=uneven_bins_names)

# Save the new column names for future use. 
  
bucketized_old <- c(quantiles_names, uneven_bins_names)
  
for(i in 1:length(bucketized_old))
{
  nameB <- paste(bucketized_old[i], "_bucket", sep ="")
  vars[[length(vars) + 1]] <- nameB
}

# Drop variables used for feature engineering.

vars <- setdiff(vars, vars_fe)

binned <- binned[vars]

# Convert all categorical variables into factor.

binned %>% 
  sapply(is.character) %>%
  which(useNames=TRUE) %>%
  names() ->
catc
  
binned[catc] %<>% lapply(factor)

glimpse(binned)

# Export the result.

write.csv(binned, file=file.path(wd, dpath, "binnedSimu.csv"), row.names=FALSE)

2.5.2 Model Building and Evaluation

In this section, we get started to build and evaluate the model. This section will introduce how a model is constructed for credit risk prediction. Normally credit risk prediction is categorized as a classification problem. Depending on the variety of labels, the problem can be either binary-classification or multi-classification.

The case in this example is a binary-classification problem. The label for prediction is default status, i.e., "bad_flag", which has two levels, 'yes' and 'no'.

It is possible that not all the variables are correlated with the label, therefore, feature selection is usually performed to eliminate the less relevant ones.

As the data set is a blend of numeric and discrete variables (or discrete variables only), correlation analysis is not applicable. One alternative is to train a model, which can do automatic feature selection in the process of model building. Such model includes lasso logistic regression, extreme gradiant boosting, random forest, and so on.

2.5.2.1. Lasso Logistic Regression Model

First of all, the lasso logistic regression model, is applied on data after binning, considering it is very sensitive to outliers.

# Introduce a generic variables.

data <- binned

vars <- setdiff(names(data), c(target, id))

The data is split into training and testing datasets for model creation and validation.

# Split Data

set.seed(42)

data <- data[order(runif(nrow(data))), ]

train <- sample(nrow(data), 0.70 * nrow(data))
test <- setdiff(seq_len(nrow(data)), train)

We train a lasso logistic regression model as the baseline model.

# Matrix preparation for glmnet model

form <- as.formula(paste(target, paste(vars, collapse="+"), sep="~"))
form

xfactors <- model.matrix(form, data)[, -1]
x <- as.matrix(data.frame(xfactors))

bad_flag <- data$bad_flag
## Train lasso logistic regression model.

# Note alpha=1 for lasso only and can blend with ridge penalty down to alpha=0 ridge only.

model_glmnet <- glmnet(x[train, ], y=bad_flag[train], alpha=1, family="binomial")

# Plot variable coefficients vs. shrinkage parameter lambda.

# plot(model_glmnet, xvar="lambda")

# Obtain optimal lambda using cross validation.

model_glmnet_cv <- cv.glmnet(x[train, ], y=bad_flag[train], alpha=1, family="binomial")

# plot(model_glmnet_cv)

best_lambda <- model_glmnet_cv$lambda.min %T>% print()

# Obtain the coefficient of each variable in the optimal fitted model, which shows variable importance as well.

coef(model_glmnet_cv, s=best_lambda)

Once a (lasso logistic regression) model has been built based on a training data set, next the validity of the model needs to be assessed in a testing data set.

# Evaluate model performance on the validation dataset. 

# Obtain the probabilities of credit default for the fitted optimal glmnet() model.

probs <- predict(model_glmnet_cv, newx=x[test, ], type="response", s=best_lambda)

# Create a prediction object.

pred <- ROCR::prediction(predictions=probs, 
                         labels=data[test, ]$bad_flag)

The predictive power of the model can be assessed via various ways, for example, useful graphs (e.g., Kolmogorov Smirnov chart, roc curve, lift chart) and confusion matrix. The results are as shown below.

# Draw Kolmogorov Smirnov chart. 

score1 <- pred@predictions[[1]][pred@labels[[1]]=="no"]
score2 <- pred@predictions[[1]][pred@labels[[1]]=="yes"]
group <- c(rep("score1", length(score1)), rep("score2", length(score2)))
dat <- data.frame(KSD=c(score1, score2), group=group)

# Create ECDF of data

cdf1 <- ecdf(score1) 
cdf2 <- ecdf(score2) 

# Find min and max statistics to draw line between points of greatest distance

minMax <- seq(min(score1, score2), max(score1, score2), length.out=length(score1))

# Compute KS

ks <- max(abs(cdf1(minMax) - cdf2(minMax))) %>% print()

# Find the predicted probability where the cumulative distributions have the biggest difference. 
x0 <- minMax[which(abs(cdf1(minMax) - cdf2(minMax)) == ks)] 
y0 <- cdf1(x0)
y1 <- cdf2(x0)

dat %>%
ggplot(aes(x=KSD, group=group, color = group)) +
  stat_ecdf(size=1) +
    geom_segment(aes(x=x0[1], y=y0[1], xend=x0[1], yend=y1[1]),
                 linetype = "dashed", color = "red") +
    geom_point(aes(x=x0[1], y=y0[1]), color="red", size=5) +
    geom_point(aes(x=x0[1], y=y1[1]), color="red", size=5) +
    annotate("text", x=0.3, y=0.00, hjust=1, vjust=0, size=5,
              label=paste("KS =", round(ks, 3))) +
    labs(x="Score",
         y="ECDF",
         title="KS Plot for Lasso Logistic Regression Model",
         caption="Source: " %s+% "glmnet") +
    theme(axis.text.x=element_text(hjust=1))

# Draw ROC curve

perf <- ROCR::performance(pred, "tpr", "fpr")
auc <- ROCR::performance(pred, "auc")@y.values[[1]] %>% print()
pd <- data.frame(fpr=unlist(perf@x.values), tpr=unlist(perf@y.values))

pd %>%
 ggplot(aes(x=fpr, y=tpr)) +
   geom_line(colour="red") +
   geom_line(data=data.frame(), aes(x=c(0,1), y=c(0,1)), colour="grey") +
   annotate("text", x=0.50, y=0.00, hjust=0, vjust=0, size=5,
             label=paste("AUC =", round(auc, 3))) +
   labs(x="False Positive Rate",
        y="True Positive Rate",
        title="ROC Curve for Lasso Logistic Regression Model",
        caption="Source: " %s+% "glmnet") +
   theme(axis.text.x = element_text(hjust=1))

# Calculate the confusion matrix
 
df_pred <- data.frame(cbind(data[test, ], probs))
colnames(df_pred)[colnames(df_pred)=="X1"] <- "scored_probs"
df_pred %<>% mutate(scored_label=ifelse(scored_probs > 0.5, "yes", "no"))

confusionMatrix(data=df_pred$scored_label, 
                reference=data[test, ]$bad_flag, 
                positive="yes")
2.5.2.2 Extreme Gradiant Boosting

As mentioned earlier, it is not necessary to conduct binning analytics, when applying some machine learning algorithms, like extreme gradiant boosting, etc. Here, we build the model on the processed data directly.

# Introduce a generic variable.

data <- processed

vars <- setdiff(names(data), c(target, id))
# Train model: xgboost

ntrain <- as.matrix(sapply(data[train, vars], as.numeric))

dtrain <- list()
dtrain$data <- Matrix(ntrain, sparse=TRUE)
dtrain$label <- as.numeric(data[train, target]) - 1

dtrain %>% str()

model_xgb <- xgboost(data=dtrain$data, 
                       label=dtrain$label,
                       booster="gbtree",
                       max_depth=6, 
                       eta=0.3, 
                       nthread=2, 
                       nround=150, 
                       verbose=0,
                       eval_metrics=list("error"),
                       objective="binary:logistic")

# Calculate feature importance

imp <- xgb.importance(feature_names=dtrain$data@Dimnames[[2]], 
                      model=model_xgb)

print(imp)

# Visualize feature importance

# xgb.plot.importance(imp)

# Display tree number 1

#xgb.plot.tree(feature_names=colnames(dtrain$data),
#              model=model_xgb,
#              n_first_tree=1)

# Score model

ntest <- as.matrix(sapply(data[test, vars], as.numeric))

dtest <- list()
dtest$data <- Matrix(ntest, sparse=TRUE)
dtest$label <- as.numeric(data[test, target]) - 1

dtest %>% str()

probs <- predict(model_xgb, dtest$data)

# Create a prediction object.

pred <- ROCR::prediction(predictions=probs, 
                         labels=data[test, ]$bad_flag)
# Draw Kolmogorov Smirnov chart. 

score1 <- pred@predictions[[1]][pred@labels[[1]]=="no"]
score2 <- pred@predictions[[1]][pred@labels[[1]]=="yes"]
group <- c(rep("score1", length(score1)), rep("score2", length(score2)))
dat <- data.frame(KSD=c(score1,score2), group=group)

# Create ECDF of data

cdf1 <- ecdf(score1) 
cdf2 <- ecdf(score2) 

# Find min and max statistics to draw line between points of greatest distance

minMax <- seq(min(score1, score2), max(score1, score2), length.out=length(score1))

# Compute KS

ks <- max(abs(cdf1(minMax) - cdf2(minMax))) %>% print()

# Find the predicted probability where the cumulative distributions have the biggest difference. 

x0 <- minMax[which(abs(cdf1(minMax) - cdf2(minMax)) == ks)] 
y0 <- cdf1(x0)
y1 <- cdf2(x0)

dat %>%
ggplot(aes(x=KSD, group=group, color=group)) +
  stat_ecdf(size=1) +
    geom_segment(aes(x=x0[1], y=y0[1], xend=x0[1], yend=y1[1]),
                 linetype="dashed", color="red") +
    geom_point(aes(x=x0[1], y=y0[1]), color="red", size=5) +
    geom_point(aes(x=x0[1], y=y1[1]), color="red", size=5) +
    annotate("text", x=0.3, y=0.00, hjust=0.1, vjust=0, size=5,
            label=paste("KS =", round(ks, 3))) +
    labs(x="Score",
         y="ECDF",
         title="KS Plot for Extreme Gradiant Boost Model",
         caption="Source: " %s+% "xgboost") +
    theme(axis.text.x=element_text(hjust=1))

# Draw ROC curve

perf <- ROCR::performance(pred, "tpr", "fpr")
auc <- ROCR::performance(pred, "auc")@y.values[[1]] %>% print()
pd <- data.frame(fpr=unlist(perf@x.values), tpr=unlist(perf@y.values))

pd %>%
 ggplot(aes(x=fpr, y=tpr)) +
   geom_line(colour="red") +
   geom_line(data=data.frame(), aes(x=c(0,1), y=c(0,1)), colour="grey") +
   annotate("text", x=0.50, y=0.00, hjust=0, vjust=0, size=5,
             label=paste("AUC =", round(auc, 3))) +
   labs(x="False Positive Rate",
        y="True Positive Rate",
        title="ROC Curve for Extreme Gradiant Boosting Model",
        caption="Source: " %s+% "xgboost") +
   theme(axis.text.x=element_text(hjust=1))

# Calculate the confusion matrix
 
df_pred <- data.frame(cbind(data[test, ], scored_probs=probs))
df_pred %<>% mutate(scored_label=ifelse(scored_probs > 0.5, "yes", "no"))

confusionMatrix(data=df_pred$scored_label, 
                reference=data[test, ]$bad_flag, 
                positive="yes")
2.5.2.3 Parameter Tuning and Model Ensembles
  1. Model-specific feature selection.

Here, we train an extreme gradiant boosting model and then rank the feature importance.

# Set up the training control.

control <- trainControl(method="repeatedcv", number=5, repeats=3)

# Train the model

model <- caret::train(bad_flag ~ .,
                      data[c(target, vars)],
                      method="xgbTree",
                      preProcess="scale",
                      allowParallel=TRUE,
                      trControl=control) 

# Estimate variable importance

imp <- varImp(model, scale=FALSE)

# Plot

plot(imp)

# Select the top-ranking variables.

imp_list <- rownames(imp$importance)[order(imp$importance$Overall, decreasing=TRUE)]

# Drop the low ranking variables.

top_vars <- 
  imp_list[1:(ncol(data) - 6)] %>%
  as.character() 

top_vars
  1. Individual models.
# Initialize training control.

tc <- trainControl(method="boot", 
                   number=5, 
                   repeats=3, 
                   search="grid",
                   classProbs=TRUE,
                   savePredictions="final",
                   summaryFunction=twoClassSummary,
                   allowParallel=TRUE)

# Extreme gradiant boosting model.

time_xgb <- system.time(
  model_xgb <- caret::train(bad_flag ~ .,
                      data[train, c(target, vars)],
                      method="xgbTree",
                      trainControl=tc)
)

# Random forest model

time_rf <- system.time(
  model_rf <- caret::train(bad_flag ~ .,
                     data[train, c(target, vars)],
                     method="rf",
                     trainControl=tc)
)
  1. Ensemble of models.
# Ensemble of models

levels(data$bad_flag) <- make.names(levels(data$bad_flag))

time_ensemble <- system.time(
  model_list <- caretList(bad_flag ~ ., 
                          data=data[train, c(target, vars)],
                          trControl=tc,
                          methodList=c("xgbTree", "rf"))
)

xyplot(resamples(model_list))
modelCor(resamples(model_list))

# Ensemble of models

model_ensemble <- caretEnsemble(
  model_list, 
  metric="ROC",
  trControl=tc)

# Stack of models. Use gbm for meta model.

model_stack <- caretStack(
  model_list,
  metric="ROC",
  method="gbm",
  verbose=FALSE,
  trControl=tc)

# Variable importance of model ensembles.

# varImp(model_ensemble)

Evaluation and performance comparison.

# Predict

models <- list(model_xgb, model_rf, model_ensemble, model_stack)

# Predict class

predictions <-lapply(models, 
                     predict, 
                     newdata=data[test, c(target, vars)])

levels(predictions[[4]]) <- c("no", "yes")

# Confusion matrix evaluation results.

cm_metrics <-lapply(predictions,
                    confusionMatrix, 
                    reference=data[test, target],
                    positive="yes")

# Accuracy

acc_metrics <- 
  lapply(cm_metrics, `[[`, "overall") %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()

# Recall

rec_metrics <- 
  lapply(cm_metrics, `[[`, "byClass") %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()
  
# Precision

pre_metrics <- 
  lapply(cm_metrics, `[[`, "byClass") %>%
  lapply(`[`, 3) %>%
  unlist() %>%
  as.vector()

# Predict class probability

probs12 <- lapply(models[c(1, 2)],
                predict,
                newdata=data[test, c(target, vars)],
                type="prob") %>%
         lapply('[[', 2)

probs34 <- lapply(models[c(3, 4)],
                predict,
                newdata=data[test, c(target, vars)],
                type="prob")

probs <- c(probs12, probs34)

# Create prediction object

preds <- lapply(probs, 
                ROCR::prediction,
                labels=data[test, target])

# Auc

auc_metrics <- lapply(preds, 
                      ROCR::performance,
                      "auc") %>%
               lapply(slot, "y.values") %>%
               lapply('[[', 1) %>%
               unlist()

algo_list <- c("Xgboost", "Random Forest", "Ensemble", "Stacking")
time_consumption <- c(time_xgb[3], time_rf[3], time_ensemble[3], time_ensemble[3])

df_comp <- 
  data.frame(Models=algo_list, 
             Accuracy=acc_metrics, 
             Recall=rec_metrics, 
             Precision=pre_metrics,
             AUC=auc_metrics,
             Time=time_consumption) %T>%
             {head(.) %>% print()}

3 Conclusion

This document introduces a data-driven advanced analytics for credit risk prediction. Techniques of exploratory analytics, data aggregation and cleansing, feature engineering, but more importantly, model building and evaluation are demonstrated on the simulated datasets. This walk-through may help banks to enhance risk managment and then identify opportunities to adjust credit limits and eventually reduce losses and increase revenue in lending. The later version of this document will pursue an innovative hotspot method on credit risk in future.