Permalink
Switch branches/tags
Nothing to show
Find file Copy path
Fetching contributors…
Cannot retrieve contributors at this time
373 lines (279 sloc) 9.11 KB
title author date output
Faster and Scalable Credit Risk Prediction
Fang Zhou, Data Scientist, Microsoft
`r Sys.Date()`
html_document
knitr::opts_chunk$set(echo = TRUE,
                      fig.width = 8,
                      fig.height = 5,
                      fig.align='center',
                      dev = "png")

1 Introduction

Microsoft R is a collection of servers and tools that extend the capabilities of R, making it easier and faster to build and deploy R-based solutions. Microsoft R brings you the ability to do parallel and chunked data processing and modelling that relaxes the restrictions on dataset size imposed by in-memory open source R.

The MicrosoftML package brings new machine learning functionality with increased speed, performance and scalability, especially for handling a large corpus of text data or high-dimensional categorical data. The MicrosoftML package is installed with Microsoft R Client, Microsoft R Server and with the SQL Server Machine Learning Services.

This document will walk through you how to build faster and scalable credit risk models, using the MicrosoftML package that adds state-of-the-art machine learning algorithms and data transforms to Microsoft R Server.

2 Faster and Scalable Credit Risk Models

2.1 Setup

We load the required R packages.

## Setup

# Load the required packages into the R session.

library(rattle)       # Use normVarNames().
library(dplyr)        # Wrangling: tbl_df(), group_by(), print(), glimpse().
library(magrittr)     # Pipe operator %>% %<>% %T>% equals().
library(scales)       # Include commas in numbers.
library(RevoScaleR)   # Enable out-of-memory computation in R.
library(dplyrXdf)     # Wrangling on xdf data format.
library(MicrosoftML)  # Build models using Microsoft ML algortihms.
library(caret)        # Calculate confusion matrix by using confusionMatrix().
library(ROCR)         # Provide functions for model performance evaluation.

Then, the dataset processedSimu is ingested and transformed into a .xdf data format. This random dataset was created to simulate real world banking transaction data.

## Data Ingestion

# Identify the source location of the dataset.

wd <- getwd()

dpath <- "../Data"
data_fname <- file.path(wd, dpath, "processedSimu.csv")
output_fname <- file.path(wd, dpath, "processedSimu.xdf")
output <- RxXdfData(file=output_fname)

# Ingest the dataset.

data <- rxImport(inData=data_fname, 
                 outFile=output,
                 stringsAsFactors=TRUE,
                 overwrite=TRUE)


# View data information.

rxGetVarInfo(data)

2.2 Model Building

Now, let's get started to build credit risk models by leveraging different machine learning algorithms from the MicrosoftML package.

First of all, we create individual machine learning models on the dataset processedSimu.xdf by using the functions rxLogisticRegression(), rxFastForest(), rxFastTrees().

From the credit risk prediction template, we know that gradient boosting is the most suitable algorithm for this example, considering the overall performance. Therefore, the models implemented by the function rxFastTrees() with different sets of parameters are trained respectively.

## Variable roles.

# Target variable

target <- "bad_flag"

# Note any identifier.

id <- c("account_id") %T>% print() 

# Note the available variables as model inputs.

vars <- setdiff(names(data), c(target, id))
# Split Data

set.seed(42)

# Add training/testing flag to each observation.

data %<>%
  mutate(.train=factor(sample(1:2, .rxNumRows,
                              replace=TRUE,
                              prob=c(0.70, 0.30)),
                       levels=1:2))

# Split dataset into training/test.

data_split <- rxSplit(data, splitByFactor=".train")
# Prepare the formula

top_vars <- c("amount_6", "pur_6", "avg_pur_amt_6", "avg_interval_pur_6", "credit_limit", "age", "income", "sex", "education", "marital_status")

form <- as.formula(paste(target, paste(top_vars, collapse="+"), sep="~"))
form
# Specify the local parallel compute context.

rxSetComputeContext("localpar")

# Train model: rxLogisticRegression

time_rxlogit <- system.time(
  
  model_rxlogit <- rxLogisticRegression(
    formula=form,
    data=data_split[[1]],
    type="binary",
    l1Weight=1,
    verbose=0
  )
)

# Train model: rxFastForest

time_rxforest <- system.time(
  
  model_rxforest <- rxFastForest(
    formula=form,
    data=data_split[[1]],
    type="binary",
    numTrees=100,
    numLeaves=20,
    minSplit=10,
    verbose=0
  )
)

# Train model: rxFastTrees

time_rxtrees1 <- system.time(
  
  model_rxtrees1 <- rxFastTrees(
    formula=form,
    data=data_split[[1]],
    type="binary",
    numTrees=100,
    numLeaves=20,
    learningRate=0.2,
    minSplit=10,
    unbalancedSets=FALSE,
    verbose=0
  )
)

time_rxtrees2 <- system.time(
  
  model_rxtrees2 <- rxFastTrees(
    formula=form,
    data=data_split[[1]],
    type="binary",
    numTrees=500,
    numLeaves=20,
    learningRate=0.2,
    minSplit=10,
    unbalancedSets=FALSE,
    verbose=0
  )
)

time_rxtrees3 <- system.time(
  
  model_rxtrees3 <- rxFastTrees(
    formula=form,
    data=data_split[[1]],
    type="binary",
    numTrees=500,
    numLeaves=20,
    learningRate=0.3,
    minSplit=10,
    unbalancedSets=FALSE,
    verbose=0
  )
)

time_rxtrees4 <- system.time(
  
  model_rxtrees4 <- rxFastTrees(
    formula=form,
    data=data_split[[1]],
    type="binary",
    numTrees=500,
    numLeaves=20,
    learningRate=0.3,
    minSplit=10,
    unbalancedSets=TRUE,
    verbose=0
  )
)

Next, we build an ensemble of fast tree models by using the function rxEnsemble().

# Train an ensemble model.

time_ensemble <- system.time(
  
  model_ensemble <- rxEnsemble(
    formula=form,
    data=data_split[[1]],
    type="binary",
    trainers=list(fastTrees(), 
                  fastTrees(numTrees=500), 
                  fastTrees(numTrees=500, learningRate=0.3),
                  fastTrees(numTrees=500, learningRate=0.3, unbalancedSets=TRUE)),
    combineMethod="vote",
    replace=TRUE,
    verbose=0
  )
)

2.3 Model Evaluation

Finally, we evaluate and compare the above built models at various aspects.

# Predict

models <- list(model_rxlogit, model_rxforest, 
               model_rxtrees1, model_rxtrees2, model_rxtrees3, model_rxtrees4, 
               model_ensemble)

# Predict class

predictions <- lapply(models, 
                      rxPredict, 
                      data=data_split[[2]]) %>%
                lapply('[[', 1)

levels(predictions[[7]]) <- c("no", "yes")

# Confusion matrix evaluation results.

cm_metrics <-lapply(predictions,
                    confusionMatrix, 
                    reference=data_split[[2]][[target]],
                    positive="yes")

# Accuracy

acc_metrics <- 
  lapply(cm_metrics, `[[`, "overall") %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()

# Recall

rec_metrics <- 
  lapply(cm_metrics, `[[`, "byClass") %>%
  lapply(`[`, 1) %>%
  unlist() %>%
  as.vector()
  
# Precision

pre_metrics <- 
  lapply(cm_metrics, `[[`, "byClass") %>%
  lapply(`[`, 3) %>%
  unlist() %>%
  as.vector()

# Predict class probability

probs <- lapply(models[c(1, 2, 3, 4, 5, 6)],
                rxPredict,
                data=data_split[[2]]) %>%
                lapply('[[', 3)

# Create prediction object

preds <- lapply(probs, 
                ROCR::prediction,
                labels=data_split[[2]][[target]])

# Auc

auc_metrics <- lapply(preds, 
                      ROCR::performance,
                      "auc") %>%
               lapply(slot, "y.values") %>%
               lapply('[[', 1) %>%
               unlist()

auc_metrics <- c(auc_metrics, NaN)

algo_list <- c("rxLogisticRegression", 
               "rxFastForest", 
               "rxFastTrees", 
               "rxFastTrees(500)", 
               "rxFastTrees(500, 0.3)", 
               "rxFastTrees(500, 0.3, ub)",
               "rxEnsemble")

time_consumption <- c(time_rxlogit[3], time_rxforest[[3]], 
                      time_rxtrees1[3], time_rxtrees2[[3]], 
                      time_rxtrees3[[3]], time_rxtrees4[[3]],
                      time_ensemble[3])

df_comp <- 
  data.frame(Models=algo_list, 
             Accuracy=acc_metrics, 
             Recall=rec_metrics, 
             Precision=pre_metrics,
             AUC=auc_metrics,
             Time=time_consumption) %T>%
             print()

2.4 Save Models for Deployment

Last but not least, we need to save the model objects in various formats, (e.g., .RData, SQLServerData, ect) for the later usage of deployment.

# Save model for deployment usage.

model_rxtrees <- model_rxtrees3

save(model_rxtrees, file="model_rxtrees.RData")