## Introduction

We want to analyse loans in order to predict likelihood of default.

You will try to use:
* <strong>Decision trees</strong>
* <strong>Random forests</strong>
* <strong>Neural networks</strong>

Try to adapt the models in order to improve how well they predict Default. Use the MSE in order to assess performance.

In [None]:
# Fix plot size
options(repr.plot.width=5, repr.plot.height=5)

set.seed(7293749)

library(randomForest) #rf
library(neuralnet) #nn 
library(MASS) 
library(rpart) #decision tree
library(e1071) #svm

scaled = read.csv("output_scaled_sample_data.csv",head=T) #scaled data set

# keep a small chunk of the data so that the notebook can handle it
scaled = scaled[1:1000,]

index = sample(1:nrow(scaled),round(0.75*nrow(scaled))) #select the rows for the training set

# Split the data into a training set and a test set
train_ = scaled[index,]
test_ = scaled[-index,]

# Attach the data so we can refer to the column names
attach(train_)

First, load the libraries and data that we will use.

Before doing any analysis, take a quick peak at what's in the data:

In [None]:
head(train_)

## Decision Trees

The input parameters are:

* `minsplit`: minimum number of observations for a split to happen *cp: a split must decrease the overall lack of fit by a factor of 0.001 (cost complexity factor) before being attempted
* `method` is "class" for classification and "anova" for regression

See if by changing the input parameters below you can improve the MSE.

In [None]:
minsplit = 30
cp = 0.001
method = "anova"

output.decisionTree = rpart(Default~., data=train_,control=rpart.control(minsplit=minsplit, cp=cp),method=method)

plot(output.decisionTree, uniform=TRUE, main="Classification Tree")
text(output.decisionTree, use.n=TRUE, all=TRUE, cex=.8)

# # variable importance
# summary(output.decisionTree)

# predictions on new data
predictions = predict(output.decisionTree, test_)

# asseessment
plot(test_$Default, predictions, xlab="original",ylab="predicted",bty="n",pch=16, col="orange")
MSE = sum((predictions - test_$Default)^2)/nrow(test_)
cat("The MSE for this decision tree is ", MSE, "\n")


## RANDOM FOREST

The input parameters are:

* `ntree`: Number of trees to grow.
* `nodesize`: Minimum size of terminal nodes.

See if by changing the input parameters below you can improve the MSE.

In [None]:
#load the relevant library
library(randomForest)

ntree = 10
nodesize = 3

output.forest = randomForest(Default ~ ., data = train_, ntree=ntree, nodesize = nodesize)
# variable importance
print(importance(output.forest,type = 2)) 

# predictions on new data
predictions = predict(output.forest, test_)

# asseessment
plot(test_$Default, predictions, xlab="original",ylab="predicted",bty="n",pch=16, col="orange");
MSE = sum((predictions - test_$Default)^2)/nrow(test_)
cat("The MSE for this random forest is: ",MSE,"\n")

## NEURAL NETWORK

The input parameters are:

* `hidden` in the form of c(n,m) for n neurons for each of m hidden layers
* `rep` the number of repetitions for the neural network’s training.

See if by changing the input parameters below you can improve the MSE.

In [None]:
hidden = c(2,1) # start small; you might be dealing with a bigger data set than you think
rep = 1

n = names(train_)
f = as.formula(paste("Default ~", paste(n[!n %in% "Default"], collapse = " + ")))
output.nn = neuralnet(f,data=train_,hidden=hidden,linear.output=T,rep=rep) 

# predictions on new data
predictions = compute(output.nn, test_[,c(1:7,9:10)])$net.result

# asseessment
plot(test_$Default, predictions, xlab="original",ylab="predicted",bty="n",pch=16, col="orange");
MSE = sum((predictions - test_$Default)^2)/nrow(test_)
cat("The MSE for the NN is: ",MSE,"\n")