## Creating a simple tuned model

In [3]:
# importing the Credit_Data file
credit<-read.csv("Credit_Data.csv", stringsAsFactors = TRUE)
str(credit)

'data.frame':	1000 obs. of  21 variables:
 $ default                   : int  0 1 0 0 1 0 0 0 0 1 ...
 $ account_check_status      : Factor w/ 4 levels "< 0 DM",">= 200 DM / salary assignments for at least 1 year",..: 1 3 4 1 1 4 4 3 4 3 ...
 $ duration_in_month         : int  6 48 12 42 24 36 24 36 12 30 ...
 $ credit_history            : Factor w/ 5 levels "all credits at this bank paid back duly",..: 2 4 2 4 3 4 4 4 4 2 ...
 $ purpose                   : Factor w/ 10 levels "(vacation - does not exist?)",..: 5 5 1 8 3 1 8 4 5 3 ...
 $ credit_amount             : int  1169 5951 2096 7882 4870 9055 2835 6948 3059 5234 ...
 $ savings                   : Factor w/ 5 levels ".. >= 1000 DM ",..: 5 2 2 2 2 5 4 2 1 2 ...
 $ present_emp_since         : Factor w/ 5 levels ".. >= 7 years",..: 1 3 4 4 3 3 1 3 4 5 ...
 $ installment_as_income_perc: int  4 2 2 2 3 2 3 2 2 4 ...
 $ personal_status_sex       : Factor w/ 4 levels "female : divorced/separated/married",..: 4 1 4 4 4 4 4 4 2 3 ...
 $ o

In [4]:
# making the default variable a factor 
credit$default<-factor(credit$default)

In [5]:
# cleaning the data
levels(credit$savings)<-make.names(levels(credit$savings))
levels(credit$account_check_status)<-make.names(levels(credit$account_check_status))
levels(credit$present_emp_since)<-make.names(levels(credit$present_emp_since))
levels(credit$personal_status_sex)<-make.names(levels(credit$personal_status_sex))
levels(credit$property)<-make.names(levels(credit$property))
levels(credit$credit_history)<-make.names(levels(credit$credit_history))
levels(credit$foreign_worker)<-make.names(levels(credit$foreign_worker))
levels(credit$housing)<-make.names(levels(credit$housing))
levels(credit$job)<-make.names(levels(credit$job))
levels(credit$other_debtors)<-make.names(levels(credit$other_debtors))
levels(credit$other_installment_plans)<-make.names(levels(credit$other_installment_plans))
levels(credit$purpose)<-make.names(levels(credit$purpose))
levels(credit$telephone)<-make.names(levels(credit$telephone))

In [9]:
# Training the model using the caret package containing the C50 function
# install.packages("caret") #:: DISCLAIMER!!...takes about 4-5 minutes to install
# install.packages("C50")
library(caret)
library(C50)
set.seed(300)
m<-train(default~., data = credit, method = "C5.0")

In [10]:
# checking the output of m
m

C5.0 

1000 samples
  20 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
Resampling results across tuning parameters:

  model  winnow  trials  Accuracy   Kappa    
  rules  FALSE    1      0.6955117  0.2797417
  rules  FALSE   10      0.7309153  0.3395596
  rules  FALSE   20      0.7386964  0.3519249
  rules   TRUE    1      0.6994184  0.2838831
  rules   TRUE   10      0.7242280  0.3271621
  rules   TRUE   20      0.7295351  0.3368349
  tree   FALSE    1      0.6914577  0.2535941
  tree   FALSE   10      0.7278388  0.3053035
  tree   FALSE   20      0.7358419  0.3257068
  tree    TRUE    1      0.6916946  0.2614563
  tree    TRUE   10      0.7287126  0.3096942
  tree    TRUE   20      0.7336099  0.3233626

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were trials = 20, model = rules and
 winnow = FALSE.

In [11]:
# to apply the best model to make predictions on the training data
p<-predict(m, credit)

In [12]:
# the resulting vector of predictions
table(p, credit$default)

   
p     0   1
  0 699   4
  1   1 296

Of the 1000 observations, only 5 were misclassifiesd giving an accuracy of 99.5%. However, this should not be the case for an unseen data. The best accuracy for the model would be 73.86% as per the model summarry. 

In [13]:
# the predicted class
head(p)

In [14]:
# checking the probability of each class
head(predict(m, credit, type = "prob"))

Unnamed: 0_level_0,0,1
Unnamed: 0_level_1,<dbl>,<dbl>
1,1.0,0.0
2,0.05598479,0.94401521
3,0.95237185,0.04762815
4,0.78317975,0.21682025
5,0.18162126,0.81837874
6,0.80183886,0.19816114


## Customizing the tuning process

In [15]:
# create a control object ctrl which would be used later
ctrl<-trainControl(method = "cv", number = 10, selectionFunction = "oneSE")

In [16]:
# we can use the expand.grid() function, which creates data frames from the combinations of all values supplied
grid<-expand.grid(.model = "tree",
                  .trials = c(1,5,10,15,20,25,30,35),
                  .winnow = FALSE)
grid

.model,.trials,.winnow
<fct>,<dbl>,<lgl>
tree,1,False
tree,5,False
tree,10,False
tree,15,False
tree,20,False
tree,25,False
tree,30,False
tree,35,False


In [17]:
# we then run a throrough customized train() experiment
m1<-train(default~., data = credit, method = "C5.0",
          metric = "Kappa",
          trControl = ctrl,
          tuneGrid = grid)
m

C5.0 

1000 samples
  20 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
Resampling results across tuning parameters:

  model  winnow  trials  Accuracy   Kappa    
  rules  FALSE    1      0.6955117  0.2797417
  rules  FALSE   10      0.7309153  0.3395596
  rules  FALSE   20      0.7386964  0.3519249
  rules   TRUE    1      0.6994184  0.2838831
  rules   TRUE   10      0.7242280  0.3271621
  rules   TRUE   20      0.7295351  0.3368349
  tree   FALSE    1      0.6914577  0.2535941
  tree   FALSE   10      0.7278388  0.3053035
  tree   FALSE   20      0.7358419  0.3257068
  tree    TRUE    1      0.6916946  0.2614563
  tree    TRUE   10      0.7287126  0.3096942
  tree    TRUE   20      0.7336099  0.3233626

Accuracy was used to select the optimal model using the largest value.
The final values used for the model were trials = 20, model = rules and
 winnow = FALSE.

Something's wrong here

# Improving model performance using meta learning

## Bagging (Bootstrap aggregating)

In [19]:
# Creating the ensemble
# install.packages("ipred")
library(ipred)
set.seed(300)
mybag <- bagging(default ~ ., data = credit, nbagg = 25)
mybag


Bagging classification trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = default ~ ., data = credit, nbagg = 25)



In [20]:
credit_pred <- predict(mybag, credit)
table(credit_pred, credit$default)

           
credit_pred   0   1
          0 699   3
          1   1 297

In [21]:
# seeing how it transforms to the future performance
library(caret)
set.seed(300)
ctrl <- trainControl(method = "cv", number = 10)
train(default ~ ., data = credit, method = "treebag", trControl = ctrl)

Bagged CART 

1000 samples
  20 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
Resampling results:

  Accuracy  Kappa   
  0.752     0.372143


## Boosting  
AdaBoosting (Adaptive Boosting)

## Random Forest  
Decision tree forest

In [23]:
# Training random forest
# install.packages("randomForest")
library(randomForest)
set.seed(300)
rf <- randomForest(default ~ ., data = credit)
rf

randomForest 4.7-1.1

Type rfNews() to see new features/changes/bug fixes.


Attaching package: ‘randomForest’


The following object is masked from ‘package:ggplot2’:

    margin





Call:
 randomForest(formula = default ~ ., data = credit) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 4

        OOB estimate of  error rate: 23.6%
Confusion matrix:
    0   1 class.error
0 643  57  0.08142857
1 179 121  0.59666667

OOB estimate of error rate is the Out of Bag error rate. Is an estimate of the test set error. This means that it should be a fairly reasonable estimate of future performance

## Evaluating Random Forest performance

In [24]:
# We compare an auto-tuned random forest to the best auto-tuned boosted C5.0
# model we've been working on.
# setting our training control options
library(caret)
ctrl <- trainControl(method = "repeatedvc", number = 10, repeats = 10)

“`repeats` has no meaning for this resampling method.”


In [25]:
# setting up a tuning grid for the random forest
grid_rf <- expand.grid(.mtry = c(2,4,8,16))

In [30]:
# DISCLAIMER :: Takes upto 5 minutes to run
# supplying the grid and ctrl object to the train() function
set.seed(300)
m_rf <- train(default ~ ., data = credit, method = "rf", 
              metric = "Kappa", trainControl = ctrl, tuneGrid = grid_rf)

In [32]:
grid_c50 <- expand.grid(.model = "tree",
                        .trials = c(10,20,30,40),
                        .winnow = FALSE)
set.seed(300)
m_c50 <- train(default ~ ., data = credit, method = "C5.0",
                metric = "Kappa", trainControl = ctrl,
                tuneGrid = grid_c50)

In [34]:
# comparing the two approaches 
m_rf

Random Forest 

1000 samples
  20 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
Resampling results across tuning parameters:

  mtry  Accuracy   Kappa    
   2    0.7237513  0.1203024
   4    0.7415224  0.2573749
   8    0.7448996  0.3027063
  16    0.7499223  0.3375247

Kappa was used to select the optimal model using the largest value.
The final value used for the model was mtry = 16.

In [35]:
m_c50

C5.0 

1000 samples
  20 predictor
   2 classes: '0', '1' 

No pre-processing
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 1000, 1000, 1000, 1000, 1000, 1000, ... 
Resampling results across tuning parameters:

  trials  Accuracy   Kappa    
  10      0.7278388  0.3053035
  20      0.7358419  0.3257068
  30      0.7398061  0.3341462
  40      0.7412837  0.3398734

Tuning parameter 'model' was held constant at a value of tree
Tuning
 parameter 'winnow' was held constant at a value of FALSE
Kappa was used to select the optimal model using the largest value.
The final values used for the model were trials = 40, model = tree and winnow
 = FALSE.

With a kappa of 0.361, the random forest model with mtry = 16 was the winner
among these eight models. It was marginally higher than the best C5.0 decision tree, which had a kappa of 0.334. Based on these results, we would submit the random
forest as our final model.