# Universal Bank Prediction

The goal of this project is to explore various classification methods in R

## Customize Enviornment

In [53]:
# load packages
packages <- c("purrr", "doMC", "ggplot2", "caret", "pROC", "ROCR")
purrr::walk(packages, library, character.only = TRUE, warn.conflicts = FALSE)

# set default plot size
options(repr.plot.width=10, repr.plot.height=6)

# configure multicore processing
registerDoMC(cores=8)

## Load Data

In [54]:
# load the data from the MASS package
df <- read.csv("universal_bank.csv")

# convert variables to factors as needed
cols <- c("ZIP", "Education", "SecuritiesAccount", "CDAccount", "Online", "CreditCard", "PersonalLoan")
df[cols] <- lapply(df[cols], factor)

# create names for the class variable
levels(df$PersonalLoan) <- make.names(c("No", "Yes"))

# show a sample of the data
head(df)

# show the structure of the data
str(df)

Age,Experience,Income,ZIP,Family,CCAvg,Education,Mortgage,SecuritiesAccount,CDAccount,Online,CreditCard,PersonalLoan
25,1,49,91107,4,1.6,1,0,1,0,0,0,No
39,15,11,94720,1,1.0,1,0,0,0,0,0,No
35,9,100,94112,1,2.7,2,0,0,0,0,0,No
35,8,45,91330,4,1.0,2,0,0,0,0,1,No
37,13,29,92121,4,0.4,2,155,0,0,1,0,No
50,24,22,93943,1,0.3,3,0,0,0,0,1,No


'data.frame':	5000 obs. of  13 variables:
 $ Age              : int  25 39 35 35 37 50 35 65 67 60 ...
 $ Experience       : int  1 15 9 8 13 24 10 39 41 30 ...
 $ Income           : int  49 11 100 45 29 22 81 105 112 22 ...
 $ ZIP              : Factor w/ 467 levels "9307","90005",..: 84 368 299 97 161 268 35 367 118 398 ...
 $ Family           : int  4 1 1 4 4 1 3 4 1 1 ...
 $ CCAvg            : num  1.6 1 2.7 1 0.4 0.3 0.6 2.4 2 1.5 ...
 $ Education        : Factor w/ 3 levels "1","2","3": 1 1 2 2 2 3 2 3 1 3 ...
 $ Mortgage         : int  0 0 0 0 155 0 104 0 0 0 ...
 $ SecuritiesAccount: Factor w/ 2 levels "0","1": 2 1 1 1 1 1 1 1 2 1 ...
 $ CDAccount        : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
 $ Online           : Factor w/ 2 levels "0","1": 1 1 1 1 2 1 2 1 1 2 ...
 $ CreditCard       : Factor w/ 2 levels "0","1": 1 1 1 2 1 2 1 1 1 2 ...
 $ PersonalLoan     : Factor w/ 2 levels "No","Yes": 1 1 1 1 1 1 1 1 1 1 ...


## Partition Data

In [27]:
# split data into training and test
trainIndex <- createDataPartition(df$PersonalLoan, p = .7, 
                                  list = FALSE, 
                                  times = 1)

train <- df[ trainIndex,]
test  <- df[-trainIndex,]

In [28]:
# look at class proportions between training and test data
cat("Proportion of PersonalLoan in training data:")
prop.table(table(train$PersonalLoan))

cat("\nProportion of PersonLoan in test data:")
prop.table(table(test$PersonalLoan))

Proportion of PersonalLoan in training data:


   No   Yes 
0.904 0.096 


Proportion of PersonLoan in test data:


   No   Yes 
0.904 0.096 

## Predict Personal Loan

### Decision Tree

In [31]:
fit_ctrl <- trainControl(method = "repeatedcv", 
                         repeats = 5,
                         classProbs = TRUE,
                         summaryFunction = twoClassSummary)

In [33]:
set.seed(1234)
rpart_fit <- train(PersonalLoan ~ ., 
                   data = train,
                   method = "rpart",
                   metric = "ROC",
                   trControl = fit_ctrl,
                   na.action = na.exclude)

In [34]:
rpart_fit

CART 

3500 samples
  12 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 3150, 3150, 3151, 3151, 3149, 3149, ... 
Resampling results across tuning parameters:

  cp         ROC        Sens       Spec     
  0.1547619  0.9092224  0.9966494  0.7368806
  0.1785714  0.8844956  0.9977239  0.5478075
  0.2306548  0.6811746  0.9993042  0.2014082

ROC was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.1547619. 

In [35]:
rpart_pred <- predict(rpart_fit, test, type = "prob")
rpart_pred$pred <- factor(ifelse(rpart_pred$Yes >= .5, "Yes", "No"))
rpart_pred <- cbind(rpart_pred, actual = test$PersonalLoan)

In [72]:
confusionMatrix(data = rpart_pred$pred, 
                reference = rpart_pred$actual, 
                positive = levels(test$PersonalLoan)[2])

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1352   48
       Yes    4   96
                                         
               Accuracy : 0.9653         
                 95% CI : (0.9548, 0.974)
    No Information Rate : 0.904          
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.7687         
 Mcnemar's Test P-Value : 2.476e-09      
                                         
            Sensitivity : 0.66667        
            Specificity : 0.99705        
         Pos Pred Value : 0.96000        
         Neg Pred Value : 0.96571        
             Prevalence : 0.09600        
         Detection Rate : 0.06400        
   Detection Prevalence : 0.06667        
      Balanced Accuracy : 0.83186        
                                         
       'Positive' Class : Yes            
                                         

### Bagged Classification Trees

In [67]:
set.seed(1234)
treebag_fit <- train(PersonalLoan ~ ., 
                     data = train,
                     method = "treebag",
                     nbagg = 100,
                     metric = "ROC",
                     trControl = fit_ctrl, 
                     na.action=na.exclude)

In [68]:
treebag_fit

Bagged CART 

3500 samples
  12 predictor
   2 classes: 'No', 'Yes' 

No pre-processing
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 3150, 3150, 3151, 3151, 3149, 3149, ... 
Resampling results:

  ROC        Sens       Spec    
  0.9924231  0.9964597  0.850713

 

In [69]:
treebag_pred <- predict(treebag_fit, test, type = "prob")
treebag_pred$pred <- factor(ifelse(treebag_pred$Yes >= .5, "Yes", "No"))
treebag_pred <- cbind(treebag_pred, actual = test$PersonalLoan)

In [71]:
confusionMatrix(data = treebag_pred$pred, 
                reference = treebag_pred$actual, 
                positive = levels(test$PersonalLoan)[2])

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1350   20
       Yes    6  124
                                          
               Accuracy : 0.9827          
                 95% CI : (0.9747, 0.9886)
    No Information Rate : 0.904           
    P-Value [Acc > NIR] : < 2e-16         
                                          
                  Kappa : 0.8956          
 Mcnemar's Test P-Value : 0.01079         
                                          
            Sensitivity : 0.86111         
            Specificity : 0.99558         
         Pos Pred Value : 0.95385         
         Neg Pred Value : 0.98540         
             Prevalence : 0.09600         
         Detection Rate : 0.08267         
   Detection Prevalence : 0.08667         
      Balanced Accuracy : 0.92834         
                                          
       'Positive' Class : Yes             
                                          

### Support Vector Machine

In [74]:
set.seed(1234)
svm_fit <- train(PersonalLoan ~ ., 
                 data = train, 
                 method = "svmRadial", 
                 trControl = fit_ctrl, 
                 preProc = c("center", "scale"),
                 tuneLength = 8,
                 metric = "ROC", 
                 na.action = na.exclude)

“Variable(s) `' constant. Cannot scale data.”

In [75]:
svm_fit

Support Vector Machines with Radial Basis Function Kernel 

3500 samples
  12 predictor
   2 classes: 'No', 'Yes' 

Pre-processing: centered (478), scaled (478) 
Resampling: Cross-Validated (10 fold, repeated 5 times) 
Summary of sample sizes: 3150, 3150, 3151, 3151, 3149, 3149, ... 
Resampling results across tuning parameters:

  C      ROC        Sens       Spec     
   0.25  0.9063740  0.9872332  0.3162210
   0.50  0.9063754  0.9868534  0.3192157
   1.00  0.9064270  0.9866012  0.3167558
   2.00  0.9094990  0.9892551  0.3012121
   4.00  0.9141786  0.9894434  0.3578253
   8.00  0.9156416  0.9889991  0.3921747
  16.00  0.9143107  0.9870407  0.4213725
  32.00  0.9114441  0.9838821  0.4220856

Tuning parameter 'sigma' was held constant at a value of 0.001938687
ROC was used to select the optimal model using  the largest value.
The final values used for the model were sigma = 0.001938687 and C = 8. 

In [77]:
svm_pred <- predict(svm_fit, test, type = "prob")
svm_pred$pred <- factor(ifelse(svm_pred$Yes >= .5, "Yes", "No"))
svm_pred <- cbind(svm_pred, actual = test$PersonalLoan)

In [78]:
confusionMatrix(data = svm_pred$pred, 
                reference = svm_pred$actual,
                positive = levels(test$PersonalLoan)[2])

Confusion Matrix and Statistics

          Reference
Prediction   No  Yes
       No  1333   81
       Yes   23   63
                                         
               Accuracy : 0.9307         
                 95% CI : (0.9166, 0.943)
    No Information Rate : 0.904          
    P-Value [Acc > NIR] : 0.0001526      
                                         
                  Kappa : 0.5129         
 Mcnemar's Test P-Value : 2.28e-08       
                                         
            Sensitivity : 0.43750        
            Specificity : 0.98304        
         Pos Pred Value : 0.73256        
         Neg Pred Value : 0.94272        
             Prevalence : 0.09600        
         Detection Rate : 0.04200        
   Detection Prevalence : 0.05733        
      Balanced Accuracy : 0.71027        
                                         
       'Positive' Class : Yes            
                                         