## Supervised Machine Learning in R

To be able to run the code in this tutorial and complete your HW/lab assignments, you need to install and library the following packages: 
- kernlab
- caret:

In [9]:
#install.packages(c("kernlab","caret"))
library(kernlab)
library(caret)

Loading required package: lattice


Attaching package: ‘caret’


The following object is masked from ‘package:purrr’:

    lift




This tutorial will introduce you to **supervised machine learning in R**. 
Although the examples below are based on the dataset used in your HW assignment, you can use the same logic to complete your lab assignment. Please email me if you have any questions.  

### Supervised Machine Learning in R (using support vector machines, aka SVM)

Let's store the dataset available at: https://ist387.s3.us-east-2.amazonaws.com/data/GermanCredit.csv in a dataframe called **credit** using the **read_csv()** function from the **tidyverse** package:

In [2]:
library(tidyverse)
url="https://ist387.s3.us-east-2.amazonaws.com/data/GermanCredit.csv"
credit <- read_csv(url)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.4     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.4.0     [32m✔[39m [34mforcats[39m 0.5.0

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mggplot2[39m::[32malpha()[39m masks [34mkernlab[39m::alpha()
[31m✖[39m [34mpurrr[39m::[32mcross()[39m   masks [34mkernlab[39m::cross()
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m  masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m     masks [34mstats[39m::lag()


[36m──[39m [1m[1mColumn specification[1m[22m [36m────────────────────────────────────────────────────────[39m
cols(
  .default = col_character(),
  duration = [32mcol_double()[39m

It's always a good idea to inspect the dataframe we'll be working with and get an idea about the variables stored in it:

In [3]:
head(credit)

status,duration,credit_history,purpose,amount,savings,employment_duration,installment_rate,personal_status_sex,other_debtors,⋯,property,age,other_installment_plans,housing,number_credits,job,people_liable,telephone,foreign_worker,credit_risk
<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<chr>,⋯,<chr>,<dbl>,<chr>,<chr>,<dbl>,<chr>,<dbl>,<chr>,<chr>,<dbl>
... < 100 DM,6,critical account/other credits existing,domestic appliances,1169,unknown/no savings account,... >= 7 years,4,male : single,none,⋯,real estate,67,none,own,2,skilled employee/official,1,yes,yes,1
0 <= ... < 200 DM,48,existing credits paid back duly till now,domestic appliances,5951,... < 100 DM,1 <= ... < 4 years,2,female : divorced/separated/married,none,⋯,real estate,22,none,own,1,skilled employee/official,1,no,yes,0
no checking account,12,critical account/other credits existing,retraining,2096,... < 100 DM,4 <= ... < 7 years,2,male : single,none,⋯,real estate,49,none,own,1,unskilled - resident,2,no,yes,1
... < 100 DM,42,existing credits paid back duly till now,radio/television,7882,... < 100 DM,4 <= ... < 7 years,2,male : single,guarantor,⋯,building society savings agreement/life insurance,45,none,for free,1,skilled employee/official,2,no,yes,1
... < 100 DM,24,delay in paying off in the past,car (new),4870,... < 100 DM,1 <= ... < 4 years,3,male : single,none,⋯,unknown/no property,53,none,for free,2,skilled employee/official,2,no,yes,0
no checking account,36,existing credits paid back duly till now,retraining,9055,unknown/no savings account,1 <= ... < 4 years,2,male : single,none,⋯,unknown/no property,35,none,for free,1,unskilled - resident,2,yes,yes,1


We will use another banking dataset to train an SVM model to classify potential borrowers into 2 groups of credit risk – **reliable borrowers** and **borrowers posing a risk**. You can learn more about the variables in the dataset here:
https://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29

SVM needs numerical predictors (known in machine learning as **features**) to make a prediction about good vs. bad credit. Therefore, let's create a **subset** of **credit** which contains only the variable we are trying to predict and the numerical variables in the data:

In [4]:
cred <- data.frame(duration=credit$duration, 
                   amount=credit$amount, 
                   installment_rate=credit$installment_rate, 
                   present_residence=credit$present_residence, 
                   age=credit$age, 
                   credit_history=credit$number_credits, 
                   people_liable=credit$people_liable, 
                   credit_risk=as.factor(credit$credit_risk))

Let's take a look at this new dataframe, **cred**:

In [5]:
head(cred)

Unnamed: 0_level_0,duration,amount,installment_rate,present_residence,age,credit_history,people_liable,credit_risk
Unnamed: 0_level_1,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
1,6,1169,4,4,67,2,1,1
2,48,5951,2,2,22,1,1,0
3,12,2096,2,3,49,1,2,1
4,42,7882,2,4,45,1,2,1
5,24,4870,3,4,53,2,2,0
6,36,9055,2,4,35,1,2,1


We can see that all the variables except **credit_risk** - the one we are trying to predict - contain numbers.

Our next task is to create the so-called train-test split, i.e. split our **cred** data into a **train set** and a **test set** by randomly selecting 70 percent of the indices from **cred** and using them to subset from it to create a **train set**, while using the remaining 30 percent for the **test set**: 

In [10]:
trainList <- createDataPartition(y=cred$credit_risk, p=.70, list=FALSE)
trainSet <- cred[trainList,]
testSet <- cred[-trainList,]

We can use the **dim()** function on each new dataset to ensure we enacted a 70:30 split:

In [11]:
dim(trainSet)
dim(testSet)

We are now ready to **train** an SVM model by exposing it to the **train set** and letting it **learn** from its data. Remember to install and library the **kernlab** and **caret** packages if you haven't done so already: 

In [12]:
ksvm(credit_risk ~ ., data=trainSet, kernel= "rbfdot", kpar = "automatic", 
     C = 5, cross = 3, prob.model = TRUE)

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 5 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.127447825748301 

Number of Support Vectors : 439 

Objective Function Value : -1703.19 
Training error : 0.211429 
Cross validation error : 0.282815 
Probability model included. 

The most important arguments from the **ksvm()** function which creates the SVM model are **C** and **cross**. As we mentioned in the Week 12 slides, **C** is a **cost parameter** which determines how much your model should be penalized for a misclassification, aka an **error** in its work. You can learn more about **cross(validation)** here: https://towardsdatascience.com/cross-validation-explained-evaluating-estimator-performance-e51e5430ff85

The output of the model lets you assess important SVM-related statistics; the most critical one we are interested in is the value of the **cross validation** error because it tells us about the quality of our model - the lower, the better, but with this dataset, it's around 30 percent. To be able to use our model in the future though, especially for classifying new cases, we need to store it in a variable. Let's store the model in a variable called **svmOut** - your results may slightly differ from the model we ran above because of the inherent randomness of the model's activation:

In [13]:
svmOut <- ksvm(credit_risk ~ ., data=trainSet, kernel= "rbfdot", kpar = "automatic", 
               C = 5, cross = 3, prob.model = TRUE)

svmOut

Support Vector Machine object of class "ksvm" 

SV type: C-svc  (classification) 
 parameter : cost C = 5 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.132716033630377 

Number of Support Vectors : 441 

Objective Function Value : -1689.346 
Training error : 0.21 
Cross validation error : 0.294303 
Probability model included. 

Now comes the interesting part. So far, our model was simply learning from the **train set** - it was shown the numeric predictors and then the **correct answer** in the face of the outcome variable, **credit risk**, so it was able to identify certain patterns between the predictors and the outcome. The real test comes when we only show it the predictors and force it to guess the correct outcome. We do this with the **predict()** function and the **test set**:

In [14]:
svmPred <- predict(svmOut, # use the built model "svmOutput" to predict 
                   testSet, # use testData to generate predictions
                   type = "response" # request "votes" from the prediction process
)

As you can see, we've saved our work in a vector called **svmPred**, let's take a look at it:

In [15]:
head(svmPred)

It contains just 0s and 1s because this is how our outcome variable, **credit risk**, is formatted. Essentially, for each observation in the **test set**, the model made a prediction - good credit or bad credit, and saved its prediction in **svmPred**. To see how well it performed, we use the so-called **confusion matrix** which compares the answers the model gave against the actual answers we have in the **test set**: 

In [16]:
table(testSet$credit_risk,svmPred)

   svmPred
      0   1
  0  18  72
  1  22 188

I will explain how the confusion matrix works in class, but if you forget, check out this brief article which explains it nicely: https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62
There is a way to generate the same confusion matrix directly from the **caret** package, plus the **e1071** package - you should get the same results:

In [18]:
install.packages('e1071')
library(e1071)
confusionMatrix(testSet$credit_risk, svmPred)

Updating HTML index of packages in '.Library'

Making 'packages.html' ...
 done



Confusion Matrix and Statistics

          Reference
Prediction   0   1
         0  18  72
         1  22 188
                                          
               Accuracy : 0.6867          
                 95% CI : (0.6309, 0.7387)
    No Information Rate : 0.8667          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.1132          
                                          
 Mcnemar's Test P-Value : 4.327e-07       
                                          
            Sensitivity : 0.4500          
            Specificity : 0.7231          
         Pos Pred Value : 0.2000          
         Neg Pred Value : 0.8952          
             Prevalence : 0.1333          
         Detection Rate : 0.0600          
   Detection Prevalence : 0.3000          
      Balanced Accuracy : 0.5865          
                                          
       'Positive' Class : 0               
                              