Exercise 7 - Advanced Support Vector Machines
===

Support vector machines (SVMs) let us predict categories. In this exercise, we will be using SVM, paying attention to the key steps as we go:  formatting data correctly, splitting the data into training and test sets, training an SVM model using the training set, and then evaluating and visualising the SVM model using the test set.

We will be looking at __prions__: misfolded proteins that are associated with several fatal neurodegenerative diseases (kind of like Daleks, if you have seen Doctor Who). Looking at examples of protein mass and weight, we will build a predictive model to detect prions in blood samples.

#### Run the code below to load the required libraries for this exercise.

In [None]:
# Load `tidyverse` package
suppressMessages(install.packages("tidyverse"))
suppressMessages(library("tidyverse"))
suppressMessages(install.packages("e1071"))
suppressMessages(library("e1071"))
suppressMessages(install.packages("magrittr"))
suppressMessages(library("magrittr"))

Step 1
---

Let's load the required R packages and the prion data for this exercise.

**In the code below, complete the data loading step, by replacing `<prionDataset>` with `prion_data`, and running the code.**

In [None]:
###
# REPLACE <prionDataset> WITH prion_data
###
<prionDataset> <- read.csv("Data/PrionData.csv")
###

# Check the structure of `prion_data`
str(prion_data)
head(prion_data)

It appears that we have an extra column `X` in `prion_data` that contains the row number. By default, R has labelled the column `X` because the input didn't have a column name (it was blank). This behaviour happens regularly when exporting data sets from a program like Microsoft Excel and then importing them into R.

Let's get rid of the first column from `prion_data`, and then check that it has been successfully removed. We will use the `select` function from the `dplyr` package together with the `-` symbol to "minus" the `X` column from our dataset.

> **N.B. We have used a different assignment symbol `%<>%` from the `magrittr` package in the code below. The `magrittr` assignment symbol `%<>%` is a combination of the `magrittr` pipe symbol `%>%` and the base R assignment symbol `<-`. It takes the variable on the left hand side of the `%<>%` symbol, and updates the value of the variable with the result of the right hand side. So the object on the left hand side acts as both the initial value and the resulting value.**

#### Replace `<removeColumn>` with `-X` to remove the excess column X, then run the code.

In [None]:
###
# REPLACE <removeColumn> WITH -X
###
prion_data %<>% select(<removeColumn>)
###
str(prion_data)
head(prion_data)

# Check frequency of `prion_status` in `prion_data`
prion_data %>%
group_by(prion_status) %>%
summarise(n = n()) %>% 
mutate(freq = n/sum(n))

Excellent, we have successfully removed column `X` from `prion_data`!

Now, looking at the output of `str` and `head`, we can observe that `prion_data` is a `data.frame` that contains 485 observations and 3 variables stored in the following columns:

* `mass` is the first *feature*;
* `weight` is the second *feature*;
* `prion_status` is the *label* (or category).

Of the 485 observations, 375 (77.31%) are non-prions, and 110 (22.68%) are prions.

Step 2
---

Let's graph `prion_data` to better understand the features and labels.

**In the cell below replace:**

**1. `<xData>` with `mass`**

**2. `<yData>` with `weight`**

**3. `<colorData>` with `prion_status`**

** then __run the code__. **

In [None]:
prion_data  %>% 
###
# REPLACE <xData> WITH mass AND <yData> WITH weight AND <colorData> WITH prion_status
###
ggplot(aes(x = <xData> , y = <yData> , colour = <colorData> )) +
###
geom_point() +
ggtitle("Classification plot for prion data") +
# Create labels for x-axis, y-axis, and legend
labs(x = "Mass", y = "Weight", colour = "Prion status") +
# Align title to centre
theme(plot.title = element_text(hjust = 0.5))

Step 3
---

To create a SVM model, let's split our data into training and test sets. We'll start by checking the total number of instances in our data set. If we go back to the output from `str(prion_data)` in Step 2, we have 485 observations and 3 variables.

So, let's use 400 examples for our `training` set, and the remainder for our `test` set.

We will use the `slice` function to select the first 400 rows from `prion_data`

#### Replace `<selectData>` with `1:400`, and run the code.

In [None]:
###
# REPLACE <selectData> WITH 1:400
###
train_prion <- slice(prion_data, <selectData>)
###
str(train_prion)

# Check percentage of samples that are prions
train_prion %>%
group_by(prion_status) %>%
summarise(n = n()) %>% 
mutate(freq = n/sum(n))

# Create test set using the remaining examples
test_prion <- slice(prion_data, 401:n())
str(test_prion)

# Check percentage of samples that are prions
test_prion %>%
group_by(prion_status) %>%
summarise(n = n()) %>% 
mutate(freq = n/sum(n))

Well done! Let's look at a summary of our training data to get a better idea of what we're dealing with.

#### Replace `<trainDataset>` with `train_prion` and run the code.

In [None]:
###
# REPLACE <trainDataset> WITH train_prion
###
summary(<trainDataset>)
###

Using the `summary` function, we observe our training data contains 314 non-prions and 86 prions out of a total of 400 observations. This looks right, because the scatter plot we created in Step 2 showed us the majority of observations have 'non-prion' status.

Let's take a look at `test_prion` too, using the `summary` function again.

#### Replace `<testDataset>` with `test_prion` and run the code.

In [None]:
###
# REPLACE <testDataset> WITH test_prion
###
summary(<testDataset>)
###

Looking good! Alright, now to make a support vector machine.

Step 4
---

Below we will make an SVM similar to the previous exercise. Remember the syntax for SVMs using the `e1071::svm` function:

**`svm_model <- svm(x = x, y = y, data = dataset)`**

where `x` represents the features (a matrix), and `y` represents the labels (factors).

Alternatively, we can use the following syntax for the `svm` function:

**`model <- svm(formula = y ~ x, data = dataset)`**

where `y` represents the labels/categories, and `x` represents the features. Note if you have multiple `x` features in the dataset, you can simply type `.` in the `formula` argument to refer to everything in the data set except the y argument. Let's try out this syntax using the training data as our input.

**In the code below, create an SVM model using the `train_prion` data using the `svm` function with the `formula` argument.**

#### Replace `<dataSelection>` with `prion_status ~ .`, then run the code.

Note: the `prion_status` on the left hand side of the formula selects our labels, and the `.` on the right hand side of the formula selects our features. In this case, the `.` selects all the features in our dataset `train_prion`.

In [None]:
###
# REPLACE <dataSelection> WITH prion_status ~ .
###
SVM_Model <- svm(formula = <dataSelection> , data = train_prion)
###

print("Model ready!")

Well done! We've made a SVM model using our training set `train_prion`.

Step 5
---

Let's create some custom functions to graph and evaluate SVM models. We will use these functions throughout the remainder of this exercise. You do not need to edit the code block below.

**Run the code below**

In [None]:
# Run this box to prepare functions for later use

# Create a custom function named `Graph_SVM` to plot an SVM model

Graph_SVM <- function(model, data_set){
    grid <- expand.grid("mass" = seq(min(data_set$mass), max(data_set$mass), length.out = 100),
                        "weight" = seq(min(data_set$weight), max(data_set$weight), length.out = 100))
    preds <- predict(model, grid)
    df <- data.frame(grid, preds)
    ggplot() +
    geom_tile(data = df, aes(x = mass, y = weight, fill = preds)) +
    geom_point(data = data_set, aes(x = mass, y = weight, shape = prion_status, 
                                    colour = prion_status), 
               alpha = 0.75) +
    scale_colour_manual(values = c("grey10", "grey50")) +
    labs(title = paste("SVM model prediction"), x = "Mass", y = "Weight",
         fill = "Prediction", shape = "Prion status", colour = "Prion status") +
    theme(plot.title = element_text(hjust = 0.5))
    }

# Create another custom function named `Evaluate_SVM` to evaluate the SVM model, print results to screen,
# and run the `Graph_SVM` custom function
Evaluate_SVM <- function(model, data_set){
    predictions <- predict(model, data_set)
    total <- 0
    for(i in 1:nrow(data_set)){
    if(toString(predictions[i]) == data_set[i, "prion_status"]){
        total = total + 1
        }
        }
    # Print results to screen
    print("SVM Model Evaluation")
    print(paste0("Model name: ", deparse(substitute(model))))
    print(paste0("Dataset: ", deparse(substitute(data_set))))
    print(paste0("Accuracy: ", total/nrow(data_set)*100, "%"))
    print(paste0("Number of samples: ", nrow(data_set)))
    
    # Call our custom function for graphing SVM model
    Graph_SVM(model, data_set)
}

print("Custom function ready!")

Excellent! Now that we have created the custom function `Evaluate_SVM` (which incorporates the `Graph_SVM` function) let's evaluate our SVM model on the training data. 

In the code below, we will change the inputs to the `Evaluate_SVM` function, where the first argument is the SVM model we will evaluate, and the second argument is the dataset we will evaluate the SVM model with.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model` **

** 2. `<dataset>` with `train_prion` **

** Then __run the code__. **

In [None]:
###
# REPLACE <svmModel> WITH SVM_Model AND <dataset> WITH train_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

Step 6
---

The SVM has performed reasonably well separating our training data set into two. Now let's take a look at our test set.

In the code below, we will use our custom function `Evaluate_SVM` to evaluate `SVM_Model` on the test set.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model` **

** 2. `<dataset>` with `test_prion` **

** Then __run the code__. **

In [None]:
###
# REPLACE <svmModel> WITH SVM_Model AND <dataset> WITH test_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

That's a good result. 

Conclusion
---

Well done! We've taken a data set, tidied it, prepared it into training and test sets, created an SVM based on the training set, and evaluated the SVM model using the test set.

You can go back to the course now, or you can try using different kernels with your SVM below.

OPTIONAL: Step 8
---

Want to have a play around with different kernels for your SVM models? It's really easy!

The standard kernel is a radial basis kernel. But there's a few more you can choose from: `linear`, `polynomial`, and `sigmoid`. Let's try them out.

If you want to use a linear kernel, all you need to do is add `kernel = "linear"` to your model. Like this:

`SVM_Model <- svm(formula = y ~ x, data = dataset, kernel = "linear")`

Give it a go with all the different kernels below. The first kernel, `linear`, has been done for you.

**Run the code below**

In [None]:
# Run this box to make a linear SVM

# Make a linear SVM model
SVM_Model_Linear <- svm(prion_status ~ . , data = train_prion, kernel = "linear")
print("Model ready")

Now we have created the linear SVM model, let's evaluate it on our training and test sets using our custom function we created earlier, `Evaluate_SVM`. Remember the inputs to `Evaluate_SVM` are the SVM model followed by the data you wish to evaluate the model on.

In the code blocks below, we will change the inputs to our `Evaluate_SVM` function to the appropriate variable names to evaluate the linear SVM model on the training and test sets.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model_Linear` **

** 2. `<dataset>` with `train_prion` **

** Then __run the code__. **

In [None]:
# Evaluate linear SVM model on training set

###
# REPLACE <svmModel> WITH SVM_Model_Linear AND <dataset> WITH train_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

And now for the test set.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model_Linear` **

** 2. `<dataset>` with `test_prion` **

** Then __run the code__. **

In [None]:
# Evaluate linear SVM model on test set

###
# REPLACE <svmModel> WITH SVM_Model_Linear AND <dataset> WITH test_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

You can see the hyperplane is a linear line! Compare the linear SVM model results to the radial SVM model results to see the difference for yourself!

## Now let's try a sigmoid kernel.

** In the cell below replace: **

** 1. `<kernelSelection>` with `"sigmoid"` **

** 2. `<svmModel>` with `SVM_Model_Sigmoid` **

** 3. `<dataset>` with `train_prion` **

** Then __run the code__. **

In [None]:
###
# REPLACE <kernelSelection> WITH "sigmoid" (INCLUDING THE QUOTATION MARKS)
###
SVM_Model_Sigmoid <- svm(prion_status ~ . , data = train_prion, kernel = <kernelSelection>)
###

# Evaluate sigmoid SVM model on training set
###
# REPLACE <svmModel> WITH SVM_Model_Sigmoid AND <dataset> WITH train_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

And now for the test set.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model_Sigmoid` **

** 2. `<dataset>` with `test_prion` **

** Then __run the code__. **

In [None]:
# Evaluate sigmoid SVM model on test set
###
# REPLACE <svmModel> WITH SVM_Model_Sigmoid AND <dataset> WITH test_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

Perhaps a sigmoid kernel isn't a good idea for this data set....

## Let's try a sigmoid kernel instead.

** In the cell below replace: **

** 1. `<kernelSelection>` with `"polynomial"` **

** 2. `<svmModel>` with `SVM_Model_Sigmoid` **

** 3. `<dataset>` with `train_prion` **

** Then __run the code__. **

In [None]:
###
# REPLACE <kernelSelection> WITH "polynomial" (INCLUDING THE QUOTATION MARKS)
###
SVM_Model_Poly <- svm(prion_status ~ . , data = train_prion, kernel = <kernelSelection>)

# Evaluate polynomial SVM model on training set
###
# REPLACE <svmModel> WITH SVM_Model_Poly AND <dataset> WITH train_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

And now for the test set.

** In the cell below replace: **

** 1. `<svmModel>` with `SVM_Model_Poly` **

** 2. `<dataset>` with `test_prion` **

** Then __run the code__. **

In [None]:
# Evaluate polynomial SVM model on test set
###
# REPLACE <svmModel> WITH SVM_Model_Poly AND <dataset> WITH test_prion
###
Evaluate_SVM(<svmModel>, <dataset>)
###

If we were to carry on analysing prions like this, a polynomial SVM looks like a good choice (based on the performance of the different models on `test_prion`). If the data set was more complicated, we could try different degrees for the polynomial to see which one was the most accurate. This is part of __`tuning`__ a model.

Well done!