Exercise 6 - Support Vector Machines
===

Support vector machines (SVMs) let us predict categories. This exercise will demonstrate a simple support vector machine that can predict a category from a small number of features. 

Our problem is that we want to be able to categorise which type of tree a new specimen belongs to. To do this, we will use leaf and trunk features of three different types of trees to train SVMs.

#### Run the cell below to load the required libraries

In [None]:
# Run this!

# Load required packages
suppressMessages(install.packages("tidyverse"))
suppressMessages(library("tidyverse"))

suppressMessages(install.packages("e1071"))
suppressMessages(library("e1071"))

Step 1
---

First, let's load the required packages for this session, and load the raw data to see what features we have.

**In the code below, replace `<dataStructure>` with `str` to view the structure of the raw data. Run the code once complete.**

In [None]:
# Load tree data and save as a new variable named `tree_data`
tree_data <- read.csv("Data/trees.csv")

###
# IN THE CODE BELOW, CHECK THE STRUCTURE OF tree_data BY REPLACING <dataStructure> WITH str
###
<dataStructure>(tree_data)
###

Given the results from `str(tree_data)`, we can see that we have _four features_: 

* `leaf_width`
* `leaf_length`
* `trunk_girth`
* `trunk_height`

We also have _one label_:

* `tree_type`

Let's plot these features using the package `ggplot2`. We will look at the leaf features and trunk features separately using scatter plots, and colour the points based on the label `tree_type`.

### In the cell below replace:
#### 1. `<xData>` with `leaf_width`
#### 2. `<yData>` with `leaf_length`
#### then __run the code__.

In [None]:
# Plot the leaf features, where `x = leaf_width` and `y = leaf_length`
tree_data %>%

###
# REPLACE <xData> WITH leaf_width and <yData> with leaf_length
###
ggplot(aes(x = <xData>, y = <yData>, colour = as.factor(tree_type))) +
geom_point() +
ggtitle("Leaf length vs. leaf width coloured by tree type") +
labs(x = "Leaf width", y = "Leaf length", colour = "Tree type") +
theme(plot.title = element_text(hjust = 0.5))

Based on the features `leaf_width` and `leaf_length`, we can see three groups that separate according to the label `tree_type`: `0`, `1`, and `2` (coloured red, green, and blue, respectively).

Now let's plot the trunk features in a separate plot.

In the code below, we will graph each of the trunk features.

### In the cell below replace:
#### 1. `<xData>` with `trunk_girth`
#### 2. `<yData>` with `trunk_height`
#### then __run the code__.

In [None]:
# Plot the trunk features, where `x = trunk_girth` and `y = trunk_height`
tree_data %>%

###
# REPLACE <xData> WITH trunk_girth and <yData> WITH trunk_height
###
ggplot(aes(x = <xData>, y = <yData>, colour = as.factor(tree_type))) +
###
geom_point() +
ggtitle("Trunk height vs. trunk girth coloured by tree type") +
labs(x = "Trunk girth", y = "Trunk height", colour = "Tree type") +
theme(plot.title = element_text(hjust = 0.5))

Based on the features `trunk_girth` and `trunk_height`, again we can see three groups that separate according to the label `tree_type`: `0`, `1`, and `2` (coloured red, green, and blue, respectively). There are some outliers, but for the most part, the features trunk girth and trunk height allow you to predict tree type.

Now, say we obtain a new tree specimen and we want to figure out the tree type based on its leaf and trunk measurements. We *could* make a rough guess as to which tree type it belongs to based on where the tree data points lie in the two scatter plots we just created. Alternatively, using these same leaf and trunk measurements, SVMs can predict the tree type for us. SVMs will use the features and labels we provide for known tree types to create hyperplanes for tree type. These hyperplanes allow us to predict which tree type a new tree specimen belongs to, given their leaf and trunk measurements.

In the next step, we will use SVMs to help solve this problem.

Step 2
-----

Let's make two SVMs using our data, `tree_data`: one SVM based on the leaf features, and another SVM based on the trunk features.

The syntax for a simple SVM using the package `e1071` is as follows:

`svm_model <- svm(x = x, y = y, data = dataset)`

where `x` represents the features (of class *matrix*), and `y` represents the labels (of class *factor*).

> **R uses a variety of data types and data structures to describe different objects. You may have noticed a few types of objects already, including** `data.frame`, `list`, `matrix`, `factor`, **and** atomic vectors (`integer`, `numeric`, `logical`, `character`). **Knowing the structure of your data object is crucial, particularly when you are running functions that require the data object to be of a certain type.**

> **For the `svm` function, we require two types of data structures: a** `matrix` **and a** `factor`**. A** `matrix` **is a two-dimensional data structure (with rows and columns) containing elements of all the same type, most commonly numbers (numeric or integer) with which you can perform further mathematical operations. Note that a** `matrix` **is different from a** `data.frame`**, as data frames can contain a mix of elements, i.e. both numbers (numeric/integer) and letters (factor/character/logical). A** `factor` **is used to categorize data, where the names of the categories are known as levels. For example, "fruit" could be a factor, with levels including "apples", "bananas", and "oranges", or in our example "tree type" can be a factor, with levels "0", "1", and "2". Also note that the levels of factors can be ordered, e.g. "clothing size" can be a factor, and you can order the levels "small", "medium", and "large", which is important when it comes to plotting the data.**

For our two SVMs, we will need to create the appropriate `x` and `y` variables based on `tree_data`.

The first SVM will be based on the leaf features `leaf_width` and `leaf_length`. We will need to create a new variable that contains only these two features, then convert it from a `data.frame` to a `matrix`, so it can be the input to the `x` argument in the `svm` function. We also need to convert the labels `tree_type` into a factor for the input to the `y` variable  of the `svm` function, as it is currently stored as an integer (see the results from `str` above, where `int` stands for `integer`).

The code below creates the appropriate `x` and `y` variables for the leaf features in `tree_data`. 

### In the cell below replace:
#### 1. `<leafFeature1>` with `leaf_width`
#### 2. `<leafFeature2>` with `leaf_length`
#### then __run the code__.

In [None]:
# Create new x variable input to `svm`
x_leaf_data <- tree_data %>% 

###
# REPLACE <leafFeature1> WITH leaf_width AND <leafFeature2> WITH leaf_length
###
select(<leafFeature1>, <leafFeature2>) %>%
###
as.matrix()

# Check x variable input to `svm`
class(x_leaf_data)
head(x_leaf_data)

# Change `tree_data$tree_type` to a factor
tree_data <- tree_data  %>% 
mutate(tree_type = as.factor(tree_type))

# Check y variable input to `svm`
class(tree_data$tree_type)
head(tree_data$tree_type)

Now we can run the function `svm` based on the leaf features stored in the new variable `x_leaf_data`, and the label saved in the variable `tree_data$tree_type`.

### In the cell below replace:
#### 1. `<features>` with `x_leaf_data`
#### 2. `<labels>` with `tree_data$tree_type`
#### then __run the code__.

In [None]:
###
# REPLACE <features> WITH x_leaf_data AND <labels> WITH tree_data$tree_type
###
svm_leaf_data <- svm(x = <features>, y = <labels>, type = "C-classification", kernel = "radial")
###
print("The SVM model named svm_leaf_data is ready.")

To help us view the hyperplanes of the SVM based on the leaf data, we will create a fine grid of data points within the feature space to represent different combinations of leaf width and leaf length, and colour the new data points based on the predictions of `svm_leaf_data`. You do not need to edit this code block.

**Run the code below**

In [None]:
# Run this box to create the grid of datapoints

# Create a fine grid of the feature space
leaf_width <- seq(from = min(tree_data$leaf_width), to = max(tree_data$leaf_width), length = 100)
leaf_length <- seq(from = min(tree_data$leaf_length), to = max(tree_data$leaf_length), length = 100)

fine_grid_leaf <- as.data.frame(expand.grid(leaf_width, leaf_length))
fine_grid_leaf <- fine_grid_leaf %>%
                  dplyr::rename(leaf_width = "Var1", leaf_length = "Var2")

# Check output
head(fine_grid_leaf)

# For every new point in `fine_grid_leaf`, predict its tree type based on the SVM `svm_leaf_data`
fine_grid_leaf$tree_pred <- predict(svm_leaf_data, newdata = fine_grid_leaf, type = "decision")

# Check output
head(fine_grid_leaf)
table(fine_grid_leaf$tree_pred)

Now we can create a scatter plot that contains the new fine grid of points we created above, and also the original tree data to see which group the different trees fall into based on the SVM `svm_leaf_data`. You do not need to edit this code block.

**Run the code below**

In [None]:
# Run this box to generate the scatter plot

# Create scatter plot  with original leaf features layered over the fine grid of data points
ggplot() +
geom_point(data = fine_grid_leaf, aes(x = leaf_width, y = leaf_length, colour = tree_pred), alpha = 0.25) +
stat_contour(data = fine_grid_leaf, aes(x = leaf_width, y = leaf_length, z = as.integer(tree_pred)),
             lineend = "round", linejoin = "round", linemitre = 1, size = 0.25, colour = "black") +
geom_point(data = tree_data, aes(x = leaf_width, y = leaf_length, colour = tree_type, shape = tree_type)) +
ggtitle("SVM decision boundaries for leaf length vs. leaf width") +
labs(x = "Leaf width", y = "Leaf length", colour = "Actual tree type", shape = "Actual tree type") +
theme(plot.title = element_text(hjust = 0.5))

The graph shows three faintly coloured zones based on the SVM's predictions for the fine grid of data points (based on leaf features), and the hyperplanes for the different tree types represented by thick black lines. 

We can use these coloured zones and hyperplanes to observe which tree type the SVM has chosen to place our original data points into. Note that in the graph above, our original data points are represented by both colour and shape. Also remember, that the tree type of the fine grid of data points is based on the SVM model where we used leaf features as input to the SVM.

So, using the graph above, we observe two different classification scenarios:

1. Our original data points are classified correctly by the SVM, as the data point falls into the zone of the same colour, e.g. a green triangle data point (an actual type 1 tree) falls into the green zone (the SVM predicted the tree as type 1).

2. Our original data points are misclassified by the SVM, as the data point falls into the zone of a different colour, e.g. a red circle data point (an actual type 0 tree) falls into the green zone (the SVM predicted the tree as type 1). 

For the most part, our SVM can calculate tree type based on leaf features reasonably well, but let's determine the mis-classification rate. To do this, we will need to run the `predict` function again, but this time using our original data points as input. Note that this method is somewhat circular, since we used this same data to train the SVM, but we will run this just to give us an idea how well our SVM fits our data. 

> **If we truly want to test the performance of our SVM, we need a *training set* with which to train the SVM, and an independent *test/validation set* with which to test the SVM.**

**Run the code below, to run the `predict` function. You do not need to edit this code block.**

In [None]:
# Run this box to run the predict function

pred_leaf_data <- tree_data %>% 
select(leaf_width, leaf_length)

# Predict the tree type of our original data based on the SVM `svm_leaf_data`
pred_leaf_data$tree_pred <- predict(svm_leaf_data, newdata = pred_leaf_data, type = "decision")

# Check output
head(pred_leaf_data)

# Add tree_data$tree_type to pred_leaf_data
pred_leaf_data <- inner_join(pred_leaf_data, tree_data, by = c("leaf_width", "leaf_length")) %>%
select(-trunk_girth, -trunk_height)

# Check output
head(pred_leaf_data)

# Create a table of predictions to show mis-classification rate
table(pred_leaf_data$tree_pred, pred_leaf_data$tree_type)

# Mis-classification rate: proportion of misclassifiedb observations
mean(pred_leaf_data$tree_pred != pred_leaf_data$tree_type)

Our mis-classification rate is 6.5% which can actually preferable to a mis-classification rate of 0%, as the latter might indicate that the model has overfit the training data.

# Step 3

Now let's create our second SVM based on the trunk features. Remember, for the `e1071::svm` function, we need to create a new variable for input to the `x` argument, but we can use the same variable as before as input to `y`, `tree_data$tree_type`.

### In the cell below replace:
#### 1. `<trunkFeature1>` with `trunk_girth`
#### 2. `<trunkFeature2>` with `trunk_height`
#### then __run the code__.

In [None]:
# Create new x variable input to `svm` based on trunk features
x_trunk_data <- tree_data %>% 

###
# REPLACE <trunkFeature1> WITH trunk_girth and <trunkFeature2> WITH trunk_height
###
select(<trunkFeature1>, <trunkFeature2>) %>%
###
as.matrix()

# Check output
head(x_trunk_data)

# Fit SVM
svm_trunk_data <- svm(x = x_trunk_data, y = tree_data$tree_type, type = "C-classification", kernel = "radial")

# Create a fine grid of the feature space
trunk_girth <- seq(from = min(tree_data$trunk_girth), to = max(tree_data$trunk_girth), length = 100)
trunk_height <- seq(from = min(tree_data$trunk_height), to = max(tree_data$trunk_height), length = 100)

fine_grid_trunk <- as.data.frame(expand.grid(leaf_width, leaf_length))
fine_grid_trunk <- fine_grid_trunk %>% 
                   dplyr::rename(trunk_girth = "Var1", trunk_height = "Var2")

# Check output
head(fine_grid_trunk)

# Predict which tree type the new points fall into
fine_grid_trunk$tree_pred <- predict(svm_trunk_data, newdata = fine_grid_trunk, type = "decision")

# Check output
head(fine_grid_trunk)
table(fine_grid_trunk$tree_pred)

Now let's create a scatter plot using `ggplot2`. We will plot the fine grid as well as the original tree points.

### In the cell below replace:
#### 1. `<data1>` with `fine_grid_trunk`
#### 2. `<data2>` with `fine_grid_trunk`
#### 3. `<data3>` with `tree_data`
#### then __run the code__.

In [None]:
# Create scatter plot with original trunk features layered over the fine grid of data points
ggplot() +

# First plot the fine grid of data points;
###
# REPLACE <data1> WITH fine_grid_trunk
###
geom_point(data = <data1>, aes(x = trunk_girth, y = trunk_height, colour = tree_pred), alpha = 0.25) +

# Add contour lines based on fine grid of data points; 
###
# REPLACE <data2> WITH fine_grid_trunk
###
stat_contour(data = <data2>, aes(x = trunk_girth, y = trunk_height, z = as.integer(tree_pred)),
             lineend = "round", linejoin = "round", linemitre = 1, size = 0.25, colour = "black") +
###

# Now plot the original data points to see where they lie in relation to the fine grid of data points;

###
# REPLACE <data3> WITH tree_data
###
geom_point(data = <data3>, aes(x = trunk_girth, y = trunk_height, colour = tree_type, shape = tree_type)) +
###
ggtitle("SVM decision boundaries for trunk girth vs. trunk height") +
labs(x = "Trunk girth", y = "Trunk height", colour = "Tree type", shape = "Tree type") +
theme(plot.title = element_text(hjust = 0.5))

Excellent! Again we can observe three faintly coloured zones based on the SVM's predictions of tree type for the fine grid of data points (based on trunk features), and the hyperplanes for the different tree types represented by thick black lines. We use these coloured zones and hyperplanes to observe which tree type the SVM has chosen to place our original data points into. Again, we observe two different classification scenarios: either our original data points are classified correctly by the SVM, or 2) our original data points are misclassified by the SVM.

**Now let's run the `predict` function as we did earlier to determine the mis-classification rate of our SVM model based on trunk features.**

In [None]:
# Run this box to determing the mis-classification rate

pred_trunk_data <- tree_data %>% 
select(trunk_girth, trunk_height)

# Predict the tree type of our original data based on the SVM `svm_trunk_data`
pred_trunk_data$tree_pred <- predict(svm_trunk_data, newdata = pred_trunk_data, type = "decision")

# Check output
head(pred_trunk_data)

# Add tree_data$tree_type to pred_trunk_data
pred_trunk_data <- inner_join(pred_trunk_data, tree_data, by = c("trunk_girth", "trunk_height")) %>%
select(-leaf_length, -leaf_width)

# Check output
head(pred_trunk_data)

# Create a table of predictions to show mis-classification rate
table(pred_trunk_data$tree_pred, pred_trunk_data$tree_type)

# Mis-classification rate: proportion of misclassifiedb observations
mean(pred_trunk_data$tree_pred != pred_trunk_data$tree_type)

Here our mis-classification rate of the training data using the `svm_trunk_data` model is 4.5%, which is lower than the mis-classification rate of the `svm_leaf_data` model.


Conclusion
-------

That's it! You've made two simple SVMs that can predict the type of tree based on the leaf measurements and trunk measurements!

You can go back to the course now and click __'Next Step'__ to move onto how we can test AI models.