Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Week 7 Lecture Worksheet - Classification continued

## Learning Objectives
* Describe what a test data set is and how it is used in classification.
* Using R, evaluate classification accuracy using a test data set and appropriate metrics.
* Using R, execute cross validation in R to choose the number of neighbours.
* Identify when it is necessary to scale variables before classification and do this using R
* In a dataset with > 2 attributes, perform k-nearest neighbour classification in R using caret::train(method = "knn", ...) to predict the class of a test dataset.
* Describe advantages and disadvantages of the k-nearest neighbour classification algorithm.

## Fruit Data Example 

Load the appropriate packages and "fruit_data_with_colors.csv" dataset into the notebook.

In [None]:
suppressMessages({
  library(tidyverse)
  library(ggplot2)
  library(forcats) # fct_recode()
  library(readr)
  library(caret)
  library(readr)
  library(repr)})  # change the size of the scatterplots 
fruitDat <- read_csv("data/fruit_data_with_colors.csv")

Q1) Let's take a look at the first six observations in the fruit dataset.

In [None]:
# hide this cell
head(fruitDat)

fruitDat <- fruitDat %>% 
  mutate(fruit_name = as.factor(`fruit_name`)) 

a) Find the nearest neighbour based on mass and width to the first observation just by looking at the scatterplot below (the first observation has been circled for you).

In [None]:
# hide cell
options(repr.plot.width=6, repr.plot.height=4)
point <- c(192, 8.4)
fruitDat %>%  
  ggplot(aes(x=mass, y= width, color = fruit_name)) + 
  scale_x_continuous(name = "Mass (grams)") +
  scale_y_continuous(name = "Width (cm)") +
  geom_point() +
    annotate("path", x=point[1] + 10*cos(seq(0,2*pi,length.out=100)),
                   y=point[2]+0.2*sin(seq(0,2*pi,length.out=100)))

b) Using mass and width, calculate the distance between the first observation and the second observation (coordinates $(180, 8.0)$). 

In [None]:
#ANSWER
filter(fruitDat, row_number() %in% c(1, 2)) %>% # we filter the fruit dataset to pick out rows 1 and 2
    select(mass, width) %>%  # select the columns we want from the dataset
    dist()                   # calculate the distance

c) Calculate the distance between the first and the the 44th observation in the fruit dataset using the mass and width variables. We see from the table below, observation 44 has mass = 194 g and width = 7.2 cm.

In [None]:
filter(fruitDat, row_number() == 44)

In [None]:
#ANSWER 
filter(fruitDat, row_number() %in% c(1, 44)) %>%
    select(mass, width) %>%
    dist()

d) <!-- better as a discussion question? --> 
i) What do you notice about your answers from a) and b) that you just calculated? (Hint: look at where the observations are on the scatterplot)

ii) Is it what you would expect? Why or why not? (Hint: what might happen if we changed grams into kilograms to measure the mass?)

In [None]:
#hide cell
options(repr.plot.width=6, repr.plot.height=4)
point1 <- c(192, 8.4)
point2 <- c(180, 8)
point44 <- c(194, 7.2)
fruitDat %>%  
  ggplot(aes(x=mass, y= width, color = fruit_name)) + 
  scale_x_continuous(name = "Mass (grams)") +
  scale_y_continuous(name = "Width (cm)") +
  geom_point() +
    annotate("path", x=point1[1] + 5*cos(seq(0,2*pi,length.out=100)),
                   y=point1[2]+0.1*sin(seq(0,2*pi,length.out=100))) +
 annotate("path", x=point2[1] + 5*cos(seq(0,2*pi,length.out=100)),
                   y=point2[2]+0.1*sin(seq(0,2*pi,length.out=100))) +
    annotate("path", x=point44[1] + 5*cos(seq(0,2*pi,length.out=100)),
                   y=point44[2]+0.1*sin(seq(0,2*pi,length.out=100)))

ANSWER: 
The distance between the first and second observation is 12.01 and the distance between the first and 44th observation is 2.33. So by the formula, observation 1 and 44 are closer. However, if we look at the scatterplot the distance of the first observation to the second observation appears closer than to the 44th observation. 

Because the classifier predicts class by identifying the nearest points, the scale of the variables matters. Variables on a large scale compared to variables a small scale will have a greater effect on the distance between the observations. Here we have width (measured in cm) and mass (in grams). As far as knn is concerned, a difference of 12 g in mass between observation 1 and 2 is large compared to a difference of 1.2 cm in width between observation 1 and 44. Consequently, mass will drive the classification results, and width will have less of an effect. Hence, our distance calculation reflects that. Also, if we measured mass in kilograms, or if we measured width in meters, then we’d get different classification results. Thus we can standardize the data so that all variables will be on a comparable scale. 

e) i) Scale all the variables of the fruit dataset and save them as columns in your data table.

In [None]:
# ANSWER
fruitDat <- fruitDat %>%
  mutate(scaled_mass = scale(mass),
         scaled_width = scale(width),
         scaled_height = scale(height), 
         scaled_color_score = scale(color_score))

ii) Let's repeat Q1 parts b) and c) with the scaled variables. Calculate the distance with the scaled mass and width variables between observation 1 and 2.  Calculate the distances with the scaled mass and width variables between observation 1 and 44. What do you notice? 

In [None]:
# ANSWER
fruitDat %>% 
  select(scaled_mass, scaled_width) %>% 
  filter(row_number() %in% c(1,2)) %>%
  dist()

fruitDat %>% 
  select(scaled_mass, scaled_width) %>% 
  filter(row_number() %in% c(1,44)) %>%
  dist()

Q2)i) Partition the data into a training (70%) and testing (30%) set using the `caret` package. Select the color score, mass, and the fruit name variables to include in your sets. Set seed to 6 so consistent results are obtained.

In [None]:
#### R caret 
set.seed(6)      # set seed so same results can be obtained 

# create a vector of fruit name labels 
labels <- fruitDat %>% 
  pull(fruit_name)   # pull out a vector rather than having a dataframe

# randomly take 70% of the data in the training set proportional to the different number of fruit names in the dataset
# list = F denotes that the indices we obtain should form a vector

# create index to split based on labels
index <- createDataPartition(labels, p = 0.7, list = F)

# filtering the dataset into training data based on the random indices above
# selecting the scaled columns "mass" and "color score"
trainingDat <- fruitDat %>% 
    select(scaled_mass, scaled_color_score) %>% 
    filter(row_number() %in% index)

# create a vector of labels to use with knn by extracting the column "fruit_name" for your training set observations
training_labels <- fruitDat %>% 
    select(fruit_name) %>% 
    filter(row_number() %in% index) %>%
    pull()

# creating the testing set based on the remaining observations 
# selecting the scaled columns "mass" and "color score" 
testingDat <- fruitDat %>% 
  select(scaled_mass, scaled_color_score) %>% 
  filter(!(row_number() %in% index))

# create a vector of labels to use with knn by extracting the column "fruit_name" for your testing set observations
testing_labels <- fruitDat %>% 
  select(fruit_name) %>% 
  filter(!(row_number() %in% index)) %>% 
    pull()

ii) Using the `caret` package, perform knn to obtain fruit name predictions using "color_score" and "mass" variables. Use $k = 5$.

In [None]:
# make a dataframe with k = 5
k5 <- data.frame(k = 5)

# set the "x" argument equal to your training data as a data.frame 
# set the "y" argument equal to the training labels vector 
# set "tuneGrid" to your "k" dataframe
# set the "trControl" argument equal to trainControl(method = "none") - we will talk more about this argument later
model_knn5 <- train(x = data.frame(trainingDat),     
                   y = training_labels,
                   method = "knn", 
                   tuneGrid = k5,                               
                   trControl = trainControl(method = "none"))  

(predictions <- predict(object=model_knn5, testingDat))

ii) Choose one case that was not predicted correctly. What was predicted, and what is the correct label? 

In [None]:
# use the function "cbind()" to compare your actual and predicted values 
cbind(testingDat, testing_labels, predictions)

iii) Evaluate the classification performance by comparing the estimated labels to the true labels.

In [None]:
#create a confusion matrix of the actual vs. predicted values 
confusionMatrix(predictions, testing_labels) 

# use the function "table()" to make a table of actual vs. predicted values 
table(predictions, testing_labels)

iv) Compute the overall accuracy of the knn classifier.

In [None]:
#Compute the overall accuracy of the knn learner using the mean() function.
mean(predictions == testing_labels)

We see that 2 labels were misclassified. Accuracy = $\frac{12}{15} = 0.8667$

v) Compare $k$ values of 1, 5 and 15 to examine the impact on classification accuracy.

In [None]:
# compute the accuracy of the k = 1 model using your code above
k1 <- data.frame(k = 1)

# set the "x" argument equal to your training data as a data.frame 
# set the "y" argument equal to the training labels vector 
# set "tuneGrid" to your "k" dataframe
# set the "trControl" argument equal to trainControl(method = "none") - we will talk more about this argument later
model_knn1 <- train(x = data.frame(trainingDat),     
                   y = training_labels,
                   method = "knn", 
                   tuneGrid = k1,                               
                   trControl = trainControl(method = "none"))  

predictions1 <- predict(object=model_knn1, testingDat)
mean(predictions1 == testing_labels)

In [None]:
# modify the "train" function by setting k = 15
k15 <- data.frame(k = 15)

# set the "x" argument equal to your training data as a data.frame 
# set the "y" argument equal to the training labels vector 
# set "tuneGrid" to your "k" dataframe
# set the "trControl" argument equal to trainControl(method = "none") - we will talk more about this argument later
model_knn15 <- train(x = data.frame(trainingDat),     
                   y = training_labels,
                   method = "knn", 
                   tuneGrid = k15,                               
                   trControl = trainControl(method = "none"))  

predictions15 <- predict(object=model_knn15, testingDat)
mean(predictions15 == testing_labels)

vi) Which value of $k$ gave the highest accuracy?

For various values of $k$, fit a classifier using the training data. Use that classifier to obtain an error rate when predicting on both the training and test sets, for each $k$. How do the training error and test error change with $k$? 

In [None]:
# hide this box - looking at the decision boundary for different k values 

library(caret) 

mass_length = seq(min(testingDat$scaled_mass), max(testingDat$scaled_mass), by = 0.1)
col_length = seq(min(testingDat$scaled_color_score), max(testingDat$scaled_color_score), by = 0.1)

# generates the boundaries for your graph
lgrid <- expand.grid(scaled_mass=mass_length, 
                     scaled_color_score = col_length)
knnPredGrid <- predict(model_knn5, newdata=lgrid)

knnPredGrid = as.numeric(knnPredGrid)

# get the points from the test data...
testPred <- predict(model_knn5, newdata=testingDat)
testPred <- as.numeric(testPred)

# this gets the points for the testPred...
testingDat$Pred <- testPred

probs <- matrix(knnPredGrid, length(mass_length), length(col_length))

#ggplot(data=lgrid) + stat_contour(aes(x=scaled_mass, y=scaled_color_score, z=knnPredGrid), 
#                                  bins=2) +
#  geom_point(aes(x=scaled_mass, y=scaled_color_score, colour=as.factor(knnPredGrid))) 
#  geom_point(data=testingDat, aes(x=testingDat$scaled_mass, y=testingDat$scaled_color_score,
#                                  colour=as.factor(testingDat$Pred)),
#            size=5, alpha=0.5, shape=1)+
#  theme_bw()

contour(mass_length, col_length, probs, labels="", xlab="", ylab="", main="5-Nearest Neighbor", axes=F)
gd <- expand.grid(x=mass_length, y=col_length)

points(gd, pch=".", cex=1, col=probs)

# add the test points to the graph
points(testingDat$scaled_mass, 
       testingDat$scaled_color_score, col=testingDat$Pred, cex=1, 
      pch=16)
box()

In [None]:
knnPredGrid <- predict(model_knn1, newdata=lgrid)
knnPredGrid = as.numeric(knnPredGrid)

# get the points from the test data...
testPred <- predict(model_knn1, newdata=testingDat)
testPred <- as.numeric(testPred)

# this gets the points for the testPred...
testingDat$Pred <- testPred

probs <- matrix(knnPredGrid, length(mass_length), length(col_length))

ggplot(data=lgrid) + stat_contour(aes(x=scaled_mass, y=scaled_color_score, z=knnPredGrid), 
                                  bins=2) +
  geom_point(aes(x=scaled_mass, y=scaled_color_score, colour=as.factor(knnPredGrid))) 
  geom_point(data=testingDat, aes(x=testingDat$scaled_mass, y=testingDat$scaled_color_score,
                                  colour=as.factor(testingDat$Pred)),
            size=5, alpha=0.5, shape=1)+
  theme_bw()

contour(mass_length, col_length, probs, labels="", xlab="", ylab="", main="1-Nearest Neighbor", axes=F)
gd <- expand.grid(x=mass_length, y=col_length)

points(gd, pch=".", cex=1, col=probs)

# add the test points to the graph
points(testingDat$scaled_mass, 
       testingDat$scaled_color_score, col=testingDat$Pred, cex=1, 
      pch=16)
box()

In [None]:
# compute the accuracy of the k = 1 model using your code above
k15 <- data.frame(k = 15)

# set the "x" argument equal to your training data as a data.frame 
# set the "y" argument equal to the training labels vector 
# set "tuneGrid" to your "k" dataframe
# set the "trControl" argument equal to trainControl(method = "none") - we will talk more about this argument later
model_knn15 <- train(x = data.frame(trainingDat),     
                   y = training_labels,
                   method = "knn", 
                   tuneGrid = k15,                               
                   trControl = trainControl(method = "none"))  

knnPredGrid <- predict(model_knn15, newdata=lgrid)
knnPredGrid = as.numeric(knnPredGrid)

# get the points from the test data...
testPred <- predict(model_knn15, newdata=testingDat)
testPred <- as.numeric(testPred)

# this gets the points for the testPred...
testingDat$Pred <- testPred

probs <- matrix(knnPredGrid, length(mass_length), length(col_length))

ggplot(data=lgrid) + stat_contour(aes(x=scaled_mass, y=scaled_color_score, z=knnPredGrid), 
                                  bins=2) +
  geom_point(aes(x=scaled_mass, y=scaled_color_score, colour=as.factor(knnPredGrid))) 
  geom_point(data=testingDat, aes(x=testingDat$scaled_mass, y=testingDat$scaled_color_score,
                                  colour=as.factor(testingDat$Pred)),
            size=5, alpha=0.5, shape=1)+
  theme_bw()

contour(mass_length, col_length, probs, labels="", xlab="", ylab="", main="15-Nearest Neighbor", axes=F)
gd <- expand.grid(x=mass_length, y=col_length)

points(gd, pch=".", cex=1, col=probs)

# add the test points to the graph
points(testingDat$scaled_mass, 
       testingDat$scaled_color_score, col=testingDat$Pred, cex=1, 
      pch=16)
box()

## German credit example
We are going to work a dataset called the Statlog (German Credit Data) Data Set. The data has many attributes, such as "Status of existing checking account", "credit history" etc for 1000 individuals and classifies people as good or bad credit risks (1 = good, 2 = bad). The dataset can be found [here](http://archive.ics.uci.edu/ml/datasets/Statlog+%28German+Credit+Data%29). 

Load the dataset "german.csv" into the notebook. 

In [None]:
gDat <- read_csv("data/german.csv")
gDat <- gDat %>% 
  mutate(Default = as.factor(Default)) 

Q1) Partition the data into a training (70%) and testing (30%) set using the `caret` package. Select the "duration", "amount", and "default" variables to include in your datasets. 

In [None]:
#### R caret 
set.seed(100)      # set seed so same results can be obtained 

labels <- gDat %>% 
  pull(Default)   # pull out a vector rather than having a dataframe

# randomly take 70% of the data in the training set proportional to 
# the number of good and bad credit risk observations in the dataset
# list = F denotes that the indices we obtain should form a vector

# create index to split based on labels
index <- createDataPartition(labels, p = 0.7, list = F)

# filtering the dataset into training data based on the random indices above
trainingDat <- gDat %>% 
    select(Duration, Amount) %>% 
    filter(row_number() %in% index)

trainingLabels <-  gDat %>% 
    select(Default) %>% 
    filter(row_number() %in% index) %>%
    pull()

# creating the testing set based on the remaining observations 
testingDat <- gDat %>% 
  select(Duration, Amount) %>% 
  filter(!(row_number() %in% index))

testingLabels <- gDat %>% 
  select(Default) %>% 
  filter(!(row_number() %in% index)) %>%
  pull()

Q2) Perform 10 fold cross validation to select the value of k. What value of k do you choose?

In [None]:
model_knn <- train(x = data.frame(trainingDat),
                   y = trainingLabels,
                   method = "knn", 
                   trControl = trainControl(method = "cv", number = 10),
                   preProcess=c("center", "scale"))
predictions <- predict(model_knn, testingDat)

In [None]:
Q3) Evaluate the classification performance by comparing the estimated labels to the true labels.

In [None]:
confusionMatrix(predictions, testingLabels)