Before you turn this problem in, make sure everything runs as expected. First, **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart) and then **run all cells** (in the menubar, select Cell$\rightarrow$Run All).

Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name and collaborators below:

In [None]:
NAME = ""
COLLABORATORS = ""

---

# Week 6 Lecture Worksheet - Classification

## Learning Objectives
* Recognize situations where a simple classifier would be appropriate for making predictions.
* Explain the k-nearest neighbour classification algorithm.
* Interpret the output of a classifier.
* Compute, by hand, the distance between points when there are two attributes.
* Describe what a training data set is and how it is used in classification.
* In a dataset with two attributes, perform k-nearest neighbour classification in R using caret::train(method = "knn", ...) to predict the class of a single new observation.

In [None]:
library(repr)      # options() to change the plot sizes
library(tidyverse) 

# Part 1 - Breast cancer dataset
We will work with the breast cancer data from this week's pre-reading. Load the appropriate packages and the "clean-wdbc-data.csv" dataset into the notebook. 

In [None]:
bcDat <- read_csv("data/clean-wdbc-data.csv")

## Question 1
The first six rows of the breast cancer data table is shown below. 

In [None]:
# hidden cell 
head(bcDat)

We want to predict the variable "area" for a new observation. True or False: Is this a classification problem?

ANSWER: False

## Question 2
Just by looking at the scatterplot below, how would you classify an observation with symmetry 1 and radius 0.5?  
a) Benign 
b) Malignant

In [None]:
# Change plot size to 6 x 4
options(repr.plot.width=6, repr.plot.height=4)

bcDat %>%  
  ggplot(aes(x=Symmetry, y=Radius, color = Class)) + 
  geom_point() +
  scale_x_continuous(name = "Symmetry") +
  scale_y_continuous(name = "Radius")

## Question 3:
Using R as a calculator and the formula below, compute the distance between the first and second observation in the breast cancer dataset using attributes symmetry and radius.

* We want to find the distance between the first and second observation in the breast cancer dataset using 2 attributes: symmetry and radius.

* Recall we can calculate the distance between two points using the following formula: 
$$Distance = \sqrt{(x_a -x_b)^2 + (y_a - y_b)^2}$$

* Point $a$, $(x_a, y_a)$, has coordinates $(2.75, 1.89)$ and point $b$, $(x_b, y_b)$, has coordinates $(-0.24, 1.80)$.

In [None]:
(xa <- filter(bcDat, row_number()==1) %>% # selecting first observation from bcDat 
     select(Symmetry) %>%                 # selecting the column Symmetry 
     unlist())                            # we want the numeric value only               

(ya <- filter(bcDat, row_number()==1)  %>%  
    select(Radius) %>%
    unlist())

(xb <- filter(bcDat, row_number() == 2) %>%  
    select(Symmetry) %>%
    unlist())

(yb <- filter(bcDat, row_number() == 2) %>%
    select(Radius) %>%
    unlist())

In [None]:
# ANSWER
sqrt((xa - xb)^2 + (ya - yb)^2)

## Question 4
We want to calculate the distance between the first and second observation in the breast cancer dataset using 3 attributes: symmetry, radius and concavity. 

Notice that point a, $(x_a, y_a, z_a)$, has coordinates $(2.75, 1.89, 2.11)$ and point $b$, $(x_b, y_b, z_b)$, has coordinates $(-0.24, 1.80, -0.15)$

In [None]:
(za <- filter(bcDat, row_number() == 1) %>% # selecting first observation from bcDat 
    select(Concavity)%>%                    # selecting the column Concavity (third coordinate of point a) 
    unlist())
 
(zb <- filter(bcDat, row_number() ==2) %>%
    select(Concavity)%>%
    unlist())

### Part a) 
Using R as a calculator, calculate the distance between the first and second observation in the breast cancer dataset using 3 attributes: symmetry, radius and concavity.

In [None]:
# ANSWER
(distance <- sqrt((xa - xb)^2 + (ya - yb)^2 + (za - zb)^2))

### Part b) 
#### i) 
Set up a vector for point $a$ and point $b$ (you should have 3 coordinates for each point). For instance, point $a$ will be at coordinates $(2.75, 1.89, 2.12)$

In [None]:
# ANSWER
(point_a <- filter(bcDat, row_number() == 1) %>% # selecting first observation from bcDat 
    select(Symmetry, Radius, Concavity) %>%      # selecting columns Symmetry, radius and concavity 
    unlist())                                    # want numeric value 

(point_b <- filter(bcDat, row_number() == 2) %>%
    select(Symmetry, Radius, Concavity) %>%
    unlist()) 

#### ii)
Calculate the difference between the vectors.

In [None]:
#ANSWER 
point_a - point_b

#### iii) 
Square the differences you calculated in part ii).

In [None]:
# ANSWER 
(point_a - point_b)^2

#### iv) 
Sum the entries of your answer in part iii).

In [None]:
# ANSWER 
sum((point_a - point_b)^2)

#### v) 
Square root the sum of your squared differences you calculated in part iv. 

In [None]:
# ANSWER 
sqrt(sum((point_a - point_b)^2))

### Part c) 
If we have more than a few points, calculating distances as we did in parts (a) and (b) is slow. Let's use the `dist()` function to find the distance between the first and second observation in the breast cancer dataset using symmetry, radius and concavity. 

In [None]:
#ANSWER
head(bcDat, 2)  %>% 
    select(Symmetry, Radius, Concavity)  %>% 
    dist()

### Part d) 
Compare your answers in parts a), b), and c). 

ANS: Answers in parts a, b and c are the same.

## Question 5
Let's take a subset of 5 observations from the breast cancer dataset. We will focus on the attributes Symmetry and Radius.  

In [None]:
set.seed(2)                           # obtain the same results given the same seed number
(subDat <- sample_n(bcDat, 5) %>%     # Taking a random sample of 5 from the bcDat dataset and calling it subDat
    select(Symmetry, Radius, Class))  # Selecting only columns symmetry, radius and class

options(repr.plot.width=6, repr.plot.height=4)
subDat %>%  
  ggplot(aes(x=Symmetry, y=Radius, color = Class)) + # making a scatterplot of symmetry and radius coloured by class
  geom_point() +
  scale_x_continuous(name = "Symmetry") +            # naming the x and y labels
  scale_y_continuous(name = "Radius")

Suppose we are interested in classifying a new observation with Symmetry 0 and Radius 0.25, but unknown Class. 

In [None]:
(newDat <- subDat %>%
    add_row(Symmetry = 0, Radius = 0.25, Class = "unknown")) # Adding the new observation to the last row of subDat and calling it newDat

### Part a) 
Using the subset of 5 observations above, classify this new observation (Symmetry = 0 and Radius = 0.25, unknown Class) using the `dist()` function for $k = 1$.

In [None]:
# ANSWER 
newDat %>%
    select(Symmetry, Radius) %>% #from the subset data with the new observation selecting symmetry and radius columns
    dist() %>%                   # calculate distance between pairs of observations
    as.matrix()                  # making it into 6 x 6 matrix

ANSWER: 
The nearest observation to our new point is observation 1 with distance $0.57$. We see from the data table above that the class of observation 1 is malignant. Using k = 1 nearest neighbour I would classify the observation as malignant. 

### Part b) 
Using the subset of 5 observations above, classify this new observation (Symmetry = 0 and Radius = 0.25, unknown Class) using the `dist()` function for $k = 3$.

ANSWER: Refer to the table above for distances. Using k = 3 observations, the 3 nearest points are observations 1, 3 and 2 with distances $0.57, 1.04$ and $1.36$ respectively. I would classify the observation as benign because 2/3 of these observations are benign and we take the majority vote. 

### Part c) 
Compare your answers in part a) and b). 

ANSWER: 
In part a) we classified the observation as malignant, but in part b) we classified the observation as benign. So our new observation's classification depends on our choice of $k$. We will discuss how to choose $k$ in later sections. 

In [None]:
set.seed(2) 
(subDat <- sample_n(bcDat, 5) %>% # Taking a random sample of 5 from the bcDat dataset and calling it subDat
    select(Symmetry, Radius, Class))  # Selecting only columns symmetry, radius and class

# Part 2 - Fruit Dataset

In the agricultural industry, cleaning, sorting, grading and packaging food products are all necessary tasks in the post-harvest process. Products are classified based on appearance, size and shape, attributes which helps determine the quality of the food. Sorting can be done by humans, but it is tedious and time consuming. Automatic sorting by machine could help this process by saving time and money. Images of the food products are captured and analysed to determine visual characteristics. The [dataset](https://www.kaggle.com/mjamilmoughal/k-nearest-neighbor-classifier-to-predict-fruits/notebook) contains observations of fruit described with four features 1) mass 2) width 3) height and 4) color score. The dataset "fruitDat_scaled.csv" has been scaled as part of the data preparation. Scaling will be discussed in more detail next week. 

Load the appropriate packages and "fruitDat_scaled.csv" dataset into the notebook.

In [None]:
library(readr)
fruitDat <- read_csv("data/fruitDat_scaled.csv")

### Question 1
Name the variable type of each column in the fruit dataset. 

In [None]:
glimpse(fruitDat)

categorical: fruit label, fruit name and fruit subtype

quanititative: mass, width, height, color score

### Question 2
Change the variable "fruit_name" to a factor and save it in your dataset. 


In [None]:
fruitDat <- fruitDat %>% 
  mutate(fruit_name = as.factor(`fruit_name`)) 

### Question 3
Make a scatterplot of scaled color score and scaled mass grouped by fruit name. 

In [None]:
options(repr.plot.width=6, repr.plot.height=4)

fruitDat %>%  
  ggplot(aes(x=scaled_mass, y= scaled_color, color = fruit_name)) + 
  scale_x_continuous(name = "Mass (scaled)") +
  scale_y_continuous(name = "Color Score (scaled)") +
  geom_point()

### Question 4 
Suppose we have a new observation in the fruit dataset with scaled mass 0.5 and scaled colour score 0.5 labelled in black on the scatterplot below. 

In [None]:
# hide plot
options(repr.plot.width=6, repr.plot.height=4)

fruitDat %>%  
  ggplot(aes(x=scaled_mass, y= scaled_color, color = fruit_name)) + 
  scale_x_continuous(name = "Mass (scaled)") +
  scale_y_continuous(name = "Color Score (scaled)") +
  geom_point() + geom_point(aes(x= 0.5, y= 0.5), 
               color= "black", 
               size = 2.5) 

Just by looking at the scatterplot, how would you classify this observation based on the 3 closest neighbours? 

ANSWER: Just by looking at the scatterplot, I would classify the observation as an orange since 3 of the closest neighbours are oranges. 