# kNN CLASSIFICATION

#### CASE:

In this case study, we will use the kNN (k-Nearest Neighbors) supervised machine learning algorithm to develop a predictive model. 

For this case study, we want to use machine learning to make classification predictions about student data. Suppose we are an institution offering an online course for students, perhaps Machine Learning 101. We have offered this online course in the past, and we want to evaluate measures we can take to improve the learning experience and student success. To do so, we have decided to provided increased advising and tutoring to students who are deemed "at-risk" for completing the course successfully. However, we need to use a systematic method to classify students as "critical" or "no problem" with the course. We turn to k-Nearest Neighbors to detect the similarities/differences in students who are "critical" and students who are "no problem."


*Note: This data is completely fictional, and if we were modeling this in a professional/academic setting, we would probably include more than two factors. The purpose of this demonstration is to illustrate, at a very basic level, the kNN learning algorithm.*

*The terms provided here are used generally. In a professional/academic setting, we would need to define these terms in precise and current language.*

## INSTALL AND LOAD PACKAGES

In [17]:
install.packages("class")
install.packages("tidyverse")

library("class", quietly = TRUE)
library("tidyverse", quietly = TRUE)

Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
“installation of package ‘tidyverse’ had non-zero exit status”Updating HTML index of packages in '.Library'
Making 'packages.html' ... done
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats


## LOAD DATA (both training and test sets)

In [7]:
# Read CSV into R
trainSet <- read.csv(file="https://raw.githubusercontent.com/BenCroft/Portfolio/master/Data/TRAIN.csv", header=TRUE, sep=",")
testSet <- read.csv(file="https://raw.githubusercontent.com/BenCroft/Portfolio/master/Data/TEST.csv", header=TRUE, sep=",")


Let's get a quick glimpse of our two datasets.

In [8]:
head(trainSet)
head(testSet)

CLASS,GPA,Grade
Critical,3.0,74
Critical,2.4,70
Critical,2.7,72
Critical,2.9,85
Critical,2.5,90
Critical,2.6,95


CLASS,GPA,Grade
No Problem,4.0,94
Critical,2.7,89
No Problem,3.8,81
No Problem,3.8,92
No Problem,3.2,92
Critical,2.1,90


### We need to "standardize" both variables to be on the same spectrum (0 - 100), so that kNN weighs them equally.

In [9]:
trainSet$GPA <- trainSet$GPA * 25
testSet$GPA <- testSet$GPA * 25

### Let's take a quick glance at both the training set and the test set

In [93]:
head(trainSet,3)
head(testSet,3)

CLASS,GPA,Grade
Critical,75.0,74
Critical,60.0,70
Critical,67.5,72


CLASS,GPA,Grade
No Problem,100.0,94
Critical,67.5,89
No Problem,95.0,81


In [94]:
table(trainSet$CLASS)
table(testSet$CLASS)


  Critical No Problem 
        46         87 


  Critical No Problem 
         6          4 

The above summaries show us that there are two status categories for students in the class: "Critical" and "No Problem." There are a total of 133 students in the training set, and we want to classify the status of 10 students in the training set.

******
## Selecting the "k" in "knn"

The k-Nearest Neighbors algorithm systematically evaluates each data point in our training set, and looks at where its nearest neighbor (read: most similar data point is). The `knn` function uses Euclidean distance to measure the proximity of a point with its neighbor.

When we select our k-level, we are telling the algorithm to look at a data point's k nearest neighbor is. This means if our k is set to 13, for example, the algorithm will classify data point 1's outcome by examining the outcomes of data points 2-14. Whichever status gets the "majority vote" by these 13 data points will be what the algorithm predicts for data point 1's status. 

There is no "perfect" k-level to fit datasets. Each dataset is unique, and either a higher or lower k-level may do the trick.

### knn Predictions with k = 1

In [60]:
# Create predictions for the test set outcomes using the knn model we trained on the training data
predictions_1 <- knn(trainSet[-1], testSet[-1], trainSet$CLASS)

In [61]:
# Create a vector that contains the actual outcomes for the test set
class_actual_1 <- testSet$CLASS

In [62]:
# Construct a two-way table that illustrates the accuracy of our knn = 1 algorithm
table(predictions_1, class_actual_1)

             class_actual_1
predictions_1 Critical No Problem
   Critical          5          0
   No Problem        1          4

In [63]:
# Compute the average accuracy of the model
mean(predictions_1 == class_actual_1)

Using a k=1 model, we see that our algorithm successfully classified 9 out of 10 of the data points. Why? Because in 9 out of 10 cases, the 1-nearest-neighbor distance between "critical" data points was clustered, as were the "no problem" data points. In one case, the model incorrectly classified a data point as "critical" when the point really should have been "no problem."

### knn Predictions with k = 5

In [64]:
predictions_5 <- knn(trainSet[-1], testSet[-1], trainSet$CLASS, k = 5)

In [65]:
class_actual_5 <- testSet$CLASS

In [66]:
table(predictions_5, class_actual_5)

             class_actual_5
predictions_5 Critical No Problem
   Critical          6          0
   No Problem        0          4

In [67]:
mean(predictions_5 == class_actual_5)

By setting our k to 5, we correctly classified 100% of the data!

### knn Predictions with k = 8

In [89]:
predictions_8 <- knn(trainSet[-1], testSet[-1], trainSet$CLASS, k = 8)

In [90]:
class_actual_8 <- testSet$CLASS

In [91]:
table(predictions_8, class_actual_8)

             class_actual_8
predictions_8 Critical No Problem
   Critical          5          0
   No Problem        1          4

In [95]:
mean(predictions_8 == class_actual_8)

By increasing our k to 8, we began to see a decrease in the accuracy of our model.

### Algorithm to identify the best "k"

In [97]:
numberLoops = nrow(testSet)
numberLoops

We will loop 10 times through the data, each time with an increased number of k-Nearest Neighbors.

In [174]:
for (i in 1:numberLoops) {
    predictions <- knn(trainSet[-1], testSet[-1], trainSet$CLASS, k = i)
    class_actual <- testSet$CLASS
    accuracy <- (mean(predictions == class_actual))
    cat("The accuracy of k =", i, "is", accuracy*100.0, "percent", "\n")

}

The accuracy of k = 1 is 90 percent 
The accuracy of k = 2 is 100 percent 
The accuracy of k = 3 is 100 percent 
The accuracy of k = 4 is 100 percent 
The accuracy of k = 5 is 100 percent 
The accuracy of k = 6 is 100 percent 
The accuracy of k = 7 is 100 percent 
The accuracy of k = 8 is 90 percent 
The accuracy of k = 9 is 90 percent 
The accuracy of k = 10 is 90 percent 


### It looks like the best number of nearest neighbors for this classification is k = [2,7].

If we set our k to a number in this range, we will identify students as "critical" or "no problem" successfully, allowing us to target our efforts to assisting students in need.