<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>

<h1 align=center><font size = 5> K-Nearest Neighbors Classification in R: Exercise </font></h1>

### Exercise 1: Balance Scale

In this exercise, we will be using the <font color = "green">balance_scale.txt</font> data set. In this data set, each data point is classified as having the balance scale tip to the right, tip to the left, or be balanced. The variables of the data set are 'left weight', 'left distance', 'right weight', and 'right distance'. The class is determined by the greater of (right-distance x right-weight) and (left-distance x left-weight). However, if these values are equal, the data point is balanced.

The objective of this exercise is to use the k-nearest neighbors algorithm to determine the classification of test data and to estimate using knn regression, the difference between the product of right-distance and right-weight, and the product of left-distance and left-weight.

### a). Install the package

We first install and load the libraries that will be used in this exercise.

In [None]:
# Install the packages 'class' and 'kknn' and load their libraries, which will be needed for their k-nearest neighbors algorithms
install.packages("class")
install.packages("kknn")
library(class)
library(kknn)

### b). Download the data set, load the data and view its structure

We can use the download.file command to download the data set, <font color = "green">"balance_scale.txt"</font>, load the data using the read.csv command, and view its structure using the str command. We will also look at the first few lines of the data using the head command.

In [None]:
# Load the data and view its structure
balance_scale <- read.csv("https://ibm.box.com/shared/static/684jzm7e6fbbssg87yc2v4dy53dgkdew.txt", sep = ",")
str(balance_scale)

# View the first few rows of the data using the head function
# Note: The raw data does not contain any column names
head(balance_scale)

### c). Clean the data
Now we will clean the data by adding column names, as well new columns for the right and left products and their differences.

In [None]:
## Your Answer Code ## 





<div align="right">
<a href="#p1" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p1" class="collapse">
```
# Add column names
colnames(balance_scale) <- c("Class_Name","Left_Weight", "Left_Distance", "Right_Weight", "Right_Distance")
head(balance_scale)
# Note: We do not need to standardize the data in this instance since the numerical data values lie on the same scale.
# Calculate the products and differences
Right_Product <- balance_scale[,4]*balance_scale[,5]
Left_Product <- balance_scale[,2]*balance_scale[,3]
Differences <- Right_Product-Left_Product
# Add columns for Right_Product, Left_Product and Differences
balance_scale$Right_Product <- Right_Product
balance_scale$Left_Product <- Left_Product
balance_scale$Differences <- Differences
    
```
</div>

### d). Separate the data into training and test groups
Next we will split the data into two groups which will be used for training and testing respectively. To do this, we will create a vector of equal length to our data, with approximately 70% of the values in the vector being 1's and the remaining values being 2's. The order in which the values appear will be random and depending on the indices of the vector which correspond to the value of 1 will be used to select our training data, while the indices corresponding to 2 will be used to select our test data.

In [None]:
## Your Answer Code ## 






<div align="right">
<a href="#p2" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p2" class="collapse">
```
# Use the sample function to create a vector of indices that will be used
set.seed(1234)
ind <- sample(2, nrow(balance_scale), replace=TRUE, prob=c(0.7, 0.3))


# Create the training and test data from the dataset using ind
bscale.train <- balance_scale[ind==1, 6:8]
bscale.test <- balance_scale[ind==2, 6:8]



# Create the target vectors for the training and test data from the dataset using ind
bscale.trainLabels <- balance_scale[ind==1, 1]
bscale.testLabels <- balance_scale[ind==2, 1]
    
```
</div>

### e). Implementation of k-nearest neighbors for classification
Now we will use the knn algorithm from the 'class' package to perform our classification task. This will be done using the knn command to determine the class name (R,L,B) of our test data based on our training data and our choice of k.

In [None]:
## Your Answer Code ## 






<div align="right">
<a href="#p3" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p3" class="collapse">
```
# Use the knn command to make predictions on the Class_Name of the test data
knn_class <- knn(train = bscale.train, test = bscale.test, cl = bscale.trainLabels, k=3)

# Find the number of incorrectly classified points
correct <- which(knn_class == bscale.testLabels, arr.ind = TRUE)
incorrect <- which(knn_class != bscale.testLabels, arr.ind = TRUE)
cat("Number of incorrectly classified points:",length(incorrect),"\n")

# Find the proportion of correctly classified points
proportion_correct <- length(correct)/length(bscale.testLabels)
cat("Proportion of correctly classified points", proportion_correct,"\n")
    
```
</div>

### f). K-nearest neighbors regression using the kknn command
The idea of using regression is by far predicting the continious response variable, instead of class variables, here the method is similar to localized linear regressions a.k.a LOESS, technique used, where $k$ is equivalent to the spread of kernel function around the point. [Reference](http://127.0.0.1:44104/help/library/kknn/doc/paper399.pdf) is good starting point to know about this methodology 

In [None]:
## Your Answer Code ## 






<div align="right">
<a href="#p4" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p4" class="collapse">
```
# Run the knn regression using the kknn command
knn_reg <- kknn(formula = Differences ~ ., train=bscale.train, test=bscale.test, k=3)

# Find the number of incorrectly classified points
incorrect_reg <- which(knn_reg$fitted.values != bscale.test$Differences, arr.ind = TRUE)
cat("Number of incorrectly classified points:", length(incorrect_reg), "\n");

# Find the proportion of correctly classified points
correct_reg <- which(knn_reg$fitted.values == bscale.test$Differences, arr.ind = TRUE)
cat("Proportion of correctly classified points", length(correct_reg)/length(bscale.test$Differences), "\n")


# Display the first few rows of the regression estimates of the differences and their true values
head(cbind(knn_reg$fitted.values,bscale.test$Differences))
    
```
</div>

We can also determine the optimal value of k using the train.kknn command.

In [None]:
## Your Answer Code ## 







<div align="right">
<a href="#p5" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p5" class="collapse">
```
best_reg <- train.kknn(formula = Differences ~ ., data=bscale.train, kmax=8)
best_reg$best.parameters  
```
</div>

The results indicate that the best value for k, in terms of knn regression, is k=2. This value makes sense given that the k-nearest neighbors regression algorithm estimates the outcome using a weighted average of the k nearest neighbors, weighted by the inverse of their distance.

In [None]:
# Run the knn regression again using the kknn command with k=2
knn_reg2 <- kknn(formula = Differences ~ ., train=bscale.train, test=bscale.test, k=2)

# Find the number of incorrectly classified points
incorrect_reg2 <- which(knn_reg2$fitted.values != bscale.test$Differences, arr.ind = TRUE)
cat("Number of incorrectly classified points:", length(incorrect_reg2),"\n")
# Find the proportion of correctly classified points
correct_reg2 <- which(knn_reg2$fitted.values == bscale.test$Differences, arr.ind = TRUE)
cat("Proportion of correctly classified points",length(correct_reg2)/length(bscale.test$Differences),"\n")

# Display the first few rows of the new regression estimates of the differences and their true values
head(cbind(knn_reg2$fitted.values,bscale.test$Differences))

As can be seen, the estimates of the regression using k = 2 are closer to the true values with almost double the accuracy.

### g). K-nearest neighbors classification using the kknn command

We can also use the kknn command for classification:

In [None]:
## Your Answer Code ## 











<div align="right">
<a href="#p6" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p6" class="collapse">
```
# The kknn function can also be used for classification
knn_class2 <- kknn(formula = Class_Name ~ ., train=subset(balance_scale, select=c(Class_Name,Right_Product,Left_Product,Differences))[ind==1,], test=subset(balance_scale, select=c(Class_Name,Right_Product,Left_Product,Differences))[ind==2,], k=3)
# Find the number of incorrectly classified points
incorrect_class2 <- which(knn_class2$fitted.values != bscale.testLabels, arr.ind = TRUE)
cat("Number of incorrectly classified points:",length(incorrect_class2),"\n")
    
```
</div>

Now let's try finding the optimal value of k for classification.

In [None]:
best_class <- train.kknn(formula = Class_Name ~ ., data=subset(balance_scale, select=c(Class_Name,Right_Product,Left_Product,Differences))[ind==1,], kmax=8)
best_class$best.parameters

In [None]:
# Using k=1
knn_class3 <- kknn(formula = Class_Name ~ ., train=subset(balance_scale, select=c(Class_Name,Right_Product,Left_Product,Differences))[ind==1,], test=subset(balance_scale, select=c(Class_Name,Right_Product,Left_Product,Differences))[ind==2,], k=1)
# Find the number of incorrectly classified points
incorrect_class3 <- which(knn_class3$fitted.values != bscale.testLabels, arr.ind = TRUE)
cat("Number of incorrectly classified points:", length(incorrect_class3),"\n")

The results of using k=1 are apparently similar to those of using k=3.

   ### Exercise 2: Second Language Immersion Schools in Canada (Classification)

French immersion programs are designed for students who are not native French speakers or students who do not speak French at home. These programs provide instruction in the classroom entirely in French until the end of grade 3, although certain specialist subjects may be instructed in English. We are provided with a data set of Canadian schools offering French immersion programs (<font color = "green">"Second_Language_Immersion_Schools_in_Canada.csv"</font>), which contains information regarding these schools such as their names, provinces and coordinates.

Suppose we would like to determine the province in which a certain school or other place of interest is located in Canada, given that we have the coordinates (latitude and longitude) of this place. We can do this by using the k-nearest neighbor algorithm.


### a). Download the data set, load the data and view its structure

We can use the download.file command to download the data set, <font color = "green">"Second_Language_Immersion_Schools_in_Canada.csv"</font>, load the data using the read.csv command, and view its structure using the str command.

In [None]:
# Load the data and view the structure
schools <- read.csv("https://ibm.box.com/shared/static/uummw8ijp41gn3nfkuipi78xnalkss4c.csv", sep = ",")
str(schools)

### b). Clean the data
For the task we would like to do we only need a few variables of the data. These variables are school name (just for viewing), province, latitude and longitude. We will also change the variable names to cleaner names where possible.

In [None]:
# Create a subset of the data containing the variables province, latitude, and longitude
schools.sub <- subset(schools, select=c(name, province.name..english, latitude, longitude))
head(schools.sub)
# Change the column names to cleaner names
colnames(schools.sub) <- c("name", "province", "latitude", "longitude")
head(schools.sub)

### c). Visualize the data on a map
We will now install and load packages that will enable us to visualize our data by displayiing it on a map. This should make it easier to see clearly the predictions made by the knn algorithm.

In [None]:
# Installing leaflet maps
install.packages("leaflet")
library(leaflet)
# Packages used to display the maps in this notebook
library(htmlwidgets)
library(IRdisplay)

Next we will set up a default parameters for the setView attribute of the map we are creating.

In [None]:
# Establish the limits of our default visualization
lower_lon = -140
upper_lon = -50
lower_lat = 40
upper_lat = 65
# Establish the center of our default visualization
center_lon = (lower_lon + upper_lon)/2
center_lat = (lower_lat + upper_lat)/2
# Set the zoom of our default visualization
zoom = 4

Now we create and display our map

In [None]:
# Create a leaflet map
schools_map <- leaflet(schools.sub) %>%
  setView(center_lon,center_lat, zoom)%>% 
  addProviderTiles("OpenStreetMap.BlackAndWhite")%>% # set the map that we want to use as background
  addCircleMarkers(lng = schools.sub$longitude, 
                   lat = schools.sub$latitude, 
                   popup = schools.sub$name, # pop-ups will show the name of school if you click on a data point
                   fillColor = "Black", # colors of the markers will be black
                   fillOpacity = 1, # the shapes will have maximum opacity
                   radius = 4, # radius determine the size of each shape
                   stroke = F) # no stroke will be drawn in each data point

saveWidget(schools_map, file="schools_map.html", selfcontained = F) #saving the leaflet map in html
display_html(paste("<iframe src=' ", 'schools_map.html', " ' width='100%' height='400'","/>")) #display the map !


### d). Implementation of k-nearest neighbors
Now we will use the knn algorithm from the 'class' package to perform our classification task. This will be done by first separating our data into training and test groups, selecting our k parameter value, and then using them to make predictions of the provinces of our test data using the knn command.

In [None]:
## Your Answer Code ## 

















<div align="right">
<a href="#p8" class="btn btn-default" data-toggle="collapse">Click here for the solution</a>
</div>
<div id="p8" class="collapse">
```
# Use the sample function to create a vector of indices that will be used to split our data into training and test data
set.seed(1234)
ind <- sample(2, nrow(schools.sub), replace=TRUE, prob=c(0.7, 0.3)) # This creates a vector of equal length to our data, with approximately 70% of the values being 1 and the remaining values are 2  
# Create training and test sets containing only the geographical locations 
schools.train <- schools.sub[ind==1,3:4]
schools.test <- schools.sub[ind==2,3:4]
# Note: Normalizing or standardizing the geographical attributes of the data will not be helpful since values are already on the same scale 
# Create the target vectors for the training and test data from the original dataset using ind
schools.trainLabels <- schools.sub[ind==1,2]
schools.testLabels <- schools.sub[ind==2,2]
# Run the knn algorithm to classify test data points into the different provinces
prov_pred <- knn(train = schools.train, test = schools.test, cl = schools.trainLabels, k=3)

# Find the number of incorrectly classified points
correct_provinces <- which(prov_pred == schools.testLabels, arr.ind = TRUE)
incorrect_provinces <- which(prov_pred != schools.testLabels, arr.ind = TRUE)
cat("Number of incorrectly classified points:", length(incorrect_provinces),"\n")
# Find the proportion of correctly classified points
proportion_of_correct_provinces <- length(correct_provinces)/length(schools.testLabels)
cat("Proportion of correctly classified points", proportion_of_correct_provinces,"\n")
    
```
</div>

### e). Plot the classification predictions of k-nearest neighbors
Now we will link the knn predictions of the provinces with the coordinates of the test data, and display them on a map.

In [None]:
# Link the knn predictions with the test data
schools.test$prov <- prov_pred
# Plot the test data with predicted provinces
color <- colorFactor(topo.colors(10), schools.test$prov)
schools_map2 <- leaflet(schools.test) %>%
  setView(center_lon,center_lat, zoom)%>% 
  addProviderTiles("OpenStreetMap.BlackAndWhite")%>% # set the map that we want to use as background
  addCircleMarkers(lng = schools.test$longitude, 
                   lat = schools.test$latitude, 
                   popup = schools.test$prov, # pop-ups will show the predicted province if you click on a data point
                   fillColor = ~color(prov),
                   fillOpacity = 1, # the shapes will have maximum opacity
                   radius = 4, # radius determine the size of each shape
                   stroke = F) # no stroke will be drawn in each data point

saveWidget(schools_map2, file="schools_map2.html", selfcontained = F) #saving the leaflet map in html
display_html(paste("<iframe src=' ", 'schools_map2.html', " ' width='100%' height='400'","/>")) #display the map !



## Want to learn more?

IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems – by your enterprise as a whole. A free trial is available through this course, available here: [SPSS Modeler for Mac users](https://cocl.us/ML0151EN_SPSSMod_mac) and [SPSS Modeler for Windows users](https://cocl.us/ML0151EN_SPSSMod_win)

Also, you can use Data Science Experience to run these notebooks faster with bigger datasets. Data Science Experience is IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-packaged in the cloud, DSX enables data scientists to collaborate on their projects without having to install anything. Join the fast-growing community of DSX users today with a free account at [Data Science Experience](https://cocl.us/ML0151EN_DSX)

### Thank you for completing this exercise!

Notebook created by: Dominique Warren

### References:

* [K-Nearest Neighbors Algorithm (Wikipedia)](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)
* [K-Nearest Neighbors (Statsoft)](http://www.statsoft.com/textbook/k-nearest-neighbors)
* [Information regarding the kknn package (CRAN)](https://cran.r-project.org/web/packages/kknn/kknn.pdf)

<hr>
Copyright &copy; 2017 [IBM Cognitive Class](https://cognitiveclass.ai/?utm_source=ML0151&utm_medium=lab&utm_campaign=cclab). This notebook and its source code are released under the terms of the [MIT License](https://cognitiveclass.ai/mit-license/).