<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Introduction to Classification with K-Nearest Neighbors

_Instructor: Aymeric Flaisler_

---

### Learning Objectives
- Understand the difference between classification and regression models
- Understand the K-Nearest Neighbors algorithm visually and in pseudocode
- Explain the differences between distance metrics and explore the two most common
- Apply KNN classification to the Wisconsin breast cancer dataset
- Explain the effect of choosing K on the bias-variance tradeoff

### Lesson Guide
- [Introduction: classification vs. regression](#intro)
- [K-Nearest Neighbors visually](#knn-visual-intro)
- [The K-Nearest Neighbors (KNN) algorithm](#knn)
    - [Note on parametric vs. nonparametric methods](#nonparametric)
- [The KNN distance metric](#distance)
    - [Euclidean distance](#euclidean)
    - [Manhattan distance](#manhattan)
- [Load the Wisconsin breast cancer dataset](#wisconsin)
    - [Rename columns and subset the data](#rename-subset)
    - [Encode the target as a binary class](#target)
- [Examine the correlation structure of the dataset](#correlations)
    - [Use a heatmap](#heatmap)
    - [Use a pairplot](#pairplot)
- [Using sklearn's `KNeighborsClassifier` and `StratifiedKFold`](#kneighborsclassifier)
    - [Create the target and predictors](#target-predictors)
    - [Standardize the predictor matrix](#standardize)
    - [Write a function to manually perform the cross-validation procedure](#manual-cv)
    - [Calculate the "baseline" accuracy](#baseline)
    - [Cross-validate the mean accuracy with 5 neighbors](#cv-knn5)
    - [Cross-validate the mean accuracy with 1 neighbor](#cv-knn5)
- [Visualize the knn decision boundary](#visualize-knn)
- [How is bias-variance affected by number of neighbors?](#bias-variance)
- [Additional resources](#resources)

In [None]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

<a id='intro'></a>

## Introduction: regression vs. classification

We've discussed the difference between **continuous and discrete** variables. We've predicted **continuous numbers** using regression. 

Now, what about **discrete ones**?

Think back to the wine quality dataset we used in the past. We used Linear Regression to predict the quality ranging from 0-10. What if we just wanted to predict whether wine was **good or bad**? **Red or white**? 

Classification algorithms do just that; they **predict categories, or classes**. Split the data into groups and place new data into those groups. 

How does it work visually? Are we still plotting values? Where is the y here ?

![](http://ipython-books.github.io/images/ml.png "Best Split vs Best Fit")

### Pair question: 
- Recall how a linear regression works (what's the minimisation problem ?)
- How can we use it for a classification problem? Would it be a good idea?

### Regression vs. Classification

Regression is used to predict continuous values. Classification is used to predict which class a data point is part of (discrete value).

- Example 1: I have a home with X bedrooms, Y sq ft, Z lot size. What is the price of this home?
- Example 2: I have an unknown fruit that is 5.5 inches long, 2 inches in diameter, and yellow. What is this fruit?

<a id='knn-visual-intro'></a>

---

### K-Nearest Neighbors (KNN) visually

KNN works similarly to how we humans might choose to **classify things**. 

Below we have some red and blue dots:
- A new dot appears without a color and we need to decide which color it is most likely going to be
![image.png](attachment:image.png)

We compare it to its three nearest neighbors – its neighbors are more often red, so we label it red:
![Alt text](http://blog.yhat.com/static/img/knn_new_point_pred.png "3 Nearest Neighbors")

What if we increase the number of neighbors to consider to 5?
![Alt text](http://blog.yhat.com/static/img/knn_new_point_pred_blue.png "5 Nearest Neighbors")

This is in essence the K-Nearest Neighbors (KNN) algorithm. The K represents the number of "neighbors" you use.

> ***Images above credited to the yhat blog.***

<a id='knn'></a>

## The KNN algorithm

---

K-Nearest Neighbors takes a **different approach to modeling** than we have been practicing with linear models. In order to estimate a value (regression) or class membership (classification), the algorithm finds the observations in its training data that are **nearest** to the observation to predict. It then averages or takes a vote of those training observations' target values to estimate the value for the new data point.

Distance is usually calculated using the **euclidean distance**. The "K" in KNN refers to the number of nearest neighbors that will be contributing to the prediction. 

Today we will be looking at KNN only in the context of classification.

**Checkout (in pair):** What could be the difference between a linear regression and the KNN algorithm for regression?

**The KNN can be concisely represented with pseudocode:**

Procedure KNN(x):
- begin
    - looping through all known data points in training data, find the closest k points to x
    - assign class: majority classification among the k closest points
- end

> **Note**: in the case of ties, sklearn's `KNeighborsClassifier()` will just choose the first class (when weights are uniform). If this is unappealing to you you can change the weights keyword argument to 'distance'.

### EXAMPLES AND APPLICATIONS

Consider determining if an individual is going to default on their loan. Age and Loan are the two numerical variables (predictors) and Default is the target:
![image.png](attachment:image.png)

**Checkout (in pair):** How does our k affect our bias-variance tradeoff?

## ADVANTAGES AND DRAWBACKS

### Benefits
- Simple to understand and explain
- Model training is fast
- Can be used for classification and regression
- Non-linear, which may be common (imagine age vs income)

### Drawbacks
- Must store all of the training data
- Prediction phase can be slow when n is large
- Sensitive to irrelevant features
- Sensitive to the scale of the data
- Accuracy is (generally) not competitive with the best supervised learning methods

### Note on parametric vs. nonparametric methods

Thus far, all of our tests and methods have been **parametric**. That is, we have assumed a certain **distribution for our data**. In linear regression our parameters are the coefficients in our model, and our estimate of the target is calculated from these parameters.

There are alternatives in the case where we cannot assume a particular distribution for our data or choose not to. These methods are **nonparametric** 

=> When we make **no assumptions** about the distribution for our data, we call our data **nonparametric**. For nearly every parametric test, there is a nonparametric analog available. The KNN model is an example of a **nonparametric model**. You can see that there are no coefficients for the different predictors and our estimate is not represented by a formula of our predictor variables.

<a id='distance'></a>
## The KNN distance metric

---
KNN typically uses one of two distance metrics: euclidean or manhattan. Other distance metrics are possible, but more rare (sometimes it makes sense to create your own distance function.

<a id='euclidean'></a>
### Euclidean distance

Recal the famous Pythagorean Theorem
![Alt text](http://ncalculators.com/images/pythagoras-theorem.gif)

We can apply the theorem to calculate distance between points. This is called Euclidean distance. 

![Alt text](http://rosalind.info/media/Euclidean_distance.png)

### $$\text{Euclidean  distance}=\sqrt{(x_1-x_2)^2+(y_1-y_1)^2}$$

There are many different distance metrics, but Euclidean is the most common (and default in sklearn).


---

<a id='manhattan'></a>
### Manhattan distance (A.K.A Taxicab Distance)

Another way to measure distance between two points is to take the sum of the absolute value of their differences. 

### $$ D = \sum_{i=1}^n | x_i - y_i | $$

The name Manhattan distance comes from the fact that taxicabs in Manhattan must drive from point A to point B on streets that force traffic to flow forward or backwards and left or right -- but never diagonally. 
![](https://upload.wikimedia.org/wikipedia/commons/thumb/d/de/Manhattan_distance_bgiu.png/261px-Manhattan_distance_bgiu.png)
![](https://pbs.twimg.com/media/CgIlqLTWEAAedKB.jpg)

**Note that the Manhattan distance is a less common choice.**
- Manhattan distance is more restrictive than Euclidean distance in how distance is measured
- [Manhattan distance comes from $L_{p = 1}$ space and Euclidean distance comes from $L_{p = 2}$ space.](https://en.wikipedia.org/wiki/Lp_space)
- In practice, we can cross-validate KNN using both types of distances to see which performs best. 

![](http://www.improvedoutcomes.com/docs/WebSiteDocs/image/diagram_euclidean_manhattan_distance_metrics.gif)

- The default distance we tend to use is the L2 or Euclidean one. 
    - Has very nice properties (symmetic, spherical, treat all dimensions equally, works well in optimisation problems under constraints)
    - Sensitive to extreme differences in single attribute

<a id='wisconsin'></a>

## Load the Wisconsin breast cancer dataset

---

Below we will be testing out the KNN classification algorithm on the classic [Wisconsin breast cancer dataset](https://www.kaggle.com/uciml/breast-cancer-wisconsin-data)

> **Note:** (The file as suffix '.data' but is actually formatted as a .csv)

In [1]:
bcw = pd.read_csv('./datasets/wdbc.data', header=None, index_col=None)

In [2]:
bcw.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
0,842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,1.095,0.9053,8.589,153.4,0.006399,0.04904,0.05373,0.01587,0.03003,0.006193,25.38,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189
1,842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,0.5435,0.7339,3.398,74.08,0.005225,0.01308,0.0186,0.0134,0.01389,0.003532,24.99,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902
2,84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,0.7456,0.7869,4.585,94.03,0.00615,0.04006,0.03832,0.02058,0.0225,0.004571,23.57,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758
3,84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,0.4956,1.156,3.445,27.23,0.00911,0.07458,0.05661,0.01867,0.05963,0.009208,14.91,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173
4,84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,0.7572,0.7813,5.438,94.44,0.01149,0.02461,0.05688,0.01885,0.01756,0.005115,22.54,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678


---

<a id='rename-subset'></a>
### Rename columns and subset the data

The attributes below are the columns of the dataset.

The column names are taken from the dataset info file. For more information check out the information file:

`./datasets/wdbc.names`

You can open it with a text editor of your choice.

      Attribute                     
    --------------------------------------------
    1. Sample code number [subject ID]
    2. Class
    3. Cell nucleus mean radius
    4. Cell nucleus SE radius
    5. Cell nucleus worst radius
    6. Texture mean
    7. Texture SE
    8. Texture worst
    9. Perimeter mean
    10. Perimeter SE
    11. Perimeter worst
    12. Area mean
    13. Area SE
    14. Area worst
    15. Smoothness mean
    16. Smoothness SE
    17. Smoothness worst
    18. Compactness mean
    19. Compactness SE
    20. Compactness worst
    21. Concavity mean
    22. Concavity SE
    23. Concavity worst
    24. Concave points mean
    25. Concave points SE
    26. Concave points worst
    27. Symmetry mean
    28. Symmetry SE
    29. Symmetry worst
    30. Fractal dimension mean
    31. Fractal dimension SE
    32. Fractal dimension worst
   
**Using the provided list, reassign the column names in the dataset.**

In [4]:
column_names = ['id','malignant',
                'nucleus_mean','nucleus_se','nucleus_worst',
                'texture_mean','texture_se','texture_worst',
                'perimeter_mean','perimeter_se','perimeter_worst',
                'area_mean','area_se','area_worst',
                'smoothness_mean','smoothness_se','smoothness_worst',
                'compactness_mean','compactness_se','compactness_worst',
                'concavity_mean','concavity_se','concavity_worst',
                'concave_pts_mean','concave_pts_se','concave_pts_worst',
                'symmetry_mean','symmetry_se','symmetry_worst',
                'fractal_dim_mean','fractal_dim_se','fractal_dim_worst']


**Remove the columns that are not the standard deviation "_se" or the "_worst" measurements:**

You should only have the mean measurement columns

In [8]:
# A:

---
<a id='target'></a>
### Encode the target class variable `malignant` to be a binary 0 vs. 1

The `malignant` class target variable is coded as "B" for benign and "M" as malignant. 

We need to recode this to a binary integer for classification:
 - Encode malign as 1
 - Encode benign as 0
 
Malign is assigned to 1 because our goal is to predict malign tumors with the data. In binary classification problems the category "of interest" to predict is typically encoded as 1.

In [None]:
# A:

<a id='correlations'></a>
## Examine the correlation structure of the dataset

---

You should exclude the `id` column as this is just an indicator variable for the subject.

<a id='heatmap'></a>
### Method 1: plot a heatmap of the correlation matrix

Plot a seaborn heatmap of the correlation matrix to visually examine which variables are correlated and anti-correlated, and to what degree.

In [None]:
# A:

<a id='pairplot'></a>
### Method 2: Use seaborn's pairplot to visualize relationships between variables

When you have a small number of predictor variables, seaborn's `pairplot` function will give you a more detailed visual look at the relationships between variables. The pairplot is similar to a correlation matrix, but displays scatterplots of variable pairs. Along the diagonal line are histograms showing the distribution of each variable.

One of the most appealing aspects of the pairplot function for classification tasks is that the scatterplots and histograms can be split along a hue variable. If we use the `malignant` target class as the hue we are able to see how the classes are distributed across these variables as well.

Plot data using seaborn's `pairplot()` function. The hue will be the class variable "malignant". The variables will be the other columns excluding, of course, the subject ID column. This function can take some time to run.

> **Note:** Most of these predictors are highly correlated with the "class" variable. This is already an indication that our classifier is very likely to perform well.

In [None]:
# A:

<a id='kneighborsclassifier'></a>

## Using sklearn's `KNeighborsClassifier` and `StratifiedKFold`

---

Let's see how the sklearn KNN classifier performs on our dataset predicting the malignant target class using cross-validation.

Load the KNN classifier like so:
```python
from sklearn.neighbors import KNeighborsClassifier
```

**We are going to set some arguments when instantiating the model:**
1. **n_neighbors** specifies how many neighbors will vote on the class
2. **weights** uniform weights indicate that all neighbors have the same weight
3. **metric** and **p**: when distance is minkowski (the default) and p == 2 (the default), _this is equivalent to the euclidean distance metric_

Also load sklearn's `StratifiedKFold` from the `model_selection` module:
```python
from sklearn.model_selection import StratifiedKFold
```

The `StratifiedKFold` object will return cross-validation _indices_ which you can use to subset your data (in a for loop, for example) that runs the model and tests it. 

This is the **stratified** version of the `KFold` class. Stratification ensures that there are equal proportions the predicted class in each train-test fold. This is a best practice in classification tasks.

> **Note:** The `cross_val_score` can also stratify for you. However, you should get familiar with using indices for cross-validation on data. Being able do cross-validation at a more "manual" level allows for a lot more power and customization. It also reinforces what is happening in your head during cross-validation, since you have to divide up the data yourself with the indices!



In [None]:
# Let's reload the data:
bcw = pd.read_csv('./datasets/wdbc.data', header=None, index_col=None)
column_names = ['id','malignant',
                'nucleus_mean','nucleus_se','nucleus_worst',
                'texture_mean','texture_se','texture_worst',
                'perimeter_mean','perimeter_se','perimeter_worst',
                'area_mean','area_se','area_worst',
                'smoothness_mean','smoothness_se','smoothness_worst',
                'compactness_mean','compactness_se','compactness_worst',
                'concavity_mean','concavity_se','concavity_worst',
                'concave_pts_mean','concave_pts_se','concave_pts_worst',
                'symmetry_mean','symmetry_se','symmetry_worst',
                'fractal_dim_mean','fractal_dim_se','fractal_dim_worst']
bcw.columns = column_names
bcw = bcw[[c for c in bcw.columns if not '_worst' in c and not '_se' in c]]
bcw['malignant'] = bcw['malignant'].map(lambda x: 0 if x == "B" else 1)
print (bcw.malignant.value_counts())
bcw.head(2)

In [None]:
# A:
from sklearn.model_selection import StratifiedKFold
from sklearn.neighbors import KNeighborsClassifier

<a id='target-predictors'></a>
### Create your target vector and predictor matrix

The target should be the binary `malignant` column. The predictors are up to you.

In [None]:
# A:

<a id='standardize'></a>

### Standardize the predictor matrix

Standardization should be done for the predictors when using a KNN model. Why? 

Remember that KNN finds the nearest neighbors according to a distance metric. If the predictors are left unstandardized, then it is possible that some predictors will have an unfair impact on the distance measure simply because they are on a larger scale than the other variables.

In [None]:
# A:
from sklearn.preprocessing import StandardScaler

<a id='cv-inds'></a>
### Create the cross-validation indices using `StratifiedKFold`

`StratifiedKFold` takes is instantiated with `n_splits` number of train-test pairs desired.

The built-in `.split()` function will take a predictor matrix and target array and return the training and testing indices.

> **Note:** The `split` function will return a python *generator*. This can be iterated, but works differently from a list in that once iterated once it will be "empty". You can convert the output to a list using a list comprehension if you need to use the indices multiple times.

In [None]:
# A:

<a id='manual-cv'></a>
### Write a function to manually perform cross-validation using your stratified indices

Now that we have the indices (row indexes for our train-test splits), write a function that will:
- Split the X and y into training and testing subsets
- Fit a KNN classifier on the training set
- Calculate the accuracy of the classifier on the test set
- Store the accuracies for each fold and return them as a list

In [None]:
# A:

<a id='baseline'></a>
### Calculate the "baseline" accuracy

Before we can evaluate whether our classifier's accuracy is good or bad, we need to know the baseline accuracy.

**The baseline accuracy is the proportion of the majority class.**

For a binary classification, this means that the baseline accuracy is the percent of the dataset that is labeled malignant or benign, depending on whichever of malignant or benign is greater. This can be calculated:

```python
baseline = np.mean(y)  # if np.mean(y) is >= 0.5
baseline = 1. - np.mean(y) # if np.mean(y) is < 0.5
```

**It is critical that you know your baseline accuracy!**

If your dataset for example had 95 1's and 5 0's, and you got a 95% accuracy using KNN, if you had not looked at your baseline accuracy you may conclude that your classifier is doing great. In fact, it's doing no better than chance! The classifier could have guessed only 1's and gotten a 95% accuracy.

In [None]:
# A:

<a id='cv-knn5'></a>
### Cross-validate the mean accuracy for a KNN model with 5 neighbors

In [None]:
# A:

<a id='cv-knn1'></a>
### Cross-validate the mean accuracy for a KNN model with 1 neighbor

As you can see the mean cross-validated accuracy is very high with 5 neighbors. 

Let's see what it's like when we use only 1 neighbor:

In [None]:
# A:

<a id='visualize-knn'></a>

## Visualize the KNN decision boundary

---

Even with 1 neighbor we can do quite well predicting the malignant observations.

Below you can load an interactive KNN visualization that shows how the decision boundary of KNN changes as the number of neighbors changes.

The `KNNBoundaryPlotter` class has 4 required arguments:

    KNNBoundaryPlotter(data, predictor1, predictor2, class_target)
    
It will by default fit a visualization of the decision boundary across 1 to 100 nearest neighbors.

The boundary is where the classifier will vote for malignant vs. benign classes. 

In [None]:
# import imp
# plotter = imp.load_source('plotter', './knn_plotter.py')
# from plotter import KNNBoundaryPlotter

# kbp = KNNBoundaryPlotter(bcw, 'area_mean', 'symmetry_mean', 'malignant', nn_range=range(1,101))

# kbp.knn_mesh_runner()
# kbp.knn_interact()

<a id='bias-variance'></a>
### How does increasing the number of neighbors impact the bias and variance of your model?

In [None]:
# A:

<a id='resources'></a>

## Additional resources

---


- Scott Foreman-Roe's [breakdown](http://scott.fortmann-roe.com/docs/BiasVariance.html) (required) of the bias-variance tradeoff featuring a discussion of KNN is an excellent read
- A [detailed discussion](https://saravananthirumuruganathan.wordpress.com/2010/05/17/a-detailed-introduction-to-k-nearest-neighbor-knn-algorithm/) of KNN
- A long, applied example of KNN applied to [image classification](http://cs231n.github.io/classification/ )
- Read the SKLearn [documentation](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) on implementing KNN
- Choosing the right [algorithm from SKLearn](http://scikit-learn.org/stable/tutorial/machine_learning_map/)
- A deeper dive into [Euclidian distance](http://www.econ.upf.edu/~michael/stanford/maeb4.pdf)
- Classifier comparsion from [SKLearn](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) (this is also in our [repository](https://github.com/ga-students/DSI-DC-2/blob/master/curriculum/Week-04/4.01%20Intro%20to%20Classification/classification-methods.py))