# The beauty of kNN

Sections
1. kNN as a classifier
2. kNN as regression

The lecture draws from Chapters 2 & 3 of James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). "An introduction to statistical learning: with applications in r."

---
## 1. kNN as a classifier

For this lecture we will highlight one of the most useful unsupervised, non-parametric approaches to both classification & regression: the k-nearest neighbor (kNN) algorithm.

We start with kNN classification because it is a bit conceptually easier to understand. But as you will see, kNN regression really is a special case of kNN classification.

<br>
Remember from last lecture on classification that the goal of classification is to maximize the probability 

$$P(Y=j | X = i)$$

This means that if you have pairs of response and predictor variables $(x_1,y_1), ... , (x_n, y_n)$ where $X$ is a continuous, quantitative variable and $Y$ is a categorical predictor (e.g., $y \in \{0,1\}$), then given a new observation $x_j$, you want to predict what group of $Y$ it belongs too (i.e., $y_j=0$ or $y_j=1$?).

In the last lecture you saw how to do this with logistic regression & LDA. But there's a simple and intuitive way to find what group $x_{n+1}$ belongs to: _just ask his neighbors_.

This scenario is illustrated below. Let the yellow circles be all cases in your original set where $y_i=0$ (called "Class A") and the purple circles be observations where $y_i=1$ (called "Class B"). The question is which group does the new observation ($x_j$, red star) belong to?

<img src="imgs/L13_kNNClassifier2.png" width=500>

To determine which group the new observation belongs to, you simply ask the _k_ closest points what class of $Y$ they belong to and take they belong to. In the example above, when _k=3_, we would say the new observation belongs to the group $y_j=1$. If we set _k=6_ then we would say it belongs to the group $y_j=0$.

kNN classification is using consensus as a statistical tool. The trick is to find the right _k_ for your problem, but once you have that then everything else is intuitive.

<br>

---
* **Question:** What about cases where there is a tie?

* **Answer:** A tie reflects points that sit perfectly on the _discrimination boundary_ discused in the last lecture.
---

<br>

## How do you find neighbors?

The critical part of kNN classification is in terms of finding the nearest neighbors. We call this a _distance measure_. The easiest way to do this is to calculate the Euclidean distance between the observation you are evaluating $x_j$ and each of your original observations (e.g., $x_i$).

$$ d_{i,j} = (x_i - x_j)^2 $$

You calculate $d$ for all $n$ observations in your training set, meaning that $d$ is a vector of distances to all the other data points. Then you sort $d$ from lowest to highest. In order to find the _k_ nearest neighbors you just find the first _k_ points in $d$ with the smallest distances. Then to determine the category that $x_j$ belongs to you just look for the most common value of $y$ in the first _k_ nearest neighbors. 

This makes the _entire_ kNN algorithm in a nutshell is:

1. Choose k.
2. Find the distances between a value of $X$ and all other observations in your data set.
3. Choose the k smallest distance.
4. Determine the class your observation belongs to by a majority vote.

<br>

## Decision boundaries

The beauty of kNN is its simplicity. You can immediately classify any new observation without making assumptions about your data. You can even estimate the optimal descrimination boundary (or decision boundary) in kNN by finding the regions of your data set where you have a tie. How can you have a tie if you set _k_ to being an odd value? Well that's the case where you have equal distance between two or more observations. 

As mentioned above, the case where there is a tie defines a decision boundary. You can actually use kNN to realtively easily visualize the discrimination boundary. All you do is try all possible values of $X$ and see the regions of your data space where ties occur. 

Let's say that you have 12 observations of $x,y$ pairs. Half of them belong to Group 1 and half belong to Group 2. In this case $X$ is a 2 dimensional matrix (i.e., the data space of $X$ is a plane). For all pairs of $X$, $(x_{1,i}, x_{2,i})$ you ask the kNN classifier what group it would belong to. To do this you just march through your data space asking this question. The result is a map that looks like this.

![kNN Decision Space](imgs/L13_kNNDecisionBoundary.png)

Here the parts of the space that are in Group 1's territory is shown in orange while the parts of space that are in Group 2's territory are shown in blue. The regions of space where there is a perfect tie are shown as black lines. Here you can see quite clearly where the decision boundary lies. Points that fall into either of the territories are automatically members of the group that owns that territory.

What is nice about this approach is that this makes no assumptions about the form of $f(X)$ (unlike logistic regression) nor does it make assumptions about the nature of the underlying distributions of $X$ (as in LDA and QDA). So the decision boundary can take on very odd shapes and still be accurate.

<br>

## Variance-bias tradeoff

The art of kNN is in choosing the right value of _k_. That's because the flexibility (variability) of your classifier is soley determined by _k_. When _k_ is small, the model has more flexibility (variability). When _k_ is big, your model has a lot of bias. 

Consider the three cases where _k=1, k=5,_ or _k=25_. 

![kNN k tests](imgs/L13_kNNClassifierDecisionBoundary.png)

When _k=1_, we capture a lot of the variance in the original data set, but any new predictions are at the mercy of who their closest neighbor is (even if that neighbor is noisy or different than the rest of the group). As _k_ increases, the little "islands" of one group nested within the territory of the other group start to disappear. When _k=25_ you can see a smooth function that separates _most_ of the two groups, but allows for some flexibility close to the decision boundary. This model has more bias, but is less susceptible to noise

<br>

**So how does one find the right value of _k_?**

<br>

You tune your kNN classifier the same way you would tune any other model: evaluating error in a _training_ data set and a _test_ data set.

Estimating error in a general classifier is easier than estimating it in a regression context. A prediction is either right or it is wrong. So you can ask how many predictions you got wrong in order to determine the goodness of fit of your model. 

Now how do you determine training accuracy in the context of kNN? Well, you can iteratively go through each observation in your data set, take it out, do a kNN classification, and see if you got it right. 

To determine test set accuracy, it's just like we discussed in regression: take a new data set, use the training data set to do the classification, and see how many of the test set value you get right.

The figure below shows the training vs. test accuracies for a set of simulations. For consistency sake with how we showed flexibility in the regression context, we plot it here as $1/K$. So as you move to the right of the x-axis, a model becomes **more** flexible (variable). As you move to the left it becomes less flexible (variable).

![kNN Variance Bias](imgs/L13_kNNVarianceBias.png)

As expected when the model gets more flexible (i.e., _k_ gets smaller and $1/K$ gets bigger), the training set error goes doesn. This makes sense because when k = 1, you'll just go with your immediate neighbor most of the time. However, the test accuracy shows the inverted-U shaped function, that hits the Bayes optimal solution (dashed black line) at moderate levels of _k_. Thus, it is in these ranges that _k_ optimally manages the variance-bias tradeoff.





---
# 2. kNN Regression

In some ways kNN as a solution to the classification problem is intuitive. But it turns out that the same logic can be applied to the regression context. Let's see the symmetries.

<br>

**The Classifier problem:**
* $\hat{f}(x_0) = P(Y=j|X=x_0) = \frac{1}{k} \sum_{i \in k} I(y_i=j)$ 
* $x_0$ is the point we are trying to classify.
* $I(y_i = j)$ is the indicator point that sample _i_, with the k nearest neighbors, is in class _j_. 

<br>

**The Regression problem:**
* $\hat{f}(x_0) =  P(Y=j|X=x_0) = \frac{1}{k} \sum_{i \in k} y_i$
* The prediction $\hat{f}(x_0)$ is the mean of the $y$ values for the k neighors.

<br>

Instead of taking a majority vote like is done in classification, for regression you simply take the average value for $Y$ instead. 

It is really that simple. 

As with kNN classification, _k_ determines your variance-bias tradeoff.

For example, the image below shows the kNN prediction (blue line) when _k=1_ (left) versus when _k=9_ (right).

![kNN Regression](imgs/L13_kNNRegressionVB.png)

The kNN solution approaches the linear regression solution as _k_ gets bigger. 

The beauty of kNN as a regression model is that it is _non-parametric_, which means that it doesn't make any assumptions. So you can fit more complex relationships. For example, [nonmonotonic functions](https://en.wikipedia.org/wiki/Monotonic_function). 

![kNN Curves](imgs/L13_kNNRegressionCurves.png)

In the example above, the black line is the real realtionship and we see fits for when _k=1_ (blue) and _k=9_ (red).

So far we've shown predictions where there is a single $X$ variable, but this process works for when there is more than one dimension of $X$.

![Multivariate kNN](imgs/L13_kNNRegression2D.png)

In the image above we see the case where _k=9_ for a data set with _n=64_ and _p=2_. Each orange dot shows the $(y_i, x_{1,i}, x_{2,i})$ observations from the data set. The manifold values show the predicted values for $Y, X_1,$ and $X_2$ from kNN. 

## The Curse of Dimensionality

In the lecture on OLS regression, we mentioned the idea of the dimensionality of your model. In this case it is defined as the ratio of observations (n) to predictor varables (p). As _p_ approaches _n_ (or gets bigger than _n_), we run up to the biggest limitations of kNN. **kNN regression fails when the dimensionality of the problem ($\frac{p}{n}$) get too high.**

Consider the case of looking at test set accuracy in a kNN regression model where we have the same _n_ observations, but we keep increasing _p_.

![Curse of Dimensionality](imgs/L13_kNNRegressionDimensionality.png)

Notice how at low and moderate values of p (when dimensionality is low), the mean squared error of the kNN model hovers around that of the linear regression model (dashed black line). However, as _p_ increases you can still see the curve showing the variance-bias tradeoff, but the absolute error keeps increasing. 

This is because you will begin to find situations where observations do not have enough nearby neighbors. Therefore the method has to stretch great distances to work and that's when things really fall apart.

## When to use kNN over linear regression?

As a general rule, parametric approaches like linear regression will outperform non-parametric approaches (like kNN) if the assumptions of the parametric model are met. This means it will require less computational time to find the best fit if you meet your assumptions. However, the more you deviate from those assumptions, the more a non-parametric method is preferred. **That is, unless you also have a high-dimensional data problem**.