# k-Nearest Neighbors

Given a positive integer K and a test observation $x_0$, the KNN regression first identifies the $K$ points in the training data  that are closest to $x_0$, represented by $N_0$. It then estimates $f(x_0)$ using the probability estimation of a given class j among all the training responses in $N_0$:

\begin{align}
\hat{f}(x_0)=\frac{1}{K}\sum_{i\in N_0}I(y_i = j)
\end{align}

Suppose we choose K = 3. Then KNN will identify the three observations that are closest to the point we want to predict.
Assume among all the neighbours, 2 are bluepoints and 1 orange, resulting in estimated $f(x_0)$ 2/3 for blue class ad 1/3 for the orange class. Hence KNN will predict the point to the blue class.
<img src="images/3.png" width="200">

- The optimal value for K will depend on the **bias-variance trade-off**.
 - A smaller value of K provides a more flexible fit, which will 
 have low bias but high variance. This variance is due to the fact that 
 the prediction in a given region is entirely dependent on just one observation.
 - A largee values of K provide a smoother and less variable fit, which will 
 have high bias but low variance. 
 The prediction in a region is given by the overall probability in a region, 
 and so changing one observation has a smaller effect. 
 However, the smoothing may cause bias by masking some of the structure in f(X).
 - If K decreases K, in the regression setting, 
 the training error rate consistently declines as the flexibility increases, 
 but the test error exhibits a U-shape, declining at first and increasing again when the model becomes excessively flexible and overfits.
 
<img src="images/4.png" width="500">

# Comparison of Linear Regression with KNN

**The parametric approach will outperform the nonparametric approach if the parametric form that has been selected is close to the true form of $f$, because non-parametric approach incurs a cost in variance that is not offset by a reduction in bias.**

<img src="images/37.png" width="500">

- KNN performs slightly worse than linear regression when the relationship is linear, but much better than linear regression for non-linear situations.

<img src="images/36.png" width="500">


**For multi-dimensional data, the increase in dimension has only caused a small deterioration in the linear regression test set MSE, but it has caused more than a ten-fold increase in the MSE for KNN.**

This decrease in performace as the dimension increases is a common problem for KNN, and results from the fact that in higher dimensions tehre's effectively a reduction in sample size. For example, in this dataset, there're 1-- training observations; where p=1, this provides enough information to accurately estimate $f$. However, spreading 100 observations over p=20 dimensions results in a phenomenon in which a given observation has no **nearby neighbors** - this is the so-calld **curved of dimensionality** That is, the K observations that are nearest to a given test observation ${x_0}$ in p-dimensional space when p is large, leading to very poor prediction of $f$ and hence a poor KNN fit.

**As a general rule, parametric methods will tend to outperform non-parametric approaches when there's a small number of observations per predictor.**

**Even in problems in which the dimension is small, we might prefer linear regression to KNN from an interpretability standpoint. If the test MSE of KNN is only slightly lower than that of linear regression, we might be willing to forego a little bit of prediction accuracy for the sake of a simple model that can be described in terms of just a few coefficients, and for which p-values are available.**

# KNN algorithm

In [10]:
'''
kNN: k Nearest Neighbors

Input:      X_test: input array to classify (1xN)
            X_train: training dataset of known vectors with size M (NxM)
            y_train: target labels for training dataset (1xM vector)
            k: number of neighbors to use 

Output:     y_pred: predicted class label for X_test
'''

import numpy as np
import operator

def knn(X_train, y_train, X_test, k):
    # Get number of rows in the whole X_train dataset
    train_size = X_train.shape[0]
    
    # numpy.tile(arr, repetitions) : constructs a new array by repeating array the number of times
    # Tile the input vector to be the same shape as the training set 
    # Get the difference between elements in each observation and elements in the input vector
    diffMat = np.tile(X_test, (train_size, 1)) - X_train
    
    # Calculate the Euclidian distance between the each observation in the training dataset and the input vector
    sqDiffMat = diffMat ** 2
    
    # After .sum, sqdistances becomes a 1xM vector.
    sqdistances = sqDiffMat.sum(axis=1)
    
    # Distance is also a 1xM vector. Each element is the observation's distance between the input vector.
    distances = sqdistances ** 0.5
    
    # argsort: returns the indices that would sort an array.
    # sortedDistIndicies[0] is the index of the smallest value in the distances vector
    sortedDistIndicies = distances.argsort()
    
    # Create a dictionary to count the number of elements in each class among the first k data
    classCount = {}
    
    for i in range(k):
        # Extract the label of the k nearest neighbors
        labelType = y_train[sortedDistIndicies[i]]
        classCount[labelType] = classCount.get(labelType, 0) + 1
    
    # .items decompose the dictionary into a list of tuples: [(labelType1, number), (labelType2, number)]
    # operator.itemgetter returns numbers in each (labelType1, number) tuple
    sortedClassCount = sorted(classCount.items(), key=operator.itemgetter(1), reverse=True)
    
    # Return the labelType with the most elements among the k nearest neighbors
    y_pred = sortedClassCount[0][0]
    print('The predicted class for test data is: ', y_pred)
    return y_pred

In [11]:
# Test case
X_train = np.array([[1, 2, 3, 4, 5],
                    [3, 4, 5, 6, 7],
                    [4, 5, 6, 7, 8],
                    [5, 6, 7, 8, 9],
                    [6, 7, 8, 9, 10]])
y_train = np.array(['a', 'b', 'a', 'b', 'a'])
X_test = np.array([4, 3, 4, 3, 6])
k = 3

knn(X_train, y_train, X_test, k)

The predicted class for test data is:  a


'a'