# Chapter 15 K-Nearest Neighbors

## 15.1 Finding an Observation's Nearest Neighbors

**Problem**

You need to find an observation’s k nearest observations (neighbors).

In [None]:
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

In [2]:
# Load data
iris = datasets.load_iris()
features = iris.data

In [3]:
# Create standardizer
standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)

# Two nearest neighbors
neares_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)

# With a new observation
new_observation = [1, 1, 1, 1]

# Find distances and indices of the observation's nearest neighbors
distances, indices = neares_neighbors.kneighbors([new_observation])

distances, indices

(array([[0.49140089, 0.74294782]]), array([[124, 110]], dtype=int64))

In [6]:
# View the nearest neighbors
features_standardized[indices]

array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
        [0.79566902, 0.32841405, 0.76275827, 1.05393502]]])

**Discussion**

- There are plenty of distances to choose from. Some are more or less adequate to different types of data
- Since the distance is calculated using all the features, it is important to have them be at the same scale. This is why we used the standard scaler

## Creating a K-Nearest Neighbor classifier

**Problem**

Given an observation of unknown class, you need to predict its class based on
the class of its neighbors

In [8]:
# Load libraries

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

new_observations = [[0.75, 0.75, 0.75, 0.75],
                    [1, 1, 1, 1]]
knn.predict(new_observations)


array([1, 2])

**Discussion**
- The algorithm identifies the k closest observations to the new observations
- The k observations vote on the class. The class that wins the vote is the prediction
- ``n_jobs`` selects the number of computer cores to use
- The ``algorithm`` for calculation is selected automatically
- We can set the parameter `weight` to make closer neighbours weight more in the voting

# 15.3 Identifying the best neighborhood size

**Problem**
Selecting the best value for k in a knn classifier

**Solution**
Using model selection techniques such as GridSearchCV

In [9]:
# Load libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [11]:
# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

standardizer = StandardScaler()

features_standardized = standardizer.fit_transform(features)
knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

# Create a pipeline
pipe = Pipeline([
    ("standardizer", standardizer),
    ("knn", knn)
])

search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

classifier = GridSearchCV(pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)

In [12]:
# Best neighborhood size (k)
classifier.best_estimator_.get_params()["knn__n_neighbors"]

6

**Discussion**
- In KNN, the bias-variance tradeoff is cristal clear:
- Bias is the diference between the expected value of the prediction of a model and the true value. A common scenario for a high bias is having an underfit model that is too simple for the complexity of the data.
- Variance is a measure of how much the predictions of a model vary if we retrain the model with different subsets of the training data. A common scenario for a model having high variance is having an overfitted model that is too complex for the data.
- Being n the number of observations, if k=n, our model has high bias and low variance. It simplifies the prediction to the majority class of the training dataset
- If k = 1, we have low bias and high variance, since the prediction will radically change with the training data, since it just predicts the closest neighbour.

## Creating a Radius-Based Nearest Neighbor Classifier

**Problem**
Given an observation of unknown class, you need to predict its class based on
the class of all observations within a certain distance.

In [15]:
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

# Load data
iris = datasets.load_iris()
features = iris.data
target = iris.target

standardizer = StandardScaler()

features_standardized = standardizer.fit_transform(features)

rnn = RadiusNeighborsClassifier(
    radius=.5, n_jobs=-1).fit(features_standardized, target)

new_observations = [[1, 1, 1, 1]]

rnn.predict(new_observations)

array([2])

**Questions:**
What is the diff between fit and fit transform?

**Discussion:**
- We could specify the `outlier_label` to give a label to observations that are don't have any neighbor in the defined radius. This can be an outlier detection rationale.