## Chapter 15
---
# K-Nearest Neighbors

An observation is predicted to be the class of that of the largest proportion of the k-nearest observations.

## 15.1 Finding an Observation's Nearest Neighbors

In [2]:
from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

iris = datasets.load_iris()
features = iris.data

standardizer = StandardScaler()

features_standardized = standardizer.fit_transform(features)

nearest_neighbors = NearestNeighbors(n_neighbors=2).fit(features_standardized)
#nearest_neighbors_euclidian = NearestNeighbors(n_neighbors=2, metric='euclidian').fit(features_standardized)
new_observation = [1, 1, 1, 1]

distances, indices = nearest_neighbors.kneighbors([new_observation])

features_standardized[indices]

array([[[1.03800476, 0.56925129, 1.10395287, 1.1850097 ],
        [0.79566902, 0.33784833, 0.76275864, 1.05353673]]])

### Discussion

How do we measure distance?

* Euclidian
$$
d_{euclidean} = \sqrt{\sum_{i=1}^{n}{(x_i - y_i)^2}}
$$

* Manhattan
$$
d_{manhattan} = \sum_{i=1}^{n}{|x_i - y_i|}
$$

* Minkowski (default)
$$
d_{minkowski} = (\sum_{i=1}^{n}{|x_i - y_i|^p})^{\frac{1}{p}}
$$
## 15.2 Creating a K-Nearest Neighbor Classifier

In [3]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

iris = datasets.load_iris()
X = iris.data
y = iris.target

standardizer = StandardScaler()

X_std = standardizer.fit_transform(X)

knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1).fit(X_std, y)

new_observations = [[0.75, 0.75, 0.75, 0.75],
                   [1, 1, 1, 1]]

knn.predict(new_observations)

array([1, 2])

### Discussion
In KNN, given an observation $x_u$, with an unknown target class, the algorithm first identifies the k closest observations (sometimes called $x_u$'s neighborhood) based on some distance metric, then these k observations "vote" based on their class and the class that wins the vote is $x_u$'s predicted class. More formally, the probability $x_u$ is some class j is:
$$
\frac{1}{k} \sum_{i \in v}{I(y_i = j)}
$$
where v is the k observatoin in $x_u$'s neighborhood, $y_i$ is the class of the ith observation, and I is an indicator function (i.e., 1 is true, 0 otherwise). In scikit-learn we can see these probabilities using `predict_proba`

## 15.3 Identifying the Best Neighborhood Size

In [4]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

iris = datasets.load_iris()
features = iris.data
target = iris.target

standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)

knn = KNeighborsClassifier(n_neighbors=5, n_jobs=-1)

pipe = Pipeline([("standardizer", standardizer), ("knn", knn)])

search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

classifier = GridSearchCV(
    pipe, search_space, cv=5, verbose=0).fit(features_standardized, target)

classifier.best_estimator_.get_params()["knn__n_neighbors"]

6

## 15.4 Creating a Radius-Based Nearest Neighbor Classifier
given an observation of unknown class, you need to predict its class based on the class of all observations within a certain distance.

In [5]:
from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

iris = datasets.load_iris()
features = iris.data
target = iris.target

standardizer = StandardScaler()
features_standardized = standardizer.fit_transform(features)

rnn = RadiusNeighborsClassifier(
    radius=.5, n_jobs=-1).fit(features_standardized, target)

new_observations = [[1, 1, 1, 1]]

rnn.predict(new_observations)

array([2])