# CHAPTER - 15: K-Nearest Neighbors

- K-Nearest Neighbors is simplest and most commonly used classifier in supervised machine learning
- KNN is considered lazy learner, it doesn't train a model to make predictions, instead the observaion is predicted to be the class of that of the largest proportion of the k nearest neighbors.

## 15.1 Finding an Observation's Nearest Neighbors

Finding an observations nearest observations

In [2]:
# loading libraries

from sklearn import datasets
from sklearn.neighbors import NearestNeighbors
from sklearn.preprocessing import StandardScaler

In [3]:
# loading dataset

iris = datasets.load_iris()
features = iris.data

In [4]:
# creating standardizer

standardizer = StandardScaler()

In [5]:
# standardizing the features

features_standardized = standardizer.fit_transform(features)

In [6]:
# getting two nearest neighbors

nearest_neighbors = NearestNeighbors(n_neighbors = 2).fit(features_standardized)

In [7]:
# creating an observation

new_observation = [1, 1, 1, 1]

In [8]:
# finding the distance and indices of the observation's nearest neighbors

distances, indices = nearest_neighbors.kneighbors([new_observation])

# indices contains the locations of the observations in our dataset that are closest

In [9]:
# view the nearest neighbors

features_standardized[indices]

array([[[1.03800476, 0.55861082, 1.10378283, 1.18556721],
        [0.79566902, 0.32841405, 0.76275827, 1.05393502]]])

In [10]:
# we can set distance metric using metric parameter

nearestneighbors_euclidean = NearestNeighbors(n_neighbors = 2, metric = 'euclidean').fit(features_standardized)

In [11]:
# the distance varuable has the actual distance measurement

distances

array([[0.49140089, 0.74294782]])

In [12]:
# we can use kneighbors_graph to create a matrix showing each observations nearest neighbors

nearestneighbors_euclidean = NearestNeighbors(n_neighbors = 3, metric = 'euclidean').fit(features_standardized)

In [14]:
nearest_neighbors_with_self = nearestneighbors_euclidean.kneighbors_graph(features_standardized).toarray()

In [15]:
# remove 1's marking an observation is a nearest neighbor to itself
for i, x in enumerate(nearest_neighbors_with_self):
    x[i] = 0

In [16]:
# to view first observations two nearest neighbors

nearest_neighbors_with_self[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

## 15.2 Creating a K-Nearest Neighbors Classifier

given an observation of unknown class, we need to predict its class based on class of its neighbors

In [17]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

In [19]:
# loading the dataset

iris = datasets.load_iris()
X = iris.data
y = iris.target

In [20]:
# creating Standardizer

standardizer = StandardScaler()

In [21]:
# Standardizing the features

X_std = standardizer.fit_transform(X)

In [22]:
# training a KNN classifier with 5 neighbors

knn = KNeighborsClassifier(n_neighbors = 5, n_jobs = -1).fit(X_std, y)

In [23]:
# creating two observation

new_observations = [[0.75, 0.75, 0.75, 0.75],
                   [1, 1, 1, 1]]

In [24]:
# predicting the class of two observation

knn.predict(new_observations)

array([1, 2])

The algorithm first finds the k closest observations based on some distnace metric(like Euclidean), then these k observations "vote" based on their class, and the class that wins the vote is the predicted class

In [25]:
# view probability of each observation is one of the three classes

knn.predict_proba(new_observations)

array([[0. , 0.6, 0.4],
       [0. , 0. , 1. ]])

KNeighborsClassifier has number of important parameters to consider:
   - "metric" sets the distance metric used.
   - "n_jobs" determines how many computer cores to use.
   - "algorithm" sets the method used to calculate the nearest neighbors.

## 15.3 Identifying the Best Neighborhood Size

selecting the best value for k in a k-nearest neighbors classifier

In [27]:
# loading libraries

from sklearn.neighbors import KNeighborsClassifier
from sklearn import datasets
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.model_selection import GridSearchCV

In [29]:
# loading data

iris = datasets.load_iris()
features = iris.data
target = iris.target

In [30]:
# creating a standardizer

standardizer = StandardScaler()

In [31]:
# standardizinf features

features_standardized = standardizer.fit_transform(features)

In [32]:
# creating a KNN classifier

knn = KNeighborsClassifier(n_neighbors = 5, n_jobs = -1)

In [40]:
# creating a pipeline

pipe = Pipeline([("standardizer", standardizer),("knn",knn)])

In [42]:
# creating spaace of candidate values

search_space = [{"knn__n_neighbors": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]}]

In [43]:
# creating grid search

classifier = GridSearchCV(pipe, search_space, cv = 5, verbose = 0).fit(features_standardized, target)

In machine learning we are trying to find a balance between bias and variance, and in few places is that as explicit as the value of k. If k=1, we will have low bias but high variance.

The best model will come from finding the value of k that balances this bias-variance trade-off.

In [44]:
# best neighboorhood size(k)

classifier.best_estimator_.get_params()["knn__n_neighbors"]

6

## 15.4 Creating a Radius-Based Nearest Neighbor Classifier

If an observation is from unknown class, we need to predict its class based on class of all observations within a certain distance

In [45]:
# loading the libraries

from sklearn.neighbors import RadiusNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn import datasets

In [46]:
# loading data

iris = datasets.load_iris()
features = iris.data
target = iris.target

In [49]:
# creating standardizer

standardizer = StandardScaler()

In [50]:
# Standardizing features

features_standardized = standardizer.fit_transform(features)

In [51]:
# train a radius neighbors classifier

rnn = RadiusNeighborsClassifier(radius = 0.5, n_jobs = -1).fit(features_standardized, target)

In [52]:
# creating new observations

new_observations = [[1, 1, 1, 1]]

In [53]:
# predicting the class of two observations

rnn.predict(new_observations)

array([2])

RadiusNeighborsClassifier is very similar to KNeighborsClassifier, with the exception of two parameters.