In [1]:
import numpy as np
import sklearn

# A. Finding the nearest neighbors
In Chapter 1, we mentioned that clustering is a method for grouping together similar data observations. Another method for finding similar data observations is the nearest neighbors approach. With this approach, we find the k most similar data observations (i.e. neighbors) for a given data observation (where k represents the number of neighbors).

In scikit-learn, we implement the nearest neighbors approach with the NearestNeighbors object (part of the neighbors module).

The code below finds the 5 nearest neighbors for a new data observation (new_obs) based on its fitted dataset (data).

In [3]:
data = np.array([
  [5.1, 3.5, 1.4, 0.2],
  [4.9, 3. , 1.4, 0.2],
  [4.7, 3.2, 1.3, 0.2],
  [4.6, 3.1, 1.5, 0.2],
  [5. , 3.6, 1.4, 0.2],
  [5.4, 3.9, 1.7, 0.4],
  [4.6, 3.4, 1.4, 0.3],
  [5. , 3.4, 1.5, 0.2],
  [4.4, 2.9, 1.4, 0.2],
  [4.9, 3.1, 1.5, 0.1]])

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(data)
new_obs = np.array([[5, 3.5, 1.6, 0.3]])
dists, knbrs = nbrs.kneighbors(new_obs) # dists and knbrs are inbuilt values which gives the following relations which are given below

# nearest neighbors indexes
print('{}\n'.format(repr(knbrs))) # it tells the index of nearest neighbours 
# nearest neighbor distances
print('{}\n'.format(repr(dists)))

only_nbrs = nbrs.kneighbors(new_obs,return_distance=False) 
# this makes a function where return of distance is False so it only return knbrs and not dists
print('{}\n'.format(repr(only_nbrs)))

array([[7, 0, 4, 6, 9]], dtype=int64)

array([[0.17320508, 0.24494897, 0.24494897, 0.45825757, 0.46904158]])

array([[7, 0, 4, 6, 9]], dtype=int64)



The NearestNeighbors object is fitted with a dataset, which is then used as the pool of possible neighbors for new data observations. The kneighbors function takes in new data observation(s) and returns the k nearest neighbors along with their respective distances from the input data observations. Note that the nearest neighbors are the neighbors with the smallest distances from the input data observation. We can choose not to return the distances by setting the return_distance keyword argument to False.

The default value for k when initializing the NearestNeighbors object is 5. We can specify a new value using the n_neighbors keyword argument.

In [4]:
data = np.array([
  [5.1, 3.5, 1.4, 0.2],
  [4.9, 3. , 1.4, 0.2],
  [4.7, 3.2, 1.3, 0.2],
  [4.6, 3.1, 1.5, 0.2],
  [5. , 3.6, 1.4, 0.2],
  [5.4, 3.9, 1.7, 0.4],
  [4.6, 3.4, 1.4, 0.3],
  [5. , 3.4, 1.5, 0.2],
  [4.4, 2.9, 1.4, 0.2],
  [4.9, 3.1, 1.5, 0.1]])

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors(n_neighbors=2)
nbrs.fit(data)
new_obs = np.array([
  [5. , 3.5, 1.6, 0.3],
  [4.8, 3.2, 1.5, 0.1]])
dists, knbrs = nbrs.kneighbors(new_obs)

# nearest neighbors indexes
print('{}\n'.format(repr(knbrs)))
# nearest neighbor distances
print('{}\n'.format(repr(dists)))

array([[7, 0],
       [9, 2]], dtype=int64)

array([[0.17320508, 0.24494897],
       [0.14142136, 0.24494897]])

