# Nearest Neighbor Search

## Learning objectives

After reading this notebook, students will be able to:

- Explain distance and distance metric mathematically,
- Exemplify the nearest neighbor (NN)  search in layman terms,
- Describe distance metrics like Euclidean distance, Manhattan distance, Mahalanobis distance, and their use cases.

## Introduction

The nearest neighbor problem is defined as follows: Given a set of points in some metric space $(X, D),$ build a data structure that, given any point $q,$ returns a point in $P$ that is closest to $q$(its “nearest neighbor” in $P$).

I will explain this definition with an example of a post office problem.

__Post office problem__

---







<center>
<img src="https://i.postimg.cc/YqRRk2Ng/KD-Tree.png" height=450/>
<figcaption>Figure: Post office problem</figcaption>
</center>

Say, you are on point (12,9), and you want to send a post office message to your friend. There are several post offices nearby you. Which post office would you choose to send a letter to your friend, and why?

__You cannot formulate an answer to the problem without defining the distance between an arbitrary point and your location.__

## Distance

Let $X$ be a set. A function $d : X × X → R$ is called a distance (or
dissimilarity) on $X$ if, for all $x,y ∈ X,$ there holds:


1. Non-Negativity:  $~~~~~~d(x,y)≥ 0 $
    * It means distance between the two points is always positive. For example, the distance between your location and the nearby post office is always positive.
2. Identity/Reflexivity: $~~d(x,y) = 0$ iff $ x = y$
    * It means distance between two identical points is zero. For example, distance between your location and your location is zero.
3. Symmetry:$~~~~~~~~~~~~~d(x,y) = d(y,x)$
    * It means distance must remain equal if you calculate it from either point. For example, the distance between your location and the nearby post office is equal to distance between the nearby post office and your position.



## Measuring Similarity

__Distance Metric__


In mathematics, a metric or distance function is a function that defines a distance between each pair of point elements of a set.

A function is called a distance function or a distance metric if it satisfies:
1. Non-Negativity
2. Identity
3. Symmetry
4. [Triangle Inequality:](https://www.onlinemathlearning.com/triangle-inequality.html) $~d(x,y) ≤ d(x,c) + d(c,y)$

A distance becomes metric when it follows Triangle Inequality property.

* A distance that satisfies 1,3, and 4 but not two is called a pseudometric.

* A distance that satisfies 1,2, and 4 but not three is called a quasi-metric.


Now that you have learned about the condition of distance function let's learn some popular distance function and their use cases, starting with Euclidean distance.


## Euclidean distance

Euclidean distance or Euclidean metric is the straight-line distance between two points in Euclidean space.

<center>
<img src="https://i.postimg.cc/521t51DP/Euclidean-distance.png" height= 450/>
<figcaption>Figure: Euclidean Distance of two points</figcaption>
</center>

In the above figure $x$ and $y$ are two points in 2D eucledean space. Here $x=(x_1,y_1)~~ and ~~y=(x_2,y_2).$


The eucledean distance between $x$ and $y$ is given by:

$$ d(x,y) =  \sqrt{ (x_2 - x_1)^2 + (y_2 - y_1)^2}$$






In general,


$$  d(x,y) = {(\sum^n_{i=1}{|x_i- y_i|}^2)}^{1/2}$$



__Note:__

When comparing distances, it is not necessary to perform the square root operation: the sums of squares can be compared directly.

Disadvantage:

- It is extremely sensitive to the scales of the variables involved in the dataset.
- The Euclidean distance is blind to correlated variables in the dataset.




_Let's discuss Manhattan distance in detail._

## Manhattan distance




Manhattan distance is also known as Taxicab distance.
It is the distance you would need to walk in a city like Manhattan. You must stay on the street because you can't cut through the buildings.



<center>
<img src="https://i.postimg.cc/hP2Ys15H/manhatten.png" height= 450/>
<figcaption>Fig: Manhattan Distance of two points</figcaption>
</center>


In the above figure $x$ and $y$ are two points in 2D. Here $x=(x_1,y_1)~~ and ~~y=(x_2,y_2).$

The manhattan distance between $x$ and $y$ is given by:

$$ d(x,y) =  | (x_2 - x_1) + (y_2 - y_1) |$$

In general,


$$  d(x,y) = {(\sum^n_{i=1}{|x_i- y_i|})}$$



Disadvantage:

* Path is not unique and often contains many turns.

## Minkowski distance


Minkowski distance is a distance on $R^n$ defined, for $x,y ∈$ $R^n$ by

$$ d(x,y) = {(\sum^n_{i=1}{|x_i- y_i|}^p)}^{1/p}$$

It defines a distance between two points in the normed vector space(i.e., in a space where distances can be represented as a vector that has a length). It is also considered as the generalization of both Euclidean distance and Manhattan distance.

* p = 2, Euclidean distance
* p = 1, Manhattan distance
* p = $\infty$, Chebyshev distance


The figure below shows the unit circles,the set of all points that are at the unit distance from the centre, with different values of $p.$

<figure>
<center>
<img src="https://media.geeksforgeeks.org/wp-content/uploads/20211201082526/1.PNG" />
<figcaption>Fig: Unit circles with different values of $p$</figcaption>
</center>
</figure>




__Note:__

Similarly you can define $power(p,r)$ distance as:

$$ d(x,y) =  {(\sum^n_{i=1}{|x_i- y_i|}^p)}^{1/r}$$


* The case $p=2$ and $r = 1$ corresponds to the squared Euclidean distance.

* for $p = r > 1$ it is the $l_p-metric.$

The $power (p,r)-distance$ with $0<p = r<1$ is called fractional $l_p$ distance. It is not a metric because the unit box is not convex. If you have a few observations and the number of a variable is large, you can use this fractional distance on “dimensionality-cursed” data.

## Disadvantage of euclidean distance and Manhatten distance

__You should not use the above-defined metric if the units on each coordinate are not the same.__

Example:


| Area(sq.ft)  |  Price(K)  |
| ---  | ---   |
| 2424  |  162000  |
| 960  |  1265  |
| 840  |  89450  |
| 1650  |  140600  |



| Area(hectare)  |  Price(in M)  |
| ---  | ---   |
| 0.0225197 |  162  |
| 0.00891869  |  1.265  |
| 0.00780386  |  89.450  |
| 0.015329  |  140.600  |

The two tables above show the `area` and `price` of the same entity. Only the units of the features are different.

Since both tables represent the same entities, the distance between any two rows should be the same. But Euclidean distance and Manhattan distance give a different value, even though the distances are technically the same in physical space.

* The solution could be to normalize the features.

---

__The euclidean distance is blind to the correlated variable.__

| f1 |  f2 | f3|
| ---  | ---   |--- |
| 45 |  22  | 45|
| 12 | 11 | 12|
| 88 |  17 | 88|
| 79  |  12  | 79|


Consider a hypothetical data set containing three features f1, f2, and f3. In this dataset, f1 and f3 are duplicate columns.They have a high correlation. Yet, Euclidean distance has no means of considering that the duplicate column brings no new information and will essentially weigh the copied variable more heavily in its calculations than the other variables.


To solve all these scaling and correlation problems,  [Prof. Prasanta Chandra Mahalanobis](https://en.wikipedia.org/wiki/Mahalanobis_distance)  introduced a new distance calculation method, which is known after his name as Mahalanobis Distance.


## Mahalanobis Distance

Mahalanobis distance is the distance between a point and a distribution. It is not a distance between two distinct points. Mahalanobis Distance is a generic distance measurement technique that equals to Euclidean distance for uncorrelated variables.

Suppose you have two groups of animals like a mountain goat $G_1$ and a normal goat $G_2$. And you want to know how different they are. You want to formulate a hypothesis about a species, its origin, or evolution.
You can solve this problem using a measure of divergence or distance between groups in terms of multiple characteristics. You can compare two different groups based on their height, fur length, horn length, tail length, etc. You can use a Mahalanobis distance measure to solve this problem.

* If the average of features in a population of two species is large, the population is different else statistically, they are very same.



<center>

![Mahalanobis-Dist](https://i.postimg.cc/qR5R253j/Mahalanobis-Dist.png)
</center>

### Mathematics behind mahananobis distance

Mahalanobis distance is:

$$d^2 ~~ = $ $ (x - \mu )^T \sum ^-~^1~(x - \mu ) $$

Where,

* $x$ is a vector of observation(row).
* $\mu$ is a vector of mean of independent variable
in each group $G_1$ and $G_2$.
* $\sum$ is a common (nonsingular) covariance matrix of $x$ in each group $G_1$ and $G_2$

The above equation is in the quadratic form; in the end, it gives a real number. $\sum$ is a positive-definite, and hence $d^2$ is a metric.


* The Mahalanobis distance takes the covariance among the variables to calculate distance. With this, the problems of scale and correlation that exists in the Euclidean distance are no longer an issue.




<center>

Table 1: Covariance Matrix

| features  |  x1 | x2| x.| xn|
| ---  | ---   | ---| --- | --- |
| x1  |  var(x1) | cov(x1,x2)| ......| cov(x1,xn)|
| x2  | cov(x2,x1)  | var(x2)| ......| cov(x2,xn)|
| ...  | ...  | ...| ......| .....|
| xn  | cov(xn,x1)  | cov(xn,x2) | ......| var(xn)|

</center>


`Table 1` shows covariance matrix $ \sum $ . It contains variance in diagonal. Get Inverse of Covariance Matrix: $ \sum $ to calculate Mahalanobis
distance.




* For correlated data features, covariance will be high and dividing by a large covariance value will effectively reduce the distance.

* If features are not correlated covariance will be small disttance will not reduce much.

Therefore it addresses both the problems of scale as well as the correlation of the variables.

Computationally Mahalanobis distance :

1. Transforms the columns into uncorrelated variables
2. Scales the columns to make their variance equal to 1
3. Calculates the Euclidean distance.

__Application of Mahalanobis distance__

1. Multivariate outlier detection
2. Classification Problems
3. One-Class Classification

Our motive is just to learn Mahalanobis distance. If you want to learn more about application mentioned here click [Mahalonobis Distance – Understanding the math with examples (python).](https://www.machinelearningplus.com/statistics/mahalanobis-distance/)

_Upto now you haved learned different distance metric and how they calcualte distance. Now, let's jump back into post office problem and try to solve it._


<center>
<img src="https://i.postimg.cc/YqRRk2Ng/KD-Tree.png" />
<figcaption>Figure: Post office problem</figcaption>
</center>


## Brute-force Method

In this problem you cannot go through the building that is why you have to use manhattan distance metric to calculate distance between you and the post office.

In [None]:
post_offices_loc = [(6,12), (3,7), (10,10), (8,4), (15,2), (12,11), (14,10)]
your_loc = (12,9)

The correct answer is (12,11). But how did you calculate this value? A naive approach would be:

1. Find the distance from key to every element of P.
2. Use this distance value to find the minimum distance.



In [None]:
# Find distance between points
distances=[]
for i in range(len(post_offices_loc)):
    distances.append(abs((your_loc[0] - post_offices_loc[i][0])) + abs(your_loc[1] - post_offices_loc[i][1]))

In [None]:
distances

[9, 11, 3, 9, 10, 2, 3]

Let's define a function that finds minimuum value's index from a given list.

In [None]:
def get_indexes_min_value(l):
    min_value = min(l)
    if l.count(min_value) > 1:
        return [i for i, x in enumerate(l) if x == min(l)]
    else:
        return l.index(min(l))

In [None]:
location = get_indexes_min_value(distances)

In [None]:
post_offices_loc[location]

(12, 11)

In this naive approach, you are doing a linear search and finding the minimum distance to find the nearest neighbor.
* You need one loop to find the distance. Therefore, the worst-case time complexity becomes $O(n).$

* You need one loop to find minimum distance. You compare values and find minimum distance. Therefore, the worst-case complexity becomes $O(n).$

For d- dimensional space Our naive approach results have $O(nd)$ time complexity in the worst case.

---

This approach is suitable for a small number of datasets. What if there are millions of entries?
---


<summary>Ans
<Details>
* If there are millions of entries, our naive approach becomes inefficient. It will take a lot of memory and space. So we need to formulate a new efficient algorithm.
</Details>
</summary>


__Objective__

* Design an algorithm for Nearest neighbour searching that achieves worst case complexity less than $O(nd).$

# Key Takeaways

* To find a nearest neighbor you need to define distance.


* Distance is non-negative, symmetric and identical.

* Euclidean Distance represents the shortest distance between two points

* Manhattan Distance is the sum of absolute differences between points across all the dimensions.

* Mahalanobis distance is the distance between a point and a distribution. It is not a distance between two distinct point.


* Calculating distances in high dimension sometimes makes no sense and it is a challenging work. It is called curse of dimension.

* The worst-case time complexity of the brute-force method to find the nearest neighbor is $O(nd).$ Where $n$ is the number of points, and $d$ is the number of dimensions.