Distance measuring is an important aspect for machine learning. It gives an estimate as to how close two observations are. Distance measure gives a good estimate on how similar two observations are. It could also help reduce the error during prediction so that predicted outputs are in the vicinity of the input data, increasing the prediction accuracy.

A distance function provides distance between the elements of a set. If the distance is zero then elements are equivalent else they are different from each other. Distance measures tell about the similarity and dissimilarity of two vectors.

There are many distance measures, but we are going to focus mainly on the following four distance measures:
<ol>
    <li>Hamming distance</li>
    <li>Euclidean distance</li>
    <li>Manhattan distance</li>
    <li>Minkowski distance</li>
</ol>

## Hamming distance

According to [Wikipedia](https://en.wikipedia.org/wiki/Hamming_distance), hamming distance between two strings of equal length is the number of positions in which the corresponding symbols are different. It could also be defined, as the minimum number of sustitutions needed to change one string into another.

e.g. The Hamming distance between:
<ul>
    <li><b>B</b>at and <b>C</b>at is 1.</li>
    <li>100<b>0</b>10<b>1</b> and 100<b>1</b>10<b>0</b> is 2.</li>
</ul>

Hamming distance could be defined as: (sum of the disagreeing components per position) / (length of the string).

Let us demonstrate this with an example.

In [1]:
def hamming_distance(a, b):
    return sum(abs(v1 - v2) for v1, v2 in zip(a, b)) / len(a)

vector1 = [1, 0, 0, 0, 1, 0, 1]
vector2 = [1, 0, 0, 1, 1, 0, 0]

print(hamming_distance(vector1, vector2))

0.2857142857142857


For bitstrings that may have many 1 bits, it is more common to calculate the average number of bit differences to give a hamming distance score between 0 (identical) and 1 (all different).

We could also use the [hamming()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.hamming.html) function provided by SciPy.

In [2]:
from scipy.spatial.distance import hamming

vector1 = [1, 0, 0, 0, 1, 0, 1]
vector2 = [1, 0, 0, 1, 1, 0, 0]

print(hamming_distance(vector1, vector2))

0.2857142857142857


## Euclidean distance

According to [Wikipedia](https://en.wikipedia.org/wiki/Euclidean_distance), the Euclidean distance or Euclidean metric is the "ordinary" straight-line distance between two points in Euclidean space.

<img src="EuclideanDistance.png" alt="Euclidean Distance" style="align:center"/>

In [3]:
from math import sqrt

def euclidean_distance(a, b):
    return sqrt(sum((v1 - v2)**2 for v1, v2 in zip(a, b)))

vector1 = [10, 20, 30, 40]
vector2 = [15, 25, 35, 45]

print(euclidean_distance(vector1, vector2))

10.0


One could also use [euclidean()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.euclidean.html?highlight=euclidean%20distance#scipy.spatial.distance.euclidean) function provided by SciPy.

In [4]:
from scipy.spatial.distance import euclidean

print(euclidean(vector1, vector2))

10.0


## Manhattan distance

According to [Wikipedia](https://en.wikipedia.org/wiki/Taxicab_geometry), Manhattan distance or a taxicab geometry is a form of geometry in which the usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian coordinates.
<p style="text-align: center">&sum;<sub>i</sub>|u<sub>i</sub> - v<sub>i</sub>|</p>

In [5]:
def manhattan_distance(a, b):
    return sum(abs(v1 - v2) for v1, v2 in zip(a, b))

print(manhattan_distance(vector1, vector2))

20


One could also use [cityblock()](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cityblock.html?highlight=manhattan%20distance) function provided by SciPy.

In [6]:
from scipy.spatial.distance import cityblock

print(cityblock(vector1, vector2))

20


## Minkowski Distance

According to [Wikipedia](https://en.wikipedia.org/wiki/Minkowski_distance), Minkowski distance is a metric in a normed vector space which can be considered as a generalization of both the Euclidean distance and the Manhattan distance.

<img src="MinkowskiDistance.png" alt="Minkowski distance" style="align: center"/>

"p" is called the _order_ that allows different distance measures to be calculated. When p=1, the distance is Manhattan distance and when p=2 the distance is Euclidean distance.

In [7]:
def minkowski_distance(a, b, p):
    return (sum(abs(v1 - v2)**p for v1, v2 in zip(a,b)))**(1/p)

print(minkowski_distance(vector1, vector2, 1))
print(minkowski_distance(vector1, vector2, 2))

20.0
10.0
