Distance Metrics: How to measure stuff with Math
-----


<center><img src="images/naked.jpg" width="75%"/></center>

Distance Metrics
----

Distance is a numerical measurement of how far apart objects are.

Data Science mostly about counting. But after you count, you can make comparisons.

Source: https://medium.com/@montjoile/l0-norm-l1-norm-l2-norm-l-infinity-norm-7a7d18a4f40c

By The End Of This Session You Should Be Able To:
----

- Define a norm
- The explain the following distances: 1-norm, 2-norm, and p-norm 
- Implement each of those distances from scratch

What is a Norm?
----



Total length of all the vectors in a space

1-norm distance, aka city-block distance
------

<center><img src="https://cdn-images-1.medium.com/max/800/0*1kmU2e3eDsPGJjvR.jpg" width="35%"/></center>

What is the 1-norm w.r.t to the origin?

7 = abs(0+3) + abs(0+4)

1-norm distance, aka city-block distance
------

<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/9b1c64cf0749473b579dfad01f5add58422a3ddb" width="35%"/></center>

Student Activity: Write 1-norm distance function in Python

Bonus: Without using `numpy`

Let's assume the input is two 1-D arrays:


In [16]:
reset -fs

In [32]:
q = [0, 0]
p = [3, 4]

#=> 1-norm = 7

In [46]:
def city_block(p, q):
    "https://en.wikipedia.org/wiki/Taxicab_geometry"
    return sum(abs((px - qx)) for px, qx in zip(p, q))

In [34]:
# import matplotlib.pyplot as plt
# import numpy as np
# import pandas as pd
# import seaborn as sns
# import sklearn

# import warnings
# warnings.filterwarnings('ignore')

# %matplotlib inline

In [35]:
# # Random points
# min_value = 0
# max_value = 100
# size = 10

# p = np.random.random_integers(min_value, max_value, size)
# q = np.random.random_integers(min_value, max_value, size)

# sns.scatterplot(p, q);
# plt.ylim(min_value, max_value);
# plt.xlim(min_value, max_value);

In [47]:
list(zip(p, q))

[(3, 0), (4, 0)]

What should I do next?
------

Test it!

In [48]:
from scipy.spatial.distance import cityblock as city_block_benchmark

In [49]:
assert city_block(p, q) == city_block_benchmark(p, q)

In [39]:
city_block(p, q)

7

2-norm distance, aka as the crow flies
-----

<center><img src="https://cdn-images-1.medium.com/max/800/0*HTlIui8sHP8pIHBW.jpg" width="55%"/></center>

What is the 2-norm w.r.t. to the origin?

5

<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/923484efce295a64585d029f0faa2166ebed1f87" width="55%"/></center>

Student Activity: Write 2-norm distance function in Python

Bonus: Write the test also

In [50]:
from math import sqrt

def euclidean(p, q):
    "https://en.wikipedia.org/wiki/Taxicab_geometry"
    return sqrt(sum((px - qx) ** 2.0 for px, qx in zip(p, q)))

[Source](https://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml)

In [51]:
from scipy.spatial.distance import euclidean as euclidean_bechmark

assert euclidean(p, q) == euclidean_bechmark(p, q)

Source: https://en.wikipedia.org/wiki/Norm_(mathematics)

p-norm distance, aka Minkowski distance of order p
------

Generalization notion of normed vector space.

When p = 1, Manhattan distance.  
When p = 2, Euclidean distance.  

p could be any value ([look into what it means to a "metric"](https://en.wikipedia.org/wiki/Minkowski_distance))

In the limiting case of p reaching infinity, we obtain the Chebyshev distance.

<center><img src="https://wikimedia.org/api/rest_v1/media/math/render/svg/14b24f10bce2d1ceac1b92bf045a9d5c552d9bb8" width="75%"/></center>

Student Activity: Write p-norm distance function in Python

In [52]:
def minkowski(p, q, power):
    "https://en.wikipedia.org/wiki/Minkowski_distance"
    return (sum(abs(px - qx) ** power for px, qx in zip(p, q))) ** (1 / power)

In [45]:
from scipy.spatial.distance import minkowski as minkowski_benchmark

print(f"{'Power'} {'Mine':^10} {'Scipy':^8}")
for power in range(1, 10):
    print(f"{power:^6} {minkowski(p, q, power):>8.3f} {minkowski_benchmark(p, q, power):>8.3f}")

Power    Mine     Scipy  
  1       7.000    7.000
  2       5.000    5.000
  3       4.498    4.498
  4       4.285    4.285
  5       4.174    4.174
  6       4.111    4.111
  7       4.072    4.072
  8       4.048    4.048
  9       4.032    4.032


Norms (will help with regularization)
------

<center><img src="images/1_2.png" width="35%"/></center>

<center><img src="images/p.png" width="35%"/></center>

As p increasing notice on how the geometrical shape changes (same thing as the computed table)

Check for understanding
-----

<center><img src="images/1_2.png" width="25%"/></center>

What happens if we pick 1-norm for k-nn?

Summary
-----

- Your job as Data Scientist is to measure things usefuly, thus distance metrics are important.
- 2-norm distance (Euclidean) is a good default
- 1-norm distance (Manhattan) is useful sometimes
- Choose p if you dare (Minkowski)
- Don't be afraid to take it to ∞ (Chebyshev)

-----
Bonus
-----

# Chebyshev distance

In [23]:
def chebyshev(p, q):
    "https://en.wikipedia.org/wiki/Chebyshev_distance"
    return max(abs(px - py) for px, py in zip(p, q))

In [24]:
from scipy.spatial.distance import chebyshev as chebyshev_benchmark

In [25]:
assert chebyshev(p, q) == chebyshev_benchmark(p, q)