In [None]:
import numpy as np

# Distance Metrics

A way to describe the "closeness" of data points $\rightarrow$ proxy for similarity

## Manhattan Distance

Imagine a grid and you travel along a grid

> Does it matter what path we take along the grid?

<img src='images/manhattan-distance.png' width = 50%/>

### Formula

$$dist(A,B) = \sum_{k=1}^{N} |a_k - b_k| $$

### Code distance

Can use a for-loop but vectorization is usually very quick

In [None]:
a = np.array([2,3,5])
b = np.array([1,-1,3])

display(a)
display(b)

In [None]:
diffs = a - b
print('A - B')
display(diffs)

In [None]:
print('|A - B|')
abs_diff = np.abs(diffs)
display(abs_diff)

In [None]:
dist = np.sum(abs_diff)
print('sum(|A-B|)')
display(dist)

## Euclidean Distance (Pythagorean Distance)

Well-known for the Pythagorean Theorem

<img src='images/euclidean-distance.png' width = 50%/>

### Formula

$$dist(A,B) = \sqrt{ \sum_{k=1}^{N} (a_k - b_k)^2 } $$

### Code distance

In [None]:
a = np.array([2,3,5])
b = np.array([1,-1,3])

display(a)
display(b)

In [None]:
diffs = a - b
print('A - B')
display(diffs)

In [None]:
print('(A - B)^2')
sq_diffs = diffs * diffs
display(sq_diffs)

In [None]:
print('sum[(A - B)^2]')
sq_sum = np.sum(sq_diffs)
display(sq_sum)

In [None]:
dist = np.sqrt(sq_sum)
print('√sum[(A - B)^2]')
display(dist)

## Minkowski Distance

Used in a Normed Vector Space

Above were special cases of the Minkowski Distance

### Formula

$$dist(A,B) = (\sum_{k=1}^{N} |a_k - b_k|^c )^\frac{1}{c} $$

### Code distance

In [None]:
def minkowski(A,B,c=2):
    abs_diffs = np.abs(A-B)
    pow_diffs = np.power(abs_diffs, c)
    sum_diff = np.sum(pow_diffs)
    dist = np.power(sum_diff, 1/c)
    return dist

In [None]:
a = np.array([2,3,5])
b = np.array([1,-1,3])

display(a)
display(b)

In [None]:
# Euclidean Distance
minkowski(a,b)

# K-Nearest Neighbors

Classification / Supervised Learning

## Summary

Use the training data to "learn" and then predict a test point

<img src='https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/440px-KnnClassification.svg.png'/>

## Advantages

- lazy learning (no training phase)
- easy to interpret

## Disadvantages

- has to be kept in memory (small data with few features)
- not robust; doesn't generalize well
- soft boundaries are troublesome

## Determining K

### Overfitting & Underfitting

## Implementing via sklearn