## Distances and Disimilarities Cheat Sheet

#### Table of Contents

* [Numeric Distances](#1.-Numeric-Distances)
    * [Manhattan](#1.1-Manhattan-distance)
    * [Euclidean](#1.2-Euclidean-distance)
    * [Chebyshev](#1.3-Chebyshev-distance)
    * [Minkowski](#1.4-Minkowski-(General))
    * [Cosine Similarity](#1.5-Cosine-Similarity)
* [Categorical Distances](#2.-Categorical-Distances)
    * [Hamming](#2.1-Hamming-Distance)
    * [Dice](#2.2-Dice-dissimilarity)
    * [Jaccard](#2.3-Jaccard-distance)
* [Mixed Distances](#3.-Distances-for-mixed-data)
    * [Gower](#Gower.Gower-Gower)

### 1. Numeric Distances
1.1 Manhattan (L<sub>1</sub>)

1.2 Euclidean (L<sub>2</sub>)

1.3 Chebyshev(L<sub>$\infty$</sub>)

1.4 Minkowski (General L<sub>p</sub>)

1.5 Cosine Similarity

#### 1.1 Manhattan distance

$$\sum_{i=0}^n|x_i - y_i|$$

* Intuition: "Taxi cab/city block distance" This metric will be less affected by outlier differences in the calculation than euclidean.
* Examples:
    * `manhattan([0,0], [3,4])` is 7
    * `manhattan([0,0], [3,10])` is 13
    * `manhattan([2,3], [4,6])` is 5
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_1$ norm" (is minkowski distance ($L_p$) with $p=1$)
<center>Manhattan (L<sub>1</sub>)</center>
<img src='https://i.imgur.com/pvk6uO0.png'>
<hr>
<br>

#### 1.2 Euclidean distance

$$\sqrt{\sum_{i=0}^n(x_i - y_i)^2}$$

* Intuition: "Straight line distance." This metric will be more affected by outlier differences in the calculation than Manahattan (due to being squared).
* Examples:
    * `euclidean([0,0], [3,4])` is 5
    * `euclidean([0,0], [3,10])` is 10.44
    * `euclidean([2,3], [4,6])` is 3.606
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_2$ norm" (is minkowski distance ($L_p$) with $p=2$)
<center>Euclidean (L<sub>2</sub>)</center>
<img src='https://i.imgur.com/zcuxxY7.png'>
<hr>
<br>

#### 1.3 Chebyshev distance

$$max(|x_i - y_i|)$$

* Intuition: "The biggest difference between the 2 rows." This metric is only affected by outlier differences in the calculation.  (it's only the max)
* Examples:
    * `chebyshev([0,0], [3,4])` is 4
    * `chebyshev([0,0], [3,10])` is 10
    * `chebyshev([2,3], [4,6])` is 3
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_\infty$ norm" (is minkowski distance ($L_p$) with $p=\infty$)
<center>Chebyshev (L<sub>$\infty$</sub>)</center>
<img src='https://i.imgur.com/u1xZkja.png'>
<hr>
<br>

#### 1.4 Minkowski (General)

All the above distances are versions of minkowski.  Plug in $p=1$ and $p=2$ to prove that's true ($p=\infty$ is a little tougher to prove).

$$\sqrt[p]{\sum_{i=0}^n|x_i - y_i|^p}$$

* As $p$ gets larger the greater the focus is on the biggest difference between $x$ and $y$.
    * In manahattan, $p=1$ and we weight each absolute difference the same.  For example, if we compare `[0, 0]` to `[3, 4]`, the differences are $3$ and $4$ and we simply add them up to get a distance of $7$.
    * In euclidean, $p=2$ and by squaring each difference we put a greater emphasis on larger differences.  For example, if we compare `[0, 0]` to `[3, 4]`, the differences are $2$ and $4$.  Squaring these leads to $2^2 = 4$ and $4^2 = 16$; this exagerates the importance of the larger difference and the final result is 
    * In chebyshev, $p=\infty$ and we *only* care about the biggest difference
<center>Minkowski (L<sub>p</sub>)</center>
<img src='https://i.imgur.com/u1xZkja.png'>
<hr>
<br>

#### 1.5 Cosine Similarity

Cosine similarity ranges from [-1, 1]; to convert this to a 'distance' we do 1 - cosine similarity.  So the new range is [2, 0].

$$cos(\theta) = \frac{x \cdotp y}{||x|| ||y||}$$

See [this YouTube playlist](https://www.youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab) for a deeper intuition of vectors and linear algebra.

* Intuition: "Angle between the vectors defined by each observation."  Focuses more on how each column relates to one another within each observation; if their relationships between columns are similar then this is a small distance.
* Examples:
    * `cosine_dis([0,0], [3,4])` is nan*
    * `cosine_dis([0,0], [3,10])` is nan*
    * `cosine_dis([0,0,1], [3,10,1])` is 0.904
    * `cosine_dis([2,3], [4,6])` is 0
* Code: `pdist(x, metric='cosine')`

*think about our plot below; we can't really draw a vector from (0, 0) to (0, 0) and measure the angle between that and another vector
<center>Cosine Disimilarity</center>
<img src='https://i.imgur.com/5tljRAL.png'>
<center>A: [1,1], B: [4,4], C:[5,9]</center>
<hr>
<br>

### 2. Categorical Distances

2.1. Hamming Distance(0's are meaningful)

2.2. Dice Disimilarity (matching 1's is much more important)

2.3. Jacard Distance (Middle ground for binary and dummy)

<hr>
<br>

#### 2.1 Hamming Distance

$$\frac{n_{misses}}{n_{columns}}$$

* **Makes a lot of sense for binary columns where a `0` is a meaningful response.**
* Intuition: "What fraction of the elements between the 2 rows are differnt?"
* Examples:
    * `hamming([0,0,0], [1,1,1])` is $\frac{3}{3}$ = 1
    * `hamming([1,0,0], [1,1,1])` is $\frac{2}{3}$
    * `hamming([1,1,0], [1,1,1])` is $\frac{1}{3}$
    * `hamming([1,1,1], [1,1,1])` is $\frac{0}{3}$ = 0
    * `hamming([0,0,1], [0,0,0])` is $\frac{1}{3}$
    * `hamming([0,0,1], [0,1,1])` is $\frac{1}{3}$
* Code: `pdist(x, metric='hamming')` or `pdist(x, metric='matching')`
<hr>
<br>

#### 2.2 Dice dissimilarity

$$\frac{n_{misses}}{2n_{one\_matches} + n_{misses}}$$

* **Makes a lot of sense for dummy columns where a `0` is a less meaningful response, but matching on a 1 means a lot (i.e. a dummy matching on 1 means the original input categorical data matched).**
* Intuition: "Hamming distance but... ignore matches of `0`s and extra count matches of `1`s"
* Examples:
    * `dice([0,0,0], [1,1,1])` is $\frac{3}{2(0) + 3}$ = 1
    * `dice([1,0,0], [1,1,1])` is $\frac{2}{2(1) + 2}$ = $\frac{1}{2}$
    * `dice([1,1,0], [1,1,1])` is $\frac{1}{2(2) + 1}$ = $\frac{1}{5}$
    * `dice([1,1,1], [1,1,1])` is $\frac{0}{2(3) + 0}$ = 0
    * `dice([0,0,1], [0,0,0])` is $\frac{1}{2(0) + 1}$ = 1
    * `dice([0,0,1], [0,1,1])` is $\frac{1}{2(1) + 1}$ = $\frac{1}{3}$
* Code: `pdist(x, metric='dice')`
<hr>
<br>


#### 2.3 Jaccard distance

$$\frac{n_{misses}}{n_{one\_matches} + n_{misses}}$$

* **Makes a lot of sense for a mix of binary and dummy columns**
* Intuition: "What if there was a middle ground between hamming and dice?"
* Examples:
    * `jaccard([0,0,0], [1,1,1])` is $\frac{3}{0 + 3}$ = 1
    * `jaccard([1,0,0], [1,1,1])` is $\frac{2}{1 + 2}$ = $\frac{2}{3}$
    * `jaccard([1,1,0], [1,1,1])` is $\frac{1}{2 + 1}$ = $\frac{1}{3}$
    * `jaccard([1,1,1], [1,1,1])` is $\frac{0}{0 + 3}$ = 0
    * `jaccard([0,0,1], [0,0,0])` is $\frac{1}{0 + 1}$ = 1
    * `jaccard([0,0,1], [0,1,1])` is $\frac{1}{1 + 1}$ = $\frac{1}{2}$
* Code: `pdist(x, metric='jaccard')`
<hr>


### 3. Distances for mixed data

3.1. Gower

" 'p much all gower" -Adam Spannbauer
<hr>

#### Gower.Gower Gower
Gower distance is essentially a combination of manhattan distance and jaccard distance. 
* It applies manhattan to continuous variables and jaccard to binary variables.
    * gower is restrictive in how it preprocesses
* With this metric we also have the ability to assign weights to show how important each feature should be in the distance calculation.
    
```python
#!pip install gower
import gower

df = pd.DataFrame(
    {
        "age": [21, 24, 35, 52, 55],
        "account_age": [2, 3, 12, 20, 18],
        "region": ["west", "south", "west", "east", "east"],
        "late_payments": ["y", "n", "y", "n", "y"],
    }
)

pd.DataFrame(gower.gower_matrix(df)).style.background_gradient()

# I think late_payments should be 5 times as important as the
# rest of the features (idk why, just made it up to use weights)

# cant use list, use a np.array or a pd.series
w = np.array([1, 1, 1, 5])
pd.DataFrame(gower.gower_matrix(df, weight=w))
```

<table><tr><td>Gower Matrix</td><td>Weighted Gower Matrix</td></tr><tr><td><img src='https://i.imgur.com/F1H5qnJ.png'></td><td><img src='https://i.imgur.com/v3xi1Na.png2'></td></tr></table>
   
Compare the 2 outputs.

[0, 2, 4] all had the same value for the more heavily weighted late_payments feature. The distances between [0, 2], [0, 4], and [2, 4] all got smaller when we weighted that feature.

[1, 3] had the same value for late_payments feature. The distances between [1, 3] got smaller.

We also see larger distances between these 2 groups ([0, 2, 4] <-> [1, 3])