In [None]:
%reload_ext nb_black

### Warm-up

* Which supervised learning method's loss function is shown below? ($SSE$ is Sum of Squared Errors & $\beta_i$ is the $i$th coefficient of the model)
    * (A) Logisitic Regression
    * (B) LASSO Regression
    * (C) Ridge Regression
    * (D) ElasticNet Regression

$$SSE + \lambda \sum_{i=1}|\beta_i|$$

* Bonus for above, this method can also be referenced as "\_\_\_\_\_ Regularization"
    * (A) L1
    * (B) L2
    * (C) A1
    * (D) B4

---

* Which supervised learning method's loss function is shown below? ($SSE$ is Sum of Squared Errors & $\beta_i$ is the $i$th coefficient of the model)
    * (A) Logisitic Regression
    * (B) LASSO Regression
    * (C) Ridge Regression
    * (D) ElasticNet Regression

$$SSE + \lambda \sum_{i=1}\beta_i^2$$

* Bonus for above, this method can also be referenced as "\_\_\_\_\_ Regularization"
    * (A) L1
    * (B) L2
    * (C) A1
    * (D) B4

----

Some code for the remaining questions:

```python
import numpy as np

y_test = np.array([ 6, -2, -4, 6,  -7])
y_pred = np.array([ 4,  4, -3, 9, -30])

mae = np.____(np.____(y_pred - y_test))
rmse = np.____(np.____((y_pred - y_test) ** 2))
```

* What is the Mean Absolute Error in this case?

* What is the Root Mean Squared Error in this case?


* What are the differences between MAE and RMSE in this case?
* What are the differences between the 2 in general?
* Why do we have multiple metrics for error?

----

We're talking about distances today as if they're something new, but keep in mind you've been doing some distance calculations already: 
* Sometimes more explicitly - like when checking model performance by calculating MAE/RMSE (average distance between `y_pred` & `y_test`)
* Sometimes less explicitly - like when using LASSO/Ridge/ElasticNet (distance between coefficients and 0)
* Sometimes it was the whole point of the method - KNN (distance between observation and its nearest neighbors)

In [None]:
import numpy as np
import pandas as pd

from scipy.spatial.distance import pdist, squareform

import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## Distances for continuous data

Resource that has plans to include all the distances we'll cover: https://adamspannbauer.github.io/distance_metrics_demo/

In [None]:
df = pd.DataFrame(
    [[1, 30, 0, 3, 80000], [2, 32, 0, 4, 77000], [3, 55, 3, 12, 81000]],
    columns=["id", "age", "n_children", "education", "income"],
)

df

* Who are the most similar intuitively?
* Calculate the distance between each row to support/refute your intuition.
    * We'll calculate these first few 'by hand' and then use an imported function later.

A more practical solution is use the `pdist` function from `scipy.spatial.distance`.

Its often paired with the `squareform` function.

For a prettier print in jupyter you can convert to a dataframe.

* What is the default method used in `pdist`?
* What other distance methods does `pdist` provide?

An example to show why cosine distance makes sense.

Let's say the below data is some features made from 2 different blogs.  The way we created this data is by counting the number of times each word appeared in each blog.

One was a political blog talking about Donald Trump; the other was a game blog talking about the intricacies of a card game's 'trump card' mechanic.

In [None]:
text_df = pd.DataFrame({"trump": [20, 40], "card": [1, 45], "donald": [17, 0]})
text_df.index = ["blog1", "blog2"]
text_df

The option to visualize that we've been using so far together:

In [None]:
plt.scatter(text_df["trump"][0], text_df["card"][0], color="blue", label="blog1")
plt.scatter(text_df["trump"][1], text_df["card"][1], color="orange", label="blog2")

plt.axis("square")
plt.xlabel("n(trump)")
plt.ylabel("n(card)")
plt.xlim(-1, 48)
plt.ylim(-1, 48)
plt.legend(loc="upper left")
plt.show()

Another way to visualize/think about this is the data as vectors.  The beginnings of the vectors originate from the origin, and the tips of the vectors point to the location of our data as shown in the scatter plot.

In [None]:
# fmt: off
plt.quiver(
    [0], [0],
    text_df["trump"][0], text_df["card"][0],
    color="blue", label="blog1",
    angles="xy", scale_units="xy", scale=1,
)
plt.quiver(
    [0], [0],
    text_df["trump"][1], text_df["card"][1],
    color="orange", label="blog2",
    angles="xy", scale_units="xy", scale=1,
)
# fmt: on
plt.axis("square")
plt.xlabel("n(trump)")
plt.ylabel("n(card)")
plt.xlim(-1, 48)
plt.ylim(-1, 48)
plt.legend(loc="upper left")
plt.show()

We now have the same style of data but for a shorter post.  Which blog is the post most similar to?

In [None]:
new_observation = pd.DataFrame(
    {"trump": [5], "card": [6], "donald": [0]}, index=["new_post"]
)

full_df = pd.concat((text_df, new_observation))
full_df

A scatter plot makes our minds think more in terms of euclidean distance.

In [None]:
plt.scatter(full_df["trump"][0], full_df["card"][0], color="blue", label="blog1")
plt.scatter(full_df["trump"][1], full_df["card"][1], color="orange", label="blog2")
plt.scatter(
    full_df["trump"][2], full_df["card"][2], color="black", label="new_post",
)

plt.axis("square")
plt.xlabel("n(trump)")
plt.ylabel("n(card)")
plt.xlim(-1, 48)
plt.ylim(-1, 48)
plt.legend(loc="upper left")
plt.show()

A visual with a more vector representation of this data tells a different story of similarity/distance.

In [None]:
# fmt: off
plt.quiver(
    full_df["trump"][0], full_df["card"][0],
    color="blue", label="blog1",
    angles="xy", scale_units="xy", scale=1,
)
plt.quiver(
    full_df["trump"][1], full_df["card"][1],
    color="orange", label="blog2",
    angles="xy", scale_units="xy", scale=1,
)
plt.quiver(
    full_df["trump"][2], full_df["card"][2],
    color="black", label="new_post",
    angles="xy", scale_units="xy", scale=1,
)
# fmt: on
plt.axis("square")
plt.xlabel("n(trump)")
plt.ylabel("n(card)")
plt.xlim(-1, 48)
plt.ylim(-1, 48)
plt.legend(loc="upper left")
plt.show()

In [None]:
euclid_dist = squareform(pdist(full_df))
cosine_dist = squareform(pdist(full_df, metric="cosine"))

euclid_dist = pd.DataFrame(euclid_dist, columns=full_df.index, index=full_df.index)
cosine_dist = pd.DataFrame(cosine_dist, columns=full_df.index, index=full_df.index)

print("Original Data")
display(full_df)

print("\nEuclidean Distance")
display(euclid_dist)

print("\n1 - Cosine Similarity")
display(cosine_dist)

In [None]:
pdist([[2, 3], [4, 6]], metric="cityblock")

### Numeric distance cheat sheet:

#### Manhattan distance

$$\sum_{i=0}^n|x_i - y_i|$$

* Intuition: "Taxi cab/city block distance" This metric will be less affected by outlier differences in the calculation than euclidean.
* Examples:
    * `manhattan([0,0], [3,4])` is 7
    * `manhattan([0,0], [3,10])` is 13
    * `manhattan([2,3], [4,6])` is 5
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_1$ norm" (is minkowski distance ($L_p$) with $p=1$)


#### Euclidean distance

$$\sqrt{\sum_{i=0}^n(x_i - y_i)^2}$$

* Intuition: "Straight line distance." This metric will be more affected by outlier differences in the calculation than Manahattan (due to being squared).
* Examples:
    * `euclidean([0,0], [3,4])` is 5
    * `euclidean([0,0], [3,10])` is 10.44
    * `euclidean([2,3], [4,6])` is 3.606
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_2$ norm" (is minkowski distance ($L_p$) with $p=2$)


#### Chebyshev distance

$$max(|x_i - y_i|)$$

* Intuition: "The biggest difference between the 2 rows." This metric is only affected by outlier differences in the calculation.  (it's only the max)
* Examples:
    * `chebyshev([0,0], [3,4])` is 4
    * `chebyshev([0,0], [3,10])` is 10
    * `chebyshev([2,3], [4,6])` is 3
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_\infty$ norm" (is minkowski distance ($L_p$) with $p=\infty$)

#### Cosine disimilarity

$$max(|x_i - y_i|)$$

* Intuition: "Angle between the vectors defined by each observation."  Focuses more on how each column relates to each other within each observation; if they relationships between columns are the same then this is a small distance.
* Examples:
    * `cosine_dis([0,0], [3,4])` is nan
    * `cosine_dis([0,0], [3,10])` is nan
    * `cosine_dis([0,0,1], [3,10,1])` is 0.904
    * `cosine_dis([2,3], [4,6])` is 0
* Code: `pdist(x)` or `pdist(x, metric='euclidean')`
* "$L_\infty$ norm" (is minkowski distance ($L_p$) with $p=\infty$)

## Distances for categorical data

### Scenario 1:

Which users are the most similar?

In [None]:
df = pd.DataFrame(
    {
        "subscriber": ["yes", "no", "no"],
        "dog_owner": ["yes", "yes", "no"],
        "cat_owner": ["yes", "yes", "no"],
        "smoker": ["yes", "yes", "yes"],
    },
    index=["user_1", "user_2", "user_3"],
)
df

Encode the data for an ML model to consume it:

* Does this encoding change how similar things look?
* In regards to comparing rows:
    * what does it mean when both rows have a `0`?
    * what does it mean when both rows have a `1`?
    * what does it mean when one row has a `0` and one row has a `1`?

### Scenario 2:

Which users are most similiar?

In [None]:
df = pd.DataFrame(
    {
        "region": ["west", "south", "south", "north", "east"],
        "favorite_show": [
            "office",
            "sportscenter",
            "office",
            "sportscenter",
            "bachelor",
        ],
        "music_service": ["spotify", "apple", "spotify", "pandora", "apple"],
        "favorite_planet": ["earth", "pluto", "earth", "pandora", "pluto"],
    },
    index=["user_1", "user_2", "user_3", "user_4", "user_5"],
)

df

Encode the data for an ML model to consume it:

* Does this encoding change how similar things look?
* In regards to comparing rows:
    * what does it mean when both rows have a `0`?
    * what does it mean when both rows have a `1`?
    * what does it mean when one row has a `0` and one row has a `1`?
    
-----

### Categorical distance cheat sheet:

#### Hamming distance

$$\frac{n_{misses}}{n_{columns}}$$

* **Makes a lot of sense for binary columns where a `0` is a meaningful response.**
* Intuition: "What fraction of the elements between the 2 rows are differnt?"
* Examples:
    * `hamming([0,0,0], [1,1,1])` is $\frac{3}{3}$ = 1
    * `hamming([1,0,0], [1,1,1])` is $\frac{2}{3}$
    * `hamming([1,1,0], [1,1,1])` is $\frac{1}{3}$
    * `hamming([1,1,1], [1,1,1])` is $\frac{0}{3}$ = 0
    * `hamming([0,0,1], [0,0,0])` is $\frac{1}{3}$
    * `hamming([0,0,1], [0,1,1])` is $\frac{1}{3}$
* Code: `pdist(x, metric='hamming')` or `pdist(x, metric='matching')`


#### Dice dissimilarity

$$\frac{n_{misses}}{2n_{one\_matches} + n_{misses}}$$

* **Makes a lot of sense for dummy columns where a `0` is a less meaningful response, but matching on a 1 means a lot (i.e. a dummy matching on 1 means the original input categorical data matched).**
* Intuition: "Hamming distance but... ignore matches of `0`s and extra count matches of `1`s"
* Examples:
    * `dice([0,0,0], [1,1,1])` is $\frac{3}{2(0) + 3}$ = 1
    * `dice([1,0,0], [1,1,1])` is $\frac{2}{2(1) + 2}$ = $\frac{1}{2}$
    * `dice([1,1,0], [1,1,1])` is $\frac{1}{2(2) + 1}$ = $\frac{1}{5}$
    * `dice([1,1,1], [1,1,1])` is $\frac{0}{2(3) + 0}$ = 0
    * `dice([0,0,1], [0,0,0])` is $\frac{1}{2(0) + 1}$ = 1
    * `dice([0,0,1], [0,1,1])` is $\frac{1}{2(1) + 1}$ = $\frac{1}{3}$
* Code: `pdist(x, metric='dice')`


#### Jaccard distance

$$\frac{n_{misses}}{n_{one\_matches} + n_{misses}}$$

* **Makes a lot of sense for a mix of binary and dummy columns**
* Intuition: "What if there was a middle ground between hamming and dice?"
* Examples:
    * `jaccard([0,0,0], [1,1,1])` is $\frac{3}{0 + 3}$ = 1
    * `jaccard([1,0,0], [1,1,1])` is $\frac{2}{1 + 2}$ = $\frac{2}{3}$
    * `jaccard([1,1,0], [1,1,1])` is $\frac{1}{2 + 1}$ = $\frac{1}{3}$
    * `jaccard([1,1,1], [1,1,1])` is $\frac{0}{0 + 3}$ = 0
    * `jaccard([0,0,1], [0,0,0])` is $\frac{1}{0 + 1}$ = 1
    * `jaccard([0,0,1], [0,1,1])` is $\frac{1}{1 + 1}$ = $\frac{1}{2}$
* Code: `pdist(x, metric='jaccard')`