# Lab 07: Distance Metrics (+2 Bonus Points)

This lab is presented with some revisions from [Dennis Sun at Cal Poly](https://web.calpoly.edu/~dsun09/index.html) and his [Data301 Course](http://users.csc.calpoly.edu/~dsun09/data301/lectures.html)

**When you have filled out all the questions, submit via [Tulane Canvas](https://tulane.instructure.com/)**

The previous labs we discussed ways to measure relationships between variables, or the _columns_ of a `DataFrame`. This chapter is about how to measure relationships between observations, or the _rows_ of a `DataFrame`. How do we quantify how "similar" two observations are?

## Part 1: Distances between Quantitative Variables

We will use the Ames housing data set, but to keep things simple, we will work with just three quantitative variables from that data set: the number of bedrooms, the number of bathrooms, and the living area (in square feet).

In [2]:
# clone the course repository, change to right directory, and import libraries.
%cd /content
!git clone https://github.com/nmattei/cmps3160.git
%cd /content/cmps3160/_labs/Lab07
import pandas as pd
import numpy as np

/content
Cloning into 'cmps3160'...
remote: Enumerating objects: 2071, done.[K
remote: Counting objects: 100% (234/234), done.[K
remote: Compressing objects: 100% (124/124), done.[K
remote: Total 2071 (delta 158), reused 125 (delta 103), pack-reused 1837 (from 2)[K
Receiving objects: 100% (2071/2071), 52.49 MiB | 29.31 MiB/s, done.
Resolving deltas: 100% (1209/1209), done.
/content/cmps3160/_labs/Lab07


In [3]:
housing_df = pd.read_csv("../data/ames.tsv", sep="\t")

# extract 3 quantitative variables
housing_df_quant = housing_df[["Bedroom AbvGr", "Gr Liv Area"]].copy()
housing_df_quant["Bathrooms"] = (
    housing_df["Full Bath"] +
    0.5 * housing_df["Half Bath"]
)
housing_df_quant

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,3,1656,1.0
1,2,896,1.0
2,3,1329,1.5
3,3,2110,2.5
4,3,1629,2.5
...,...,...,...
2925,3,1003,1.0
2926,2,902,1.0
2927,3,970,1.0
2928,2,1389,1.0


Shown below is a (three-dimensional) scatterplot of these variables. Consider the two observations connected by a red line. (The label next to each point is its index in the `DataFrame`.) To measure how similar they are, we can calculate the distance between the two points.

<img src="https://github.com/nmattei/cmps3160/blob/master/_labs/images/distance.png?raw=1">

Calculating the distance between two points is not as straightforward as it might seem because there is more than one way to define distance. The one most familiar to you is probably **Euclidan distance**, which is the straight-line distance ("as the crow flies") between the two points. The formula for calculating this distance is a generalization of the Pythagorean theorem:

$$ d({\bf x}, {\bf x'}) = \sqrt{\sum_{j=1}^D (x_j - x'_j)^2} $$

Which we've seen before as the sum of squared distances!

In [4]:
x = housing_df_quant.loc[2927]
x1 = housing_df_quant.loc[2928]
x
x - x1

Unnamed: 0,0
Bedroom AbvGr,1.0
Gr Liv Area,-419.0
Bathrooms,0.0


In [5]:
(x - x1) ** 2

Unnamed: 0,0
Bedroom AbvGr,1.0
Gr Liv Area,175561.0
Bathrooms,0.0


In [6]:
np.sqrt(((x - x1) ** 2).sum())

np.float64(419.0011933157231)

The beauty of this definition is that it generalizes to more than three dimensions. Even though it is difficult to visualize points in 100-dimensional space, we can calculate distances between them in exactly the same way.

However, Euclidean distance is not the only way to measure how far apart two points are. There is also [**Manhattan distance**](https://en.wikipedia.org/wiki/Taxicab_geometry) (also called _taxicab distance_ ), which measures the distance a taxicab in Manhattan would have to drive to travel from A to B. Taxicabs are not able to travel in a straight line (i.e., the green path below, the Euclidian distance) because they have to follow the street grid. But there are multiple paths along the street grid that all have exactly the same length (i.e., the red, yellow, and blue paths below); the Manhattan distance is the length of any one of these shortest paths.

<img src="https://github.com/nmattei/cmps3160/blob/master/_labs/images/dist.png?raw=1">

The formula for Manhattan distance is actually quite similar to the formula for Euclidean distance. Instead of squaring the differences and taking the square root at the end (as in Euclidean distance), we simply take absolute values:
$$ d({\bf x}, {\bf x'}) = \sum_{j=1}^D |x_j - x'_j|. $$

The following code calculates Manhattan distance:

In [7]:
((x - x1).abs()).sum()

np.float64(420.0)

### Comparison of Euclidean and Manhattan distance

The Euclidean distance was essentially just the largest difference. This is because Euclidean distance first _squares_ the differences. The squaring operation has a "rich get richer" effect; larger values get magnified by more than smaller values. As a result, the largest differences tend to dominate the Euclidean distance.

On the other hand, Manhattan distance treats all differences equally. So Manhattan distance is preferred if you are concerned that an outlier in one variable might dominate the distance metric.

### The Importance of Scaling

Here's a quiz. There are two pairs of observations in the figure below, one connected by a red line, the other connected by an orange line. Which pair of observations is more similar (assuming we use Euclidean distance)?

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/closer.png?raw=1)

Let's actually calculate these two distances.

In [8]:
# Distance between two points connected by red line
x = housing_df_quant.loc[2927]
x1 = housing_df_quant.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

np.float64(419.0011933157231)

In [9]:
# Distance between two points connected by orange line
x = housing_df_quant.loc[2498]
x1 = housing_df_quant.loc[290]

np.sqrt(((x - x1) ** 2).sum())

np.float64(5.0990195135927845)

Surprised by the answer? The scatterplot is deceiving because it automatically scales the variables to make the points fit on the same plot. In reality, the variables are on very different scales. The number of bedrooms and bathrooms range from 0 to 6, while living area is in the thousands. When variables are on such different scales, the variable with the largest variability will dominate the distance metric.

The plot below shows the same data, but drawn to scale. You can see that differences in the number of bedrooms and the number of bathrooms hardly matter at all; only the variability in the living area matters.

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/closer_rescaled.png?raw=1)

To obtain distances that agree more with our intuition---and that do not give too much weight to any one variable---we transform the variables to be on the same scale. There are a few ways to **scale** a variable:

- **standardizing**: subtract each variable by its mean, then divide by its standard deviation, (also called z-standardization)
$$ x_i \leftarrow \frac{x_i - \text{mean}[X]}{\text{SD}[X]} $$
- **normalizing**: scale each variable to have length (or "norm") 1,
$$ x_i \leftarrow \frac{x_i}{\sqrt{\sum_{i=1}^n x_i^2}} $$
- **min/max scaling**: scale each variable so that all values are between 0 and 1,
$$x_i \leftarrow \frac{x_i - \min[X]}{\max[X] - \min[X]}.$$

The figure below illustrates what each of these scaling methods do to a synthetic data set with two variables. All three methods scale the variables in similar (but slightly different) ways, resulting in figure-eights with different aspect ratios.  Standardizing also moves the data to be centered around the origin, while min-max scaling moves the data to be in a box whose corners are $(0, 0)$ and $(1, 1)$.

![](https://github.com/nmattei/cmps3160/blob/master/_labs/images/scaling.png?raw=1)

Let's standardize the Ames housing data, and see how it affects the distance metric.

In [10]:
housing_df_std = (
    (housing_df_quant - housing_df_quant.mean()) /
    housing_df_quant.std()
)
housing_df_std

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.176064,0.309212,-1.176462
1,-1.032058,-1.194223,-1.176462
2,0.176064,-0.337661,-0.398702
3,0.176064,1.207317,1.156819
4,0.176064,0.255801,1.156819
...,...,...,...
2925,0.176064,-0.982555,-1.176462
2926,-1.032058,-1.182354,-1.176462
2927,0.176064,-1.047836,-1.176462
2928,-1.032058,-0.218968,-1.176462


Notice that the resulting `DataFrame` contains negative values. This makes sense because standardizing makes the mean of every variable equal to 0. If the mean is 0, then some values must be negative.

The above command is deceptively simple. We actually subtracted a `DataFrame` by a `Series`, then divided the resulting `DataFrame` by another `Series`. We relied on `pandas` to broadcast each `Series` over the right dimension of the `DataFrame`. To be more explicit about the broadcasting, we could have also used the `.sub()` and `.divide()` methods (instead of `-` and `/`) and been explicit about the axis:

In [11]:
housing_df_std = (housing_df_quant.
                  sub(housing_df_quant.mean(), axis=1).
                  divide(housing_df_quant.std(), axis=1))
housing_df_std

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.176064,0.309212,-1.176462
1,-1.032058,-1.194223,-1.176462
2,0.176064,-0.337661,-0.398702
3,0.176064,1.207317,1.156819
4,0.176064,0.255801,1.156819
...,...,...,...
2925,0.176064,-0.982555,-1.176462
2926,-1.032058,-1.182354,-1.176462
2927,0.176064,-1.047836,-1.176462
2928,-1.032058,-0.218968,-1.176462


Now let's recalculate the distances using this standardized data and see if our conclusions change.

In [12]:
# Distance between two points connected by red line
x = housing_df_std.loc[2927]
x1 = housing_df_std.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

np.float64(1.4651211129695825)

In [13]:
# Distance between two points connected by orange line
x = housing_df_std.loc[2498]
x1 = housing_df_std.loc[290]

np.sqrt(((x - x1) ** 2).sum())

np.float64(3.9440754446060033)

So, if we first standardize the data, then the pair of observations connected by the red line are more similar than the pair connected by the orange line, which matches our intuition. It is (almost) always a good idea to scale your variables before calculating distances.

Now that you've seen how to implement one scaling method (standardization), you will implement two more (normalization and min-max scaling) in Exercises 1 and 2 below.

### Exercises Part 1

#### Exercise 1

Instead of standardizing the three variables from the Ames housing data set, normalize them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

In [14]:
housing_df_norm = housing_df_quant / np.sqrt(((housing_df_quant**2).sum(axis=0)))

housing_df_norm

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.018649,0.019331,0.009878
1,0.012433,0.010460,0.009878
2,0.018649,0.015514,0.014817
3,0.018649,0.024631,0.024695
4,0.018649,0.019016,0.024695
...,...,...,...
2925,0.018649,0.011709,0.009878
2926,0.012433,0.010530,0.009878
2927,0.018649,0.011323,0.009878
2928,0.012433,0.016215,0.009878


In [15]:
# Distance between two points connected by red line
x = housing_df_norm.loc[2927]
x1 = housing_df_norm.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

np.float64(0.007910021508841998)

In [16]:
# Distance between two points connected by orange line
x = housing_df_norm.loc[2498]
x1 = housing_df_norm.loc[290]

np.sqrt(((x - x1) ** 2).sum())

np.float64(0.021103948426701397)

**Written Answers Here:** Yes, my conclusion did change, the distances become smaller.

#### Exercise 2

Instead of standardizing the three variables from the Ames housing data set, apply a min-max scaling to them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

In [17]:
housing_df_min_max = (housing_df_quant - housing_df_quant.min()) / (housing_df_quant.max() - housing_df_quant.min())

housing_df_min_max

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.375,0.249058,0.2
1,0.250,0.105878,0.2
2,0.375,0.187453,0.3
3,0.375,0.334589,0.5
4,0.375,0.243971,0.5
...,...,...,...
2925,0.375,0.126036,0.2
2926,0.250,0.107008,0.2
2927,0.375,0.119819,0.2
2928,0.250,0.198757,0.2


In [18]:
# Distance between two points connected by red line
x = housing_df_min_max.loc[2927]
x1 = housing_df_min_max.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

np.float64(0.14783815972387498)

In [19]:
# Distance between two points connected by red line
x = housing_df_min_max.loc[2498]
x1 = housing_df_min_max.loc[290]

np.sqrt(((x - x1) ** 2).sum())

np.float64(0.425000668096024)

**Written Answers Here:**
 Yes, the conlusion does change, the distances became somewhat larger.

The next exercises ask you to work with a data set that describes the chemical composition of 1599 red wines (`../data/reds.csv`). There are 12 variables in this data set, all of which are quantitative (so each observation is a point in 12-dimensional space).

In [20]:
df_reds = pd.read_csv("../data/reds.csv", sep=';')
df_reds[:5]

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


#### Exercise 3


Find which red wine is more similar to wine 0 in the `DataFrame`: wine 6 or wine 36? (Do not scale the variables.) You should do this for both Euclidian Distance and Manhattan Distance.  Does your answer depend on which distance metric you use to measure "similarity"?

In [21]:
#Euclidian Distance
x = df_reds.loc[0]
x1 = df_reds.loc[6]

display(np.sqrt ((x - x1) ** 2).sum( ))

x = df_reds.loc[0]
x1 = df_reds.loc[36]

display(np.sqrt ((x - x1) ** 2).sum( ))

np.float64(30.278400000000005)

np.float64(30.680300000000003)

In [22]:
#Manhattan Distance
x = df_reds.loc[0]
x1 = df_reds.loc[6]

display(((x - x1).abs()).sum())

x = df_reds.loc[0]
x1 = df_reds.loc[36]

display(((x - x1).abs()).sum())

np.float64(30.278400000000005)

np.float64(30.680300000000003)

**Written Answers:**  No, my answer did not depend on the distance metric I used to measure similarity.

#### Exercise 4

Now suppose we agree to measure similarity using Euclidean distance, and we wish to investigate the effect of scaling the variables. Which red wine is more similar to wine 0: wine 6 or wine 36? Does the answer depend on whether the variables are scaled or not? Does it depend on the choice of scaling?  What happens for each type of scaling? Your answer test all of the scaling methods described above.

In [62]:
#Finding Euclidean distance without scaling the variables
x = df_reds.loc[0]
x1 = df_reds.loc[6]
print("The distance between wine 0 and wine 6 without scaling the variables is", np.sqrt ((x - x1) ** 2).sum( ))
x = df_reds.loc[0]
x1 = df_reds.loc[36]
print("The distance between wine 0 and wine 36 without scaling the variables is", np.sqrt ((x - x1) ** 2).sum( ))
print()

#Standardizing
df_reds_std = (df_reds - df_reds.mean()) / df_reds.std()
df_reds_std

#Finding Euclidean distance
x = df_reds_std.loc[0]
x1 = df_reds_std.loc[6]
print("The distance between wine 0 and wine 6 after standardizing is", np.sqrt ((x - x1) ** 2).sum( ))
x = df_reds_std.loc[0]
x1 = df_reds_std.loc[36]
print("The distance between wine 0 and wine 36 after standardizing is", np.sqrt ((x - x1) ** 2).sum( ))
print()

#Normalizing
df_reds_norm = df_reds / np.sqrt(((df_reds**2).sum(axis=0)))
df_reds_norm

#Finding Euclidean distance
x = df_reds_norm.loc[0]
x1 = df_reds_norm.loc[6]
print("The distance between wine 0 and wine 6 after normalizing is", np.sqrt ((x - x1) ** 2).sum( ))
x = df_reds_norm.loc[0]
x1 = df_reds_norm.loc[36]
print("The distance between wine 0 and wine 36 after normalizing is", np.sqrt ((x - x1) ** 2).sum( ))
print()

#Min_max scaling
df_reds_min_max = (df_reds - df_reds.min()) / (df_reds.max() - df_reds.min())
df_reds_min_max

#Finding Euclidean distance
x = df_reds_min_max.loc[0]
x1 = df_reds_min_max.loc[6]
print("The distance between wine 0 and wine 6 after min-max scaling is", np.sqrt ((x - x1) ** 2).sum( ))
x = df_reds_min_max.loc[0]
x1 = df_reds_min_max.loc[36]
print("The distance between wine 0 and wine 36 after min-max scaling is", np.sqrt ((x - x1) ** 2).sum( ))

The distance between wine 0 and wine 6 without scaling the variables is 30.278400000000005
The distance between wine 0 and wine 36 without scaling the variables is 30.680300000000003

The distance between wine 0 and wine 6 after standardizing is 5.349507128300721
The distance between wine 0 and wine 36 after standardizing is 6.946289842459201

The distance between wine 0 and wine 6 after normalizing is 0.03634450797537521
The distance between wine 0 and wine 36 after normalizing is 0.05174302295697016

The distance between wine 0 and wine 6 after min-max scaling is 0.6776768627714191
The distance between wine 0 and wine 36 after min-max scaling is 1.0068755971025716


**Written Answers Here:** Red wine [6] is most similar to red wine [0]. This does not change even when the variables are not scaled, nor does this answer change depending on how the variables are scaled. However, the level of similarity does vary depending on the type of scaling applied.

## Part 2: Distances Between Categorical Variables

The distance metrics that we studied in the previous section were designed for quantitative variables. But most data sets contain a mix of categorical and quantitative variables. For example, the Titanic data set contains both quantitative variables, like `age`, and categorical variables, like `sex` and `embarked`. How do we measure the similarity between observations for a data set like this one? The most straightforward solution is to convert the categorical variables into quantitative ones.

In [26]:
titanic = pd.read_csv("../data/titanic.csv")
titanic

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0000,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.5500,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0000,1,2,113781,151.5500,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0000,1,2,113781,151.5500,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,3,0,"Zabour, Miss. Hileni",female,14.5000,1,0,2665,14.4542,,C,,328.0,
1305,3,0,"Zabour, Miss. Thamine",female,,1,0,2665,14.4542,,C,,,
1306,3,0,"Zakarian, Mr. Mapriededer",male,26.5000,0,0,2656,7.2250,,C,,304.0,
1307,3,0,"Zakarian, Mr. Ortin",male,27.0000,0,0,2670,7.2250,,C,,,


### Converting Categorical Variables to Quantitative Variables

Binary categorical variables (categorical variables with two categories) can be converted into quantitative variables by coding one category as 1 and the other category as 0. (In fact, the `survived` column in the Titanic data set is an example of a variable where this has been done.) But what do we do about a categorical variable with more than 2 categories, like `embarked`, which has 3 categories?

We can convert a categorical variable with $K$ categories into $K$ separate 0/1 variables, or **dummy variables**. Each of the $K$ variables is an indicator for one of the $K$ categories. That is, each dummy variable is 1 if the observation fell into that category and 0 otherwise.

Although it is not difficult to create dummy variables manually, the easiest way to create them is the `get_dummies()` function in `pandas`.

Note, in newer versions of Pandas, we must specify the paramter `dtype="int'` to return quantitative (0/1) values rather than boolean (True/False).

In [27]:
pd.get_dummies(titanic["embarked"], dtype="int")

Unnamed: 0,C,Q,S
0,0,0,1
1,0,0,1
2,0,0,1
3,0,0,1
4,0,0,1
...,...,...,...
1304,1,0,0
1305,1,0,0
1306,1,0,0
1307,1,0,0


Since every observation is in exactly one category, each row contains exactly one 1; the rest of the values in each row are 0s.

We can call `get_dummies` on a `DataFrame` to encode multiple categorical variables at once. `pandas` will only dummy-encode the variables it deems categorical, leaving the quantitative variables alone. If there are any categorical variables that are represented in the `DataFrame` using numeric types, they must be cast explicitly to a categorical type, such as `str`.  `pandas` will also automatically prepend the variable name to all dummy variables, to prevent collisions between column names in the final `DataFrame`.

In [28]:
# Convert pclass to a categorical type
titanic["pclass"] = titanic["pclass"].astype(str)

# Pass all variables to get_dummies, except ones that are "other" types
titanic_num = pd.get_dummies(
    titanic.drop(["name", "ticket", "cabin", "boat", "body"], axis=1),
    dtype="int"
)
titanic_num

Unnamed: 0,survived,age,sibsp,parch,fare,pclass_1,pclass_2,pclass_3,sex_female,sex_male,...,"home.dest_Wimbledon Park, London / Hayling Island, Hants","home.dest_Windsor, England New York, NY","home.dest_Winnipeg, MB","home.dest_Winnipeg, MN","home.dest_Woodford County, KY","home.dest_Worcester, England","home.dest_Worcester, MA","home.dest_Yoevil, England / Cottage Grove, OR","home.dest_Youngstown, OH","home.dest_Zurich, Switzerland"
0,1,29.0000,0,0,211.3375,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
1,1,0.9167,1,2,151.5500,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
2,0,2.0000,1,2,151.5500,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
3,0,30.0000,1,2,151.5500,1,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
4,0,25.0000,1,2,151.5500,1,0,0,1,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1304,0,14.5000,1,0,14.4542,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1305,0,,1,0,14.4542,0,0,1,1,0,...,0,0,0,0,0,0,0,0,0,0
1306,0,26.5000,0,0,7.2250,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0
1307,0,27.0000,0,0,7.2250,0,0,1,0,1,...,0,0,0,0,0,0,0,0,0,0


Notice that categorical variables, like `pclass`, were converted to dummy variables with names like `pclass_1`, `pclass_2` and `pclass_3`, while quantitative variables, like `age`, were left alone.

Now that we have converted every variable in our data set into a quantitative variable, we can apply the techniques from the previous section to calculate distances between observations. For example, to find the passenger who is most similar to the first passenger, Elisabeth Watson, we can find the row with the smallest Euclidean distance to that row in the above `DataFrame`.

In [29]:
titanic_std = (titanic_num - titanic_num.mean()) / titanic_num.std()
np.sqrt(
    ((titanic_std - titanic_std.loc[0]) ** 2).sum(axis=1)
).sort_values()

Unnamed: 0,0
0,0.000000
193,1.509375
238,1.509375
261,4.655385
24,18.111957
...,...
694,41.133172
797,41.133207
49,41.133360
472,41.186785


The passenger who was most similar to Elisabeth Allen, other than herself, is passenger 238. Let's extract these passengers from the original `DataFrame` to see how similar they really are.

In [30]:
titanic.loc[[0, 238]]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
238,1,1,"Robert, Mrs. Edward Scott (Elisabeth Walton Mc...",female,43.0,0,1,24160,211.3375,B3,S,2,,"St Louis, MO"


The two passengers are indeed very similar, only differing in age and the number of parents/children accompanying her. They even happen to share the same first two names ("Elizabeth Walton").

### Exercises Part 2

The next exercises again use the Ames housing data set (`../data/ames.tsv`).

In [31]:
df_ames = pd.read_csv("../data/ames.tsv", sep="\t")
df_ames[:5]

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [32]:
print(df_ames.columns.tolist())

['Order', 'PID', 'MS SubClass', 'MS Zoning', 'Lot Frontage', 'Lot Area', 'Street', 'Alley', 'Lot Shape', 'Land Contour', 'Utilities', 'Lot Config', 'Land Slope', 'Neighborhood', 'Condition 1', 'Condition 2', 'Bldg Type', 'House Style', 'Overall Qual', 'Overall Cond', 'Year Built', 'Year Remod/Add', 'Roof Style', 'Roof Matl', 'Exterior 1st', 'Exterior 2nd', 'Mas Vnr Type', 'Mas Vnr Area', 'Exter Qual', 'Exter Cond', 'Foundation', 'Bsmt Qual', 'Bsmt Cond', 'Bsmt Exposure', 'BsmtFin Type 1', 'BsmtFin SF 1', 'BsmtFin Type 2', 'BsmtFin SF 2', 'Bsmt Unf SF', 'Total Bsmt SF', 'Heating', 'Heating QC', 'Central Air', 'Electrical', '1st Flr SF', '2nd Flr SF', 'Low Qual Fin SF', 'Gr Liv Area', 'Bsmt Full Bath', 'Bsmt Half Bath', 'Full Bath', 'Half Bath', 'Bedroom AbvGr', 'Kitchen AbvGr', 'Kitchen Qual', 'TotRms AbvGrd', 'Functional', 'Fireplaces', 'Fireplace Qu', 'Garage Type', 'Garage Yr Blt', 'Garage Finish', 'Garage Cars', 'Garage Area', 'Garage Qual', 'Garage Cond', 'Paved Drive', 'Wood Deck 

#### Exercise 5

The neighborhood variable (`Neighborhood`) in this data set is categorical. Convert it to $K$ quantitative variables. What is $K$ in this case?

In [33]:
#Converting neighborhood to K quantitative variables
df_ames.columns = df_ames.columns.str.strip()
df_ames_encoded = pd.get_dummies(df_ames, columns=['Neighborhood'], dtype='int')

print(df_ames_encoded.head())

   Order        PID  MS SubClass MS Zoning  Lot Frontage  Lot Area Street  \
0      1  526301100           20        RL         141.0     31770   Pave   
1      2  526350040           20        RH          80.0     11622   Pave   
2      3  526351010           20        RL          81.0     14267   Pave   
3      4  526353030           20        RL          93.0     11160   Pave   
4      5  527105010           60        RL          74.0     13830   Pave   

  Alley Lot Shape Land Contour  ... Neighborhood_NoRidge Neighborhood_NridgHt  \
0   NaN       IR1          Lvl  ...                    0                    0   
1   NaN       Reg          Lvl  ...                    0                    0   
2   NaN       IR1          Lvl  ...                    0                    0   
3   NaN       Reg          Lvl  ...                    0                    0   
4   NaN       IR1          Lvl  ...                    0                    0   

  Neighborhood_OldTown Neighborhood_SWISU Neighbor

In [34]:
#Counting number of unique (K) variables
num_unique_neighborhoods = len(df_ames['Neighborhood'].unique())
print(f"K = {num_unique_neighborhoods}")

K = 28


**Written Answers**:  ùêæ  in this case is 28

#### Exercise 6

Based on these $K$ variables only, calculate the Euclidean distance between house 0 and each of the other houses in the data set. What are the possible values of the Euclidean distance? Can you explain what a distance of $0$ means, in the context of this variable? What about a distance greater than 0?

In [36]:
neighborhood_columns = df_ames_encoded[[col for col in df_ames_encoded.columns if 'Neighborhood_' in col]]
x1 = neighborhood_columns.iloc[0].values
distances = np.sqrt(((neighborhood_columns.values - x1) ** 2).sum(axis=1))

print(distances[:10])

[0.         0.         0.         0.         1.41421356 1.41421356
 1.41421356 1.41421356 1.41421356 1.41421356]


**Written Answers**: A distance of 0 means that the two houses are identical in terms of the neighborhood they belong to, and a distance of greater than 0 means that the houses are different in at least one way. Most have a distance of approximately 1.414 because they are different in two dimensions. Using the Euclidian distance will output the sqaure root of 2 (the number of differences) which comes out to be approximately 1.414.

## Part 3: The Distance Matrix

In many applications, we need the distance between every pair of observations ${\bf x}_i$ and ${\bf x}_j$ in a data set. How do we represent this information? The most common way is to use an $n \times n$ matrix, where the $(i, j)$th entry is the distance between ${\bf x}_i$ and ${\bf x}_j$. That is,

$$ D = \begin{pmatrix}
d({\bf x}_1, {\bf x}_1) & d({\bf x}_1, {\bf x}_2) & \cdots & d({\bf x}_1, {\bf x}_n) \\
d({\bf x}_2, {\bf x}_1) & d({\bf x}_2, {\bf x}_2) & \cdots & d({\bf x}_2, {\bf x}_n) \\
\vdots & \vdots & \ddots & \vdots \\
d({\bf x}_n, {\bf x}_1) & d({\bf x}_n, {\bf x}_2) & \cdots & d({\bf x}_n, {\bf x}_n)
\end{pmatrix}. $$

There are a few things we can say about the $n\times n$ distance matrix $D$.

1. All of the entries of $D$ are non-negative.
2. Because the distance between any observation and itself, $d({\bf x}_i, {\bf x}_i)$, is always zero, the _diagonal_ elements of this matrix, $D_{ii}$ are all equal to 0.
3. For many distance metrics, including Euclidean and Manhattan distance, $d$ is symmetric, meaning that $d({\bf x}_i, {\bf x}_j) = d({\bf x}_i, {\bf x}_j)$. Therefore, the matrix $D$ will also be symmetric; that is, the values in the upper triangle will match their reflection in the lower triangle.

How do we calculate the distance matrix for a `DataFrame` consisting of all quantitative variables? For example, suppose we want to calculate the matrix of distances between each of the houses in the Ames housing data set, based on the number of bedrooms, number of bathrooms, and the living area (in square feet).

In [37]:
housing_df = pd.read_csv("../data/ames.tsv",sep="\t")

# extract 3 quantitative variables
housing_df_quant = housing_df[["Bedroom AbvGr", "Gr Liv Area"]].copy()
housing_df_quant["Bathrooms"] = (
    housing_df["Full Bath"] +
    0.5 * housing_df["Half Bath"]
)
housing_df_quant

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,3,1656,1.0
1,2,896,1.0
2,3,1329,1.5
3,3,2110,2.5
4,3,1629,2.5
...,...,...,...
2925,3,1003,1.0
2926,2,902,1.0
2927,3,970,1.0
2928,2,1389,1.0


### The Long Way

It is possible to create the distance matrix entirely in `pandas`. The idea is to first define a function that calculates the distances between a given observation and all of the other observations:

In [38]:
def get_euclidean_dists_from_obs(obs):
    return np.sqrt(
        ((housing_df_quant - obs) ** 2).sum(axis=1)
    )

get_euclidean_dists_from_obs(housing_df_quant.loc[0])

Unnamed: 0,0
0,0.000000
1,760.000658
2,327.000382
3,454.002478
4,27.041635
...,...
2925,653.000000
2926,754.000663
2927,686.000000
2928,267.001873


The code for this function is very similar to the code that we wrote in the exercises for Part 1.

Now, to get a matrix of distances $D$, we simply need to apply this function to every row of the `DataFrame`. To achieve this, we use the `.apply()` method with `axis=1`:

In [39]:
D = housing_df_quant.apply(
    get_euclidean_dists_from_obs,
    axis=1
)
D

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2920,2921,2922,2923,2924,2925,2926,2927,2928,2929
0,0.000000,760.000658,327.000382,454.002478,27.041635,52.021630,318.003145,376.002660,40.024992,148.007601,...,564.000222,72.013888,72.013888,530.000943,432.001157,653.000000,754.000663,686.000000,267.001873,344.003270
1,760.000658,0.000000,433.001443,1214.001339,733.002217,708.002295,442.001131,384.001302,720.000694,908.001790,...,196.003189,832.003005,832.003005,230.004348,328.006098,107.004673,6.000000,74.006756,493.000000,1104.001472
2,327.000382,433.001443,0.000000,781.000640,300.001667,275.001818,9.069179,49.012753,287.002178,475.001053,...,237.000000,399.001566,399.001566,203.000616,105.005952,326.000383,427.001464,359.000348,60.010416,671.000745
3,454.002478,1214.001339,781.000640,0.000000,481.000000,506.000000,772.000810,830.000753,494.001265,306.000000,...,1018.000491,382.001636,382.001636,984.000127,886.001834,1107.001016,1208.001345,1140.000987,721.002254,110.000000
4,27.041635,733.002217,300.001667,481.000000,0.000000,25.000000,291.002148,349.001791,13.047988,175.000000,...,537.000931,99.006313,99.006313,503.000249,405.004012,626.001797,727.002235,659.001707,240.006771,371.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2925,653.000000,107.004673,326.000383,1107.001016,626.001797,601.001872,335.002985,277.003610,613.001631,801.001404,...,89.001404,725.001379,725.001379,123.004065,221.002262,0.000000,101.004950,33.000000,386.001295,997.001128
2926,754.000663,6.000000,427.001464,1208.001345,727.002235,702.002315,436.001147,378.001323,714.000700,902.001802,...,190.003289,826.003027,826.003027,224.004464,322.006211,101.004950,0.000000,68.007353,487.000000,1098.001480
2927,686.000000,74.006756,359.000348,1140.000987,659.001707,634.001774,368.002717,310.003226,646.001548,834.001349,...,122.001025,758.001319,758.001319,156.003205,254.001968,33.000000,68.007353,0.000000,419.001193,1030.001092
2928,267.001873,493.000000,60.010416,721.002254,240.006771,215.007558,51.009803,109.004587,227.002203,415.003916,...,297.002104,339.007375,339.007375,263.003802,165.012121,386.001295,487.000000,419.001193,0.000000,611.002660


Notice that this is a $2930 \times 2930$ symmetric matrix of non-negative numbers, with zeroes along the diagonal, just as we predicted.

### Better, the short way...

_The Short Way_: There are many packages in Python that calculate distance matrices. One such package is scikit-learn, a machine learning package in Python. Machine learning will be discussed in depth in the coming Labs, and we will explore the features of scikit-learn extensively in those chapters. Because distance matrices are important in machine learning, scikit-learn provides functions for calculating distance matrices.

For example, the following code calculates the (Euclidean) distance matrix between all of the houses in the Ames housing data set:

In [40]:
from sklearn.metrics import pairwise_distances

D_ = pairwise_distances(housing_df_quant, metric="euclidean")
D_

array([[   0.        ,  760.00065789,  327.00038226, ...,  686.        ,
         267.00187265,  344.00327033],
       [ 760.00065789,    0.        ,  433.00144342, ...,   74.00675645,
         493.        , 1104.00147192],
       [ 327.00038226,  433.00144342,    0.        , ...,  359.00034819,
          60.01041576,  671.00074516],
       ...,
       [ 686.        ,   74.00675645,  359.00034819, ...,    0.        ,
         419.00119332, 1030.00109223],
       [ 267.00187265,  493.        ,   60.01041576, ...,  419.00119332,
           0.        ,  611.00265957],
       [ 344.00327033, 1104.00147192,  671.00074516, ..., 1030.00109223,
         611.00265957,    0.        ]])

Notice that the return type is a `numpy` array, instead of a `pandas` `DataFrame`. That is because scikit-learn was designed to work with `numpy` arrays. Although it will accept `pandas` `DataFrame`s as arguments, scikit-learn will convert them `numpy` arrays underneath the hood and return `numpy` arrays.

Fortunately, many of the usual `pandas` operations work on `numpy` arrays as well. For example, to get the maximum value in each row, we can use the `.max()` method with `axis=1`.

In [41]:
D_.max(axis=1)

array([3986.00028224, 4746.00034239, 4313.00011593, ..., 4672.0002408 ,
       4253.00038208, 3642.        ])

### Exercises Part 3

The following exercises again use the  red wine data (`../data/reds.csv`). All 12 variables in this data set are quantitative.

In [1]:
df_reds = pd.read_csv("../data/reds.csv", sep=";")
df_reds[:1599]

NameError: name 'pd' is not defined

#### Exercise 7

Using sklearn, calculate the distance between every pair of wines in this data set.

In [44]:
from sklearn.metrics import pairwise_distances

D = pairwise_distances(df_reds, metric="euclidean")
D

array([[6.74349576e-07, 3.58601922e+01, 2.04097050e+01, ...,
        1.91056851e+01, 2.33225970e+01, 1.10366429e+01],
       [3.58601922e+01, 0.00000000e+00, 1.64045889e+01, ...,
        2.73859012e+01, 2.41316795e+01, 2.61010203e+01],
       [2.04097050e+01, 1.64045889e+01, 0.00000000e+00, ...,
        1.99197504e+01, 1.98237135e+01, 1.26797093e+01],
       ...,
       [1.91056851e+01, 2.73859012e+01, 1.99197504e+01, ...,
        0.00000000e+00, 5.18964605e+00, 1.12669730e+01],
       [2.33225970e+01, 2.41316795e+01, 1.98237135e+01, ...,
        5.18964605e+00, 0.00000000e+00, 1.42996395e+01],
       [1.10366429e+01, 2.61010203e+01, 1.26797093e+01, ...,
        1.12669730e+01, 1.42996395e+01, 0.00000000e+00]])

#### Exercise 8

Using the distance matrix that you calculated in the previous exercise, calculate the distance of each wine to the most similar other wine.

*Hint:* It might be good to think about what the [values on the diagonal](https://numpy.org/doc/stable/reference/generated/numpy.fill_diagonal.html) of the matrix are... you don't want to select the wine that is itself... you want another wine...

In [45]:
np.fill_diagonal(D, np.inf)

closest_distances = np.min(D, axis=1)

closest_distances

array([6.74349576e-07, 1.54385948e+00, 1.25056949e+00, ...,
       0.00000000e+00, 4.63842996e-01, 1.89385448e+00])

#### Exercise 9

Using the distance matrix that you calculated previously, determine the identity of the wine that is most similar to each wine.

In [50]:
closest_wine_indices = np.argmin(D, axis=1)

for i, j in zip(np.arange(len(closest_wine_indices)), closest_wine_indices):
    print(f"Wine {i} is most similar to Wine {int(j)}")

Wine 0 is most similar to Wine 4
Wine 1 is most similar to Wine 752
Wine 2 is most similar to Wine 196
Wine 3 is most similar to Wine 787
Wine 4 is most similar to Wine 0
Wine 5 is most similar to Wine 686
Wine 6 is most similar to Wine 1502
Wine 7 is most similar to Wine 1143
Wine 8 is most similar to Wine 69
Wine 9 is most similar to Wine 11
Wine 10 is most similar to Wine 107
Wine 11 is most similar to Wine 9
Wine 12 is most similar to Wine 1502
Wine 13 is most similar to Wine 613
Wine 14 is most similar to Wine 15
Wine 15 is most similar to Wine 14
Wine 16 is most similar to Wine 1572
Wine 17 is most similar to Wine 19
Wine 18 is most similar to Wine 654
Wine 19 is most similar to Wine 17
Wine 20 is most similar to Wine 1251
Wine 21 is most similar to Wine 1141
Wine 22 is most similar to Wine 27
Wine 23 is most similar to Wine 1370
Wine 24 is most similar to Wine 1447
Wine 25 is most similar to Wine 1019
Wine 26 is most similar to Wine 1334
Wine 27 is most similar to Wine 22
Wine 2


#### Bonus Exercise (2 Points)

Suppose that you really like wine 0 in the data set, but you find that the **chlorides** value is too high. Find wines that have lower chlorides than wine 0 but are similar to it. Be sure to actually look at the profiles of the wines that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different standardization methods. How sensitive are your results to these choices? To answer this question, you should create a table of each configuration you tried and what the closest wine with lower chlorides than wine 0 is.

|metric|standardization|index of closest wine with lower chlorides|
|---|---|---|
|   |   |   |


_Think:_ If the goal is to find wines with lower chlorides, should chlorides be included as a variable in the distance metric?

In [70]:
## YOUR CODE HERE

**Written answers here**

**When you have filled out all the questions, submit via [Tulane Canvas](https://tulane.instructure.com/)**