## Euclidean and Manhattan Distance Calculations (completed)

In this short mini project you will see examples and comparisons of distance measures. Specifically, you'll visually compare the Euclidean distance to the Manhattan distance measures. The application of distance measures has a multitude of uses in data science and is the foundation of many algorithms you'll be using such as Prinical Components Analysis.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('ggplot')
import seaborn as sns

In [1]:
# Load Course Numerical Dataset
df = pd.read_csv('data/distance_dataset.csv',index_col=0)
df.head()

NameError: name 'pd' is not defined

In [None]:
g = sns.PairGrid(df)
g.map_upper(sns.kdeplot)
g.map_lower(sns.scatterplot)
g.map_diag(sns.histplot);  # Pair plot of the DataFrame df

### Euclidean Distance

Let's visualize the difference between the Euclidean and Manhattan distance.

We are using Pandas to load our dataset .CSV file and use Numpy to compute the __Euclidean distance__ to the point (Y=5, Z=5) that we choose as reference. On the left here we show the dataset projected onto the YZ plane and color coded per the Euclidean distance we just computed. As we are used to, points that lie at the same Euclidean distance define a regular 2D circle of radius that distance.

Note that the __SciPy library__ comes with optimized functions written in C to compute distances (in the scipy.spatial.distance module) that are much faster than our (naive) implementation.

In [None]:
# In the Y-Z plane, we compute the distance to ref point (5,5)
distEuclid = np.sqrt((df.Z - 5)**2 + (df.Y - 5)**2)
distEuclid.head()

**<font color='teal'>Create a distance to reference point (3,3) matrix similar to the above example.</font>**

In [None]:
distEuclid3 = np.sqrt((df.Z - 3)**2 + (df.Y - 3)**2)
distEuclid3.head()

**<font color='teal'>Replace the value set to 'c' in the plotting cell below with your own distance matrix and review the result to deepen your understanding of Euclidean distances. </font>**

In [None]:
figEuclid = plt.figure(figsize=(10,8))
plt.set_cmap("seismic")
plt.scatter(df.Y - 5, df.Z-5, c=distEuclid, s=20, clim=(0,6))
plt.ylim([-4.9,4.9])
plt.xlim([-4.9,4.9])
plt.xlabel('Y - 5', size=14)
plt.ylabel('Z - 5', size=14)
plt.title('Euclidean Distance')
cb = plt.colorbar()
cb.set_label('Distance from (5,5)', size=14)

#figEuclid.savefig('plots/Euclidean.png')

In [None]:
figEuclid3 = plt.figure(figsize=(10,8))

plt.scatter(df.Y - 5, df.Z - 5, c=distEuclid3, s=20, clim=(0,6))
plt.ylim([-4.9,4.9])
plt.xlim([-4.9,4.9])
plt.xlabel('Y - 5', size=14)
plt.ylabel('Z - 5', size=14)
plt.title('Euclidean Distance')
cb = plt.colorbar()
cb.set_label('Distance from (3,3)', size=14)

plt.text(3 - 5, 3 - 5, " (3,3)", size=14, verticalalignment='center', color='red')
plt.scatter(3 - 5, 3 - 5, marker='x', s=30, c='r');

### Manhattan Distance

Manhattan distance is simply the sum of absolute differences between the points coordinates. This distance is also known as the taxicab or city block distance as it measure distances along the coorinate axis which creates "paths" that look like a cab's route on a grid-style city map.

We display the dataset projected on the XZ plane here color coded per the Manhattan distance to the (X=5, Z=5) reference point. We can see that points laying at the same distance define a circle that looks like a Euclidean square.

In [None]:
# In the Y-Z plane, we compute the distance to ref point (5,5)
distManhattan = np.abs(df.X - 5) + np.abs(df.Z - 5)
distManhattan.head()

In [None]:
figManhattan = plt.figure(figsize=(10,8))
plt.scatter(df.X - 5, df.Z - 5, c=distManhattan, s=20, clim=(0,6))
plt.ylim([-4.9,4.9])
plt.xlim([-4.9,4.9])
plt.xlabel('X - 5', size=14)
plt.ylabel('Z - 5', size=14)
plt.title('Manhattan Distance')
cb = plt.colorbar()
cb.set_label('Distance from (5,5)', size=14)

**<font color='teal'>Create a Manhattan distance to reference point (4,4) matrix similar to the above example and replace the value for 'c' in the plotting cell to view the result.</font>**

In [None]:
distManhattan4 = np.abs(df.X - 4) + np.abs(df.Z - 4)
distManhattan4.head()

In [None]:
figManhattan4 = plt.figure(figsize=(10,8))
plt.scatter(df.X - 5, df.Z - 5, c=distManhattan4, s=20, clim=(0,6))
plt.ylim([-4.9,4.9])
plt.xlim([-4.9,4.9])
plt.xlabel('X - 5', size=14)
plt.ylabel('Z - 5', size=14)
plt.title('Manhattan Distance')
cb = plt.colorbar()
cb.set_label('Distance from (4,4)', size=14)

plt.scatter(4 - 5, 4 - 5, marker='x', s=30, c='g', label="Point at (X,Z)=(4,4)")
plt.plot([x-5 for x in [1, 4, 7, 4, 1]], 
         [y-5 for y in [4, 7, 4, 1, 4]], 
         c='g', 
         linewidth=1, 
         linestyle='dashed', 
         label="Points 3 Units Manhattan Distance Away"
         )
plt.legend(loc='lower center');

Now let's create distributions of these distance metrics and compare them. We leverage the scipy dist function to create these matrices similar to how you manually created them earlier in the exercise.

In [None]:
import scipy.spatial.distance as dist

mat = df[['X','Y','Z']].to_numpy()
DistEuclid = dist.pdist(mat,'euclidean')
DistManhattan = dist.pdist(mat, 'cityblock')
largeMat = np.random.random((10000,100))

In [None]:
mat.shape, DistEuclid.shape, DistManhattan.shape, largeMat.shape

**<font color='teal'>Plot histograms of each distance matrix for comparison.</font>**

In [None]:
plt.hist([DistEuclid, DistManhattan], bins=range(16), label=['Euclidean', 'Manhattan'], density=True)
plt.legend()
plt.title('Comparison of Pairwise Distances, Euclidean & Manhattan')
plt.xlabel("Pairwise Distance")
plt.ylabel("Density");

In [None]:
plt.hist(DistEuclid, bins=64)
plt.title("Euclidean Pairwise Distances")
plt.show()
plt.hist(DistManhattan, bins=64,  color='b')
plt.title("Manhattan Pairwise Distances")
plt.show()