# Diving Into Clustering and Unsupervised Learning
*Curtis Miller*

In this notebook I give some functions for computing distances between points. This is to introduce the idea of different distance metrics, an important idea in data science and clustering.

Many of these metrics are already supported in relevant packages, but you are welcome to look at functions defining them to understand how they work.

## Euclidean Distance

This is the "straight line" distance people are most familiar with.

In [None]:
import numpy as np

In [None]:
def euclidean_distance(v1, v2):
    """Computes the Euclidean distance between two vectors"""
    return np.sqrt(np.sum((v1 - v2) ** 2))

In [None]:
vec1 = np.array([1, 2, 3])
vec2 = np.array([1, -1, 0])

euclidean_distance(vec1, vec2)

## Manhattan Distance

Also commonly known as "taxicab distance" this is the distance between two points when "diagonal" movement is not allowed.

In [None]:
def manhattan_distance(v1, v2):
    """Computes the Manhattan distance between two vectors"""
    return np.sum(np.abs(v1 - v2))

In [None]:
manhattan_distance(vec1, vec2)

## Angular Distance

This is the size of the angle between the two vectors.

In [None]:
from numpy.linalg import norm

def angular_distance(v1, v2):
    """Computes the angular distance between two vectors"""
    sim = v1.dot(v2)/(norm(v1) * norm(v2))
    return np.arccos(sim)/np.pi

In [None]:
angular_distance(vec1, vec2)

In [None]:
angular_distance(vec1, vec1)    # Two identical vectors have an angular distance of 0

In [None]:
angular_distance(vec1, 2 * vec1)    # It's insensitive to magnitude (technically it's not a metric as defined by
                                    # mathematicians because of this, except on a unit circle)

## Hamming Distance

Intended for strings (bitstring or otherwise), the Hamming distance between two strings is the number of symbols that need to change in one string to make it identical to the other. (The following code was shamelessly stolen from [Wikipedia](https://en.wikipedia.org/wiki/Hamming_distance).)

In [None]:
def hamming_distance(s1, s2):
    """Return the Hamming distance between equal-length sequences"""
    if len(s1) != len(s2):
        raise ValueError("Undefined for sequences of unequal length")
    return sum(el1 != el2 for el1, el2 in zip(s1, s2))

In [None]:
hamming_distance("11101", "11011")

## Jaccard Distance

The Jaccard distance, defined for two sets, is the number of elements that the two sets don't have in common divided by the total number of elements the two sets combined have (removing duplicates).

In [None]:
def jaccard_distance(s1, s2):
    """Computes the Jaccard distance between two sets"""
    s1, s2 = set(s1), set(s2)
    diff = len(s1.union(s2)) - len(s1.intersection(s2))
    return diff / len(s1.union(s2))

In [None]:
jaccard_distance(["cow", "pig", "horse"], ["cow", "donkey", "chicken"])

In [None]:
jaccard_distance("11101", "11011")    # Sets formed from the contents of these strings are identical

In a later video I will discuss similarity metrics, focusing on Jaccard similarity.