 # Distance/Similarity Metrics
Although many packages contain implementation of many distance/similarity metrics, we will still implement them by hand to gain more familiarity with them. The only external package we will use is numpy

## Euclidean Distance or L2-Norm
Euclidean distance is probably the most widely taught distance metric. It taught in early on in school to find the distance between two points on a coordinate grid. It is a simple application of pythagorean theorem in two dimensions.

In [106]:
import numpy as np
import math

# Create Two Simple Vectors(Distance 1)
vector_1 = np.array([0, 0])
vector_2 = np.array([0, 1])

# Find Difference(Aka Created Vector Between The Two Vectors)
difference = vector_1-vector_2

# Square Each Number In Vector And Sum Them Up
squared = np.dot(difference, difference)

# Find Square Root Of Sum Of Squares
euclidean_distance = np.sqrt(squared)

print("Euclidean Distance: " + str(euclidean_distance))

Euclidean Distance: 1.0


In [107]:
# Create Function
def euclidean(vector_1, vector_2):
    difference = vector_1-vector_2
    distance = np.sqrt(np.dot(difference, difference))
    return distance

# On Unit Vector
vector_1 = np.array([0, 0])
vector_2 = np.array([math.sqrt(2)/2, math.sqrt(2)/2])
print("Euclidean Distance: " + str(euclidean(vector_1, vector_2)))

# Scaled From Above Example(3 Times Length)
vector_1 = np.array([0, 0])
vector_2 = np.array([3*math.sqrt(2)/2, 3*math.sqrt(2)/2])
print("Euclidean Distance: " + str(euclidean(vector_1, vector_2)))

Euclidean Distance: 1.0
Euclidean Distance: 3.0000000000000004


## Manhattan Distance or L1-Norm or City-Block Distance
Manhattan distance is a simple to calculate distance metric. It is simply the sum of the absolute differences of the two vectors being compared. Unlike euclidean distance which is the measurement of the straight line between two points, manhattan is the sum of the difference in each axis. An analogy is traveling through a city whose roads are in a grid. For instance, you would tell somone to walk north 3 blocks and east 2 blocks meaning they would travel 5 blocks total.

In [123]:
# Create Two Simple Vectors(Distance 1)
vector_1 = np.array([0, 0])
vector_2 = np.array([0, 1])

# Find Difference(Aka Created Vector Between The Two Vectors)
difference = vector_1-vector_2

absolute_diff = np.abs(difference)

manhattan_distance = np.sum(absolute_diff)

print("Manhattan Distance: " + str(manhattan_distance))

Manhattan Distance: 1


In [124]:
# Create Function
def manhattan(vector_1, vector_2):
    difference = vector_1-vector_2
    absolute_diff = np.abs(difference)
    distance = np.sum(absolute_diff)
    return distance

# On Unit Vector
vector_1 = np.array([0, 0])
vector_2 = np.array([math.sqrt(2)/2, math.sqrt(2)/2])
print("Manhattan Distance: " + str(manhattan(vector_1, vector_2)))

# Scaled From Above Example(3 Times Length)
vector_1 = np.array([0, 0])
vector_2 = np.array([3*math.sqrt(2)/2, 3*math.sqrt(2)/2])
print("Manhattan Distance: " + str(manhattan(vector_1, vector_2)))

Manhattan Distance: 1.4142135623730951
Manhattan Distance: 4.242640687119286


## Cosine Similairty
Cosine similarity is a measurement that can be used as a distance metric, but it is not a proper distance metric since it violates the triangle inequality. Cosine similarity is simply the cosine of the angle between two vectors.

In [125]:
# Create Two Simple Vectors(Distance 1)
vector_1 = np.array([0, 0])
vector_2 = np.array([0, 1])

dot_prob = np.dot(vector_1, vector_2)
vector_1_norm = np.sqrt(np.dot(vector_1, vector_1))
vector_2_norm = np.sqrt(np.dot(vector_2, vector_2))

cosine_sim = dot_prob/(vector_1_norm*vector_2_norm)

  if __name__ == '__main__':


So here is the first "gotcha" with cosine similarity. If we have a vector with a L2-Norm(dot product with itself) that is zero, then we will end up with a divide by zero error. So we will add an if statement to catch that case. There is no possible value in this situation so we will return NaN.

In [126]:
dot_prob = np.dot(vector_1, vector_2)
vector_1_norm = np.sqrt(np.dot(vector_1, vector_1))
vector_2_norm = np.sqrt(np.dot(vector_2, vector_2))

if vector_1_norm ==0 or vector_2_norm==0:
    cosine_sim = np.nan
else: 
    cosine_sim = dot_prob/(vector_1_norm*vector_2_norm)
    
print("Cosine Similarity: " + str(cosine_sim))

Cosine Similarity: nan


In [127]:
# Create Function
def cosine(vector_1, vector_2):
    dot_prob = np.dot(vector_1, vector_2)
    vector_1_norm = np.sqrt(np.dot(vector_1, vector_1))
    vector_2_norm = np.sqrt(np.dot(vector_2, vector_2))

    if vector_1_norm == 0 or vector_2_norm == 0:
        return np.nan
    
    cosine_sim = dot_prob/(vector_1_norm*vector_2_norm)
    return cosine_sim

# 45 Degree Difference
vector_1 = np.array([1, 0])
vector_2 = np.array([math.sqrt(2)/2, math.sqrt(2)/2])
print("Cosine Similarity: " + str(cosine(vector_1, vector_2)))
print("cos(45 Degrees):" + str(math.cos(math.pi/4)))

# 90 Degree Difference
vector_1 = np.array([1, 0])
vector_2 = np.array([0, 1])
print("Cosine Similarity: " + str(cosine(vector_1, vector_2)))
print("cos(90 Degrees):" + str(math.cos(math.pi/2)))

Cosine Similarity: 0.7071067811865476
cos(45 Degrees):0.7071067811865476
Cosine Similarity: 0.0
cos(90 Degrees):6.123233995736766e-17


So now we must talk about another issue with cosine similarity, and that issue is its a similarity measurement and not a distance. With a distance we think of a small value as being closer or more similar; however, with cosine similarity larger values are more similar and smaller are less similar. On A side note cosine similarity is bound between -1 and 1. So we must make cosine similarity operate like a normal distance. To do this we simply do 1 minus the cosine similarity. Now we have cosine distance which is bound between 0 and 2.

In [113]:
# Create Updated Function
def cosine(vector_1, vector_2):
    dot_prob = np.dot(vector_1, vector_2)
    vector_1_norm = np.sqrt(np.dot(vector_1, vector_1))
    vector_2_norm = np.sqrt(np.dot(vector_2, vector_2))

    if vector_1_norm == 0 or vector_2_norm == 0:
        return np.nan
    
    cosine_sim = dot_prob/(vector_1_norm*vector_2_norm)
    return 1-cosine_sim

# 45 Degree Difference
vector_1 = np.array([1, 0])
vector_2 = np.array([math.sqrt(2)/2, math.sqrt(2)/2])
print("Cosine Similarity: " + str(cosine(vector_1, vector_2)))
print("1-cos(45 Degrees):" + str(1-math.cos(math.pi/4)))

# 90 Degree Difference
vector_1 = np.array([1, 0])
vector_2 = np.array([0, 1])
print("Cosine Similarity: " + str(cosine(vector_1, vector_2)))
print("1-cos(90 Degrees):" + str(1-math.cos(math.pi/2)))

Cosine Similarity: 0.2928932188134524
1-cos(45 Degrees):0.2928932188134524
Cosine Similarity: 1.0
1-cos(90 Degrees):0.9999999999999999


## Correlation(Pearson's Correlation)
Correlation is the measurment of linear relationship between two vectors(variables). Correlation is bound between -1 and 1. 1 is an absolute postive correlation, 0 is no correlation, and -1 is an absolute negative correlation. For comparing two vectors, its best to think about correlation as comparing two signals. If they have similar shape they will have a higher correlation.

In [114]:
# Create Two Simple Vectors(Distance 1)
vector_1 = np.array([1, 1])
vector_2 = np.array([1, 0])

vector_1_centered = vector_1-np.mean(vector_1)
vector_2_centered = vector_2-np.mean(vector_2)

vector_1_stddev = np.sqrt(np.dot(vector_1_centered, vector_1_centered))
vector_2_stddev = np.sqrt(np.dot(vector_2_centered, vector_2_centered))

covariance = np.dot(vector_1_centered, vector_2_centered)

correlation = covariance/(vector_1_stddev * vector_2_stddev)

  del sys.path[0]


So here we have another "gotcha". If a vector has a standard deviation of zero(aka a vector with all the same values) then we end up with a divide by zero error. So how should we handle this? Will remember that correlation is a measurement of LINEAR relationship. Since one of the vectors isn't changing we cant make a determination if there is a relationship. So in this case we will assign it a correlation of 0.

In [128]:
vector_1_centered = vector_1-np.mean(vector_1)
vector_2_centered = vector_2-np.mean(vector_2)

vector_1_stddev = np.sqrt(np.dot(vector_1_centered, vector_1_centered))
vector_2_stddev = np.sqrt(np.dot(vector_2_centered, vector_2_centered))

covariance = np.dot(vector_1_centered, vector_2_centered)

if vector_1_stddev == 0 or vector_2_stddev == 0:
    correlation = 0
else:
    correlation = covariance/(vector_1_stddev * vector_2_stddev)
    
print("Correlation: " + str(correlation))

Correlation: -0.9999999999999998


In [129]:
# Create Function
def correlation(vector_1, vector_2):
    vector_1_centered = vector_1-np.mean(vector_1)
    vector_2_centered = vector_2-np.mean(vector_2)
    vector_1_stddev = np.sqrt(np.dot(vector_1_centered, vector_1_centered))
    vector_2_stddev = np.sqrt(np.dot(vector_2_centered, vector_2_centered))
    covariance = np.dot(vector_1_centered, vector_2_centered)

    if vector_1_stddev == 0 or vector_2_stddev == 0:
        return 0
    
    correlation = covariance/(vector_1_stddev * vector_2_stddev)
    return correlation

# Exact Same
vector_1 = np.array([1, 0])
vector_2 = np.array([1, 0])
print("Correlation: " + str(correlation(vector_1, vector_2)))

# Same But Scaled
vector_1 = np.array([1, 2, 3, 4, 5, 6, 7, 8])
vector_2 = np.array([2, 4, 6, 8, 10, 12, 14, 16])
print("Correlation: " + str(correlation(vector_1, vector_2)))

# Exact Same But Flipped
vector_1 = np.array([1, 0])
vector_2 = np.array([-1, 0])
print("Correlation: " + str(correlation(vector_1, vector_2)))

# Random Numbers
vector_1 = np.random.normal(0, 1, 10)
vector_2 = np.random.normal(0, 1, 10)
print("Correlation: " + str(correlation(vector_1, vector_2)))

Correlation: 0.9999999999999998
Correlation: 1.0
Correlation: -0.9999999999999998
Correlation: -0.13372386230821978


So just like cosine similarity we run into the issue of correlation not being a true distance metric. We will do the exact same fix we did to cosine similarity, that is we will do 1 minus correlation. This is often referred to as pearson distance; we we will rename our method to reflect this.

In [117]:
# Create Updated Function
def pearson(vector_1, vector_2):
    vector_1_centered = vector_1-np.mean(vector_1)
    vector_2_centered = vector_2-np.mean(vector_2)
    vector_1_stddev = np.sqrt(np.dot(vector_1_centered, vector_1_centered))
    vector_2_stddev = np.sqrt(np.dot(vector_2_centered, vector_2_centered))
    covariance = np.dot(vector_1_centered, vector_2_centered)

    if vector_1_stddev == 0 or vector_2_stddev == 0:
        return 1
    
    correlation = covariance/(vector_1_stddev * vector_2_stddev)
    return 1-correlation

## Chi-Squared Distance
In Chi-Squared Distance we will treat two vectors as a two-way contingency table and calculate a chi-squared statistic for the table. If the chi-squared statistic is near 0 we assume the two vectors are "independent" meaning their shape should be roughly similar; however, if the statistic is large there is some type of relationship meaning the shape is different for both vectors. This distance metric is only suitable for non-negative integers. You can imagine both vectors are a histogram and we are comparing their shapes to one another. You might see this in comparing frequency of words in a text documents.

See: "A Recent Advance in Data Analysis: Clustering Objects into Classes Characterized by Conjunctive Concepts"
      Michalski, R. S. et al.
      http://mars.gmu.edu/jspui/handle/1920/1556?show=full
See: https://stats.stackexchange.com/questions/184101/comparing-two-histograms-using-chi-square-distance

In [118]:
vector_1 = np.array([1., 0.])
vector_2 = np.array([1., 0.])

col_sums = vector_1 + vector_2
col_sums_recip = np.reciprocal(col_sums)

rel_freq_vector_1 = vector_1/np.sum(vector_1)
rel_freq_vector_2 = vector_2/np.sum(vector_2)

diff_rel_freq = rel_freq_vector_1-rel_freq_vector_2
diff_rel_freq_square = np.square(diff_rel_freq)

chisqr = np.sqrt(np.dot(col_sums_recip, diff_rel_freq_square))

print("Chi-Square: " + str(chisqr))

Chi-Square: nan


  """


First "gotcha" for this metric. If the sum of the two vectors end up with with a element that is 0 we get a divide by zero error when finding reciprocals. When this occurs we should not allow that element to contribute to the chi-squared statistic, so we will let reciprocal be equal to zero in that case. We can use the where argument of reciprocal() to skip finding the reciprocal of a number that is 0.

In [119]:
col_sums = vector_1 + vector_2
col_sums_recip = np.reciprocal(col_sums, where=(col_sums!=0.0))    
rel_freq_vector_1 = vector_1/np.sum(vector_1)
rel_freq_vector_2 = vector_2/np.sum(vector_2)
diff_rel_freq = rel_freq_vector_1-rel_freq_vector_2
diff_rel_freq_square = np.square(diff_rel_freq)
chisqr = np.dot(col_sums_recip, diff_rel_freq_square)

print("Chi-Square: " + str(chisqr))

vector_1 = np.array([0., 0.])
vector_2 = np.array([1., 0.])

col_sums = vector_1 + vector_2
col_sums_recip = np.reciprocal(col_sums, where=(col_sums!=0.0))    
rel_freq_vector_1 = vector_1/np.sum(vector_1)
rel_freq_vector_2 = vector_2/np.sum(vector_2)
diff_rel_freq = rel_freq_vector_1-rel_freq_vector_2
diff_rel_freq_square = np.square(diff_rel_freq)
chisqr = np.sqrt(np.dot(col_sums_recip, diff_rel_freq_square))

Chi-Square: 0.0


  app.launch_new_instance()


One final "gotcha". If we have a vector of all zeros no statistic can be calculated, so we will return nan.

In [120]:
# Create Updated Function
def chisqr(vector_1, vector_2):
    col_sums = vector_1 + vector_2
    col_sums_recip = np.reciprocal(col_sums, where=(col_sums!=0.0)) 
    vector_1_sum =  np.sum(vector_1)
    vector_2_sum =  np.sum(vector_2)
    
    if vector_1_sum==0. or vector_2_sum==0.:
        return np.nan
    
    rel_freq_vector_1 = vector_1/vector_1_sum
    rel_freq_vector_2 = vector_2/vector_2_sum
    diff_rel_freq = rel_freq_vector_1-rel_freq_vector_2
    diff_rel_freq_square = np.square(diff_rel_freq)
    chisqr = np.sqrt(np.dot(col_sums_recip, diff_rel_freq_square))
    return chisqr

vector_1 = np.array([0., 0.])
vector_2 = np.array([1., 0.])
print("Chi-Square: " + str(chisqr(vector_1, vector_2)))

# Exact Same
vector_1 = np.array([1.0, 0.0])
vector_2 = np.array([1.0, 0.0])
print("Chi-Square: " + str(chisqr(vector_1, vector_2)))

# Same But Scaled
vector_1 = np.array([1., 2., 3., 4., 5., 6., 7., 8.])
vector_2 = np.array([2., 4., 6., 8., 10., 12., 14., 16.])
print("Chi-Square: " + str(chisqr(vector_1, vector_2)))
    
vector_1 = np.array([5.0, 10.0])
vector_2 = np.array([10.0, 5.0])
print("Chi-Square: " + str(chisqr(vector_1, vector_2)))

vector_1 = np.array([43.,44.,87.])
vector_2 = np.array([9.,4.,13.])
print("Chi-Square: " + str(chisqr(vector_1, vector_2)))

Chi-Square: nan
Chi-Square: 0.0
Chi-Square: 0.0
Chi-Square: 0.12171612389003691
Chi-Square: 0.019821345298596107


## Hamming Distance
Hamming distance is another simple distance metric. It is simply the count of discrepancies(non-equal values) between the two vectors. It is useful when comparing strings and bit strings.

In [121]:
vector_1 = np.array([1, 0, 0, 1, 0, 1])
vector_2 = np.array([1, 0, 0, 1, 0, 1])

distance = np.sum(vector_1 != vector_2)
print("Hamming Distance: " + str(distance))

vector_1 = np.array([0, 0, 0, 0, 0, 0])
vector_2 = np.array([1, 1, 1, 1, 1, 1])

distance = np.sum(vector_1 != vector_2)
print("Hamming Distance: " + str(distance))

vector_1 = np.array([1, 1, 1, 0, 0, 0])
vector_2 = np.array([1, 1, 1, 1, 1, 1])

distance = np.sum(vector_1 != vector_2)
print("Hamming Distance: " + str(distance))

Hamming Distance: 0
Hamming Distance: 6
Hamming Distance: 3


In [122]:
def hamming(vector_1, vector_2):
    return np.sum(vector_1 != vector_2)

vector_1 = np.array([1, 0, 0, 1, 0, 1])
vector_2 = np.array([1, 0, 0, 1, 0, 1])
print("Hamming Distance: " + str(hamming(vector_1, vector_2)))

vector_1 = np.array([0, 0, 0, 0, 0, 0])
vector_2 = np.array([1, 1, 1, 1, 1, 1])
print("Hamming Distance: " + str(hamming(vector_1, vector_2)))

vector_1 = np.array([1, 1, 1, 0, 0, 0])
vector_2 = np.array([1, 1, 1, 1, 1, 1])
print("Hamming Distance: " + str(hamming(vector_1, vector_2)))

Hamming Distance: 0
Hamming Distance: 6
Hamming Distance: 3
