# Measuring similarity between points

Measuring similarity between points in Machine Learning is a important specifically for pattern recog. and classification to see how close or far two data points are on the projected space.

The metrics of similarities, measure different aspects of similarity and are used for differnet projection spaces. For instance, Euclidean Distance is used for Euclidean geometry like the name suggests and the cosine similarity is used more for a vectors (in word embedding and word vectors).

# Eculidean Distance

This is the ordinary stright line distance between two points in Euclidean space. This is measure is also referred to as the $ L^2 $$ norm or $$ L^2  $ distance. Smaller the distance, the more alike the points are.

There are a few variations of the formula to calculate the distance:

### For Cartesian coordinate

The **Eculidean Distance** between two points **p** = $  p_1, p_2,..... p_n $ and **q**= $  q_1, q_2,..... q_n $

$$ d(p, q) = d(q, p) =  \sqrt{(q_1 - p_1)^2 + (q_2 - p_2)^2+ ... + (q_n - p_n)^2} $$
                                       $$= \sqrt{\sum_{i=1}^{n}(q_i - p_i)^2}$$
                                       
### For Euclidean vector:
The Euclidean normal, or Euclidean length, or magnitude of two Euclidean vectors **p** and **q** starting from the origin of the space with their tips ending at the two points (p and q) is:

$$ \lVert p - q \rVert = \sqrt{\lVert p \rVert^2 + \lVert q \rVert^2 - 2 p\cdot q }$$

where
$$ \lVert p \rVert = \sqrt{p_1^2 + p_2^2 + ... + p_n^2}$$

In [23]:
import math

def euclidean_distance(p, q):
    """Calculate Euclidean distacnce between two points p, q
    
            Calculates the Euclidean disatnce between two points p and  q.
            
            This function assumes that the poinst are in Cartesian coordinate.
            
            Args:
                p (:obj: `list` of int or float): This is the first point.
                q (:obj: `list` of int or float): This is the second point.
            
            Returns:
                int: Distance between the two points.
            
            Raises:
                AssertionError: If p and q are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
            Example:
                dist = euclidean_distance([1,2,3], [4,5,6])
        
    """
    if len(p) != len(q):
        raise AssertionError("p and q should of same length i.e. they should have the same dimentions")
    
    if not all(isinstance(p_, (int, float)) for p_ in p):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(q_, (int, float)) for q_ in q):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
    
    #int: Summation of all the squared differences between the two points.
    summation = 0
    
    for x,y in zip(p,q):
        diff_sq = (x-y)**2 #int: The square of the difference between each dimention.
        summation += diff_sq
        
    return math.sqrt(summation) #int: Square root of the summations of the square of the differences.

In [24]:
euclidean_distance([1,2,3], [4,5,6])

5.196152422706632

# Manhattan distance

The distance between two points is the sum of the absolute distance of their Cartesian coordinates. The measure is along right angles.

Note: While Euclidean distances are stright line distance, Manhattan distance is always measured in right angles. Thus Manhattan distances are usually greater or equal to Euclidean distance. 

This metric has different names. Some of them are taxicab distance, or $L_1$ normal.

Mathamaticaly Manhattan distance is defined as $d_1$, between two vectors **p**, **q** is the sum of the lengths of the projections of the line segments between the points onto the coordinate axes. 

$$d_1(\textbf{p}, \textbf{q}) = \| \textbf{p} - \textbf{q} \|_1 = \sum_{i=1}^n|p_i - q_i| $$

where **p** = ($p_1, p_2, ..., p_n$) and **q** = ($q_1, q_2, ..., q_n$)

In [25]:
def manhattan_distance(p,q):
    """Calcualtes the Manahattan Distance between the points p and q
        This function assumes that the poinst are in Cartesian coordinate.
            
            Args:
                p (:obj: `list` of int): This is the first point.
                q (:obj: `list` of int): This is the second point.
            
            Returns:
                int: Distance between the two points.
            
            Raises:
                AssertionError: If p and q are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
            Example:
                dist = manhattan_distance([1,2,3], [4,5,6])
    """
    if len(p) != len(q):
        raise AssertionError("p and q should of same length i.e. they should have the same dimentions")
    
    if not all(isinstance(p_, (int, float)) for p_ in p):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(q_, (int, float)) for q_ in q):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    #int: Summation of all the squared differences between the two points.
    summation = 0
    
    for x,y in zip(p,q):
        diff_abs = abs(x-y) #int: The absolute of the difference between each dimention.
        summation += diff_abs
        
    return summation #int: Summations of the absolute of the differences.

In [26]:
manhattan_distance([1,2,3], [4,5,6])

9

# Chebyshev Distance

Also known as Tchebychev distance, maximum metric or $L_\infty$ is where distance between two vectors is greatest of thiir difference along any coordinate dimention.

### Formal defination:

The Chebyshev Distance between two vectors or points p and q, with standard coordinates *$p_i$* and *$q_i$* respectively is

$$D_Chebyshev(p,q) := max_{\substack{i}}(|p_i - q_i|)$$

This equals the limit of the L_p metrics:

$$\lim_k \to \infty (\sum_i=1^n |p_i - q_i|^k)^1/k$$

Hence it is called $L_\infty$


In [7]:
def chebyshev_distance(p,q):
    """Calcualtes the Chebyshev Distance between the points p and q
        This function assumes that the poinst are in Cartesian coordinate.
            
            Args:
                p (:obj: `list` of int): This is the first point.
                q (:obj: `list` of int): This is the second point.
            
            Returns:
                int: Distance between the two points.
            
            Raises:
                AssertionError: If p and q are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
            Example:
                dist = chebyshev_distance([1,2,3], [4,5,6])
    """
    if len(p) != len(q):
        raise AssertionError("p and q should of same length i.e. they should have the same dimentions")
    
    if not all(isinstance(p_, (int, float)) for p_ in p):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(q_, (int, float)) for q_ in q):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    #list of `int or float`: List of absolute difference between the points.
    diffLst = []
    
    for x,y in zip(p,q):
        diffLst.append(abs(x-y)) #int: The absolute of the difference between each dimention.
        
    return max(diffLst) #int: Maximun differnce.

In [9]:
chebyshev_distance([1,2,3], [4,5,6])

3

# Minkowski Distance

All the above distances are a spacial case of Minkowski distance. Mathmatically it can be represented as:

For two points **p** = ($p_1, p_2, ..., p_n$) and **q** = ($q_1, q_2, ..., q_n$)

$$D(p,q) = (\sum_i=1^n|p_i - q_i|^k)^{1/k}$$

The values that is usually assigned to *k* is either 1 or 2 and the limiting factor p reaching infinity we observer Chebyshev distance.

In [27]:
def minkowski_distance(p, q, k):
    """Calcualtes the Minkowski Distance between the points p and q given the value of k.
        This function assumes that the poinst are in Cartesian coordinate.
            
            Args:
                p (:obj: `list` of int): This is the first point.
                q (:obj: `list` of int): This is the second point.
                k (int): This is the power to which the difference is raised.
            
            Returns:
                int: Distance between the two points.
            
            Raises:
                AssertionError: If p and q are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
            Example:
                dist = minkowski_distance([1,2,3], [4,5,6], 1)
    """
    if len(p) != len(q):
        raise AssertionError("p and q should of same length i.e. they should have the same dimentions")
    
    if not all(isinstance(p_, (int, float)) for p_ in p):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(q_, (int, float)) for q_ in q):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
    
    #int: Summation of all the squared differences between the two points.
    summation = 0
    
    for x,y in zip(p,q):
        diff_sq = abs((x-y))**k #int: The square of the difference between each dimention.
        summation += diff_sq
        
    return (summation)**(1/k) #int: kth root of the summations of the kth power of the differences.

In [28]:
minkowski_distance([1,2,3], [4,5,6], 1)

9.0

In [29]:
minkowski_distance([1,2,3], [4,5,6], 2)

5.196152422706632

# Cosine Similarity


Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. The cosine of 0° is 1, and it is less than 1 for any angle in the interval (0,π] radians. If the value of this distance is closer to 1, the vectors are similar. If the value is closer to -1, the vectors are dissimilar. 

### Defination:

The cosine of two non=zero vectors A and B can be determined using:

$$A \cdot B = \|A\| \|B\| \cos\theta$$

$$\cos(\theta) = \frac{A \cdot B}{\|A\| \|B\|} = \frac{\sum_{i=1}^n{A_iB_i}}{\sqrt{\sum_{i=1}^n{A_i^2}}\sqrt{\sum_{i=1}^n{B_i^2}}}$$

where $A_i$ and $B_i$ are the components of vector A nad B respectively

In [30]:
from math import sqrt

def cosine_similarity(A, B):
    """Measures the cosine similarity between two vectors A and B.
    
        Args:
            A (:obj: `list` of int): This is the first vector.
            B (:obj: `list` of int): This is the second vector.
        Returns:
                int: Distance between the two points.
            
        Raises:
                AssertionError: If A and B are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
        Example:
                similarity = cosine_similarity([1,2,3], [4,5,6])
    """
    if len(A) != len(B):
        raise AssertionError("Parameters should of same length i.e. they should have the same dimentions")
        
    if not all(isinstance(A_, (int, float)) for A_ in A):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(B_, (int, float)) for B_ in B):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
    
    AxB = 0 #Sum of product of vector A and B
    AxA = 0 #Sum of product of vector A and A
    BxB = 0 #Sum of product of vector B and B
    
    for a,b in zip(A,B):
        AxB += a*b
        AxA += a**2
        BxB += b**2
    similarity = AxB/(sqrt(AxA) * sqrt(BxB))
    return similarity

In [34]:
cosine_similarity([1,2,3], [100,5,120])

0.8037417976696531

# Jaccard Distance

Jaccard Index is defined as the size of the intersection divided by the size of the union of two sample sets. If the two sets are empty, then the index is 1.

Jaccard distance is 1- Jaccard index. Jaccard distance is the measure of dissimilarity between the two sets/ vectors. 


For vectors x,y where $x= (x_1, x_2, ... , x_n)$ and $y = (y_1, y_2, ..., y_n)$, the Jaccard index and distance is defined as:

$$J(x,y) = \frac{\sum_i min(x_i, y_i)}{\sum_i max(x_i, y_i)}$$

$$d_j(x, y) = 1 - J(x,y)$$

In [11]:
def jaccard_distance(x, y):
    """
        Calculates the Jaccard distance between two real value vectors x and y.
        
        Args:
            x: (:obj: `list` of int or float) The first vector.
            y: (:obj: `list` of int or float) The second vector.
        Return:
            int: Jaccard Distance between vector x and y
        Raises:
                AssertionError: If x and y are not of the name dimentionality.
                AssertionError: If the list don't have elements of type int or float 
                
        Example:
                similarity = jaccard_distance([1,2,3], [4,5,6])
        
    """
    if len(x) != len(y):
        raise AssertionError("Parameters should of same length i.e. they should have the same dimentions")
        
    if not all(isinstance(x_, (int, float)) for x_ in x):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
        
    if not all(isinstance(y_, (int, float)) for y_ in y):
        raise AssertionError("All elements of the parameters passes should be of type int or float")
    
    sumMin = 0
    sumMax = 0
    
    for x_, y_ in zip(x, y):
        sumMin+= min(x_, y_)
        sumMax += max(x_, y_)
        
    return 1 - (sumMin/sumMax)
    

In [12]:
jaccard_distance([1,2,3], [4,5,6])

0.6