<a href="https://colab.research.google.com/github/SudhakarKuma/Machine_Learning/blob/master/ME781/Assignment_3/183236001_A3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Finding Dissimilarity Measures 

## Instructions


1. Create a Python function to calculate the dissimilarity and similarity between two data points. This function should take 2 data points and a dissimilarity/ similarity abbreviation as inputs. Accordingly, it should return both the dissimilarity and similarity (corresponding to the acronym) between the data points. Please note that the function should be robust to the inappropriate inputs. The Python function should have methods to calculate the following dissimilarity/ similarity measures.

Dissimilarity/ Similarity measure | Abbreviation
--- | ---
Euclidean norm | EN
Frobenius or Hilbert Schmidt norm | HSN
Diagonal norm | DN
Mahalanobis norm | MN
Lebesgue or Minkowski norm | LMN
Cosine | CS
Overlap | OS
Dice | DS
Jaccard | JS

2. Create a function (with test cases) to perform unit testing of the function (mentioned in the first point) for five different assertions. 

## Mathematical Equations 

1. **Euclidean norm** - It measures the Euclidean distance between two data points ($x$ and $y$) in the simple trigonometric way, as given below: 

  $EN(x, y) = \sqrt{\sum_{i = 1}^k (x_i - y_i)^2}$

  When data is dense or continuous, this is the best proximity measure. The Euclidean distance between two points is the length of the path connecting them. The Pythagorean theorem gives this distance between two points. 

  Reference: 
  * http://www.ashukumar27.io/similarity_functions/


2. **Frobenius norm** - It is the matrix norm of an $m \times n$ matrix $A$, which is defined as the square root of the sum of the absolute squares of its elements, as given below. 

  $|| A ||_F = \sqrt{\sum_{i = i}^m \sum_{j = 1}^n |a_{ij}|^2}$

  Reference: 
  * https://mathworld.wolfram.com/FrobeniusNorm.html

3. **Diagonal norm** - For two data points ($x$ and $y$) and a diagonal matrix ($d$), diagonal norm (DN) can be calculated as given below:

  $DN = \sqrt{\sum_{i = 1}^k d_i(x_i - y_i)^2}$

  Reference:
  * https://www.cis.upenn.edu/~cis515/cis515-11-sl4.pdf

4. **Mahalanobis norm** -  It is a useful multivariate distance metric that measures the distance between a point (vector) and a distribution. Euclidean norm will work great as long as the features are equally weighted and are independent of each other. That is, if the attributes (columns in the dataset) are correlated to one another, which is typically the case in real-world datasets, the Euclidean distance between a point and the center of the points (distribution) can give little or misleading information about how close a point is to the cluster. 

  The formula to compute Mahalanobis norm (MN) is as follows:
  
  ${MN}^2 = (x - m)^T \sum^{-1} (x - m)$

  where, 

  $x$ is the vector of the observation (row in a dataset).

  $m$ is the vector of mean values of independent variables (mean of each column). 

  $\sum$ is the inverse covariance matrix of independent variables.

  Reference:
  * https://www.machinelearningplus.com/statistics/mahalanobis-distance/
  * https://www.youtube.com/watch?v=4buOoXp7AyI
  * http://www.cleartheconcepts.com/dm-similarity-dissimilarity-measure/   


5. **Minkowski norm** - The  Euclidean norm is generalized by the Minkowski norm, as shown below: 

  $LMN(x, y) =\bigg({\sum_{i = 1}^k |x_i - y_i| ^ r}\bigg) ^ {1/r}$

  where $r$ is a parameter. 

  Reference:
  * https://stackoverflow.com/questions/1401712/how-can-the-euclidean-distance-be-calculated-with-numpy 

6. **Cosine similarity** - This metric finds the normalized dot product of the two attributes. By determining the cosine similarity, we find out the cosine of the angle between the two objects. The cosine of 0° is 1, and it is less than or equal to 1 for any other angle. It is thus a judgment of orientation and not magnitude: two vectors with the same orientation have a cosine similarity of 1, two vectors at 90° have a similarity of 0, and two vectors diametrically opposed have a similarity of -1, independent of their magnitude. 

  This similarity (CS) between two data points ($x$ and $y$) is calculated as follows: 

  $CS (x, y) = \frac{\sum_{i = 1}^p x_i y_i}{\sqrt{\sum_{i = 1}^p (x_i)^2 (y_i)^2}}$  

  Reference:
  * http://www.ashukumar27.io/similarity_functions/ 


7. **Overlap similarity** (OS) - This metric is evaluated as follows: 

  $OS (x, y) = \frac{\sum_{i = 1}^p x_i y_i}{min\big(\sum_{i = 1}^p (x_i)^2, \; \sum_{i = 1}^p (y_i)^2  \big)}$ 

8. **Dice similarity** (DS) - This metric is evaluated as follows: 

  $DS (x, y) = \frac{2 \sum_{i = 1}^p x_i y_i}{\sum_{i = 1}^p (x_i)^2 \; + \; \sum_{i = 1}^p (y_i)^2}$ 

9. **Jaccard similarity** - The Jaccard Similarity $JS$ is evaluated as follows: 

  $JS (x, y) = \frac{\sum_{i = 1}^p x_i y_i}{\sum_{i = 1}^p (x_i)^2 \; + \; \sum_{i = 1}^p (y_i)^2 \; - \; \sum_{i = 1}^p x_i y_i}$ 

## Python Scripts for Calculating Measures 

From the mathematical equations, we can observe that the norms take the inputs as given below:

Norm | Input args
--- | ---
Euclidean | two vectors ($x$ and $y$)
Minkowski | two vectors ($x$ and $y$) and one integer ($r$)
Diagonal | two vectors ($x$ and $y$) and a diagonal matrix ($d$)
Cosine | two vectors ($x$ and $y$)
Overlap | two vectors ($x$ and $y$)
Dice | two vectors ($x$ and $y$)
Jaccard | two vectors ($x$ and $y$)
Frobenius | a matrix 
Mahalanobis | two vectors ($x$ and $y$) and one covariance matrix 

Therefore, I will write three Python function, which will calculate the norms as given below:

1. `vector_norms()` - Calculates Euclidean, Minkowski, Diagonal, Cosine, Overlap, Dice, and Jaccard. 

2. `frobenius_norm()` - Calculates Frobenius

3. `mahalanobis_dist()` - Calculates Mahalanobis

### `vector_norms()` function

In [None]:
%%file vector_norms.py

# Import all the necessary modules 
import math
from math import*
from decimal import Decimal
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.spatial import distance

def vector_norms(x, y, r, d = 0, identity = "EN"):

  """ Returns dissimilarity between two numeric vectors x and y 

  Args:
  x: numeric tuple. First input vector 
  y: numeric tuple. Second input vector 
  r: int. Degree needed for calculating Minkowski norm 
  identity: str. Identity to know which norm to calculate for x and y, as given below
  "EN": Euclidean, "LMN": Minkowski, "DN": Diagonal, "CS": Cosine, "OS": Overlap, 
  "DS": Dice, "JS": Jaccard
  d: array/matrix. A weight matrix needed to calculate the diagonal norm. Default value zero. 

  Returns: 
  dissimilarity: float or int. Dissimilarity measure between x and y (along with r)
  similarity: float or int. Similarity measure between x and y (along with r)

  It also returns a message, when the inputs are not entered as mentioned in Args.  

  Caution: 
  Some of the norms (like CS) uses the product of the magnitudes of x and y in
  its denomintator. So, we should avoid having a zero vector while calculating 
  these norms. 
  """

  ########## Check whether the inputs are entered in desired format ###########

  ######## Check whether the identity is one among those defined above ########
  if identity not in ["EN", "LMN", "CS", "OS", "DN", "DS", "JS"]:
    print("You have entered the wrong identity!")
    return
  
  ################ Check whether the input vectors are tuples #################
  if (isinstance(x, tuple) == False) or (isinstance(y, tuple) == False):
    print("This functions accepts only tuples as x and y!")
    return

  #################### Check whether the tuple x is numeric ###################
  for valx in x:
    if (isinstance(valx, float) == False) and (isinstance(valx, int) == False):
      print("This function accept only numeric tuples as x and y!")
      return
  
  #################### Check whether the tuple y is numeric ###################
  for valy in y:
    if (isinstance(valy, float) == False) and (isinstance(valy, int) == False):
      print("This function accept only numeric tuples as x and y!")
      return

  ############ Check whether the input degree r is an integer #################
  if (isinstance(r, int) == False): 
    print("This functions accepts only integers as r!")
    return


  ############### Calculate norms as per the value of identity ################
  
  ################## Calculate Euclidean norm for x and y #####################
  if (identity == "EN"):
    dissimilarity = math.sqrt(sum(pow(a - b, 2) for a, b in zip(x, y)))
  
  ################ Calculate Minkowski norm for x, y, and r ###################
  elif (identity == "LMN"):
    def nth_root(value, n_root):
      root_value = 1/float(n_root)
      return  value ** root_value
    dissimilarity = nth_root(sum(pow(abs(a - b), r) for a, b in zip(x, y)), r)

  ################ Calculate Diagonal norm for x, y, and d ###################
  elif (identity == "DN"):
    dissimilarity = round(math.sqrt(sum(w*(pow(a - b, 2)) for a, b, w in zip(x, y, d))),3)

  ################ Calculate Cosine dissimilarity for x and y #################
  elif (identity == "CS"):
    def square_rooted(m):
      return round(sqrt(sum([a * a for a in m])), 3)
    numerator = sum(a * b for a, b in zip(x, y))
    denominator = square_rooted(x) * square_rooted(y)
    dissimilarity = round(numerator/float(denominator), 3)

  ############### Calculate Overlap dissimilarity for x and y #################
  elif (identity == "OS"):
    numerator = sum(a * b for a, b in zip(x, y))
    denominator = min(sum([a * a for a in x]), sum([b * b for b in y]))
    dissimilarity = round(numerator/denominator, 2)

  ############### Calculate Dice dissimilarity for x and y ####################
  elif (identity == "DS"):
    numerator = 2 * sum(a * b for a, b in zip(x, y))
    denominator = sum([a * a for a in x]) + sum([b * b for b in y])
    dissimilarity = round(numerator/denominator, 2)

  ################ Calculate Jaccard dissimilarity for x and y ################
  elif (identity == "JS"):
    numerator = sum(a * b for a, b in zip(x, y))
    denominator = sum([a * a for a in x]) + sum([b * b for b in y]) - sum(a * b for a, b in zip(x, y))
    dissimilarity = round(numerator/denominator, 3)

  ########################## Calculate similarity #############################
  similarity = 1 / (1 + dissimilarity) 

  ################### Return dissimilarity, similarity ########################
  return dissimilarity, similarity

Overwriting vector_norms.py


### Testing `vector_norms()` for Error Messages 

I have applied the following constraints on the input arguments: 

* The function `vector_norms` accepts two numeric tuples ($x$ and $y$). If $x$ or $y$ is not a numeric tuple, the program will throw a message. 

* The function `vector_norms`  accepts integers as the value of $r$.

* The function `vector_norms` can only calculate the following norms if passed the suitable identity: 

Identity | Norm 
--- | ---
"EN"| Euclidean
"LMN" | Minkowski
"DN" | Diagonal 
"CS" | Cosine
"OS" | Overlap 
"DS" | Dice 
"JS" |Jaccard

If we try to pass an identity other than those defined above, it will throw a message, `You have entered the wrong identity!`

Now, we will call the function with a few right and a few wrong inputs, as given below. 

In [None]:
from vector_norms import vector_norms

# Pass a numeric tuple 
x = (1, 2, 3)
y = (3, 4, 5)

values = vector_norms(x, y, r = 2, identity="LMN")

if (values != None):
  print("Dissimilarity of {} and {} is {}".format(x, y, round(values[0], 2)))
  print("Similarity of {} and {} is {}".format(x, y, round(values[1], 2)))
else:
  print("The function didn't return any value. Please check your input!")

Dissimilarity of (1, 2, 3) and (3, 4, 5) is 3.46
Similarity of (1, 2, 3) and (3, 4, 5) is 0.22


In [None]:
from vector_norms import vector_norms

# Pass a non-numeric tuple 
x = (1, 'a', 3)
y = (3, 4, 5)

values = vector_norms(x, y, r = 2, identity="LMN")

if (values != None):
  print("Dissimilarity of {} and {} is {}".format(x, y, round(values[0], 2)))
  print("Similarity of {} and {} is {}".format(x, y, round(values[1], 2)))
else:
  print("The function didn't return any value. Please check your input!")

This function accept only numeric tuples as x and y!
The function didn't return any value. Please check your input!


In [None]:
from vector_norms import vector_norms

# Pass a non-integer value of r 
x = (1, 2, 3)
y = (3, 4, 5)

values = vector_norms(x, y, r = 3.5, identity="LMN")

if (values != None):
  print("Dissimilarity of {} and {} is {}".format(x, y, round(values[0], 2)))
  print("Similarity of {} and {} is {}".format(x, y, round(values[1], 2)))
else:
  print("The function didn't return any value. Please check your input!")

This functions accepts only integers as r!
The function didn't return any value. Please check your input!


In [None]:
from vector_norms import vector_norms

# Pass an identity other than those defined above
x = (1, 3, 3)
y = (3, 6, 5)

values = vector_norms(x, y, r = 2, identity="XY")

if (values != None):
  print("Dissimilarity of {} and {} is {}".format(x, y, round(values[0], 2)))
  print("Similarity of {} and {} is {}".format(x, y, round(values[1], 2)))
else:
  print("The function didn't return any value. Please check your input!")

You have entered the wrong identity!
The function didn't return any value. Please check your input!


### Testing `vector_norms()` via `Pytest`

Now, I will write test cases for the function `vector_norms()`. As I have implemented all the norms from scratch, I will use the built-in Python packages (as much as possible) to check for the correctness of the norms. 

In Python, we have `scipy.spatial` which has all the functions required for calculating the norms, as given below:

* `scipy.spatial.distance.euclidean(u, v)` - Computes the Euclidean distance between two 1-D arrays. 

* `scipy.spatial.distance.minkowski(u, v, p = 2, w = None)` - Computes the Minkowski distance between two 1-D arrays.

* `scipy.spatial.distance.cosine(u, v, w = None)` - Computes the Cosine distance between 1-D arrays. It computes the distance and not the similarity. So, we must subtract the value from 1 to get the similarity.

For the remaining norms, I have tested the examples given on http://www.cleartheconcepts.com/dm-similarity-dissimilarity-measure/ 



In [None]:
%%file test_vector_norms.py

from vector_norms import vector_norms
from scipy.spatial import distance

############### Test Euclidean norm on two different inputs ###############
def test_vector_norms_en_1():
  x = (1, 2, 3, 1)
  y = (2, 1, 2, 3)
  expected = (distance.euclidean(x, y), 1 / (1 + distance.euclidean(x, y)))
  actual = vector_norms(x, y, r = 0, identity = "EN")
  assert actual == expected 

def test_vector_norms_en_2():
  x = (0, 0, 0, 0)
  y = (0, 0, 0, 0)
  expected = (distance.euclidean(x, y), 1 / (1 + distance.euclidean(x, y)))
  actual = vector_norms(x, y, r = 0, identity = "EN")
  assert actual == expected 

############### Test Minkowski norm on three different inputs ###############
def test_vector_norms_lmn_1():
  x = (1, 1, 0)
  y = (0, 1, 0)
  r = 1
  expected = (distance.minkowski(x, y, r), 1 / (1 + distance.minkowski(x, y, r)))
  actual = vector_norms(x, y, r, identity = "LMN")
  assert actual == expected 

def test_vector_norms_lmn_2():
  x = (1, 2, 3, 1)
  y = (2, 1, 2, 3)
  r = 2
  expected = (distance.minkowski(x, y, r), 1 / (1 + distance.minkowski(x, y, r))) 
  actual = vector_norms(x, y, r, identity = "LMN")
  assert actual == expected 

def test_vector_norms_lmn_3():
  x = (1, 0, 0)
  y = (0, 1, 0)
  r = 3
  expected = (distance.minkowski(x, y, r), 1 / (1 + distance.minkowski(x, y, r)))
  actual = vector_norms(x, y, r, identity = "LMN")
  assert actual == expected 

################## Test Diagonal norm on one input ##################
def test_vector_norms_dn():
  x = (1, 2, 3, 1)
  y = (3, 1, 2, 1)
  r = 0
  expected = (2.646, 1/(1 + 2.646))
  actual = vector_norms(x, y, r, identity = "DN", d = (1, 2, 1, 2))
  assert actual == expected 

############### Test Cosine dissimilarity on two different inputs #############
def test_vector_norms_cs_1():
  x = (1, 2, 3, 1)
  y = (3, 1, 2, 1)
  r = 0
  expected = (1 - distance.cosine(x, y), 1 / (1 + (1 - distance.cosine(x, y))))
  actual = vector_norms(x, y, r, identity = "CS")
  assert actual == expected 

def test_vector_norms_cs_2():
  x = (1, 1, 1, 1)
  y = (1, 1, 1, 1)
  r = 0
  expected = (1 - distance.cosine(x, y), 1 / (1 + (1 - distance.cosine(x, y))))
  actual = vector_norms(x, y, r, identity = "CS")
  assert actual == expected 

############# Test Overlap dissimilarity on two different inputs ##############
def test_vector_norms_os_1():
  x = (1, 2, 3, 1)
  y = (3, 1, 2, 1)
  r = 0
  expected = (0.8, 1 / (1 + 0.8))
  actual = vector_norms(x, y, r, identity = "OS")
  assert actual == expected 

##################### Test Dice dissimilarity on one input ####################
def test_vector_norms_ds_1():
  x = (1, 2, 3, 1)
  y = (3, 1, 2, 1)
  r = 0
  expected = (0.8, 1 / (1 + 0.8))
  actual = vector_norms(x, y, r, identity = "DS")
  assert actual == expected 

################## Test Jaccard dissimilarity on one input ####################
def test_vector_norms_js_1():
  x = (1, 2, 3, 1)
  y = (3, 1, 2, 1)
  r = 0
  expected = (0.667, 1 / (1 + 0.667))
  actual = vector_norms(x, y, r, identity = "JS")
  assert actual == expected 

Overwriting test_vector_norms.py


In [None]:
# Run the pytest module 
!python -m pytest test_vector_norms.py

platform linux -- Python 3.6.9, pytest-3.6.4, py-1.9.0, pluggy-0.7.1
rootdir: /content, inifile:
plugins: typeguard-2.7.1
collected 11 items                                                             [0m

test_vector_norms.py ...........[36m                                         [100%][0m



### `frobenius_norm()` function


In [None]:
%%file frobenius_norm.py

# Import all the necessary modules 
import numpy as np
import pandas as pdz

def frobenius_norm(x):

  """ Returns Frobenius norm of a matrix 

  Args:
  x: matrix. The matrix whose Frobenius norm is to be calculated. 

  Returns: 
  dissimilarity: float or int. The Frobenius norm of x 
  similarity: float or int. 1 / (1 + dissimilarity)

  Caution: 
  This function only excepts a matrix as input. If you are having an array, you
  should apply reshape function before feeding it to the function.   
  """

  #### Calculate the Frobenius norm using the built-in function of numpy ######
  # https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html 
  dissimilarity = round(np.linalg.norm(x, 'fro'), 3)

  ########################## Calculate similarity #############################
  similarity = 1 / (1 + dissimilarity) 

  ################### Return dissimilarity, similarity ########################
  return dissimilarity, similarity

Overwriting frobenius_norm.py


### Testing `frobenius_norm()` via `Pytest`

I have used the examples given on https://numpy.org/doc/stable/reference/generated/numpy.linalg.norm.html for testing the `frobenius_norm()` function. 

In [None]:
%%file test_frobenius_norm.py

from frobenius_norm import frobenius_norm
import numpy as np

############### Test Frobenius_norm on three different inputs ###############
def test_frobenius_norm_1():
  x = [[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]]
  expected = (1.732, 1 / (1 + 1.732))
  actual = frobenius_norm(x)
  assert actual == expected 

def test_frobenius_norm_2():
  x = [[-4, -3, -2],
       [-1,  0,  1],
       [ 2,  3,  4]]
  expected = (7.746, 1 / (1 + 7.746))
  actual = frobenius_norm(x)
  assert actual == expected 

def test_frobenius_norm_3():
  a = np.arange(9) - 4
  b = a.reshape((3, 3))
  expected = (7.746, 1 / (1 + 7.746))
  actual = frobenius_norm(b)
  assert actual == expected 

Overwriting test_frobenius_norm.py


In [None]:
!python -m pytest test_frobenius_norm.py

platform linux -- Python 3.6.9, pytest-3.6.4, py-1.9.0, pluggy-0.7.1
rootdir: /content, inifile:
plugins: typeguard-2.7.1
collected 3 items                                                              [0m

test_frobenius_norm.py ...[36m                                               [100%][0m



### `mahalanobis_dist()` function 

In [None]:
%%file mahalanobis_dist.py

# Import all the necessary modules 
import numpy as np
import pandas as pd


def mahalanobis_dist(x, data): 
  """ Returns Mahalanobis distance between a data (vector) and a distribution

  Args:
  x: array. The input vector whose distance is to be calculated. 
  data: matrix. The input data with which the distance of x is to be calculated. 

  Returns: 
  dissimilarity: float or int. The Mahalanobis distance of x and data 
  similarity: float or int. 1 / (1 + dissimilarity)
  """

  m =  np.mean(data, axis = 0) # calculate mean of the independent variables of data

  x_minus_m = x - m  ## subtract x from m 

  data = np.transpose(data)
  cov_M = np.cov(data, bias = False) # Evaluate the covariance matrix
  inv_Cov_M = np.linalg.inv(cov_M) # Find the inverse covariance matrix

  temp1 = np.dot(x_minus_m, inv_Cov_M) # Multiply the (x - m) and inverse of covariance matrix  
  temp2 = np.dot(temp1, np.transpose(x_minus_m)) # Multiply (x-m)^T and the previous product 
  
  dissimilarity = np.sqrt(np.reshape(temp2, -1)) # Evaluate the dissimilarity from the previous product 

  ########################## Calculate similarity #############################
  similarity = 1 / (1 + dissimilarity) 

  ################### Return dissimilarity, similarity ########################
  return dissimilarity, similarity

Overwriting mahalanobis_dist.py


### Testing `mahalanobis_dist()` via `Pytest`

In [None]:
%%file test_mahalanobis_dist.py

from mahalanobis_dist import mahalanobis_dist
import numpy as np

############### Test mahalanobis_dist on two different inputs ###############
def test_mahalanobis_dist_1():
  x = np.array([[154,900, 80]]) 
  data = np.array([[1,	100,	10],
                 [2,	300,	15],
                 [4,	200,	20],
                 [2,	600,	10],
                 [5,	100,	30]])
  expected = (321.166, round(1 / (1 + 321.166), 3))
  values_from_fun = mahalanobis_dist(x, data)
  actual = (round(*values_from_fun[0], 3), round(*values_from_fun[1], 3))
  assert actual == expected 

############### Test mahalanobis_dist on two different inputs ###############
def test_mahalanobis_dist_2():
  x = np.array([[4, 500, 40]]) 
  data = np.array([[1,	100,	10],
                 [2,	300,	15],
                 [4,	200,	20],
                 [2,	600,	10],
                 [5,	100,	30]])
  expected = (10.33, round(1 / (1 + 10.33), 3))
  values_from_fun = mahalanobis_dist(x, data)
  actual = (round(*values_from_fun[0], 3), round(*values_from_fun[1], 3))
  assert actual == expected 


Overwriting test_mahalanobis_dist.py


In [None]:
!python -m pytest test_mahalanobis_dist.py

platform linux -- Python 3.6.9, pytest-3.6.4, py-1.9.0, pluggy-0.7.1
rootdir: /content, inifile:
plugins: typeguard-2.7.1
collected 2 items                                                              [0m

test_mahalanobis_dist.py ..[36m                                              [100%][0m

