# Network Mining

## Lab 2: Similarity

#### Notebook Author: Mario Prado

The similarity measure is the measure of how much alike
two data objects are.

§ It is a distance with dimensions representing features of
the objects.

§ The similarity is subjective and is highly dependent on the
domain and application.

§ For example, two fruits are similar because of color or size
or taste.

§ Similarity are measured in the range 0 to 1, i.e [0,1]

### Generating two random vectors

In [19]:
import random
import numpy as np
random.seed(1)

vector_A = np.random.rand(5)
vector_A

array([0.38667925, 0.44430709, 0.1957925 , 0.34829179, 0.94380021])

In [20]:
vector_B = np.random.rand(5)
vector_B

array([0.74162062, 0.91206826, 0.48436526, 0.29570169, 0.38927116])

### Manhattan Distance

In [8]:
from math import*

def manhattan_distance(x,y):
    return sum(abs(a-b) for a,b in zip(x,y))

manhattan_distance(vector_A, vector_B)

1.4731377852223462

### Euclidean Distance

In [9]:
def euclidean_distance(x,y):
    return sqrt(sum(pow(a-b,2) for a, b in zip(x, y)))

euclidean_distance(vector_A, vector_B)

0.7981540197600708

### Minkowski distance

In [14]:
from decimal import Decimal

def nth_root(value, n_root):
    root_value = 1/float(n_root)
    return round (Decimal(value) ** Decimal(root_value),3)

def minkowski_distance(x,y,p):
    return nth_root(sum(pow(abs(a-b),p) for a,b in zip(x,y)), p)
                    
minkowski_distance(vector_A, vector_B, 4)

Decimal('0.646')

### Cosine Similarity

In [15]:
def square_rooted(x):
    return round(sqrt(sum([a*a for a in x])),3)

def cosine_similarity(x,y):
    numerator = sum(a*b for a,b in zip(x,y))
    denominator = square_rooted(x)*square_rooted(y)
    return round(numerator/float(denominator),3)

cosine_similarity(vector_A, vector_B)

0.866

### Jaccard Similarity

In [21]:
def jaccard_sim(im1, im2):
    im1 = np.asarray(im1).astype(np.bool)
    im2 = np.asarray(im2).astype(np.bool)
    intersection = np.logical_and(im1, im2)
    union = np.logical_or(im1, im2)
    return intersection.sum() / float(union.sum())

jaccard_sim(vector_A, vector_B)

1.0

### Weighted Jaccard Similarity

In [22]:
def weighted_jaccrard(X,Y):
    numerator = sum(min(a,b) for a,b in zip(X,Y))
    denominator = sum(max(a,b) for a,b in zip(X,Y))
    return numerator/denominator

weighted_jaccrard(vector_A, vector_B)

0.49903170932215446

## Using numpy

In [24]:
# Prepare 2 vectors of 100 dimensions
import scipy.spatial.distance as dist
import numpy as np

A = np.random.uniform(0, 10, 100)
B = np.random.uniform(0, 10, 100)
AA = np.random.randint(0, 2, 1000000)
BB = np.random.randint(0, 2, 1000000)

### Manhattan Distance

In [26]:
dist.cityblock(A, B)

283.7649581425182

### Euclidean Distance

In [27]:
dist.euclidean(A, B)

35.879331751119935

### Jaccard Distance

In [28]:
dist.jaccard(A, B)

1.0

### Chebyshev Distance

In [31]:
dist.chebyshev(A, B)

9.417031268452849

### Cosine Similarity

In [30]:
dist.cosine(A, B)

0.1915380094178647