In [1]:
import numpy as np
import sklearn

# A. What defines similarity?
To find similarities between data observations, we first need to understand how to actually measure similarity. The most common measurement of similarity is the cosine similarity metric.

A data observation with numeric features is essentially just a vector of real numbers. Cosine similarity is used in mathematics as a similarity metric for real-valued vectors, so it makes sense to use it as a similarity metric for data observations. The cosine similarity for two data observations is a number between -1 and 1. It specifically measures the proportional similarity of the feature values between the two data observations (i.e. the ratio between feature columns).

Cosine similarity values closer to 1 represent greater similarity between the observations, while values closer to -1 represent more divergence. A value of 0 means that the two data observations have no correlation (neither similar nor dissimilar).

# B. Calculating cosine similarity
The cosine similarity for two vectors, u and v, is calculated as the dot product between the L2-normalization of the vectors. The exact formula for cosine similarity is:

cossim(u,v)= (u/||u||<sub>2</sub>) * (v/||v||<sub>2</sub>)

where ||u||<sub>2</sub> represents the L2 norm of u and ||v||<sub>2</sub> represents the L2 norm of v.

In scikit-learn, cosine similarity is implemented via the cosine_similarity function (which is part of the metrics.pairwise module). It calculates the cosine similarities for pairs of data observations in a single dataset, or pairs of data observations between two datasets.

The code below computes cosine similarities between pairs of observations in a 2-D dataset.
​


In [2]:
from sklearn.metrics.pairwise import cosine_similarity
data = np.array([
  [ 1.1,  0.3],
  [ 2.1,  0.6],
  [-1.1, -0.4],
  [ 0. , -3.2]])

cos_sims = cosine_similarity(data)
print('{}\n'.format(cos_sims))

[[ 1.          0.99992743 -0.99659724 -0.26311741]
 [ 0.99992743  1.         -0.99751792 -0.27472113]
 [-0.99659724 -0.99751792  1.          0.34174306]
 [-0.26311741 -0.27472113  0.34174306  1.        ]]



When we only pass in one dataset into cosine_similarity, the function will compute cosine similarities between pairs of observations within the dataset. In the code above, we passed in data (which contains 4 data observations), so the output of cosine_similarity is a 4x4 array of cosine similarity values.

The value at index (i, j) of cos_sims is the cosine similarity between data observations i and j in data. Since cosine similarity is symmetric, the cos_sims array contains the same values at index (i, j) and (j, i).

Note that the cosine similarity between a data observation and itself is 1, unless the data observation contains only 0's as feature values (in which case the cosine similarity is 0).

If we decide to pass in two datasets (with equal numbers of columns) into cosine_similarity, the function will compute the cosine similarities for pairs of data observations between the two datasets.

In [3]:
from sklearn.metrics.pairwise import cosine_similarity
data = np.array([
  [ 1.1,  0.3],
  [ 2.1,  0.6],
  [-1.1, -0.4],
  [ 0. , -3.2]])
data2 = np.array([
  [ 1.7,  0.4],
  [ 4.2, 1.25],
  [-8.1,  1.2]])
cos_sims = cosine_similarity(data, data2)
print('{}\n'.format(repr(cos_sims)))

array([[ 0.9993819 ,  0.99973508, -0.91578821],
       [ 0.99888586,  0.99993982, -0.9108828 ],
       [-0.99308366, -0.9982304 ,  0.87956492],
       [-0.22903933, -0.28525359, -0.14654866]])



In the code above, the value at index (i, j) of cos_sims is the cosine similarity between data observation i in data and data observation j in data2. Note that cos_sims is a 4x3 array, since data contains 4 data observations and data2 contains 3.