## Similarity Metrics

*Prepared by:*
**Jude Michael Teves**  
Faculty, Software Technology Department  
College of Computer Studies - De La Salle University

This notebook is for introducing different similarity metrics that we could use in the context of recommender systems.

## Preliminaries

### Import libraries

In [2]:
import numpy as np
import pandas as pd

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics import jaccard_score
from scipy.stats import pearsonr

## Similarity Metrics

In this section, we will explore the different similarity metrics that we can use.

For the following equations, let:
- $r_x$ : the vector of ratings of user *x*
- $r_y$ : the vector of ratings of user *y*

Let's use the following dummy data for pedagogical purposes. The rows are correspond to the users while the columns correspond to the items.

In [9]:
vals = [[4, np.nan, np.nan, 5, 1, np.nan, np.nan],
        [5, 5, 4, np.nan, np.nan, np.nan, np.nan],
        [np.nan, np.nan, np.nan, 2, 4, 5, np.nan],
        [np.nan, 3, np.nan, np.nan, np.nan, np.nan, 3]]
vals = pd.DataFrame(vals)

vals

Unnamed: 0,0,1,2,3,4,5,6
0,4.0,,,5.0,1.0,,
1,5.0,5.0,4.0,,,,
2,,,,2.0,4.0,5.0,
3,,3.0,,,,,3.0


### Jaccard Similarity

$$\Large S_J(r_x,r_y) = \frac{|r_x \cap r_y|}{|r_x \cup r_y|}$$

In Jaccard similarity, we simply check the intersection over union of the rated items. This is akin to treating the values as implicit ratings, which means that the value of a rating (high or low) is ignored.




In [14]:
vals_bool = ~vals.isnull()
vals_bool

Unnamed: 0,0,1,2,3,4,5,6
0,True,False,False,True,True,False,False
1,True,True,True,False,False,False,False
2,False,False,False,True,True,True,False
3,False,True,False,False,False,False,True


#### Sample score

In [90]:
jaccard_score(vals_bool.iloc[1], vals_bool.iloc[0])

0.2

#### Similarity scores in the whole utility matrix

In [91]:
vals_bool.apply(lambda x: jaccard_score(x, vals_bool.iloc[0]), axis=1)

0    1.0
1    0.2
2    0.5
3    0.0
dtype: float64

### Cosine Similarity

$$\Large S_C(r_x,r_y) = \frac{r_x \cdot r_y}{\| r_x \|  \| r_y \|}$$
$$\Large S_C(r_x,r_y) = \frac{\sum_i r_{xi} r_{yi} }{\sqrt{ \sum_i r_{xi}^2} \sqrt{ \sum_i r_{yi}^2} }$$

In Cosine similarity, we look at the angle of the two vectors and not their magnitudes.


In [26]:
vals_filled = vals.fillna(0)
cosine_similarity(vals_filled, vals_filled.iloc[[0]])

array([[1.        ],
       [0.37986859],
       [0.32203059],
       [0.        ]])

#### Mean-centered variation

The limitation of the vanilla cosine similarity metric is that some users rate very high or very low. And we need to take it into account when we predict the ratings. Additionally, unrated items are treated as *'negatives'*, at least in this 0-5 rating scheme because nulls are treated as 0s which connote a negative rating. Our solution to this is to apply mean-centering to the user vectors before computing the similarity scores.

In [96]:
vals_mean = vals.mean(axis=1).values.reshape(-1,1)
vals_centered  = vals - vals_mean
vals_filled = vals_centered.fillna(0)
vals_filled

Unnamed: 0,0,1,2,3,4,5,6
0,0.666667,0.0,0.0,1.666667,-2.333333,0.0,0.0
1,0.333333,0.333333,-0.666667,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,-1.666667,0.333333,1.333333,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [97]:
cosine_similarity(vals_filled, vals_filled.iloc[[0]])

array([[ 1.        ],
       [ 0.09245003],
       [-0.55908525],
       [ 0.        ]])

### Pearson

$$\Large \rho(r_x,r_y) = \frac{\sum_i (r_{xi} - \bar{r_{x}}) (r_{yi} - \bar{r_{y}}) }{\sqrt{ \sum_i (r_{xi} - \bar{r_{x}})^2} \sqrt{ \sum_i (r_{yi} - \bar{r_{y}})^2} }$$

"In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/) ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation,[1] or colloquially simply as the correlation coefficient[2] ― is a measure of linear correlation between two sets of data." - <a href="https://en.wikipedia.org/wiki/Pearson_correlation_coefficient">Wikipedia</a>

**When we impute all the missing values with the users' mean ratings, we this is the same as centered-cosine similarity.**

In [100]:
vals_filled = vals.apply(lambda x: x.fillna(x.mean()), axis=1)
vals_filled

Unnamed: 0,0,1,2,3,4,5,6
0,4.0,3.333333,3.333333,5.0,1.0,3.333333,3.333333
1,5.0,5.0,4.0,4.666667,4.666667,4.666667,4.666667
2,3.666667,3.666667,3.666667,2.0,4.0,5.0,3.666667
3,3.0,3.0,3.0,3.0,3.0,3.0,3.0


In [101]:
vals_filled.apply(lambda x: pearsonr(x, vals_filled.iloc[0])[0], axis=1)



0    1.000000
1    0.092450
2   -0.559085
3         NaN
dtype: float64

## End
<sup>made by **Jude Michael Teves**</sup> <br>
<sup>for comments, corrections, suggestions, please email:</sup><sup> <href>judemichaelteves@gmail.com</href> or <href>jude.teves@dlsu.edu.ph</href></sup><br>