### Similarity metrics (*content-based filtering*)
     

#### Cosine similarity 

$$ cosine(x, y) = \frac{\sum_i{x_i \cdot y_i}}{\sqrt{\sum_i{x_i^2}} \cdot {\sqrt{\sum_i{y_i^2}}}} $$

* Seems like *cosine* is very similar to *Pearson correlation* (
$ corr(x, y) = \frac{E((X - E(X)) \cdot (Y - E(Y)))}  {\sqrt{D(X)} \cdot \sqrt{D(Y)}} $ )

#### Jaccard index
$$ Jacc(A, B) = \frac{|A \cap B|}{|A \cup B|} $$
* Only for binary item encodings
* For instance, can be applied to One-Hot-Encoded item tags 

#### Euclidean distance
$$ dist(x, y) = \sqrt{\sum_i{(x_i-y_i)^2}} $$

#### Minkowski distance
$$ dist(x, y) = (\sum_i^n{|x_i-y_i|^n}) ^ \frac{1}{n}  $$

#### Pearson/Spearman/Kendall/... correlation
* Coefficients can then be used as weights to get the target for vector x:
    $$target(x) = weights \cdot targets $$

In [1]:
import numpy as np

# Generate similar vectors

x = np.random.uniform(size=20)
y = x + np.random.normal(0.04, 0, size=20)

print(f'x:\t{x}')
print(f'y:\t{y}')

x:	[0.93451155 0.65743761 0.9961082  0.27446787 0.4783896  0.13702663
 0.14188131 0.42261796 0.88377803 0.46889177 0.35044283 0.97407723
 0.39320223 0.78466728 0.47554408 0.21141312 0.60334072 0.63639123
 0.79672015 0.62417076]
y:	[0.97451155 0.69743761 1.0361082  0.31446787 0.5183896  0.17702663
 0.18188131 0.46261796 0.92377803 0.50889177 0.39044283 1.01407723
 0.43320223 0.82466728 0.51554408 0.25141312 0.64334072 0.67639123
 0.83672015 0.66417076]


### Simple implementions of some similarity metrics

In [5]:
def cosine(x: np.array, y: np.array) -> float:
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

def jaccard(x: np.array, y: np.array) -> float:
    uni = (x == y).sum()
    inter = x.size + y.size - uni
    return uni / inter

def euclidean(x: np.array, y: np.array) -> float:
    return np.linalg.norm(x - y)

def minkowski(x: np.array, y: np.array, n: int = 2) -> float:
    return np.sum(np.abs(x - y) ** n) ** (1 / n)

def pearson(x: np.array, y: np.array) -> float:    
    return np.mean(
        (x - np.mean(x)) * (y - np.mean(y))
    ) / np.sqrt(np.var(x) * np.var(y))

In [6]:
print(f'Cosine similarity:\t{cosine(x, y)}')
print(f'Euclidean similarity:\t{euclidean(x, y)}')
print(f'Minkowski similarity:\t{minkowski(x, y)}')
print(f'Minkowski similarity:\t{minkowski(x, y, 3)}')
print(f'Pearson similarity:\t{pearson(x, y)}')

Cosine similarity:	0.9996641581280302
Euclidean similarity:	0.1788854381999832
Minkowski similarity:	0.1788854381999832
Minkowski similarity:	0.1085767046637963
Pearson similarity:	1.0


### Predictive metrics
- MAE
- RMSE
- Classification metrics (Accuracy, Precision, Recall, F1,  ROC-AUC, PR-AUC, ...)
- Decision support metrics
  1. **Precision@k**
        $$ Precision@k = \frac{\text{Relevant items in top-k recommendations}}{\text{All recommended items at k}} = \frac{TP}{TP + FP}$$
  2. **Recall@k ( aka Hitrate@k )**
      $$ Recall@k = \frac{\text{Relevant items in top-k recommendations}}{\text{All relevant items}} = \frac{TP}{TP + FN}$$
  3. **F1@k**
      $$ F1@k = \frac{2 \cdot Precision@k \cdot Recall@k}{Precision@k + Recall@k} $$
- Ranking based metrics
  1. **Average Precision**
        $$ AP@N =   \frac{1}{m} \cdot \sum_{k=1}^N{relevant(k) \cdot P(k)} $$
      $ relevant(k) \text{ = 1 if k-th item is relevant else 0} $
      
      $ m $ - number of items user adds
  2. **MAP**
  3. **DCG (Discounted cumulative gain)**  

In [127]:
def precision_k(real: np.array, preds: np.array, rec_threshold=0.5, k=5):
    real = real[(-preds).argsort()]
    preds.sort()
    
    top_k = preds[::-1][:k]
        
    recommended_k = (top_k >= rec_threshold).sum()
    relevant_k = ((top_k >= rec_threshold) * (real >= rec_threshold)[:k]).sum()
    
    return 1 if recommended_k == 0 else relevant_k / recommended_k


def hitrate_k(real: np.array, preds: np.array, rec_threshold=0.5, k=5):
    real = real[(-preds).argsort()]
    preds.sort()
    
    top_k = preds[::-1][:k]
        
    relevant_total = (real >= rec_threshold).sum()
    relevant_k = ((top_k >= rec_threshold) * (real >= rec_threshold)[:k]).sum()
    
    return 1 if relevant_total == 0 else relevant_k / relevant_total  


def f1_k(real: np.array, preds: np.array, b:float=1, rec_threshold=0.5, k=5):
    real = real[(-preds).argsort()]
    preds.sort()
    
    top_k = preds[::-1][:k]

    relevant_k = ((top_k >= rec_threshold) * (real >= rec_threshold)[:k]).sum()
    recommended_k = (top_k >= rec_threshold).sum()
    relevant_total = (real >= rec_threshold).sum()
    
    precision = relevant_k / recommended_k if recommended_k != 0 else 1
    recall = relevant_k / relevant_total if relevant_total != 0 else 1
    
    return (1 + b**2) * precision * recall / (b**2 * precision + recall)    

In [128]:
real, preds = np.array([4, 2, 3, 5, 2, 4]), np.array([2.3, 3.6, 3.4, 4.5, 4.9, 4.3])

print(f'P@k = \t {precision_k(real, preds, rec_threshold=3.5, k=3)}') # 0.67 precision
print(f'Hitrate@k = \t {hitrate_k(real, preds, rec_threshold=3.5, k=3)}') # 0.67 hitrate / recall
print(f'F1@k = \t {f1_k(real, preds, rec_threshold=3.5, k=3)}') # 0.67 f1

P@k = 	 0.6666666666666666
Hitrate@k = 	 0.6666666666666666
F1@k = 	 0.6666666666666666


In [153]:
# TODO: fix
def average_precision(real: np.array, preds: np.array, rec_threshold=0.5):
    real = real[(-preds).argsort()]
    preds.sort()
    preds = preds[::-1]
            
    relevant_total = (real >= rec_threshold).sum()
    
    def rel(k):
        return int(preds[k] >= rec_threshold and real[k] >= rec_threshold)
    
    ap = 0
    
    for k in range(preds.size):
        ap += precision_k(real, preds, rec_threshold, k) * rel(k)
    
    return ap / relevant_total

real, preds = np.array([4, 2, 3, 5, 2, 4]), np.array([2.3, 3.6, 3.4, 4.5, 4.9, 4.3])

print(f'AP@k = \t {average_precision(real, preds, rec_threshold=3.5)}') # 0.5 average precision

AP@k = 	 0.3333333333333333


In [152]:
from sklearn.metrics import average_precision_score

y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])

average_precision(y_true, y_scores)

0.0