In [1]:
import pandas as pd
from tqdm import notebook
import numpy as np
from scipy.spatial import distance_matrix

### Euclidean Distance
[[Reference]](https://en.wikipedia.org/wiki/Euclidean_distance)

>#### Similarity from two vector

In [2]:
distance_matrix([[0,0]], [[1,0]])

array([[1.]])

>#### Similarity matrix

[scipy.spatial.distance_matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance_matrix.html)

In [3]:
distance_matrix(
    [[0,0], [3,1]],
    [[1,0], [1,1]])

array([[1.        , 1.41421356],
       [2.23606798, 2.        ]])

**Note: if `matrix` is huge, you can set `threshold` which is parameter let algorithm uses a Python loop instead of large temporary arrays.**

### Cosine-based similarity
[[Reference]](https://en.wikipedia.org/wiki/Cosine_similarity)

In [4]:
from sklearn.metrics.pairwise import cosine_similarity

因為餘弦值的範圍是[-1,+1]，把相似度計算時一般需要值歸一化到[0,1]，一般通過右方公式： `sim = 0.5 + 0.5 * cosθ`

In [5]:
def _cosine_similarity(x, y):
    return 0.5 + 0.5 * cosine_similarity(x, y)

```
vec_a = [[a1, a2]]
vec_b = [[b1, b2]]
cosine_similarity 計算如下:          
    [[cosine_sim(a1, b1)     cosine_sim(a1, b2)],    
     [cosine_sim(a2, b1)     cosine_sim(a2, b2)]] 
```

>#### Similarity from two vector

In [6]:
vec_a = [[-1, -2, -3, -4, -5],[1, -2, -3, -4, -5]]
vec_b = [[1, 3, 5, 7, 9], [-1, -2, -3, -4, -5]]

In [7]:
_cosine_similarity(vec_a, vec_b)

array([[0.00137931, 1.        ],
       [0.01187659, 0.98181818]])

>#### Similarity matrix

In [8]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(
    [[0,0], [1,1]],
    [[3,0], [1,1]])

array([[0.        , 0.        ],
       [0.70710678, 1.        ]])

### Adjusted Cosine Similarity
[[Reference]](https://stackoverflow.com/questions/40716459/choice-between-an-adjusted-cosine-similarity-vs-regular-cosine-similarity)

A和B兩個用戶對兩個內容的評分分別為（1,2）和（4,5），使用餘弦相似度得到的結果是0.98。但從極度上看A似乎不喜歡這兩個Item，但B喜歡。餘弦相似度對絕對數值不敏感，所以需要修改這種不合理性。而Adjusted Cosine Similarity就是减去對item i打過分的每個user，其打分的平均值

In [9]:
from scipy import spatial
import numpy as np
a = np.array([2.0,1.0])  
b = np.array([5.0,3.0])

>#### Regular Cosine Similarity

In [10]:
1 - spatial.distance.cosine(a,b)

0.9970544855015815

In [11]:
c = np.array([5.0,4.0])
1 - spatial.distance.cosine(c,b)

0.9909924304103233

>#### Adjusted Cosine Similarity

In [12]:
mean_ab = sum(sum(a,b)) / 4
mean_ab

3.5

In [13]:
1 - spatial.distance.cosine(a - mean_ab, b - mean_ab)

-0.21693045781865616

In [14]:
mean_cb = sum(sum(c,b)) / 4
mean_cb

6.5

In [15]:
1 - spatial.distance.cosine(c - mean_cb, b - mean_cb)

0.9908301680442989

### Pearson (correlation)-based similarity

In [16]:
def pearson(x, y):
    return 0.5 + 0.5 * np.corrcoef(x, y, rowvar=True) # 歸一化

In [17]:
vec_a = [[-1, -2, -3, -4, -5], [-1, -2, -3, -4, -5]]
vec_b = [[6, 8, 1, 7, 2], [-5, -2, -3, -4, -5]]

雖然有兩個dataset，但其實你可以想像是一個dataset，然後一列一列計算相關性
所以整個大的相關性矩陣式是
* a、b 是一個 list 或向量
```
vec_a = [[a1, a2]]
vec_b = [[b1, b2]]
np.corrcoef 計算的是相關性矩陣
                  a1             a2             b1             b2
        a1   corr(a1, a1)   corr(a1, a2)   corr(a1, b1)   corr(a1, b2)
        a2   corr(a2, a1)   corr(a2, a2)   corr(a2, b1)   corr(a2, b2)
        b1   corr(b1, a1)   corr(b1, a2)   corr(b1, b1)   corr(b1, b2)
        b2   corr(b2, a1)   corr(b2, a2)   corr(b2, b1)   corr(b2, b2)
```

In [18]:
pearson(vec_a, vec_b)

array([[1.        , 1.        , 0.72845289, 0.62126781],
       [1.        , 1.        , 0.72845289, 0.62126781],
       [0.72845289, 0.72845289, 1.        , 0.62928525],
       [0.62126781, 0.62126781, 0.62928525, 1.        ]])

### Hamming Distance
[[Reference]](https://en.wikipedia.org/wiki/Hamming_distance)

The Hamming distance between 1-D arrays u and v, is simply the proportion of disagreeing components in u and v. 

In [19]:
from scipy.spatial.distance import hamming

In [20]:
x = [7, 12, 14, 19, 22]
y = [7, 12, 16, 26, 27]

In [21]:
hamming(x, y)

0.6

### Manhattan Distance
[[Reference]](https://zh.wikipedia.org/zh-tw/%E6%9B%BC%E5%93%88%E9%A0%93%E8%B7%9D%E9%9B%A2)

坐標（x1, y1）的點P1與坐標（x2, y2）的點P2的曼哈頓距離為
${\displaystyle d(x,y)=\left|x_{1}-x_{2}\right|+\left|y_{1}-y_{2}\right|.}$

In [22]:
x = [0, 1, 1, 1, 0, 1]
y = [0, 9, 1, 1, 0, 2]

In [23]:
def manhattan_distance(x, y):
    if isinstance(x, np.ndarray):
        return np.abs(x - y).sum()
    else:
        return np.abs(np.array(x) - np.array(y)).sum()

In [24]:
manhattan_distance(x, y)

9

### Suggestion for recommendation system:
* Use Pearson when your data is subject to user-bias/ different ratings scales of users
* Use Cosine, if data is sparse (many ratings are undefined)
* Use Euclidean, if your data is not sparse and the magnitude of the attribute values is significant
* Use adjusted cosine for Item-based approach to adjust for user-bias

