## 最近鄰搜尋

推薦系統、搜尋系統常常都會產生使用者向量、物品向量，但運用一般的比對方式會讓速度變得非常緩慢，因此需要透過一個 ANN 最近鄰搜尋的方式幫助提速。推薦使用 annoy 套件(需要下載C++)，Spotify 正使用此套件完成其推薦系統。

In [1]:
!pip install annoy



In [2]:
from sklearn.metrics.pairwise import cosine_similarity

In [3]:
from annoy import AnnoyIndex
import random


# 向量維度，因為是示意，自產向量
f = 40  
nums_vectors = 1000

# 1. 建立 Index，angular是相似度計算方法
t = AnnoyIndex(f, 'angular')

# 2. 製作向量
for i in range(nums_vectors):
    v = [random.gauss(0, 1) for z in range(f)]
    # 將向量加入的 index 內
    t.add_item(i, v)

# 3. 建立 tree 結構去儲存這些向量到index內
t.build(10) # 10 trees


# 4. 儲存到目前的結構、資料進 disk 內
t.save('test.ann')

True

In [4]:
# 建立 index
u = AnnoyIndex(f, 'angular')

# 讀回之前儲存的結構
u.load('test.ann') # super fast, will just mmap the file



nums_retrieve = 100

# 取得與 0idx-vector 最近的向量
print(u.get_nns_by_item(0, nums_retrieve)) # will find the 1000 nearest neighbors

[0, 901, 993, 659, 125, 421, 425, 736, 572, 726, 766, 687, 562, 779, 357, 478, 624, 645, 566, 229, 399, 203, 197, 701, 356, 15, 522, 622, 246, 481, 422, 771, 713, 446, 265, 445, 472, 429, 355, 438, 599, 322, 121, 339, 146, 979, 112, 590, 485, 926, 454, 700, 658, 72, 406, 375, 310, 906, 469, 675, 212, 959, 160, 279, 299, 621, 350, 520, 888, 111, 302, 346, 401, 833, 432, 510, 805, 530, 947, 990, 783, 126, 638, 988, 822, 239, 288, 116, 161, 471, 839, 342, 394, 131, 989, 923, 570, 164, 945, 589]


In [6]:
# 簡單透過 cosine similarity 計算相似程度來驗證

for idx in u.get_nns_by_item(0, nums_retrieve):
    # 計算相似性
    print(cosine_similarity([u.get_item_vector(0)], [u.get_item_vector(idx)]))

[[1.]]
[[0.50664558]]
[[0.48497557]]
[[0.47285208]]
[[0.42606327]]
[[0.42556953]]
[[0.4060858]]
[[0.39945202]]
[[0.39079785]]
[[0.38677427]]
[[0.36875233]]
[[0.36448019]]
[[0.36064675]]
[[0.36056403]]
[[0.35257917]]
[[0.35206143]]
[[0.34833343]]
[[0.34213999]]
[[0.34155917]]
[[0.34103703]]
[[0.34065471]]
[[0.34012047]]
[[0.33955432]]
[[0.33061043]]
[[0.32996406]]
[[0.32549411]]
[[0.32400912]]
[[0.3191204]]
[[0.31375441]]
[[0.31287499]]
[[0.31046919]]
[[0.31003807]]
[[0.30915652]]
[[0.30835805]]
[[0.30315317]]
[[0.3022529]]
[[0.30209057]]
[[0.30163048]]
[[0.29579672]]
[[0.29520779]]
[[0.2949873]]
[[0.29328196]]
[[0.28791709]]
[[0.28743712]]
[[0.28542967]]
[[0.28521491]]
[[0.28389263]]
[[0.28336431]]
[[0.28253476]]
[[0.27511568]]
[[0.27480874]]
[[0.27227681]]
[[0.27092637]]
[[0.27086995]]
[[0.27048335]]
[[0.2651507]]
[[0.26444455]]
[[0.26360781]]
[[0.26232992]]
[[0.2605601]]
[[0.26017268]]
[[0.25542703]]
[[0.25484172]]
[[0.25396785]]
[[0.24947488]]
[[0.24769039]]
[[0.24636157]]
[[0.24558