# 最近鄰算法 Approximation Nearest Neighbors

用例子來說明原理還是最好的:

如果有500萬個商品，每個商品的特徵向量表示$\in \textbf{R}^{100}$，我們想要尋找其中一個商品最鄰近的前十個。

In [1]:
import numpy as np
import gc
from tqdm import tqdm

items = int(5*1e6) #500萬
factor = 100
x = np.random.rand(items,factor) ## 記憶體~4GB

class TopRelated:
    ## 利用向量內積，查找最鄰近的物品(cosine based)
    def __init__(self, items_factors):
        ## 初始化需要正規化物品向量
        norms = np.linalg.norm(items_factors, axis=1)
        self.factors = items_factors / norms[:, np.newaxis]

    def get_related(self, itemid, N=10):
        scores = self.factors.dot(self.factors[itemid]) # cosine 
        best = np.argpartition(scores, -N)[-N:] # partion --> 小於此的放在左側
        return sorted(zip(best, scores[best]), key=lambda x: -x[1])

In [44]:
top_related = TopRelated(x)
%time top_related.get_related(100)

Wall time: 266 ms


[(100, 1.0),
 (2266144, 0.85922529002190373),
 (1367840, 0.85190470560078513),
 (4827941, 0.85175188853230699),
 (4158610, 0.85165288118765758),
 (3025845, 0.85092234565859126),
 (4118586, 0.85018339949723798),
 (2610570, 0.84982956620805683),
 (333249, 0.84958760878590689),
 (4265367, 0.84917002416086451)]

對每個item計算最鄰近的10個item需要耗時280ms,估計對500萬個items計算需要至少 380hr

In [23]:
round(items *0.28 /3600,2)# hour

388.89

如果我們需要對海量的商品尋找最接近的，透過線性的查找(暴力法)會非常耗時。這時候可以透過近似近鄰(ANN)方法來幫助，先看怎麼使用套件[annoy](https://github.com/spotify/annoy)套件作者是超級大大(自己google)專案有三千多星!!用他應該一定沒錯啦

In [8]:
import annoy
class ApproximateTopRelated:
    def __init__(self, items_factors, treecount=20):
        index = annoy.AnnoyIndex(items_factors.shape[1], 'angular')
        for i, row in tqdm(enumerate(items_factors)):
            index.add_item(i, row)
        index.build(treecount)
        self.index = index

    def get_related(self, itemid, N=10):
        neighbours = self.index.get_nns_by_item(itemid, N)
        return sorted(((other, 1 - self.index.get_distance(itemid, other))
                      for other in neighbours), key=lambda x: -x[1])

In [9]:
%time app_top_related = ApproximateTopRelated(x)

5000000it [01:08, 72851.61it/s]


Wall time: 13min 15s


In [10]:
%time app_top_related.get_related(100)

Wall time: 0 ns


[(100, 1.0),
 (3164034, 0.4320356249809265),
 (1034936, 0.42606252431869507),
 (118226, 0.40478748083114624),
 (3624813, 0.4044606685638428),
 (2921124, 0.39136600494384766),
 (847370, 0.38512957096099854),
 (4961798, 0.37953001260757446),
 (122976, 0.37684887647628784),
 (416312, 0.37207216024398804)]

In [20]:
import time 

1516869461.8291452

In [51]:
t0 = time.time()
for i in range(items):
    app_top_related.get_related(i)
t1 = time.time()
print('time elapse :{}s'.format(t1-t0))

time elapse :999.8614735603333s
