# Scalable Recommendation Retrieval with Locality Sensitive Hashing

From the previous example, we can see that the cost of exhaustive search is linear to the number of items, i.e., $n$ and number of features, i.e., $d$. 

In this part, we will practice to use Locality Sensitive Hashing to speed up the recommendation retrieval task. 

In [None]:
%matplotlib inline
import numpy as np
import time
import pickle
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from utils.lsh import *
from utils.load_data import *
from utils.pmf import *
from utils.evaluation import *

In [None]:
#load train/test data and trained pmf model
path_to_model = 'pmf_mvl1m.model'
path_to_train_data = 'train_data'
path_to_test_data  = 'test_data'

pmf_mvl = pickle.load(open(path_to_model, 'rb'))
train   = pickle.load(open(path_to_train_data, 'rb'))
test    = pickle.load(open(path_to_test_data, 'rb'))

The function defined below measure precision and recall value of the top-k recommendation list returned to each user by the model. The result is average over number of users in the testing set.

## Performance of Linear Scanning Solution

We first measure the precision@10 and recall@10

In [None]:
# measuring performance of the first 2000 users
topK = 10
data = pmf_mvl.w_Item
queries = pmf_mvl.w_User[:2000,:]

linear_prec, linear_recall = evaluate_topK(test, data, queries, topK)
print('linear_prec@{0} \t linear_recall@{0}'.format(topK))
print('{0}\t{1}'.format(linear_prec, linear_recall))

## Locality Sensitive Hashing

One of the most popular search protocal using Locality Sensitive Hashing structure is Hashtable look-up (illustrated below).  

<img src="resources/images/lsh_retrieval.png" width="600">

### Performances without post-processing

In this experiment, we immediately build LSH index on the output of PMF algorithm. You should expect to see the precision and recall degeneration as compared to those of linear scanning solution. Here, we report three values here:

1. relative_prec@10 = $\frac{\text{precision@10 of LSH Indexing}}{\text{precision@10 of linear scanning}}$ 
&nbsp;
&nbsp;
&nbsp;  
&nbsp;
2. relative_rec@10    = $\frac{\text{recall@10 of LSH Indexing}}{\text{recall@10 of linear scanning}}$
&nbsp;
&nbsp;
&nbsp;  
&nbsp;
3. touched = $\frac{\text{Average number of investigated items by LSH}}{\text{Total number of items}}$

In [None]:
topK = 10
b_vals = [4, 6, 8]
L_vals = [5, 10]

#queries = pmf_mvl.w_User
#data    = pmf_mvl.w_Item

print('#table\t #bit \t relative_prec@{0} \t relative_recall@{0} \t touched'.format(topK))
for nt in L_vals:
    print('-----------------------------------------------------------------------')
    for b in b_vals: 
        prec, recall, touched = evaluate_LSHTopK(test, data, -queries, CosineHashFamily, nt, b, dot, topK)
        print("{0}\t{1}\t{2}\t{3}\t{4}".format(nt, b, prec/linear_prec, recall/linear_recall, touched)) 


### Performances with post-processing Xbox transformation

Now, before building LSH index, we first apply the Xbox transformation for both user and item vectors. This original maximum inner product search on the original representation becomes the maximum cosine similarity search on the new representation.

\begin{equation}
P(y_i) = [y_i, \sqrt{M^2 - ||y_i||^2}];  Q(x_u) = [x_u, 0]
\end{equation}


In [None]:
#apply Xbox transformation
topK = 10
b_vals = [4, 6, 8]
L_vals = [5, 10]


M = np.linalg.norm(data, axis=1)
max_norm = max(M)
q_norm = np.sqrt((queries * queries).sum(axis=1))
n_queries = queries/q_norm.reshape(queries.shape[0], 1)

n_data = np.concatenate((data, np.sqrt(max_norm**2 - pow(M, 2)).reshape(data.shape[0], -1)), axis = 1)
n_data = n_data/max_norm # normalized data vectors
n_queries = np.concatenate((n_queries, np.zeros((n_queries.shape[0], 1))), axis = 1)

print('Done X-box transformation!')
print('#table\t #bit \t relative_prec@{0} \t relative_recall@{0} \t touched'.format(topK))
for nt in L_vals:
    print('-----------------------------------------------------------------------')
    for b in b_vals: 
        prec, recall, touched = evaluate_LSHTopK(test, n_data, -n_queries, CosineHashFamily, nt, b, dot, topK)
        print("{0}\t{1}\t{2}\t{3}\t{4}".format(nt, b, prec/linear_prec, recall/linear_recall, touched)) 
