In [1]:
!pip install hnswlib



In [2]:
import numpy as np
import hnswlib
import cornac
from cornac.data import Reader
from cornac.datasets import netflix
from cornac.eval_methods import RatioSplit
from cornac.models import MF, HNSWLibANN

## Recommender model training

The following experiment shows how to perform ANN-search within Cornac. First, we need to train a model that supports ANN search. Here we choose MF for simple illustration purpose. Other models that support ANN search should work in a similar fashion.

In [3]:
data = netflix.load_feedback(variant="small", reader=Reader(bin_threshold=1.0))

ratio_split = RatioSplit(
    data=data,
    test_size=0.1,
    rating_threshold=1.0,
    exclude_unknowns=True,
    verbose=True,
    seed=123,
)

mf = MF(
    k=50, 
    max_iter=25, 
    learning_rate=0.01, 
    lambda_reg=0.02, 
    use_bias=False,
    verbose=True,
    seed=123,
)

auc = cornac.metrics.AUC()
rec_20 = cornac.metrics.Recall(k=20)

cornac.Experiment(
    eval_method=ratio_split,
    models=[mf],
    metrics=[auc, rec_20],
    user_based=True,
).run()

rating_threshold = 1.0
exclude_unknowns = True
---
Training data:
Number of users = 9986
Number of items = 4921
Number of ratings = 547022
Max rating = 1.0
Min rating = 1.0
Global mean = 1.0
---
Test data:
Number of users = 9986
Number of items = 4921
Number of ratings = 60747
Number of unknown users = 0
Number of unknown items = 0
---
Total users = 9986
Total items = 4921

[MF] Training started!


  0%|          | 0/25 [00:00<?, ?it/s]

Optimization finished!

[MF] Evaluation started!


Ranking:   0%|          | 0/8233 [00:00<?, ?it/s]


TEST:
...
   |    AUC | Recall@20 | Train (s) | Test (s)
-- + ------ + --------- + --------- + --------
MF | 0.8530 |    0.0669 |    0.9060 |   6.3865



## Building index for ANN recommender

After MF model is trained, we need to wrap it with an ANN recommender. We employ Cornac built-in HNSWLibANN which implements [HNSW algorithm](https://arxiv.org/abs/1603.09320) for building index and doing approximate K-nearest neighbor search. More on how to tune the hyper-parameters at https://github.com/nmslib/hnswlib and https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md.

In [4]:
ann = HNSWLibANN(
    model=mf,
    M=16,
    ef_construction=100,
    ef=50,
    seed=123,
    num_threads=-1,
)
ann.build_index()

## Time/accuracy tradeoff

Here we measure the tradeoff between efficiency and accuracy. Let say we do top-20 recommendations for 10,000 users.

In [5]:
K = 20
N = 10000
test_users = np.random.RandomState(123).choice(mf.user_ids, size=N)

In [6]:
%%time

mf_recs = []
for uid in test_users:
    mf_recs.append(mf.recommend(uid, k=K))

CPU times: user 1min 14s, sys: 18.1 ms, total: 1min 14s
Wall time: 1.56 s


In [7]:
%%time

ann_recs = []
for uid in test_users:
    ann_recs.append(ann.recommend(uid, k=K))

CPU times: user 218 ms, sys: 32 µs, total: 218 ms
Wall time: 216 ms


While it took MF 4.98s to complete the task, it's only 285ms for ANN. The speed up is about 17 times. Note that our dataset contains less than 5000 items. We will see an even bigger improvement with more items and with high dimensional factors.

In [8]:
recalls = []
for mf_rec, ann_rec in zip(mf_recs, ann_recs):
    recalls.append(len(set(mf_rec) & set(ann_rec)) / len(mf_rec))
print(np.mean(recalls) * 100.0)

99.87450000000001


In terms of recall, we only see a small drop of less than 1% meaning recommendations are very similar between the two. While it's almost a free lunch for this case, the numbers might differ for other cases. It's always good to make sure that ANN maintains consistent recommendations with the base model.

## Save/load for deployment

In [9]:
ann.save("save_dir")

'save_dir/HNSWLibANN/2023-12-08_19-27-58-137671.pkl'

In [10]:
loaded_ann = HNSWLibANN.load("save_dir/HNSWLibANN")

Let's compare top-K recommendations for 5 random users between the original ANN and the loaded ANN. Of course they should be the same.

In [11]:
np.array_equal(
    ann.recommend_batch(test_users[:5], k=K), 
    loaded_ann.recommend_batch(test_users[:5], k=K),
)

True

One more test, the loaded ANN should achieve the same recall as the original one.

In [12]:
loaded_ann_recs = []
for uid in test_users:
    loaded_ann_recs.append(loaded_ann.recommend(uid, k=K))
    
recalls = []
for mf_rec, ann_rec in zip(mf_recs, loaded_ann_recs):
    recalls.append(len(set(mf_rec) & set(ann_rec)) / len(mf_rec))
print(np.mean(recalls) * 100.0)

99.87450000000001
