# Introduction

This notebook is a detail introduction for anyone interested in this repository(basic knowledge for recommender system is essential). DaisyRec aims to design a toy bricks for every details in recommender system. Although `main.py` provided a quik interface for users to get results, I don't think all of them wanna that, so this tutorial will split `main.py` and generate visible results for you to have a straight view over daisy.

This tutorial will take movielens-100k dataset as a example, then make recommendation list step by step. Hope it will be helpful for you. :)

# Load Data

In [1]:
from daisy.utils.loader import load_rate

df, user_num, item_num = load_rate('ml-100k', '5core', binary=True)

Finish loading [ml-100k]-[5core] dataset


more details for `load_rate` function could be reviewed by typing `?load_rate` in the cell below, and the other functions could also show the details via this method, I'll then ommit this operation in the following part.

In [2]:
?load_rate

In [3]:
user_num, item_num

(943, 1349)

In [4]:
df.head()

Unnamed: 0,user,item,rating,timestamp
0,195,240,1.0,881250949
1,304,240,1.0,886307828
2,5,240,1.0,883268170
3,233,240,1.0,891033261
4,62,240,1.0,875747190


We can find that all user ID and item ID have already been categorized. Now, after loading the original experiment data, we need split it into training set and test set. Here we use fold-out strategy(also known as split-by-ratio) and extract 20% data as test set.

In [5]:
from daisy.utils.loader import get_ur
from daisy.utils.splitter import split_test

train_set, test_set = split_test(df, 'fo', .2)

For further KPI calculation, we need figure out the ground truth for each user

In [6]:
# get ground truth
test_ur = get_ur(test_set)
total_train_ur = get_ur(train_set)

In [7]:
# tmp = list(test_ur.keys())[0]
# tmp, test_ur[tmp]

# Run Algorithm

Taking BPR-MF as an example, we should firstly sample some negative samlpes. 

In [8]:
from daisy.utils.sampler import Sampler

sampler = Sampler(
    user_num, 
    item_num, 
    num_ng=4, 
    sample_method='uniform', 
    sample_ratio=1
)
neg_set = sampler.transform(train_set, is_training=True)

Finish negative samplings, sample number is 317716......


`neg_set` is a 2-dimension list whose element is a [user, item, tag, negative set] list

after finish negative sampling, we need initialize the recommender class, as we all know, BPR-MF is a pair-wise ranking issue, so we take `PairMF` in `daisy.model.pair.MFRecommender` as the target method.

In [9]:
from daisy.model.pair.MFRecommender import PairMF

model = PairMF(
    user_num, 
    item_num,
    factors=15,
    epochs=50,
    lr=0.01,
    reg_1=0.,
    reg_2=0.01,
    loss_type='BPR',
)

the following code is just similar to the pytorch coding style

In [10]:
import torch
import torch.utils.data as data
from daisy.utils.data import PairData

train_dataset = PairData(neg_set, is_training=True)
train_loader = data.DataLoader(
    train_dataset, 
    batch_size=256, 
    shuffle=True, 
    num_workers=4
)
model.fit(train_loader)

[Epoch 001]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 103.92it/s, loss=13.9]
[Epoch 002]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 104.74it/s, loss=13.9]
[Epoch 003]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 106.79it/s, loss=11.1]
[Epoch 004]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 107.25it/s, loss=9.73]
[Epoch 005]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 110.58it/s, loss=8.57]
[Epoch 006]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 109.62it/s, loss=12.9]
[Epoch 007]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 108.70it/s, loss=7.32]
[Epoch 008]: 100%|█████████████████████████████████████████████████████| 1242/1242 [00:11<00:00, 107.38it/s, loss=9.96]
[Epoch 009]: 100%|██████████████████████

After training, we need to build candidates set with ground truth in order to calculate the further metrics. Here we set 1000 candidates for each user in test set as an example.

In [13]:
from daisy.utils.loader import build_candidates_set

item_pool = set(range(item_num))
candidates_num = 1000
test_ucands = build_candidates_set(test_ur, total_train_ur, item_pool, candidates_num)

In [18]:
from tqdm import tqdm
import pandas as pd

preds = {}
topk = 10
for u in tqdm(test_ucands.keys()):
    tmp = pd.DataFrame({
        'user': [u for _ in test_ucands[u]], 
        'item': test_ucands[u], 
        'rating': [0. for _ in test_ucands[u]], # fake label, make nonsense
    })
    
    tmp_neg_set = sampler.transform(tmp, is_training=False)
    tmp_dataset = PairData(tmp_neg_set, is_training=False)
    tmp_loader = data.DataLoader(
        tmp_dataset,
        batch_size=candidates_num, 
        shuffle=False, 
        num_workers=0
    )
    
    for items in tmp_loader:
        user_u, item_i = items[0], items[1]
        user_u = user_u.cpu()
        item_i = item_i.cpu()
        
        prediction = model.predict(user_u, item_i)
        
        _, indices = torch.topk(prediction, topk)
        top_n = torch.take(torch.tensor(test_ucands[u]), indices).cpu().numpy()
        
    preds[u] = top_n
    
# convert rank list to binary-interaction
for u in preds.keys():
    preds[u] = [1 if i in test_ur[u] else 0 for i in preds[u]]

100%|████████████████████████████████████████████████████████████████████████████████| 941/941 [01:03<00:00, 14.75it/s]


# Calculating Metrics

In [21]:
import numpy as np
from daisy.utils.metrics import precision_at_k, recall_at_k, map_at_k, hr_at_k, ndcg_at_k, mrr_at_k

tmp_preds = preds.copy()        
tmp_preds = {key: rank_list[:topk] for key, rank_list in tmp_preds.items()}

pre_k = np.mean([precision_at_k(r, topk) for r in tmp_preds.values()])
rec_k = recall_at_k(tmp_preds, test_ur, topk)
hr_k = hr_at_k(tmp_preds, test_ur)
map_k = map_at_k(tmp_preds.values())
mrr_k = mrr_at_k(tmp_preds, topk)
ndcg_k = np.mean([ndcg_at_k(r, topk) for r in tmp_preds.values()])


print(f'Precision@{topk}: {pre_k:.4f}')
print(f'Recall@{topk}: {rec_k:.4f}')
print(f'HR@{topk}: {hr_k:.4f}')
print(f'MAP@{topk}: {map_k:.4f}')
print(f'MRR@{topk}: {mrr_k:.4f}')
print(f'NDCG@{topk}: {ndcg_k:.4f}')

Precision@10: 0.2914
Recall@10: 0.2034
HR@10: 0.8959
MAP@10: 0.1814
MRR@10: 0.9728
NDCG@10: 0.6157


# OverView

All codes above have been wrapped in `main.py`. It is equal to the following instruction. You can get the same result in command console with this method.


```
python main.py --problem_type=pair --algo_name=mf --loss_type=BPR --num_ng=4 --lr=0.01 --reg_1=0 --reg_2=0.01 --factors=15 --epochs=50
```