[View in Colaboratory](https://colab.research.google.com/github/ylongqi/openrec/blob/master/tutorials/OpenRec_Tutorial_1.ipynb)


Get Started
---
by *[Longqi@Cornell](http://www.cs.cornell.edu/~ylongqi)* licensed under [Creative Commons Attribution 4.0 International License](https://creativecommons.org/licenses/by/4.0/)

This tutorial demonstrates the process of training and evaluating recommendation algorithms using OpenRec (>=0.2.0):

*   Prepare training and evaluation datasets.
*   Instantiate samplers for training and evaluation.
*   Instantiate a recommender.
*   Instantiate evaluators.
*   Instantiate a model trainer.
*   TRAIN AND EVALUATE!

Prepare training and evaluation datasets
---
*   Download your favorite dataset from the web. In this tutorial, we use [a relatively small citeulike dataset](http://www.wanghao.in/CDL.htm) for demonstration purpose.

In [0]:
!apt-get install unrar
!pip install openrec

import os
try:
    from urllib.request import urlretrieve
except ImportError:
    from urllib import urlretrieve

urlretrieve('http://www.wanghao.in/data/ctrsr_datasets.rar', 'ctrsr_datasets.rar')
os.system('unrar x ctrsr_datasets.rar')

*   Convert raw data into [numpy structured array](https://docs.scipy.org/doc/numpy-1.13.0/user/basics.rec.html). As required by the [Dataset](https://github.com/ylongqi/openrec/blob/master/openrec/utils/dataset.py) class, two keys **user_id** and **item_id** are required. Each row in the converted numpy array represents an interaction. The array might contain additional keys based on the use cases.



In [0]:
import numpy as np
import random

total_users = 0
interactions_count = 0
with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:
    for line in fin:
        interactions_count += int(line.split()[0])
        total_users += 1

# radomly hold out an item per user for validation and testing respectively.
val_structured_arr = np.zeros(total_users, dtype=[('user_id', np.int32), 
                                                  ('item_id', np.int32)]) 
test_structured_arr = np.zeros(total_users, dtype=[('user_id', np.int32), 
                                                   ('item_id', np.int32)])
train_structured_arr = np.zeros(interactions_count-total_users * 2, 
                                dtype=[('user_id', np.int32), 
                                       ('item_id', np.int32)])

interaction_ind = 0
next_user_id = 0
next_item_id = 0
map_to_item_id = dict()  # Map item id from 0 to len(items)-1

with open('ctrsr_datasets/citeulike-a/users.dat', 'r') as fin:
    for line in fin:
        item_list = line.split()[1:]
        random.shuffle(item_list)
        for ind, item in enumerate(item_list):
            if item not in map_to_item_id:
                map_to_item_id[item] = next_item_id
                next_item_id += 1
            if ind == 0:
                val_structured_arr[next_user_id] = (next_user_id, 
                                                    map_to_item_id[item])
            elif ind == 1:
                test_structured_arr[next_user_id] = (next_user_id, 
                                                     map_to_item_id[item])
            else:
                train_structured_arr[interaction_ind] = (next_user_id, 
                                                         map_to_item_id[item])
                interaction_ind += 1
        next_user_id += 1

*   Instantiate training, validation, and testing datasets using the Dataset class.

In [0]:
from openrec.tf1.utils import Dataset

train_dataset = Dataset(raw_data=train_structured_arr,
                        total_users=total_users, 
                        total_items=len(map_to_item_id), 
                        name='Train')
val_dataset = Dataset(raw_data=val_structured_arr,
                      total_users=total_users,
                      total_items=len(map_to_item_id),
                      num_negatives=500,
                      name='Val')
test_dataset = Dataset(raw_data=test_structured_arr,
                       total_users=total_users,
                       total_items=len(map_to_item_id),
                       num_negatives=500,
                       name='Test')

Instantiate samplers
---
*  For training, **RandomPairwiseSampler** is used, i.e., each instance contains an user, an item that the user interacts, and an item that the user did NOT interact.
*  For evaluation, **EvaluationSampler** is used. It feeds in user interaction data one user at a time. For a user, (relevant and irrelevant) items are divided into batches and evaluated seperately.

In [0]:
from openrec.tf1.utils.samplers import RandomPairwiseSampler
from openrec.tf1.utils.samplers import EvaluationSampler

train_sampler = RandomPairwiseSampler(batch_size=1000, 
                                      dataset=train_dataset, 
                                      num_process=5)
val_sampler = EvaluationSampler(batch_size=1000, 
                                dataset=val_dataset)
test_sampler = EvaluationSampler(batch_size=1000, 
                                 dataset=test_dataset)

Instantiate a recommender
---
*  We use the [BPR recommender](https://github.com/ylongqi/openrec/blob/master/openrec/recommenders/bpr.py) that implements the pure Baysian Personalized Ranking (BPR) algorithm.

In [0]:
from openrec.tf1.recommenders import BPR

bpr_model = BPR(batch_size=1000, 
                total_users=train_dataset.total_users(), 
                total_items=train_dataset.total_items(), 
                dim_user_embed=50, 
                dim_item_embed=50, 
                save_model_dir='bpr_recommender/', 
                train=True, serve=True)

Instantiate evaluators
---
*  Define evaluators that you plan to use. This tutorial evaluate the recommender against Area Under Curve (AUC).



In [0]:
from openrec.tf1.utils.evaluators import AUC

auc_evaluator = AUC()

Instantiate a model trainer
---
*  The model trainer wraps a recommender and makes it ready for training and evaluation.

In [0]:
from openrec import ModelTrainer

model_trainer = ModelTrainer(model=bpr_model)

TRAIN AND EVALUATE
---

In [0]:
model_trainer.train(total_iter=10000,  # Total number of training iterations
                    eval_iter=1000,    # Evaluate the model every "eval_iter" iterations
                    save_iter=10000,   # Save the model every "save_iter" iterations
                    train_sampler=train_sampler, 
                    eval_samplers=[val_sampler, test_sampler], 
                    evaluators=[auc_evaluator])