<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>

<i>Licensed under the MIT License.</i>

In [3]:
import sys
import os
import logging
import papermill as pm
import scrapbook as sb
from tempfile import TemporaryDirectory
import pandas as pd
import numpy as np
import tensorflow as tf
tf.get_logger().setLevel('ERROR') # only show error messages

from recommenders.utils.timer import Timer
from recommenders.utils.constants import SEED
from recommenders.models.deeprec.deeprec_utils import prepare_hparams

from recommenders.datasets.amazon_reviews import download_and_extract, data_preprocessing
from recommenders.datasets.download_utils import maybe_download

from recommenders.models.deeprec.models.sequential.sli_rec import SLI_RECModel as SeqModel
from recommenders.models.deeprec.io.sequential_iterator import SequentialIterator

print("System version: {}".format(sys.version))
print("Tensorflow version: {}".format(tf.__version__))

System version: 3.7.0 (default, Jun 28 2018, 08:04:48) [MSC v.1912 64 bit (AMD64)]
Tensorflow version: 1.15.0


# 1. Input Data Format
The input data contains 8 columns, i.e.,   `<label> <user_id> <item_id> <category_id> <timestamp> <history_item_ids> <history_cateory_ids> <hitory_timestamp>`  columns are seperated by `"\t"`.  item_id and category_id denote the target item and category, which means that for this instance, we want to guess whether user user_id will interact with item_id at timestamp. `<history_*>` columns record the user behavior list up to `<timestamp>`, elements are separated by commas.  `<label>` is a binary value with 1 for positive instances and 0 for negative instances.  One example for an instance is: 

`1	523120	14414	3	1612282953577	14330,14135,5877	3,0,8	1611878652251,1611935700202,1612166343656`

In data preprocessing stage, we have a script to generate some ID mapping dictionaries, so user_id, item_id and category_id will be mapped into interager index starting from 1. And you need to tell the input iterator where is the ID mapping files are. (For example, in the next section, we have some mapping files like user_vocab, item_vocab, and cate_vocab).  The data preprocessing script is at [recommenders/dataset/amazon_reviews.py](../nhat_hoang/recommenders/datasets/amazon_reviews.py). Note that ID vocabulary only creates from the train_file, so the new IDs in valid_file or test_file will be regarded as unknown IDs and assigned with a defualt 0 index.

We use Softmax to the loss function. In training and evalution stage, we group 1 positive instance with num_ngs negative instances. Pair-wise ranking can be regarded as a special case of Softmax ranking, where num_ngs is set to 1. 

More specifically,  for training and evalation, you need to organize the data file such that each one positive instance is followd by num_ngs negative instances. Our program will take 1+num_ngs lines as a unit for Softmax calculation. num_ngs is a parameter you need to pass to the `prepare_hparams`, `fit` and `run_eval` function. `train_num_ngs` in `prepare_hparams` denotes the number of negative instances for training, where a recommended number is 4. `valid_num_ngs` and `num_ngs` in `fit` and `run_eval` denote the number in evalution. In evaluation, the model calculates metrics among the 1+num_ngs instances. For the `predict` function, since we only need to calcuate a socre for each individual instance, there is no need for num_ngs setting.  More details and examples will be provided in the following sections.

In [4]:
yaml_file = '../nhat_hoang/recommenders/models/deeprec/config/sli_rec.yaml'

In [5]:
EPOCHS = 10
BATCH_SIZE = 400
RANDOM_SEED = SEED  # Set None for non-deterministic result

data_path = os.path.join("..", "nhat_hoang", "test_slirec", "sampled_dataset2")

In [6]:
# for test
train_file = os.path.join(data_path, r'train_data')
valid_file = os.path.join(data_path, r'valid_data')
test_file = os.path.join(data_path, r'test_data')
user_vocab = os.path.join(data_path, r'user_vocab.pkl')
item_vocab = os.path.join(data_path, r'item_vocab.pkl')
cate_vocab = os.path.join(data_path, r'category_vocab.pkl')
output_file = os.path.join(data_path, r'output.txt')

train_num_ngs = 4 # number of negative instances with a positive instance for training
valid_num_ngs = 4 # number of negative instances with a positive instance for validation
test_num_ngs = 9 # number of negative instances with a positive instance for testing
sample_rate = 0.01 # sample a small item set for training and testing here for fast example; otherwise, sample_rate=1

if not os.path.exists(train_file):
    data_preprocessing(train_file, valid_file, test_file, user_vocab, item_vocab, cate_vocab,
                       sample_rate=sample_rate, valid_num_ngs=valid_num_ngs, test_num_ngs=test_num_ngs)

### 1.1 Prepare hyper-parameters

`prepare_hparams()` will create a full set of hyper-parameters for model training, such as learning rate, feature number, and dropout ratio. We can put those parameters in a yaml file (a complete list of parameters can be found under our config folder) , or pass parameters as the function's parameters (which will overwrite yaml settings).

Parameters hints:
need_sample controls whether to perform dynamic negative sampling in mini-batch. train_num_ngs indicates how many negative instances followed by one positive instances.
Examples:
- `need_sample=True` and `train_num_ngs=4`: There are only positive instances in your training file. Our model will dynamically sample 4 negative instances for each positive instances in mini-batch. Note that if `need_sample is` set to True, `train_num_ngs` should be greater than zero.
- `need_sample=False` and `train_num_ngs=4`: In your training file, each one positive line is followed by 4 negative lines. Note that if `need_sample` is set to False, you must provide a traiing file with negative instances, and `train_num_ngs` should match the number of negative number in training file.

In [5]:
hparams = prepare_hparams(yaml_file, 
                          embed_l2=0., 
                          layer_l2=0., 
                          learning_rate=0.001,  # set to 0.01 if batch normalization is disable
                          epochs=EPOCHS,
                          batch_size=BATCH_SIZE,
                          show_step=20,
                          MODEL_DIR=os.path.join(data_path, "model/"),
                          SUMMARIES_DIR=os.path.join(data_path, "summary/"),
                          user_vocab=user_vocab,
                          item_vocab=item_vocab,
                          cate_vocab=cate_vocab,
                          need_sample=True,
                          train_num_ngs=train_num_ngs, # provides the number of negative instances for each positive instance for loss computation.
)

### 1.2 Create data loader
Designate a data iterator for the model. All our sequential models use SequentialIterator. data format is introduced aboved.


Validation and testing data are files after negative sampling offline with the number of <num_ngs> and <test_num_ngs>.

In [6]:
input_creator = SequentialIterator

# 2. Create Model

In [7]:
model = SeqModel(hparams, input_creator, seed=RANDOM_SEED)

# if don't want to train a new model from scratch, load a pre-trained model like this: 
# model.load_model(os.path.join(hparams.MODEL_DIR, "best_model"))

Performance before training

In [8]:
model.run_eval(test_file, num_ngs=test_num_ngs)

{'auc': 0.5513,
 'logloss': 0.6931,
 'mean_mrr': 0.2218,
 'group_auc': 0.5499,
 'ndcg@1': 0.0083,
 'ndcg@3': 0.1026,
 'ndcg@5': 0.2685,
 'hit@1': 0.0083,
 'hit@3': 0.1752,
 'hit@5': 0.587}

AUC=0.5 is a state of random guess. We can see that before training, the model behaves like random guessing.

### Train model

In [9]:
with Timer() as train_time:
    model = model.fit(train_file, valid_file, valid_num_ngs=valid_num_ngs) 

# valid_num_ngs is the number of negative lines after each positive line in your valid_file 
# we will evaluate the performance of model on valid_file every epoch
print('Time cost for training is {0:.2f} mins'.format(train_time.interval/60.0))

step 20 , total_loss: 0.8534, data_loss: 0.8534
step 40 , total_loss: 0.6505, data_loss: 0.6505
step 60 , total_loss: 0.6012, data_loss: 0.6012
step 80 , total_loss: 0.5656, data_loss: 0.5656
step 100 , total_loss: 0.3015, data_loss: 0.3015
step 120 , total_loss: 0.3988, data_loss: 0.3988
step 140 , total_loss: 0.3420, data_loss: 0.3420
step 160 , total_loss: 0.3821, data_loss: 0.3821
step 180 , total_loss: 0.4811, data_loss: 0.4811
step 200 , total_loss: 0.3865, data_loss: 0.3865
step 220 , total_loss: 0.4154, data_loss: 0.4154
step 240 , total_loss: 0.3656, data_loss: 0.3656
step 260 , total_loss: 0.3527, data_loss: 0.3527
step 280 , total_loss: 0.4438, data_loss: 0.4438
step 300 , total_loss: 0.3061, data_loss: 0.3061
step 320 , total_loss: 0.4561, data_loss: 0.4561
step 340 , total_loss: 0.4508, data_loss: 0.4508
step 360 , total_loss: 0.5473, data_loss: 0.5473
step 380 , total_loss: 0.4194, data_loss: 0.4194
step 400 , total_loss: 0.3325, data_loss: 0.3325
step 420 , total_loss: 0

step 20 , total_loss: 0.3602, data_loss: 0.3602
step 40 , total_loss: 0.4689, data_loss: 0.4689
step 60 , total_loss: 0.3043, data_loss: 0.3043
step 80 , total_loss: 0.3271, data_loss: 0.3271
step 100 , total_loss: 0.2978, data_loss: 0.2978
step 120 , total_loss: 0.2959, data_loss: 0.2959
step 140 , total_loss: 0.3659, data_loss: 0.3659
step 160 , total_loss: 0.3736, data_loss: 0.3736
step 180 , total_loss: 0.3884, data_loss: 0.3884
step 200 , total_loss: 0.2358, data_loss: 0.2358
step 220 , total_loss: 0.3042, data_loss: 0.3042
step 240 , total_loss: 0.3760, data_loss: 0.3760
step 260 , total_loss: 0.4123, data_loss: 0.4123
step 280 , total_loss: 0.2588, data_loss: 0.2588
step 300 , total_loss: 0.3269, data_loss: 0.3269
step 320 , total_loss: 0.3723, data_loss: 0.3723
step 340 , total_loss: 0.3721, data_loss: 0.3721
step 360 , total_loss: 0.2607, data_loss: 0.2607
step 380 , total_loss: 0.3323, data_loss: 0.3323
step 400 , total_loss: 0.3226, data_loss: 0.3226
step 420 , total_loss: 0

step 20 , total_loss: 0.3632, data_loss: 0.3632
step 40 , total_loss: 0.2672, data_loss: 0.2672
step 60 , total_loss: 0.2839, data_loss: 0.2839
step 80 , total_loss: 0.3390, data_loss: 0.3390
step 100 , total_loss: 0.3656, data_loss: 0.3656
step 120 , total_loss: 0.2763, data_loss: 0.2763
step 140 , total_loss: 0.3103, data_loss: 0.3103
step 160 , total_loss: 0.3346, data_loss: 0.3346
step 180 , total_loss: 0.3001, data_loss: 0.3001
step 200 , total_loss: 0.4088, data_loss: 0.4088
step 220 , total_loss: 0.3253, data_loss: 0.3253
step 240 , total_loss: 0.2444, data_loss: 0.2444
step 260 , total_loss: 0.3644, data_loss: 0.3644
step 280 , total_loss: 0.3516, data_loss: 0.3516
step 300 , total_loss: 0.3790, data_loss: 0.3790
step 320 , total_loss: 0.3510, data_loss: 0.3510
step 340 , total_loss: 0.2522, data_loss: 0.2522
step 360 , total_loss: 0.3288, data_loss: 0.3288
step 380 , total_loss: 0.2933, data_loss: 0.2933
step 400 , total_loss: 0.2499, data_loss: 0.2499
step 420 , total_loss: 0

step 20 , total_loss: 0.3835, data_loss: 0.3835
step 40 , total_loss: 0.2685, data_loss: 0.2685
step 60 , total_loss: 0.3635, data_loss: 0.3635
step 80 , total_loss: 0.3243, data_loss: 0.3243
step 100 , total_loss: 0.3155, data_loss: 0.3155
step 120 , total_loss: 0.4145, data_loss: 0.4145
step 140 , total_loss: 0.2990, data_loss: 0.2990
step 160 , total_loss: 0.2697, data_loss: 0.2697
step 180 , total_loss: 0.2310, data_loss: 0.2310
step 200 , total_loss: 0.3050, data_loss: 0.3050
step 220 , total_loss: 0.2513, data_loss: 0.2513
step 240 , total_loss: 0.3554, data_loss: 0.3554
step 260 , total_loss: 0.2777, data_loss: 0.2777
step 280 , total_loss: 0.2816, data_loss: 0.2816
step 300 , total_loss: 0.3513, data_loss: 0.3513
step 320 , total_loss: 0.2819, data_loss: 0.2819
step 340 , total_loss: 0.3400, data_loss: 0.3400
step 360 , total_loss: 0.2815, data_loss: 0.2815
step 380 , total_loss: 0.3593, data_loss: 0.3593
step 400 , total_loss: 0.3234, data_loss: 0.3234
step 420 , total_loss: 0

step 20 , total_loss: 0.2820, data_loss: 0.2820
step 40 , total_loss: 0.3008, data_loss: 0.3008
step 60 , total_loss: 0.2969, data_loss: 0.2969
step 80 , total_loss: 0.2660, data_loss: 0.2660
step 100 , total_loss: 0.3511, data_loss: 0.3511
step 120 , total_loss: 0.2716, data_loss: 0.2716
step 140 , total_loss: 0.3517, data_loss: 0.3517
step 160 , total_loss: 0.3484, data_loss: 0.3484
step 180 , total_loss: 0.4512, data_loss: 0.4512
step 200 , total_loss: 0.3554, data_loss: 0.3554
step 220 , total_loss: 0.3713, data_loss: 0.3713
step 240 , total_loss: 0.2355, data_loss: 0.2355
step 260 , total_loss: 0.3097, data_loss: 0.3097
step 280 , total_loss: 0.2518, data_loss: 0.2518
step 300 , total_loss: 0.2913, data_loss: 0.2913
step 320 , total_loss: 0.3048, data_loss: 0.3048
step 340 , total_loss: 0.3480, data_loss: 0.3480
step 360 , total_loss: 0.3588, data_loss: 0.3588
step 380 , total_loss: 0.3467, data_loss: 0.3467
step 400 , total_loss: 0.2724, data_loss: 0.2724
step 420 , total_loss: 0

### Evaluate

Performance after training

In [10]:
model.run_eval(test_file, num_ngs=test_num_ngs)

{'auc': 0.8662,
 'logloss': 0.6888,
 'mean_mrr': 0.4215,
 'group_auc': 0.8375,
 'ndcg@1': 0.0357,
 'ndcg@3': 0.4709,
 'ndcg@5': 0.5389,
 'hit@1': 0.0357,
 'hit@3': 0.7546,
 'hit@5': 0.9188}

Test the model on the test_file extracted from the whole_dataset2

In [12]:
model.run_eval(os.path.join("..", "nhat_hoang", "test_slirec", "whole_dataset", "test_data"), num_ngs=test_num_ngs)

{'auc': 0.5775,
 'logloss': 1.4509,
 'mean_mrr': 0.2483,
 'group_auc': 0.5764,
 'ndcg@1': 0.0598,
 'ndcg@3': 0.1679,
 'ndcg@5': 0.2032,
 'hit@1': 0.0598,
 'hit@3': 0.2412,
 'hit@5': 0.3281}