*Copyright (c) Cornac Authors. All rights reserved.*

*Licensed under the Apache 2.0 License.*

# Visual Bayesian Personalized Ranking with Text Data

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/PreferredAI/cornac/blob/master/tutorials/vbpr_text.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/PreferredAI/cornac/blob/master/tutorials/vbpr_text.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
</table>

## Overview

We would like to use [Visual Bayesian Personalizer Ranking (VBPR)](https://arxiv.org/pdf/1510.01784.pdf), the model makes use of pre-trained visual features extracted from CNN. However, our data of interest [MovieLens dataset](https://grouplens.org/datasets/movielens/) does not come with visual information, but instead it contains text movie plots. In this tutorial, we will employ Conac's modality infrastructures to easily utilize VBPR to leverage item text content.

## Setup

In [0]:
# install Cornac and PyTorch (VBPR model implementation uses PyTorch)
!pip3 install cornac torch>=0.4.1

In [2]:
import cornac
from cornac.data import Reader
from cornac.datasets import movielens
from cornac.eval_methods import RatioSplit
from cornac.data import TextModality, ImageModality
from cornac.data.text import BaseTokenizer

print("Cornac version: {}".format(cornac.__version__))

Cornac version: 1.4.0


## Prepare data
Here we use the MovieLens 100K dataset which is already accessible from Cornac. Hence, we can simply load movie plots and the rating data.

In [0]:
plots, movie_ids = movielens.load_plot()

# movies without plots are filtered out by `cornac.data.Reader`
ml_100k = movielens.load_feedback(reader=Reader(item_set=movie_ids))

## Cross modality

To get vector representations from text data, we build a `TextModality` using our corpus and corresponding ids. We also need to supply a `Tokenizer` for text splitting, in this case tokens are seperated by `\tab` character. We limit the maximum size of vocabulary to 5000, which also means the dimension of our vector space cannot go higher.

In [0]:
item_text_modality = TextModality(corpus=plots, ids=movie_ids, 
                                  tokenizer=BaseTokenizer(sep='\t', stop_words='english'),
                                  max_vocab=5000, max_doc_freq=0.5).build()

Next step is to create an `ImageModality`, which is use by VBPR, using our text representations. In this case, we take the word-count matrix to substitute for visual features.

In [0]:
features = item_text_modality.count_matrix.A
item_image_modality = ImageModality(features=features, ids=movie_ids)

In Cornac, every model relies on the modality for which it was designed for (i.e., visual recommendation algorithms always work with `ImageModality`). This ensures consistency with models' original assumptions, and helps us avoid confusions regarding which modality to use when integrating a new recommender model.

## Experiment

We employ the `RatioSplit` evaluation method to split the rating data. The `item_image_modality` is also supplied here for later usage by the model.

In [0]:
ratio_split = RatioSplit(data=ml_100k, test_size=0.9,
                         item_image=item_image_modality,
                         exclude_unknowns=True, 
                         verbose=True, seed=123)

We are now ready to evaluate performance of VBPR. The [BPR](https://arxiv.org/ftp/arxiv/papers/1205/1205.2618.pdf) model is also included as a baseline to examine the effectiveness of the text auxiliary data.

In [0]:
vbpr = cornac.models.VBPR(k=10, k2=10, n_epochs=20, batch_size=10, learning_rate=0.001,
                          lambda_w=1.0, lambda_b=0.0, lambda_e=100.0, use_gpu=True, seed=123)

bpr = cornac.models.BPR(k=10, max_iter=100, learning_rate=0.001, lambda_reg=0.001, seed=123)

In [0]:
auc = cornac.metrics.AUC()
rec_50 = cornac.metrics.Recall(k=50)

In [0]:
cornac.Experiment(eval_method=ratio_split,
                  models=[bpr, vbpr],
                  metrics=[auc, rec_50]).run()

Results after running the experiment:

<pre>
TEST:
...
     |    AUC | Recall@50 | Train (s) | Test (s)
---- + ------ + --------- + --------- + --------
BPR  | 0.8073 |    0.2301 |    0.2390 |   1.1167
VBPR | 0.8219 |    0.2519 |  113.8606 |   1.0624
</pre>