<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ShopRunner/collie/blob/main/tutorials/07_explicit_matrix_factorization.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShopRunner/collie/blob/main/tutorials/07_explicit_matrix_factorization.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://raw.githubusercontent.com/ShopRunner/collie/main/tutorials/07_explicit_matrix_factorization.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
# for Collab notebooks, we will start by installing the ``collie`` library
!pip install collie --quiet



In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%env DATA_PATH data/

env: DATA_PATH=data/


In [3]:
import os

from IPython.display import HTML
import joblib
import numpy as np
from pytorch_lightning.utilities.seed import seed_everything
import torchmetrics

from collie.cross_validation import stratified_split
from collie.interactions import ExplicitInteractions
from collie.metrics import explicit_evaluate_in_batches
from collie.model import CollieTrainer, MatrixFactorizationModel
from collie.movielens import get_recommendation_visualizations, read_movielens_df

# Explicit Data Support in Collie
Thus far in the tutorials, we have focused on implicit recommendations where our source data does not contain the degree to which a user loved or hated an item, only that they interacted or did not interact with an item. This meant that our previous recommendation systems built would not know whether a user watched a film and loved it, or watched it and hated it. It made our recommendation systems more challenging, since they had to sort through all this noise to effectively make recommendations for users.

While most data in the real world is implicit, there are times where we do have user feedback on their preferences, often in the form of a numerical star rating. When we have this data, known as **explicit data**, we can build even more effective recommendation systems! 

In this tutorial, we'll replicate the work we did in ``02_matrix_factorization`` earlier, but for explicit data. Note how similar the API is in Collie to prepare, train, and evaluate a model with this new dataset type. 

We'll begin by loading in data as we normally would, just with a few steps removed. 

In [4]:
df = read_movielens_df(decrement_ids=True)

df.shape

(100000, 4)

In [5]:
# note that we will be using ``ExplicitInteractions`` here instead of ``Interactions``
interactions = ExplicitInteractions(
    users=df['user_id'],
    items=df['item_id'],
    ratings=df['rating'],
    allow_missing_ids=True,
)

interactions

Checking for and removing duplicate user, item ID pairs...


ExplicitInteractions object with 100000 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.

In [6]:
# but the data split is exactly the same!
train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)


print('Train:', train_interactions)
print('Val:  ', val_interactions)

Train: ExplicitInteractions object with 89561 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.
Val:   ExplicitInteractions object with 10439 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.


## Train a Matrix Factorization Model

In [7]:
# this handy PyTorch Lightning function fixes random seeds across all the libraries used here
seed_everything(24)

Global seed set to 24


24

In [8]:
# ... and our model definition is exactly the same!
# For explicit data, it is often nice to cap our results to be within a fixed range.
# For MovieLens 100K data, all ratings are between 1 and 5, so we apply a final sigmoid at
# the end of the model to ensure all predictions are within this range. This helps the model
# learn significantly faster and perform better for explicit tasks!
model = MatrixFactorizationModel(
    train=train_interactions,
    val=val_interactions,
    embedding_dim=10,
    lr=1e-2,
    loss='mse',
    y_range=[1, 5],
)


model

MatrixFactorizationModel(
  (loss_function): MSELoss()
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1682, 1)
  (user_embeddings): ScaledEmbedding(943, 10)
  (item_embeddings): ScaledEmbedding(1682, 10)
  (dropout): Dropout(p=0.0, inplace=False)
)

In [9]:
trainer = CollieTrainer(model, max_epochs=10, deterministic=True)

trainer.fit(model)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores

  | Name            | Type            | Params
----------------------------------------------------
0 | loss_function   | MSELoss         | 0     
1 | user_biases     | ZeroEmbedding   | 943   
2 | item_biases     | ZeroEmbedding   | 1.7 K 
3 | user_embeddings | ScaledEmbedding | 9.4 K 
4 | item_embeddings | ScaledEmbedding | 16.8 K
5 | dropout         | Dropout         | 0     
----------------------------------------------------
28.9 K    Trainable params
0         Non-trainable params
28.9 K    Total params
0.115     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 24


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-03.
Epoch     5: reducing learning rate of group 0 to 1.0000e-03.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     7: reducing learning rate of group 0 to 1.0000e-04.
Epoch     7: reducing learning rate of group 0 to 1.0000e-04.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     9: reducing learning rate of group 0 to 1.0000e-05.
Epoch     9: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

## Evaluate the Model 
Just like that, we have an explicit model with very few code changes from what we did in Tutorial 02! Of course, training a model means nothing if we can't tell how well it is performing.

Luckily, explicit metrics are much more straightforward, the most common being:

* [``Mean Squared Error``](https://en.wikipedia.org/wiki/Mean_squared_error)
* [``Root Mean Squared Error``](https://en.wikipedia.org/wiki/Root-mean-square_deviation)
* [``Mean Absolute Error``](https://en.wikipedia.org/wiki/Mean_absolute_error)

Luckily, each of these metrics is built into PyTorch Lightning's companion library: ``torchmetrics``. We leverage the incredible work done in this library to handle different metric implementation. You can view the documentation for MSE [here](https://torchmetrics.readthedocs.io/en/latest/references/modules.html#meansquarederror) and MAE [here](https://torchmetrics.readthedocs.io/en/latest/references/modules.html#meanabsoluteerror).

Note that adding new metrics is a breeze here as well. To read more about this, see the docs [here](https://torchmetrics.readthedocs.io/en/latest/pages/implement.html).

We'll go ahead and evaluate all of these at once below, once again doing our computation on the GPU, if we can. Best of all, explicit metrics tend to finish significantly faster than implicit metrics, so this should be done as soon as you press start! 

In [10]:
model.eval()  # set model to inference mode

mse_score, mae_score = explicit_evaluate_in_batches(
    metric_list=[torchmetrics.MeanSquaredError(), torchmetrics.MeanAbsoluteError()],
    test_interactions=val_interactions,
    model=model,
)

print(f'MSE: {mse_score}')
print(f'MAE: {mae_score}')

  0%|          | 0/11 [00:00<?, ?it/s]

MSE: 0.8649271726608276
MAE: 0.7251297831535339


Again, we can also look at particular users to get a sense of what the recs look like. 

In [11]:
# select a random user ID to look at recommendations for
user_id = np.random.randint(0, train_interactions.num_users)

display(
    HTML(
        get_recommendation_visualizations(
            model=model,
            user_id=user_id,
            filter_films=True,
            shuffle=True,
            detailed=True,
        )
    )
)

Unnamed: 0,"Winter Guest, The (1997)",Scream (1996),Contact (1997),"Full Monty, The (1997)","Game, The (1997)",Fly Away Home (1996),Scream 2 (1997)
Some loved films:,,,,,,,

Unnamed: 0,Home Alone 3 (1997),D3: The Mighty Ducks (1996),Stephen King's The Langoliers (1995),Enchanted April (1991),"Ghost and the Darkness, The (1996)","Preacher's Wife, The (1996)",Happy Gilmore (1996),Hellraiser: Bloodline (1996),Striptease (1996),"Pagemaster, The (1994)"
Recommended films:,,,,,,,,,,


## Save and Load a Standard Model 

In [12]:
# we can still save the model with...
os.makedirs('models', exist_ok=True)
model.save_model('models/explicit_matrix_factorization_model.pth')

In [13]:
# ... and if we wanted to load that model back in, we can still do that easily...
model_loaded_in = MatrixFactorizationModel(load_model_path='models/explicit_matrix_factorization_model.pth')


model_loaded_in

MatrixFactorizationModel(
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1682, 1)
  (user_embeddings): ScaledEmbedding(943, 10)
  (item_embeddings): ScaledEmbedding(1682, 10)
  (dropout): Dropout(p=0.0, inplace=False)
)

That's the end of our tutorials, but it's not the end of the awesome features available in Collie. Check out all the different available architectures in the documentation [here](https://collie.readthedocs.io/en/latest/index.html)! 

----- 