<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ShopRunner/collie_recs/blob/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShopRunner/collie_recs/blob/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://raw.githubusercontent.com/ShopRunner/collie_recs/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
# for Collab notebooks, we will start by installing the ``collie_recs`` library
!pip install collie_recs --quiet



In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%env DATA_PATH ./data/

env: DATA_PATH=./data/


In [3]:
import os

import numpy as np
import pandas as pd
from pytorch_lightning.utilities.seed import seed_everything
from IPython.display import HTML
import joblib
import torch

from collie_recs.metrics import mapk, mrr, auc, evaluate_in_batches
from collie_recs.model import CollieTrainer, HybridPretrainedModel, MatrixFactorizationModel
from collie_recs.movielens import get_movielens_metadata, get_recommendation_visualizations

## Load Data From ``01_prepare_data`` Notebook 
If you're running this locally on Jupyter, you should be able to run the next cell quickly without a problem! If you are running this on Colab, you'll need to regenerate the data by running the cell below that, which should only take a few extra seconds to complete. 

In [4]:
try:
    # let's grab the ``Interactions`` objects we saved in the last notebook
    train_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                  'train_interactions.pkl'))
    val_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                'val_interactions.pkl'))
except FileNotFoundError:
    # we're running this notebook on Colab where results from the first notebook are not saved
    # regenerate this data below
    from collie_recs.cross_validation import stratified_split
    from collie_recs.interactions import Interactions
    from collie_recs.movielens import read_movielens_df
    from collie_recs.utils import convert_to_implicit, remove_users_with_fewer_than_n_interactions


    df = read_movielens_df(decrement_ids=True)
    implicit_df = convert_to_implicit(df, min_rating_to_keep=4)
    implicit_df = remove_users_with_fewer_than_n_interactions(implicit_df, min_num_of_interactions=3)

    interactions = Interactions(
        users=implicit_df['user_id'],
        items=implicit_df['item_id'],
        ratings=implicit_df['rating'],
        allow_missing_ids=True,
    )

    train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)


print('Train:', train_interactions)
print('Val:  ', val_interactions)

Checking for and removing duplicate user, item ID pairs...
Checking ``num_negative_samples`` is valid...
Maximum number of items a user has interacted with: 378
Generating positive items set...
Generating positive items set...
Generating positive items set...
Train: Interactions object with 49426 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.
Val:   Interactions object with 5949 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.


# Hybrid Collie Model 
In this notebook, we will use this same metadata and incorporate it directly into the model architecture with a hybrid Collie model. 

## Read in Data

In [5]:
# read in the same metadata used in notebooks ``03`` and ``04``
metadata_df = get_movielens_metadata()


metadata_df.head()

Unnamed: 0,genre_action,genre_adventure,genre_animation,genre_children,genre_comedy,genre_crime,genre_documentary,genre_drama,genre_fantasy,genre_film_noir,...,genre_unknown,decade_unknown,decade_20,decade_30,decade_40,decade_50,decade_60,decade_70,decade_80,decade_90
0,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [6]:
# to do the partial credit calculation, we need this data in a slightly different form.
# Instead of the one-hot-encoded version above, we're going to make a ``1 x n_items`` tensor
# with a number representing the first genre associated with the film, for simplicity.
# Note that with Collie, we could instead make a metadata tensor for each genre and decade
genres = (
    torch.tensor(metadata_df[[c for c in metadata_df.columns if 'genre' in c]].values)
    .topk(1)
    .indices
    .view(-1)
)


genres = genres[:train_interactions.num_items]

genres

tensor([ 2,  1, 15,  ...,  7,  0,  7])

In [7]:
# and, as always, set our random seed
seed_everything(22)

Global seed set to 22


22

In [8]:
from collie_recs.model.cold_start_matrix_factorization import ColdStartModel

In [17]:
model = ColdStartModel(train=train_interactions,
                       val=val_interactions,
                       item_buckets=genres,
                       item_buckets_lr=1e-2,
                       no_buckets_lr=1e-3,
                       embedding_dim=30)

set to stage item_buckets


In [18]:
trainer = CollieTrainer(
    model=model,
    max_epochs=20,
    logger=False,
    checkpoint_callback=False,
    weights_summary=None,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [12]:
trainer.fit(model)

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-03.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     8: reducing learning rate of group 0 to 1.0000e-04.
Epoch     8: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    11: reducing learning rate of group 0 to 1.0000e-05.
Epoch    11: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    13: reducing learning rate of group 0 to 1.0000e-06.
Epoch    13: reducing learning rate of group 0 to 1.0000e-07.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    15: reducing learning rate of group 0 to 1.0000e-07.
Epoch    15: reducing learning rate of group 0 to 1.0000e-08.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    17: reducing learning rate of group 0 to 1.0000e-08.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [13]:
evaluate_in_batches([mapk], val_interactions, model)

  0%|          | 0/48 [00:00<?, ?it/s]

0.009754821770258782

In [14]:
model.advance_stage()
trainer.max_epochs += 10

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

user embeddings initialized
set to stage no_buckets


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     7: reducing learning rate of group 0 to 1.0000e-03.
Epoch     7: reducing learning rate of group 0 to 1.0000e-04.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     9: reducing learning rate of group 0 to 1.0000e-04.
Epoch     9: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    11: reducing learning rate of group 0 to 1.0000e-05.
Epoch    11: reducing learning rate of group 0 to 1.0000e-06.


  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.045376259127933624


In [25]:
# we can save the model with...
os.makedirs('models', exist_ok=True)

model.save_model('models/hybrid_model', overwrite=True)
hybrid_model_loaded_in = HybridModel(load_model_path='models/hybrid_model')


hybrid_model_loaded_in

set to stage matrix_factorization


HybridModel(
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1674, 1)
  (user_embeddings): ScaledEmbedding(943, 30)
  (item_embeddings): ScaledEmbedding(1674, 30)
  (dropout): Dropout(p=0.0, inplace=False)
  (metadata_layer_0): Linear(in_features=28, out_features=10, bias=True)
  (combined_layer_0): Linear(in_features=70, out_features=48, bias=True)
  (combined_layer_1): Linear(in_features=48, out_features=24, bias=True)
  (combined_layer_2): Linear(in_features=24, out_features=12, bias=True)
  (combined_layer_3): Linear(in_features=12, out_features=1, bias=True)
)

That's the end of our tutorials, but it's not the end of the awesome features available in Collie. Check out all the different available architectures in the documentation [here](https://collie.readthedocs.io/en/latest/index.html)! 

----- 