<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ShopRunner/collie/blob/main/tutorials/06_multi_stage_models"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShopRunner/collie/blob/main/tutorials/06_multi_stage_models.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://raw.githubusercontent.com/ShopRunner/collie/main/tutorials/06_multi_stage_models.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
# for Collab notebooks, we will start by installing the ``collie`` library
!pip install collie --quiet



In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%env DATA_PATH ./data/

env: DATA_PATH=./data/


In [3]:
import os

import numpy as np
import pandas as pd
from pytorch_lightning.utilities.seed import seed_everything
from IPython.display import HTML
import joblib
import torch

from collie.metrics import mapk, mrr, auc, evaluate_in_batches
from collie.model import CollieTrainer, HybridPretrainedModel, MatrixFactorizationModel
from collie.movielens import (get_movielens_metadata,
                              get_recommendation_visualizations,
                              read_movielens_df_item)

## Load Data From ``01_prepare_data`` Notebook 
If you're running this locally on Jupyter, you should be able to run the next cell quickly without a problem! If you are running this on Colab, you'll need to regenerate the data by running the cell below that, which should only take a few extra seconds to complete. 

In [4]:
try:
    # let's grab the ``Interactions`` objects we saved in the last notebook
    train_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                  'train_interactions.pkl'))
    val_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                'val_interactions.pkl'))
except FileNotFoundError:
    # we're running this notebook on Colab where results from the first notebook are not saved
    # regenerate this data below
    from collie.cross_validation import stratified_split
    from collie.interactions import Interactions
    from collie.movielens import read_movielens_df
    from collie.utils import convert_to_implicit, remove_users_with_fewer_than_n_interactions


    df = read_movielens_df(decrement_ids=True)
    implicit_df = convert_to_implicit(df, min_rating_to_keep=4)
    implicit_df = remove_users_with_fewer_than_n_interactions(implicit_df, min_num_of_interactions=3)

    interactions = Interactions(
        users=implicit_df['user_id'],
        items=implicit_df['item_id'],
        ratings=implicit_df['rating'],
        allow_missing_ids=True,
    )

    train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)


print('Train:', train_interactions)
print('Val:  ', val_interactions)

Checking for and removing duplicate user, item ID pairs...
Checking ``num_negative_samples`` is valid...
Maximum number of items a user has interacted with: 378
Generating positive items set...
Generating positive items set...
Generating positive items set...
Train: Interactions object with 49426 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.
Val:   Interactions object with 5949 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.


# Hybrid, Multi-Stage Collie Model 
In this notebook, we will use the same metadata and model architecture as the previous tutorial with a hybrid, multi-stage Collie model. Unlike the last tutorial, however, we won't have to pretrain a ``MatrixFactorizationModel`` beforehand - we'll do all our training with a single model in stages.

## What is a multi-stage model? 
For more complicated deep learning model architectures, optimizers may sometimes focus on only a small handful of parameters for the bulk of the model learning, even if optimizing other parameters will lead to a lower global loss. Rather than rely on extensive hyperparameter tuning for solving this issue, we take advantage of the fact that these more complex recommendation model architectures have distinct pieces, and let each piece learn in its own _stage_. Looking at the architecture for a hybrid model, these stages are: 

1. Training the user and item embeddings and bias terms (the same parameters as in a ``MatrixFactorizationModel``)
2. Training the metadata MLP and combined MLP layers with frozen user and item embeddings (the same as we initially did with a ``HybridPretrainedModel``) 
3. Training all parameters together

With a multi-stage model, we can optimize each of these stages separately, giving them time to learn and be used effectively in the final model. Below, we'll see how this is implemented and used in Collie! 

## Read in Data

In [5]:
# read in the same metadata used in notebooks ``03``, ``04``, and ``05``
metadata_df = get_movielens_metadata()


metadata_df.head()

Unnamed: 0,genre_action,genre_adventure,genre_animation,genre_children,genre_comedy,genre_crime,genre_documentary,genre_drama,genre_fantasy,genre_film_noir,...,genre_unknown,decade_unknown,decade_20,decade_30,decade_40,decade_50,decade_60,decade_70,decade_80,decade_90
0,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [6]:
# We'll also recreate ``genres`` - a ``1 x n_items`` tensor with a number representing the
# first genre associated with the film, for simplicity
genres = (
    torch.tensor(metadata_df[[c for c in metadata_df.columns if 'genre' in c]].values)
    .topk(1)
    .indices
    .view(-1)
)[:train_interactions.num_items]


genres

tensor([ 2,  1, 15,  ...,  7,  0,  7])

In [7]:
# and, as always, set our random seed
seed_everything(22)

Global seed set to 22


22

In [8]:
from collie.model.hybrid_matrix_factorization import HybridModel

In [9]:
# note that this is the same model architecture as in tutorial 05, just without the pretrained model to
# start with! Also, because we train in stages, we can define a separate learning rate for each stage
hybrid_model = HybridModel(train=train_interactions,
                           val=val_interactions,
                           item_metadata=metadata_df,
                           metadata_layers_dims=[8],
                           combined_layers_dims=[16],
                           embedding_dim=30,
                           lr=1e-2,  # used in stage 1
                           bias_lr=1e-1,  # used in stage 1
                           metadata_only_stage_lr=1e-3,  # used in stage 2
                           all_stage_lr=1e-4)  # used in stage 3 (the final stage)

Set ``self.hparams.stage`` to "matrix_factorization"


In [10]:
trainer = CollieTrainer(
    model=hybrid_model,
    max_epochs=5,
    logger=False,
    enable_checkpointing=False,
    weights_summary=None,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [11]:
# train up the first stage only, which in this case is just the ``MatrixFactorizationModel`` components
# only
trainer.fit(hybrid_model)

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [12]:
# evaluate the first stage model
map_10_first_stage = evaluate_in_batches(metric_list=[mapk],
                                         test_interactions=val_interactions,
                                         model=hybrid_model)


print(f'MAP@10 First Stage: {map_10_first_stage}')

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 First Stage: 0.03483490215304492


With the first stage trained, we can now move on to the second stage, which optimizes the metadata and combined MLP layers instead of the user and item embeddings and bias terms.

For multi-stage Collie models, we can advance to the next stage with the ``advance_stage`` method, increase the number of epochs to train the second stage for, then fit the model as normal, with the same ``CollieTrainer`` object!

In [13]:
hybrid_model.advance_stage()
trainer.max_epochs += 10

# fit the second stage model
trainer.fit(hybrid_model)

# evaluate the second stage model
map_10_second_stage = evaluate_in_batches(metric_list=[mapk],
                                          test_interactions=val_interactions,
                                          model=hybrid_model)


print(f'MAP@10 Second Stage: {map_10_second_stage}')

Set ``self.hparams.stage`` to "metadata_only"


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Second Stage: 0.04455868604027284


With all our model parameters now optimized separately, we can advance to the final stage and optimize everything together, at a lower learning rate, to fine-tune our recommendations algorithm. 

In [14]:
hybrid_model.advance_stage()
trainer.max_epochs += 5

# fit the third stage model
trainer.fit(hybrid_model)

# evaluate the final stage model
map_10_final_stage = evaluate_in_batches(metric_list=[mapk],
                                         test_interactions=val_interactions,
                                         model=hybrid_model)


print(f'MAP@10 Final Stage: {map_10_final_stage}')

Set ``self.hparams.stage`` to "all"


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Final Stage: 0.04659126676541498


Each time we call ``trainer.fit(model)``, the optimizer and learning rate scheduler states fully clear. We've noticed that a more cyclical learning rate scheduler ends up increasing the MAP@10 considerably, something we can do easily with Collie below: 

In [15]:
for _ in range(2):
    trainer.max_epochs += 5

    trainer.fit(hybrid_model)

    map_10_final_stage_reset_optimizer = evaluate_in_batches(metric_list=[mapk],
                                                             test_interactions=val_interactions,
                                                             model=hybrid_model)

    print(f'MAP@10 After Reset Optimizer: {map_10_final_stage_reset_optimizer}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 After Reset Optimizer: 0.04723392518130159


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 After Reset Optimizer: 0.04791095644241813


As always, we can visualize some of these recommendations and see if they make sense to us! 

In [16]:
user_id = np.random.randint(0, train_interactions.num_users)

display(
    HTML(
        get_recommendation_visualizations(
            model=hybrid_model,
            user_id=user_id,
            filter_films=True,
            shuffle=True,
            detailed=True,
        )
    )
)

Unnamed: 0,Dances with Wolves (1990),Speed (1994),"Hunchback of Notre Dame, The (1996)",Cinderella (1950),Apollo 13 (1995),One Fine Day (1996),Indiana Jones and the Last Crusade (1989),Four Weddings and a Funeral (1994),"Truth About Cats & Dogs, The (1996)",Top Gun (1986)
Some loved films:,,,,,,,,,,

Unnamed: 0,Star Wars (1977),Return of the Jedi (1983),"Empire Strikes Back, The (1980)",Independence Day (ID4) (1996),Groundhog Day (1993),When Harry Met Sally... (1989),Men in Black (1997),Star Trek: First Contact (1996),Dead Poets Society (1989),"Silence of the Lambs, The (1991)"
Recommended films:,,,,,,,,,,


# Cold Start Model 

Collie includes another multi-stage model, ``ColdStartModel``, which allows you to use the same concept shown earlier, but now to train up a model that uses metadata to specifically counter the cold start problem when we have new items that users have not yet interacted with, but we want high quality recommendations for. You can view the docs for this model [here](https://collie.readthedocs.io/en/latest/models.html).

In addition, Collie includes a base pipeline for multi-stage pipelines that makes it easy to inherit and customize your own multi-stage models! See [here](https://collie.readthedocs.io/en/latest/models.html) for more details!

In [17]:
from collie.model.cold_start_matrix_factorization import ColdStartModel

In [18]:
cold_start_model = ColdStartModel(train=train_interactions,
                                  val=val_interactions,
                                  item_buckets=genres,
                                  item_buckets_stage_lr=1e-2,
                                  no_buckets_stage_lr=1e-4,
                                  embedding_dim=30)

Set ``self.hparams.stage`` to "item_buckets"


In [19]:
trainer = CollieTrainer(
    model=cold_start_model,
    max_epochs=7,
    logger=False,
    enable_checkpointing=False,
    weights_summary=None,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [20]:
# train the first stage
trainer.fit(cold_start_model)

map_10_first_stage = evaluate_in_batches(metric_list=[mapk],
                                         test_interactions=val_interactions,
                                         model=cold_start_model)


print(f'MAP@10 First Stage: {map_10_first_stage}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 First Stage: 0.012453659237541048


The MAP@10 score is low, but this is expected for MovieLens 100K data, where we only have a small set of genres that are not evenly distributed. Ideally, this model works best when our buckets are better defined, separate, and distinct. 

However, this model still provides us a useful instantiation spot that is certainly better than random. As a small aside, we'll see what the MAP@10 score is for a randomly initialized model ``MatrixFactorizationModel``, since this is the same model architecture: 

```python
>>> from collie.model import MatrixFactorizationModel
>>> 
>>> randomly_initialized_model = MatrixFactorizationModel(train=train_interactions,
>>>                                                       embedding_dim=30)
>>> randomly_initialized_map_10 = evaluate_in_batches(metric_list=[mapk],
>>>                                                   test_interactions=val_interactions,
>>>                                                   model=randomly_initialized_model)
>>> 
>>> print(f'MAP@10 for Randomly Initialized Model: {randomly_initialized_map_10}')
MAP@10 for Randomly Initialized Model: 0.002076902122018895
```

With the user and item bucket embeddings and bias parameters now tuned, we'll copy this over our second stage model. This involves copying the ``item_bucket_embeddings -> item_embeddings`` and ``item_bucket_biases -> item_biases`` as initialization. We can then advance the stage and continue training the model as normal. 

In [21]:
cold_start_model.advance_stage()
trainer.max_epochs += 10

# fit the second stage model
trainer.fit(cold_start_model)

# evaluate the final stage model
map_10_final_stage = evaluate_in_batches(metric_list=[mapk],
                                         test_interactions=val_interactions,
                                         model=cold_start_model)


print(f'MAP@10 Final Stage: {map_10_final_stage}')

Copying over item embeddings...
Set ``self.hparams.stage`` to "no_buckets"


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Final Stage: 0.03173671056847683


Now that we have a trained model, imagine we have a new film enter our database that we want to make recommendations for. With a cold start model, we can easily do this with the ``item_bucket_item_similarity`` method, which allows us to find similar items to an item bucket ID. We'll see this below on MovieLens 100K data! 

In [22]:
# read in the item information with MovieLens data
df_item = read_movielens_df_item()

# all IDs must start at 0 in Collie, so we'll make that adjustment here
df_item['item_id'] -= 1

In [23]:
action_genre_idx = metadata_df.columns.tolist().index('genre_action')

item_similarities = cold_start_model.item_bucket_item_similarity(item_bucket_id=action_genre_idx)

In [24]:
df_item.iloc[item_similarities.index][:5]

Unnamed: 0,item_id,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
1361,1361,American Strays (1996),1996-09-13,http://us.imdb.com/M/title-exact?American%20St...,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1432,1432,Men of Means (1998),1997-01-01,http://us.imdb.com/M/title-exact?imdb-title-11...,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1613,1613,"Reluctant Debutante, The (1958)",1958-01-01,http://us.imdb.com/M/title-exact?Reluctant%20D...,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1547,1547,The Courtyard (1995),1995-01-01,"http://us.imdb.com/M/title-exact?Courtyard,%20...",0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
829,829,Power 98 (1995),1996-05-17,http://us.imdb.com/M/title-exact?Power%2098%20...,0,1,0,0,0,0,...,0,0,0,0,1,0,0,1,0,0


So, if we have a new action film without any interactions, for example, we can still provide some level of personalized recommendations with a cold start model! 

Obviously this example is simple when the only buckets we use is genre, but as the buckets become more nuanced and you include other forms of metadata, these recommendations become more and more personalized! 

----- 

Thus far in the tutorials, we have mainly focused on building out implicit recommendation systems for times when we don't explicitly know the degree to which users loved or hated certain items. In the following tutorial, we'll examine how to build a recommendations model for situations in which we _do_ have that data, something known as explicit recommendations. See you there! 

----- 