<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/ShopRunner/collie_recs/blob/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" /> Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/ShopRunner/collie_recs/blob/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" /> View source on GitHub</a>
  </td>
  <td>
    <a target="_blank" href="https://raw.githubusercontent.com/ShopRunner/collie_recs/main/tutorials/05_hybrid_model.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" /> Download notebook</a>
  </td>
</table>

In [1]:
# for Collab notebooks, we will start by installing the ``collie_recs`` library
!pip install collie_recs --quiet



In [2]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

%env DATA_PATH ./data/

env: DATA_PATH=./data/


In [3]:
import os

import numpy as np
import pandas as pd
from pytorch_lightning.utilities.seed import seed_everything
from IPython.display import HTML
import joblib
import torch

from collie_recs.metrics import mapk, mrr, auc, evaluate_in_batches
from collie_recs.model import CollieTrainer, HybridPretrainedModel, MatrixFactorizationModel
from collie_recs.movielens import get_movielens_metadata, get_recommendation_visualizations

## Load Data From ``01_prepare_data`` Notebook 
If you're running this locally on Jupyter, you should be able to run the next cell quickly without a problem! If you are running this on Colab, you'll need to regenerate the data by running the cell below that, which should only take a few extra seconds to complete. 

In [4]:
try:
    # let's grab the ``Interactions`` objects we saved in the last notebook
    train_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                  'train_interactions.pkl'))
    val_interactions = joblib.load(os.path.join(os.environ.get('DATA_PATH', 'data/'),
                                                'val_interactions.pkl'))
except FileNotFoundError:
    # we're running this notebook on Colab where results from the first notebook are not saved
    # regenerate this data below
    from collie_recs.cross_validation import stratified_split
    from collie_recs.interactions import Interactions
    from collie_recs.movielens import read_movielens_df
    from collie_recs.utils import convert_to_implicit, remove_users_with_fewer_than_n_interactions


    df = read_movielens_df(decrement_ids=True)
    implicit_df = convert_to_implicit(df, min_rating_to_keep=4)
    implicit_df = remove_users_with_fewer_than_n_interactions(implicit_df, min_num_of_interactions=3)

    interactions = Interactions(
        users=implicit_df['user_id'],
        items=implicit_df['item_id'],
        ratings=implicit_df['rating'],
        allow_missing_ids=True,
    )

    train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)


print('Train:', train_interactions)
print('Val:  ', val_interactions)

Checking for and removing duplicate user, item ID pairs...
Checking ``num_negative_samples`` is valid...
Maximum number of items a user has interacted with: 378
Generating positive items set...
Generating positive items set...
Generating positive items set...
Train: Interactions object with 49426 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.
Val:   Interactions object with 5949 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.


# Hybrid Collie Model 
In this notebook, we will use this same metadata and incorporate it directly into the model architecture with a hybrid Collie model. 

## Read in Data

In [5]:
# read in the same metadata used in notebooks ``03`` and ``04``
metadata_df = get_movielens_metadata()


metadata_df.head()

Unnamed: 0,genre_action,genre_adventure,genre_animation,genre_children,genre_comedy,genre_crime,genre_documentary,genre_drama,genre_fantasy,genre_film_noir,...,genre_unknown,decade_unknown,decade_20,decade_30,decade_40,decade_50,decade_60,decade_70,decade_80,decade_90
0,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [6]:
# and, as always, set our random seed
seed_everything(22)

Global seed set to 22


22

In [7]:
from collie_recs.model.base.multi_stage_model import *

In [8]:
model = HybridModel(
    train=train_interactions,
    val=val_interactions,
    item_metadata=metadata_df,
    metadata_layers_dims=[10],
    combined_layers_dims=[48, 24, 12],
    embedding_dim=30,
    embeddings_lr=1e-2,
    metadata_only_stage_lr=1e-3,
    all_stage_lr=1e-4,
    stage='matrix_factorization',
)


In [9]:
trainer = CollieTrainer(
    model=model,
    max_epochs=20,
    logger=False,
    checkpoint_callback=False,
    weights_summary=None,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores


In [10]:
trainer.fit(model)

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     7: reducing learning rate of group 0 to 1.0000e-05.
Epoch     7: reducing learning rate of group 0 to 1.0000e-05.
Epoch     7: reducing learning rate of group 0 to 1.0000e-06.
Epoch     7: reducing learning rate of group 0 to 1.0000e-07.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     9: reducing learning rate of group 0 to 1.0000e-06.
Epoch     9: reducing learning rate of group 0 to 1.0000e-06.
Epoch     9: reducing learning rate of group 0 to 1.0000e-07.
Epoch     9: reducing learning rate of group 0 to 1.0000e-08.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    11: reducing learning rate of group 0 to 1.0000e-07.
Epoch    11: reducing learning rate of group 0 to 1.0000e-07.
Epoch    11: reducing learning rate of group 0 to 1.0000e-08.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch    13: reducing learning rate of group 0 to 1.0000e-08.
Epoch    13: reducing learning rate of group 0 to 1.0000e-08.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

In [11]:
evaluate_in_batches([mapk], val_interactions, model)

  0%|          | 0/48 [00:00<?, ?it/s]

0.03322950584653143

In [12]:
model.set_stage('metadata_only')
trainer.max_epochs += 2

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.019207124522373234


In [13]:
model.set_stage('all')
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.03491353721837874


In [14]:
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     6: reducing learning rate of group 0 to 1.0000e-03.
Epoch     6: reducing learning rate of group 0 to 1.0000e-03.
Epoch     6: reducing learning rate of group 0 to 1.0000e-04.
Epoch     6: reducing learning rate of group 0 to 1.0000e-05.


  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.03968827397869582


In [15]:
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.042952518262553904


In [16]:
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.04662143655527791


In [17]:
model.set_stage('all')
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.04774004840593153


In [18]:
model.set_stage('all')
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.04625101001712315


In [19]:
model.set_stage('all')
trainer.max_epochs += 5

trainer.fit(model)

mapk_score = evaluate_in_batches([mapk], val_interactions, model)
print(f'mapk: {mapk_score}')

Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 22


Training: 48it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-04.
Epoch     3: reducing learning rate of group 0 to 1.0000e-05.


Validating: 0it [00:00, ?it/s]

Validating: 0it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-05.
Epoch     5: reducing learning rate of group 0 to 1.0000e-06.


Validating: 0it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

mapk: 0.044583859571980106


In [20]:
# we can save the model with...
os.makedirs('models', exist_ok=True)

model.save_model('models/hybrid_model')
hybrid_model_loaded_in = HybridModel(load_model_path='models/hybrid_model')


hybrid_model_loaded_in

ValueError: Data exists in `path` at models/hybrid_model and `overwrite` is False.

## Train a ``MatrixFactorizationModel`` 

The first step towards training a Collie Hybrid model is to train a regular ``MatrixFactorizationModel`` to generate rich user and item embeddings. We'll use these embeddings in a ``HybridPretrainedModel`` a bit later. 

In [None]:
mf_model = MatrixFactorizationModel(
    train=train_interactions,
    val=val_interactions,
    embedding_dim=30,
    lr=1e-2,
)

In [None]:
trainer = CollieTrainer(model=mf_model, max_epochs=10, deterministic=True)

trainer.fit(mf_model)

In [None]:
mapk_score, mrr_score, auc_score = evaluate_in_batches([mapk, mrr, auc], val_interactions, model)

print(f'Standard MAP@10 Score: {mapk_score}')
print(f'Standard MRR Score:    {mrr_score}')
print(f'Standard AUC Score:    {auc_score}')

## Train a ``HybridPretrainedModel`` 

With our trained ``model`` above, we can now use these embeddings and additional side data directly in a hybrid model. The architecture essentially takes our user embedding, item embedding, and item metadata for each user-item interaction, concatenates them, and sends it through a simple feedforward network to output a recommendation score. 

We can initially freeze the user and item embeddings from our previously-trained ``model``, train for a few epochs only optimizing our newly-added linear layers, and then train a model with everything unfrozen at a lower learning rate. We will show this process below. 

In [None]:
# we will apply a linear layer to the metadata with ``metadata_layers_dims`` and
# a linear layer to the combined embeddings and metadata data with ``combined_layers_dims``
hybrid_model = HybridPretrainedModel(
    train=train_interactions,
    val=val_interactions,
    item_metadata=metadata_df,
    trained_model=mf_model,
    metadata_layers_dims=[8],
    combined_layers_dims=[16],
    lr=1e-2,
    freeze_embeddings=True,
)

In [None]:
hybrid_trainer = CollieTrainer(model=hybrid_model, max_epochs=10, deterministic=True)

hybrid_trainer.fit(hybrid_model)

In [None]:
mapk_score, mrr_score, auc_score = evaluate_in_batches([mapk, mrr, auc], val_interactions, hybrid_model)

print(f'Hybrid MAP@10 Score: {mapk_score}')
print(f'Hybrid MRR Score:    {mrr_score}')
print(f'Hybrid AUC Score:    {auc_score}')

In [None]:
hybrid_model_unfrozen = HybridPretrainedModel(
    train=train_interactions,
    val=val_interactions,
    item_metadata=metadata_df,
    trained_model=mf_model,
    metadata_layers_dims=[8],
    combined_layers_dims=[16],
    lr=1e-4,
    freeze_embeddings=False,
)

hybrid_model.unfreeze_embeddings()
hybrid_model_unfrozen.load_from_hybrid_model(hybrid_model)

In [None]:
hybrid_trainer_unfrozen = CollieTrainer(model=hybrid_model_unfrozen, max_epochs=10, deterministic=True)

hybrid_trainer_unfrozen.fit(hybrid_model_unfrozen)

In [None]:
mapk_score, mrr_score, auc_score = evaluate_in_batches([mapk, mrr, auc],
                                                       val_interactions,
                                                       hybrid_model_unfrozen)

print(f'Hybrid Unfrozen MAP@10 Score: {mapk_score}')
print(f'Hybrid Unfrozen MRR Score:    {mrr_score}')
print(f'Hybrid Unfrozen AUC Score:    {auc_score}')

Note here that while our ``MAP@10`` and ``MRR`` scores went down slightly from the frozen version of the model above, our ``AUC`` score increased. For implicit recommendation models, each evaluation metric is nuanced in what it represents for real world recommendations. 

You can read more about each evaluation metric by checking out the [Mean Average Precision at K (MAP@K)](https://en.wikipedia.org/wiki/Evaluation_measures_(information_retrieval)#Mean_average_precision), [Mean Reciprocal Rank](https://en.wikipedia.org/wiki/Mean_reciprocal_rank), and [Area Under the Curve (AUC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic#Area_under_the_curve) Wikipedia pages. 

In [None]:
user_id = np.random.randint(0, train_interactions.num_users)

display(
    HTML(
        get_recommendation_visualizations(
            model=hybrid_model_unfrozen,
            user_id=user_id,
            filter_films=True,
            shuffle=True,
            detailed=True,
        )
    )
)

The metrics and results look great, and we should only see a larger difference compared to a standard model as our data becomes more nuanced and complex (such as with MovieLens 10M data). 

If we're happy with this model, we can go ahead and save it for later! 

## Save and Load a Hybrid Model 

In [None]:
# we can save the model with...
os.makedirs('models', exist_ok=True)
hybrid_model_unfrozen.save_model('models/hybrid_model_unfrozen')

In [None]:
# ... and if we wanted to load that model back in, we can do that easily...
hybrid_model_loaded_in = HybridPretrainedModel(load_model_path='models/hybrid_model_unfrozen')


hybrid_model_loaded_in

That's the end of our tutorials, but it's not the end of the awesome features available in Collie. Check out all the different available architectures in the documentation [here](https://collie.readthedocs.io/en/latest/index.html)! 

----- 