# Recommender models

**Framework**: [Collie](https://github.com/ShopRunner/collie)

**Models**:
- implicit feedback:
    - MatrixFactorization
    - NeuralCollaborativeFiltering
    - Hybrid (item metadata support)
    - ColdStartHybrid (item cold start support)
- explicit feedback:
    - MatrixFactorization
    
**Evaluation Metrics**:
- implicit feedback:
    - MAP@10
        <img src="images/mapk_explained.png" width="300">

        <img src="images/mapk.png" width="200">
    - MRR
        <img src="images/mrr.png" width="350">
    - AUC - [detailed explanation](https://developers.google.com/machine-learning/crash-course/classification/roc-and-auc)

- explicit feedback:
     - <img src="images/mse.png" width="200">
     - <img src="images/mae.png" width="200">

In [1]:
import os

import numpy as np
import pandas as pd
from pytorch_lightning.utilities.seed import seed_everything
from IPython.display import HTML
import torch
import torchmetrics
from collie.cross_validation import stratified_split
from collie.interactions import Interactions, ExplicitInteractions
from collie.utils import convert_to_implicit, remove_users_with_fewer_than_n_interactions
from collie.metrics import evaluate_in_batches, explicit_evaluate_in_batches
from collie.model import (
    CollieTrainer, HybridModel, MatrixFactorizationModel, NeuralCollaborativeFiltering, ColdStartModel)
from collie.movielens import (
    get_movielens_metadata, get_recommendation_visualizations, read_movielens_df, read_movielens_df_item)

from metrics import mapk, mrr, auc
from utils import train_multi_stage_model, evaluate_explicit

### get MovieLens 100k data

In [2]:
df = read_movielens_df(decrement_ids=True)

In [3]:
df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,195,241,3,881250949
1,185,301,3,891717742
2,21,376,1,878887116
3,243,50,2,880606923
4,165,345,1,886397596


In [4]:
implicit_df = convert_to_implicit(df, min_rating_to_keep=4)
implicit_df = remove_users_with_fewer_than_n_interactions(implicit_df, min_num_of_interactions=3)

In [5]:
implicit_df.head()

Unnamed: 0,user_id,item_id,rating,timestamp
0,696,324,1,882621673
1,665,161,1,880568662
2,395,1024,1,884645839
3,859,201,1,885990932
4,21,225,1,878888145


In [6]:
interactions = Interactions(
    users=implicit_df["user_id"],
    items=implicit_df["item_id"],
    ratings=implicit_df["rating"],
    allow_missing_ids=True,
)

train_interactions, val_interactions = stratified_split(interactions, test_p=0.1, seed=42)

Checking for and removing duplicate user, item ID pairs...
Checking ``num_negative_samples`` is valid...
Maximum number of items a user has interacted with: 378
Generating positive items set...
Generating positive items set...
Generating positive items set...


In [7]:
print("Train:", train_interactions)
print("Val:  ", val_interactions)

Train: Interactions object with 49426 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.
Val:   Interactions object with 5949 interactions between 943 users and 1674 items, returning 10 negative samples per interaction.


## Matrix Factorization model

<img src="images/matrix_factorization.png" width="600">

In [8]:
seed_everything(2021)

Global seed set to 2021


2021

In [9]:
mf_model = MatrixFactorizationModel(
    train=train_interactions,
    val=val_interactions,
    embedding_dim=10,
    lr=1e-2,
)
mf_model

MatrixFactorizationModel(
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1674, 1)
  (user_embeddings): ScaledEmbedding(943, 10)
  (item_embeddings): ScaledEmbedding(1674, 10)
  (dropout): Dropout(p=0.0, inplace=False)
)

In [10]:
mf_trainer = CollieTrainer(mf_model, max_epochs=5, deterministic=True, log_every_n_steps=40)
mf_trainer.fit(mf_model)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name            | Type            | Params
----------------------------------------------------
0 | user_biases     | ZeroEmbedding   | 943   
1 | item_biases     | ZeroEmbedding   | 1.7 K 
2 | user_embeddings | ScaledEmbedding | 9.4 K 
3 | item_embeddings | ScaledEmbedding | 16.7 K
4 | dropout         | Dropout         | 0     
----------------------------------------------------
28.8 K    Trainable params
0         Non-trainable params
28.8 K    Total params
0.115     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: -1it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.
Epoch     3: reducing learning rate of group 0 to 1.0000e-03.


Epoch     5: reducing learning rate of group 0 to 1.0000e-04.
Epoch     5: reducing learning rate of group 0 to 1.0000e-04.


In [11]:
mf_model.eval()
mapk_score, mrr_score, auc_score = evaluate_in_batches([mapk, mrr, auc], val_interactions, mf_model)

print(f"MAP@10 Score: {mapk_score}")
print(f"MRR Score:    {mrr_score}")
print(f"AUC Score:    {auc_score}")

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Score: 0.045361127593681175
MRR Score:    0.16595347326905358
AUC Score:    0.9032740756204933


In [12]:
user_id = 100

parameters = {"model": mf_model, "user_id": user_id, "filter_films": True, "shuffle": True, "detailed": True}
display(HTML(get_recommendation_visualizations(**parameters)))

Unnamed: 0,Liar Liar (1997),Wag the Dog (1997),Good Will Hunting (1997),Seven Years in Tibet (1997),Contact (1997),Titanic (1997),"Peacemaker, The (1997)","Big Bang Theory, The (1994)",Amistad (1997),"Apostle, The (1997)"
Some loved films:,,,,,,,,,,

Unnamed: 0,"English Patient, The (1996)",Scream (1996),"Game, The (1997)","Devil's Advocate, The (1997)",Boogie Nights (1997),Chasing Amy (1997),Ulee's Gold (1997),"Rainmaker, The (1997)",G.I. Jane (1997),Kiss the Girls (1997)
Recommended films:,,,,,,,,,,


## Neural Collaborative Filtering

In [13]:
ncf_model = NeuralCollaborativeFiltering(
    train=train_interactions,
    val=val_interactions,
    embedding_dim=10,
    lr=1e-2,
)
ncf_model

NeuralCollaborativeFiltering(
  (user_embeddings_cf): ScaledEmbedding(943, 10)
  (item_embeddings_cf): ScaledEmbedding(1674, 10)
  (user_embeddings_mlp): ScaledEmbedding(943, 40)
  (item_embeddings_mlp): ScaledEmbedding(1674, 40)
  (mlp_layers): Sequential(
    (0): Dropout(p=0.0, inplace=False)
    (1): Linear(in_features=80, out_features=40, bias=True)
    (2): ReLU()
    (3): Dropout(p=0.0, inplace=False)
    (4): Linear(in_features=40, out_features=20, bias=True)
    (5): ReLU()
    (6): Dropout(p=0.0, inplace=False)
    (7): Linear(in_features=20, out_features=10, bias=True)
    (8): ReLU()
  )
  (predict_layer): Linear(in_features=20, out_features=1, bias=True)
)

In [14]:
ncf_trainer = CollieTrainer(ncf_model, max_epochs=5, deterministic=True, log_every_n_steps=40)
ncf_trainer.fit(ncf_model)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name                | Type            | Params
--------------------------------------------------------
0 | user_embeddings_cf  | ScaledEmbedding | 9.4 K 
1 | item_embeddings_cf  | ScaledEmbedding | 16.7 K
2 | user_embeddings_mlp | ScaledEmbedding | 37.7 K
3 | item_embeddings_mlp | ScaledEmbedding | 67.0 K
4 | mlp_layers          | Sequential      | 4.3 K 
5 | predict_layer       | Linear          | 21    
--------------------------------------------------------
135 K     Trainable params
0         Non-trainable params
135 K     Total params
0.541     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: -1it [00:00, ?it/s]

Epoch     3: reducing learning rate of group 0 to 1.0000e-03.


Epoch     5: reducing learning rate of group 0 to 1.0000e-04.


In [15]:
ncf_model.eval()  # set model to inference mode
mapk_score, mrr_score, auc_score = evaluate_in_batches([mapk, mrr, auc], val_interactions, ncf_model)

print(f"MAP@10 Score: {mapk_score}")
print(f"MRR Score:    {mrr_score}")
print(f"AUC Score:    {auc_score}")

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Score: 0.04282151476223107
MRR Score:    0.16067300901418025
AUC Score:    0.9033286978990915


In [16]:
parameters = {"model": ncf_model, "user_id": user_id, "filter_films": True, "shuffle": True, "detailed": True}
display(HTML(get_recommendation_visualizations(**parameters)))

Unnamed: 0,Liar Liar (1997),Wag the Dog (1997),Good Will Hunting (1997),Seven Years in Tibet (1997),Contact (1997),Titanic (1997),"Peacemaker, The (1997)","Big Bang Theory, The (1994)",Amistad (1997),"Apostle, The (1997)"
Some loved films:,,,,,,,,,,

Unnamed: 0,Scream (1996),"English Patient, The (1996)","Devil's Advocate, The (1997)",Chasing Amy (1997),Boogie Nights (1997),"Game, The (1997)",Ulee's Gold (1997),In & Out (1997),Cop Land (1997),Rosewood (1997)
Recommended films:,,,,,,,,,,


## Hybrid Multi Stage model
#### this multi-stage model allows to incorporate item metadata in training processes and then find similar items

In [17]:
metadata_df = get_movielens_metadata()
metadata_df.head()

Unnamed: 0,genre_action,genre_adventure,genre_animation,genre_children,genre_comedy,genre_crime,genre_documentary,genre_drama,genre_fantasy,genre_film_noir,...,genre_unknown,decade_unknown,decade_20,decade_30,decade_40,decade_50,decade_60,decade_70,decade_80,decade_90
0,0,0,1,1,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
1,1,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,1,0,0,0,1,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1
4,0,0,0,0,0,1,0,1,0,0,...,0,0,0,0,0,0,0,0,0,1


In [18]:
hybrid_model = HybridModel(
    train=train_interactions,
    val=val_interactions,
    item_metadata=metadata_df,
    metadata_layers_dims=[8],
    combined_layers_dims=[16],
    embedding_dim=10,
    lr=1e-2,  # used in stage 1
    bias_lr=1e-1,  # used in stage 1
    metadata_only_stage_lr=1e-3,  # used in stage 2
    all_stage_lr=1e-4)  # used in stage 3 (the final stage)
display(hybrid_model)

Set ``self.hparams.stage`` to "matrix_factorization"


HybridModel(
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1674, 1)
  (user_embeddings): ScaledEmbedding(943, 10)
  (item_embeddings): ScaledEmbedding(1674, 10)
  (dropout): Dropout(p=0.0, inplace=False)
  (metadata_layer_0): Linear(in_features=28, out_features=8, bias=True)
  (combined_layer_0): Linear(in_features=28, out_features=16, bias=True)
  (combined_layer_1): Linear(in_features=16, out_features=1, bias=True)
)

In [19]:
hybrid_trainer = CollieTrainer(
    model=hybrid_model,
    max_epochs=5,
    logger=False,
    checkpoint_callback=False,
    weights_summary=None,
)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs


In [20]:
hybrid_model = train_multi_stage_model(hybrid_model, hybrid_trainer, val_interactions, n_stages=3)

Stage: 0


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: -1it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Score: 0.040842806631971026
MRR Score:    0.15537292938956282
AUC Score:    0.9069970317200744
Stage: 1
Set ``self.hparams.stage`` to "metadata_only"


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: 48it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Score: 0.039398807618116836
MRR Score:    0.16205928275349793
AUC Score:    0.9019731417329955
Stage: 2
Set ``self.hparams.stage`` to "all"


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: 48it [00:00, ?it/s]

  0%|          | 0/48 [00:00<?, ?it/s]

MAP@10 Score: 0.0391707965033147
MRR Score:    0.161514215044788
AUC Score:    0.9031718420121827


In [21]:
parameters = {"model": hybrid_model, "user_id": user_id, "filter_films": True, "shuffle": True, "detailed": True}
display(HTML(get_recommendation_visualizations(**parameters)))

Unnamed: 0,Liar Liar (1997),Wag the Dog (1997),Good Will Hunting (1997),Seven Years in Tibet (1997),Contact (1997),Titanic (1997),"Peacemaker, The (1997)","Big Bang Theory, The (1994)",Amistad (1997),"Apostle, The (1997)"
Some loved films:,,,,,,,,,,

Unnamed: 0,"English Patient, The (1996)",Scream (1996),Fargo (1996),Jerry Maguire (1996),Leaving Las Vegas (1995),Boogie Nights (1997),Kiss the Girls (1997),Evita (1996),"Rainmaker, The (1997)","Godfather, The (1972)"
Recommended films:,,,,,,,,,,


#### read item metadata needed for Hybrid Model similar item prediction

In [22]:
df_item = read_movielens_df_item()
df_item["item_id"] -= 1

In [23]:
item_id = 1000
df_item.iloc[[item_id]]

Unnamed: 0,item_id,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
1000,1000,"Stupids, The (1996)",1996-08-30,"http://us.imdb.com/M/title-exact?Stupids,%20Th...",0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0


#### find most similar items according to metadata

In [24]:
similar_items = hybrid_model.item_item_similarity(item_id)
df_item.iloc[similar_items.index[:5]]

Unnamed: 0,item_id,movie_title,release_date,IMDb_URL,unknown,Action,Adventure,Animation,Children,Comedy,...,Fantasy,Film_Noir,Horror,Musical,Mystery,Romance,Sci_Fi,Thriller,War,Western
1000,1000,"Stupids, The (1996)",1996-08-30,"http://us.imdb.com/M/title-exact?Stupids,%20Th...",0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1087,1087,Double Team (1997),1997-04-04,http://us.imdb.com/M/title-exact?Double%20Team...,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1026,1026,"Shooter, The (1995)",1995-01-01,"http://us.imdb.com/M/title-exact?Shooter,%20Th...",0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1375,1375,Meet Wally Sparks (1997),1997-01-31,http://us.imdb.com/M/title-exact?Meet%20Wally%...,0,0,0,0,0,1,...,0,0,0,0,0,0,0,0,0,0
1490,1490,Tough and Deadly (1995),1995-01-01,http://us.imdb.com/M/title-exact?Tough%20and%2...,0,1,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0


## Explicit Matrix Factorization model

In [25]:
explicit_interactions = ExplicitInteractions(
    users=df['user_id'],
    items=df['item_id'],
    ratings=df['rating'],
    allow_missing_ids=True,
)

explicit_interactions

Checking for and removing duplicate user, item ID pairs...


ExplicitInteractions object with 100000 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.

In [26]:
train_explicit_interactions, val_explicit_interactions = stratified_split(explicit_interactions, test_p=0.1, seed=2021)
print("Train:", train_explicit_interactions)
print("Val:  ", val_explicit_interactions)

Train: ExplicitInteractions object with 89561 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.
Val:   ExplicitInteractions object with 10439 interactions between 943 users and 1682 items, with minimum rating of 1 and maximum rating of 5.


In [27]:
emf_model = MatrixFactorizationModel(
    train=train_explicit_interactions,
    val=val_explicit_interactions,
    embedding_dim=10,
    lr=1e-2,
    loss="mse",
    y_range=[1, 5],
)

emf_model

MatrixFactorizationModel(
  (loss_function): MSELoss()
  (user_biases): ZeroEmbedding(943, 1)
  (item_biases): ZeroEmbedding(1682, 1)
  (user_embeddings): ScaledEmbedding(943, 10)
  (item_embeddings): ScaledEmbedding(1682, 10)
  (dropout): Dropout(p=0.0, inplace=False)
)

In [28]:
emf_trainer = CollieTrainer(emf_model, max_epochs=5, deterministic=True)
emf_trainer.fit(emf_model)

GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs

  | Name            | Type            | Params
----------------------------------------------------
0 | loss_function   | MSELoss         | 0     
1 | user_biases     | ZeroEmbedding   | 943   
2 | item_biases     | ZeroEmbedding   | 1.7 K 
3 | user_embeddings | ScaledEmbedding | 9.4 K 
4 | item_embeddings | ScaledEmbedding | 16.8 K
5 | dropout         | Dropout         | 0     
----------------------------------------------------
28.9 K    Trainable params
0         Non-trainable params
28.9 K    Total params
0.115     Total estimated model params size (MB)


Validation sanity check: 0it [00:00, ?it/s]

Global seed set to 2021


Training: -1it [00:00, ?it/s]

Epoch     5: reducing learning rate of group 0 to 1.0000e-03.
Epoch     5: reducing learning rate of group 0 to 1.0000e-03.


In [29]:
mse_score, mae_score = evaluate_explicit(
    emf_model, val_explicit_interactions, [torchmetrics.MeanSquaredError(), torchmetrics.MeanAbsoluteError()])
print(f"MSE: {mse_score}")
print(f"MAE: {mae_score}")

MSE: 0.8832836151123047
MAE: 0.7385846972465515


In [30]:
parameters = {"model": emf_model, "user_id": user_id, "filter_films": True, "shuffle": True, "detailed": True}
display(HTML(get_recommendation_visualizations(**parameters)))

Unnamed: 0,Liar Liar (1997),Wag the Dog (1997),Good Will Hunting (1997),Seven Years in Tibet (1997),Contact (1997),Titanic (1997),"Peacemaker, The (1997)","Big Bang Theory, The (1994)",Amistad (1997),"Apostle, The (1997)"
Some loved films:,,,,,,,,,,

Unnamed: 0,Braveheart (1995),"Shawshank Redemption, The (1994)",It's a Wonderful Life (1946),Schindler's List (1993),Star Wars (1977),Backbeat (1993),Priest (1994),Raiders of the Lost Ark (1981),Gaslight (1944),Mr. Holland's Opus (1995)
Recommended films:,,,,,,,,,,
