Copyright 2021 The TensorFlow Authors.

In [1]:
#@title Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Recommending movies: retrieval using a sequential model

<table class="tfo-notebook-buttons" align="left">
  <td>
    <a target="_blank" href="https://www.tensorflow.org/recommenders/examples/sequential_retrieval"><img src="https://www.tensorflow.org/images/tf_logo_32px.png" />View on TensorFlow.org</a>
  </td>
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/tensorflow/recommenders/blob/main/docs/examples/sequential_retrieval.ipynb"><img src="https://www.tensorflow.org/images/colab_logo_32px.png" />Run in Google Colab</a>
  </td>
  <td>
    <a target="_blank" href="https://github.com/tensorflow/recommenders/blob/main/docs/examples/sequential_retrieval.ipynb"><img src="https://www.tensorflow.org/images/GitHub-Mark-32px.png" />View source on GitHub</a>
  </td>
  <td>
    <a href="https://storage.googleapis.com/tensorflow_docs/recommenders/docs/examples/sequential_retrieval.ipynb"><img src="https://www.tensorflow.org/images/download_logo_32px.png" />Download notebook</a>
  </td>
</table>

In this tutorial, we are going to build a sequential retrieval model. Sequential recommendation is a popular model that looks at a sequence of  items that users have interacted with previously and then predicts the next item. Here the order of the items within each sequence matters, so we are going to use a recurrent neural network to model the sequential relationship. For more details, please refer to this [GRU4Rec paper](https://arxiv.org/abs/1511.06939).



## Imports

First let's get our dependencies and imports out of the way.

In [1]:
%pip install -q tensorflow-recommenders
%pip install -q --upgrade tensorflow-datasets
%pip install wget

Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.
Note: you may need to restart the kernel to use updated packages.


In [2]:
%pip install -q scann

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement scann (from versions: none)
ERROR: No matching distribution found for scann


In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

## Preparing the dataset

Next, we need to prepare our dataset. We are going to leverage the [data generation utility](https://github.com/tensorflow/examples/blob/master/lite/examples/recommendation/ml/data/example_generation_movielens.py) in this [TensorFlow Lite On-device Recommendation reference app](https://www.tensorflow.org/lite/examples/recommendation/overview).

MovieLens 1M data contains ratings.dat (*columns: UserID, MovieID, Rating, Timestamp*), and movies.dat (*columns: MovieID, Title, Genres*). The example generation script download the 1M dataset, takes both files, only keep ratings higher than 2, form user movie interaction timelines, sample activities as labels and 10 previous user activities as the context for prediction.

In [4]:
import wget
f = wget.download("https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py")

# %wget -nc https://raw.githubusercontent.com/tensorflow/examples/master/lite/examples/recommendation/ml/data/example_generation_movielens.py
!python -m example_generation_movielens  --data_dir="data/raw"  --output_dir="data/examples"  --min_timeline_length=3  --max_context_length=10  --max_context_movie_genre_length=10  --min_rating=2  --train_data_fraction=0.9  --build_vocabs=False

Downloading data from https://files.grouplens.org/datasets/movielens/ml-1m.zip

  16384/5917549 [..............................] - ETA: 0s
  24576/5917549 [..............................] - ETA: 53s
  57344/5917549 [..............................] - ETA: 46s
  73728/5917549 [..............................] - ETA: 54s
  90112/5917549 [..............................] - ETA: 58s
 122880/5917549 [..............................] - ETA: 54s
 139264/5917549 [..............................] - ETA: 56s
 155648/5917549 [..............................] - ETA: 59s
 188416/5917549 [..............................] - ETA: 55s
 204800/5917549 [>.............................] - ETA: 1:35
 270336/5917549 [>.............................] - ETA: 1:16
 286720/5917549 [>.............................] - ETA: 1:16
 303104/5917549 [>.............................] - ETA: 1:20
 335872/5917549 [>.............................] - ETA: 1:15
 352256/5917549 [>.............................] - ETA: 1:15
 368640/5917549

2022-05-25 20:02:47.298153: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'cudart64_110.dll'; dlerror: cudart64_110.dll not found
2022-05-25 20:02:47.298403: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.
2022-05-25 20:02:50.828908: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'nvcuda.dll'; dlerror: nvcuda.dll not found
2022-05-25 20:02:50.829405: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-05-25 20:02:50.832486: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:169] retrieving CUDA diagnostic information for host: LAPTOP-GRGNU5MA
2022-05-25 20:02:50.832677: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:176] hostname: LAPTOP-GRGNU5MA
I0525 20:02:50.832911 10280 example_generation_movielens.py:460] Downloading and extracting data.
I0525 20

Here is a sample of the generated dataset.

```
0 : {
  features: {
    feature: {
      key  : "context_movie_id"
      value: { int64_list: { value: [ 1124, 2240, 3251, ..., 1268 ] } }
    }
    feature: {
      key  : "context_movie_rating"
      value: { float_list: {value: [ 3.0, 3.0, 4.0, ..., 3.0 ] } }
    }
    feature: {
      key  : "context_movie_year"
      value: { int64_list: { value: [ 1981, 1980, 1985, ..., 1990 ] } }
    }
    feature: {
      key  : "context_movie_genre"
      value: { bytes_list: { value: [ "Drama", "Drama", "Mystery", ..., "UNK" ] } }
    }
    feature: {
      key  : "label_movie_id"
      value: { int64_list: { value: [ 3252 ] }  }
    }
  }
}
```
You can see that it includes a sequence of context movie IDs, and a label movie ID (next movie), plus context features such as movie year, rating and genre. 

In our case we will only be using the sequence of context movie IDs and the label movie ID. You can refer to the [Leveraging context features tutorial](https://www.tensorflow.org/recommenders/examples/context_features) to learn more about adding additional context features.

In [5]:
train_filename = "data/examples/train_movielens_1m.tfrecord"
train = tf.data.TFRecordDataset(train_filename)

test_filename = "data/examples/test_movielens_1m.tfrecord"
test = tf.data.TFRecordDataset(test_filename)

feature_description = {
    'context_movie_id': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(0, 10)),
    'context_movie_rating': tf.io.FixedLenFeature([10], tf.float32, default_value=np.repeat(0, 10)),
    'context_movie_year': tf.io.FixedLenFeature([10], tf.int64, default_value=np.repeat(1980, 10)),
    'context_movie_genre': tf.io.FixedLenFeature([10], tf.string, default_value=np.repeat("Drama", 10)),
    'label_movie_id': tf.io.FixedLenFeature([1], tf.int64, default_value=0),
}

def _parse_function(example_proto):
  return tf.io.parse_single_example(example_proto, feature_description)

train_ds = train.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})

test_ds = test.map(_parse_function).map(lambda x: {
    "context_movie_id": tf.strings.as_string(x["context_movie_id"]),
    "label_movie_id": tf.strings.as_string(x["label_movie_id"])
})



{'context_movie_id': array([b'3481', b'3160', b'3538', b'3747', b'3624', b'3150', b'3510',
       b'3535', b'3555', b'3566'], dtype=object),
 'label_movie_id': array([b'3676'], dtype=object)}


In [19]:
for x in train.take(1):
  pprint.pprint(x)

<tf.Tensor: shape=(), dtype=string, numpy=b'\n\xa7\x02\n,\n\x10context_movie_id\x12\x18\x1a\x16\n\x14\x99\x1b\xd8\x18\xd2\x1b\xa3\x1d\xa8\x1c\xce\x18\xb6\x1b\xcf\x1b\xe3\x1b\xee\x1b\n\x18\n\x0elabel_movie_id\x12\x06\x1a\x04\n\x02\xdc\x1c\nD\n\x14context_movie_rating\x12,\x12*\n(\x00\x00\x00@\x00\x00\x80@\x00\x00@@\x00\x00\x80@\x00\x00\x00@\x00\x00\x80@\x00\x00\x00@\x00\x00@@\x00\x00\x00@\x00\x00\x00@\ng\n\x13context_movie_genre\x12P\nN\n\x06Comedy\n\x05Drama\n\x06Comedy\n\x05Drama\n\x06Action\n\x05Drama\n\x05Drama\n\x08Thriller\n\x06Comedy\n\x06Horror\n.\n\x12context_movie_year\x12\x18\x1a\x16\n\x14\xd0\x0f\xcf\x0f\xcf\x0f\xcf\x0f\xd0\x0f\xcf\x0f\xd0\x0f\xd0\x0f\xd0\x0f\xd0\x0f'>


Now our train/test datasets include only a sequence of historical movie IDs and a label of next movie ID. Note that we use `[10]` as the shape of the features during tf.Example parsing because we specify 10 as the length of context features in the example generateion step.

We need one more thing before we can start building the model - the vocabulary for our movie IDs.

In [6]:
movies = tfds.load("movielens/1m-movies", split='train')
movies = movies.map(lambda x: x["movie_id"])
movie_ids = movies.batch(1_000)
unique_movie_ids = np.unique(np.concatenate(list(movie_ids)))

## Implementing a sequential model

In our [basic retrieval tutorial](https://www.tensorflow.org/recommenders/examples/basic_retrieval), we use one query tower for the user, and the candidate tow for the candidate movie. However, the two-tower architecture is generalizble and not limited to <user,item> pair. You can also use it to do item-to-item recommendation as we note in the [basic retrieval tutorial](https://www.tensorflow.org/recommenders/examples/basic_retrieval#item-to-item_recommendation).

Here we are still going to use the two-tower architecture. Specificially, we use the query tower with a [Gated Recurrent Unit (GRU) layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/GRU) to encode the sequence of historical movies, and keep the same candidate tower for the candidate movie. 

In [7]:
embedding_dimension = 32

query_model = tf.keras.Sequential([
    tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
    tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension), 
    tf.keras.layers.GRU(embedding_dimension),
])

candidate_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_movie_ids, mask_token=None),
  tf.keras.layers.Embedding(len(unique_movie_ids) + 1, embedding_dimension)
])

The metrics, task and full model are defined similar to the basic retrieval model. 

In [8]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=movies.batch(128).map(candidate_model)
)

task = tfrs.tasks.Retrieval(
  metrics=metrics
)

class Model(tfrs.Model):

    def __init__(self, query_model, candidate_model):
        super().__init__()
        self.query_model = query_model
        self.candidate_model = candidate_model

        self._task = task

    def compute_loss(self, features, training=False):
        watch_history = features["context_movie_id"]
        watch_next_label = features["label_movie_id"]

        query_embedding = self.query_model(watch_history)       
        candidate_embedding = self.candidate_model(watch_next_label)
        
        return self._task(query_embedding, candidate_embedding, compute_metrics=not training)

## Fitting and evaluating

We can now compile, train and evaluate our sequential retrieval model.

In [9]:
model = Model(query_model, candidate_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

In [10]:
cached_train = train_ds.shuffle(10_000).batch(12800).cache()
cached_test = test_ds.batch(2560).cache()

In [23]:
print((cached_train))
for i in cached_test.take(1):
    print(i)

<CacheDataset element_spec={'context_movie_id': TensorSpec(shape=(None, 10), dtype=tf.string, name=None), 'label_movie_id': TensorSpec(shape=(None, 1), dtype=tf.string, name=None)}>
{'context_movie_id': <tf.Tensor: shape=(2560, 10), dtype=string, numpy=
array([[b'956', b'3469', b'3134', ..., b'951', b'1221', b'1284'],
       [b'47', b'296', b'2571', ..., b'2599', b'1500', b'457'],
       [b'780', b'1377', b'45', ..., b'0', b'0', b'0'],
       ...,
       [b'3160', b'2762', b'1', ..., b'3361', b'899', b'2997'],
       [b'3418', b'2858', b'2706', ..., b'2770', b'2541', b'2759'],
       [b'1409', b'628', b'1713', ..., b'408', b'1696', b'2699']],
      dtype=object)>, 'label_movie_id': <tf.Tensor: shape=(2560, 1), dtype=string, numpy=
array([[b'593'],
       [b'2959'],
       [b'3897'],
       ...,
       [b'3461'],
       [b'2718'],
       [b'3045']], dtype=object)>}


In [12]:
model.fit(cached_train, epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


<keras.callbacks.History at 0x23e645b2e60>

In [13]:
model.evaluate(cached_test, return_dict=True)



{'factorized_top_k/top_1_categorical_accuracy': 0.013177112676203251,
 'factorized_top_k/top_5_categorical_accuracy': 0.07296453416347504,
 'factorized_top_k/top_10_categorical_accuracy': 0.12803974747657776,
 'factorized_top_k/top_50_categorical_accuracy': 0.3574771583080292,
 'factorized_top_k/top_100_categorical_accuracy': 0.48859795928001404,
 'loss': 9513.587890625,
 'regularization_loss': 0,
 'total_loss': 9513.587890625}

This concludes the sequential retrieval tutorial.

In [14]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.query_model)
# recommends movies out of the entire movies dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(model.candidate_model)))
)

# Get recommendations.
_, titles = index(tf.constant(np.array([["1","2","","","","","","","",""]])))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'1538' b'1724' b'2623']


In [15]:
# Get recommendations.
_, titles = index(tf.constant(np.array([["1","","","","","","","","",""]])))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'2623' b'1538' b'1724']


In [16]:
# Get recommendations.
_, titles = index(tf.constant(np.array([["1","2","100","","","","","","",""]])))
print(f"Recommendations for user 42: {titles[0, :3]}")

Recommendations for user 42: [b'2623' b'1538' b'1062']


In [17]:
["3","2","9","","","","","","",""]

['3', '2', '9', '', '', '', '', '', '', '']