<a href="https://colab.research.google.com/github/AmirJlr/RecSys/blob/master/01_Basic_Movie_Retrieval_with_TensorFlow_Recommenders.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# This tutorial will cover:
### This is a retrieval model. Its goal is to find a set of relevant movies from a large catalog for a given user. We won't be predicting exact ratings


1. Setting up: Installing necessary libraries.
2. Data: Using the MovieLens 100K dataset.
3. Preprocessing: Preparing the data for TFRS.
4. Model Building: Creating a "Two-Tower" retrieval model.
- - Query Tower (for users)
- - Candidate Tower (for movies)
5. Training: Training the model.
6. Evaluation: Checking how well our model performs.
7. Making Recommendations: Getting top movie suggestions for a user.




## 1. Setup

In [1]:
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

In [2]:
!pip install -q tensorflow tensorflow-recommenders tensorflow-datasets

In [3]:
import os
import pprint
import tempfile

import numpy as np
import tensorflow as tf
import tensorflow_datasets as tfds
import tensorflow_recommenders as tfrs

print(f"TensorFlow version: {tf.__version__}")
print(f"TensorFlow Recommenders version: {tfrs.__version__}")

TensorFlow version: 2.18.0
TensorFlow Recommenders version: v0.7.3


## 2. Data: MovieLens 100K
We'll use the MovieLens 100K dataset, which contains 100,000 ratings from 943 users on 1682 movies.

In [4]:
# Load the ratings data
ratings_ds = tfds.load("movielens/100k-ratings", split="train")

# Load the movies data to get movie titles
movies_ds = tfds.load("movielens/100k-movies", split="train")

In [5]:
# Look at a few examples
tfds.as_dataframe(ratings_ds.take(3))

Unnamed: 0,bucketized_user_age,movie_genres,movie_id,movie_title,raw_user_age,timestamp,user_gender,user_id,user_occupation_label,user_occupation_text,user_rating,user_zip_code
0,45.0,[7],b'357',"b""One Flew Over the Cuckoo's Nest (1975)""",46.0,879024327,True,b'138',4,b'doctor',4.0,b'53211'
1,25.0,[ 4 14],b'709',b'Strictly Ballroom (1992)',32.0,875654590,True,b'92',5,b'entertainment',2.0,b'80525'
2,18.0,[4],b'412',"b'Very Brady Sequel, A (1996)'",24.0,882075110,True,b'301',17,b'student',4.0,b'55439'


In [6]:
# Look at a few examples
tfds.as_dataframe(movies_ds.take(3))

Unnamed: 0,movie_genres,movie_id,movie_title
0,[4],b'1681',b'You So Crazy (1994)'
1,[4 7],b'1457',b'Love Is All There Is (1996)'
2,[1 3],b'500',b'Fly Away Home (1996)'


In [7]:
# We only need user_id and movie_title for this basic retrieval model
ratings = ratings_ds.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
})

movies = movies_ds.map(lambda x: x["movie_title"])

## 3. Preprocessing
Our model needs to convert raw user IDs and movie titles into numerical representations (embeddings). To do this, we first need to create vocabularies of unique user IDs and movie titles.

In [8]:
### Create vocabularies for users and movies ###

user_ids_vocab = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocab.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocab = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocab.adapt(movies)


In [9]:
print(f"\nUser ID vocabulary size: {user_ids_vocab.vocabulary_size()}")

print(f"Movie Title vocabulary size: {movie_titles_vocab.vocabulary_size()}")


User ID vocabulary size: 944
Movie Title vocabulary size: 1665


In [10]:
# Example:

print(f"User ID '42' maps to: {user_ids_vocab('42')}")
print(f"Movie 'Heat (1995)' maps to: {movie_titles_vocab(tf.constant('Heat (1995)'))}")

User ID '42' maps to: 175
Movie 'Heat (1995)' maps to: 1006


## 4. Model Building: Two-Tower Retrieval Model

- Query Tower: Takes user features (here, just user_id) and outputs a user embedding.


- Candidate Tower: Takes item features (here, just movie_title) and outputs a movie embedding.

The model learns embeddings such that users and movies they've interacted with are close together in the embedding space.

In [11]:
class MovieLensModel(tfrs.Model):

  def __init__(self, user_model, movie_model, task):
    super().__init__()
    self.user_model = user_model
    self.movie_model = movie_model
    self.task = task # This will be our retrieval task

  def compute_loss(self, features, training=False):
    # We need to pass the user features and movie features into their respective
    # towers to get embeddings.
    user_embeddings = self.user_model(features["user_id"])
    positive_movie_embeddings = self.movie_model(features["movie_title"])

    # The task computes the loss and metrics.
    # It takes user embeddings and positive movie embeddings as arguments.
    # It implicitly uses other movies in the batch as negative examples.
    return self.task(user_embeddings, positive_movie_embeddings)

`compute_loss`:

 This is the heart of TFRS models.
It takes a batch of features (which will be {"user_id": ..., "movie_title": ...} dictionaries).
It generates user embeddings and movie embeddings using the respective towers.
It then passes these embeddings to self.task. The task calculates the loss (e.g., by trying to make positive pairs have higher dot products than negative pairs) and updates the metrics.

In [12]:
# Define the embedding dimension
embedding_dimension = 32

# --- Query (User) Tower ---
user_model = tf.keras.Sequential([
  user_ids_vocab, # Maps user_id strings to integer indices
  tf.keras.layers.Embedding(user_ids_vocab.vocabulary_size(), embedding_dimension)
])



# --- Candidate (Movie) Tower ---
movie_model = tf.keras.Sequential([
  movie_titles_vocab, # Maps movie_title strings to integer indices
  tf.keras.layers.Embedding(movie_titles_vocab.vocabulary_size(), embedding_dimension)
])


# --- The Retrieval Task ---
# This task uses in-batch negatives: for a given (user, movie) pair,
# all other movies in the batch are treated as negatives.
task = tfrs.tasks.Retrieval(
    metrics=tfrs.metrics.FactorizedTopK(
        candidates=movies.batch(128).map(movie_model) # Pass all movie embeddings for metric calculation
    )
)

The Retrieval task needs a metric. `FactorizedTopK` is common. It measures how often the true positive movie is found within the top K recommendations.


Crucially, FactorizedTopK needs access to all possible candidate (movie) embeddings to compute its metric correctly. We provide this by mapping our movies dataset (all movie titles) through the movie_model

In [13]:
# Create the full model
model = MovieLensModel(user_model, movie_model, task)

## 5. Training

- `Batch Size`: For retrieval with in-batch negatives, larger batch sizes (e.g., 4096, 8192) are generally better as they provide more negative examples per positive pair.


- `Epochs`: Retrieval models often converge quickly. 3-5 epochs might be sufficient.

In [14]:
# Prepare the training data
# Shuffle, batch, and cache for performance
cached_train = ratings.shuffle(100_000).batch(8192).cache() # Larger batch size for retrieval


# Compile the model
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

# Train the model
print("\nStarting model training...")
model.fit(cached_train, epochs=3) # A few epochs are usually enough for a basic retrieval model
print("Model training finished.")


Starting model training...
Epoch 1/3
Epoch 2/3
Epoch 3/3
Model training finished.


## Evaluation

Although our FactorizedTopK metric was updated during training, we can also explicitly evaluate on a test set if you had one.

For this basic example, we'll rely on the training metrics. The FactorizedTopK metric (e.g., factorized_top_k/top_100_categorical_accuracy) tells us the proportion of times the true positive movie was in the top 100 recommended movies.

In [15]:
# If we had a separate test set, we would evaluate like this:
# model.evaluate(cached_test, return_dict=True)

# For now, we observe the metrics printed during training.
# The key metric is factorized_top_k/top_100_categorical_accuracy (or top_10, top_50 etc.)

## 7. Making Recommendations

To make recommendations, we need an efficient way to find the movies whose embeddings are closest to a given user's embedding.

TFRS provides `tfrs.layers.factorized_top_k.BruteForce` for this. For larger datasets, we'd use an Approximate Nearest Neighbour (ANN) index like ScaNN.

In [16]:
# Create a BruteForce layer to find top K movies
index = tfrs.layers.factorized_top_k.BruteForce(model.user_model) # Pass the user model

# We need to populate the BruteForce layer with all candidate (movie) embeddings
# It needs a dataset of (movie_id, movie_embedding)
# For movie_id, we can use the raw movie titles.
index.index_from_dataset(
    tf.data.Dataset.zip((
        movies.batch(100), # Batch of movie titles
        movies.batch(100).map(model.movie_model) # Batch of corresponding movie embeddings
    ))
)

<tensorflow_recommenders.layers.factorized_top_k.BruteForce at 0x7a64d0119ad0>

# Get recommendations for a specific user

In [17]:
user_id_to_recommend = "42" # Example user ID
_, titles = index(tf.constant([user_id_to_recommend])) # Pass user_id as a tensor

print(f"\nTop 10 recommendations for user {user_id_to_recommend}:")
for i, title in enumerate(titles[0, :10].numpy()):
    print(f"{i+1}. {title.decode('utf-8')}")


Top 10 recommendations for user 42:
1. Rudy (1993)
2. Father of the Bride Part II (1995)
3. Indian in the Cupboard, The (1995)
4. Up Close and Personal (1996)
5. Preacher's Wife, The (1996)
6. Jack (1996)
7. Something to Talk About (1995)
8. Affair to Remember, An (1957)
9. Michael (1996)
10. Forget Paris (1995)


In [18]:
# Let's try another user
user_id_to_recommend_2 = "212"
_, titles_2 = index(tf.constant([user_id_to_recommend_2]))

print(f"\nTop 10 recommendations for user {user_id_to_recommend_2}:")
for i, title in enumerate(titles_2[0, :10].numpy()):
    print(f"{i+1}. {title.decode('utf-8')}")


Top 10 recommendations for user 212:
1. Killing Fields, The (1984)
2. Cinema Paradiso (1988)
3. Third Man, The (1949)
4. Ruby in Paradise (1993)
5. Remains of the Day, The (1993)
6. Eat Drink Man Woman (1994)
7. Garden of Finzi-Contini, The (Giardino dei Finzi-Contini, Il) (1970)
8. Jean de Florette (1986)
9. Graduate, The (1967)
10. Foreign Correspondent (1940)
