<a href="https://colab.research.google.com/github/AmirJlr/RecSys/blob/master/02_Movie_Recommender(Retrieval_Ranking).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Movie Recommender (Retrieval with Genres + Ranking)

## Recap: Two-Stage Recommender

- #### **Retrieval Stage**: Quickly selects a subset of hundreds of potentially relevant candidates from the entire catalog (millions of items). Its goal is efficiency and recall.


- #### **Ranking Stage**: Takes the candidates from the retrieval stage and scores/orders them more precisely. It can use more features and a more complex model because it only operates on a small set of items. Its goal is precision.

## 1. Setup and Data Preparation

In [1]:
import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

In [2]:
!pip install -q tensorflow-recommenders
!pip install -q --upgrade scann

In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
import tensorflow_recommenders as tfrs
import tensorflow_datasets as tfds

In [4]:
# Load ratings and movies
ratings = tfds.load("movielens/100k-ratings", split="train")
movies = tfds.load("movielens/100k-movies", split="train")

# Select relevant features
ratings = ratings.map(lambda x: {
    "movie_title": x["movie_title"],
    "user_id": x["user_id"],
    "user_rating": x["user_rating"]
})

movies = movies.map(lambda x: x["movie_title"])

In [5]:
# Create vocabularies for user IDs and movie titles
user_ids_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
user_ids_vocabulary.adapt(ratings.map(lambda x: x["user_id"]))

movie_titles_vocabulary = tf.keras.layers.StringLookup(mask_token=None)
movie_titles_vocabulary.adapt(movies)

## 2. The Retrieval Model

The retrieval stage is responsible for quickly selecting a smaller subset of relevant candidates from the entire movie catalog. We'll build a two-tower model where one tower embeds users and the other embeds movies.

In [6]:
class MovieLensRetrievalModel(tfrs.Model):

  def __init__(self, user_model, movie_model, task):
    super().__init__()
    self.user_model = user_model
    self.movie_model = movie_model
    self.task = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    user_embeddings = self.user_model(features["user_id"])
    positive_movie_embeddings = self.movie_model(features["movie_title"])
    return self.task(user_embeddings, positive_movie_embeddings)



# User and Movie Towers
embedding_dimension = 32

user_model = tf.keras.Sequential([
    user_ids_vocabulary,
    tf.keras.layers.Embedding(user_ids_vocabulary.vocabulary_size(), embedding_dimension)
])


movie_model = tf.keras.Sequential([
    movie_titles_vocabulary,
    tf.keras.layers.Embedding(movie_titles_vocabulary.vocabulary_size(), embedding_dimension)
])

# The retrieval task
retrieval_task = tfrs.tasks.Retrieval(
    metrics=tfrs.metrics.FactorizedTopK(
        candidates=movies.batch(128).map(movie_model)
    )
)

In [7]:
retrieval_model = MovieLensRetrievalModel(user_model, movie_model, retrieval_task)

retrieval_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))


# Train the retrieval model
history_retrieval = retrieval_model.fit(ratings.batch(4096), epochs=3)

Epoch 1/3
Epoch 2/3
Epoch 3/3


Exporting Embeddings for ANN: After training, we'll use the movie_model to generate embeddings for all movies. These embeddings will be used to build the ScaNN index.

## 3. Approximate Nearest Neighbor (ANN) Search with ScaNN
Once we have movie embeddings, we can build a ScaNN index for fast approximate lookups.

In [8]:
# Create the ScaNN index
scann_index = tfrs.layers.factorized_top_k.ScaNN(num_leaves=100, num_leaves_to_search=10, k=10) # k is num_recos

scann_index.index_from_dataset(
    tf.data.Dataset.zip((movies.batch(100), movies.batch(100).map(movie_model)))
)

# Get recommendations for a user
def get_retrieval_recommendations_scann(user_id_str, num_recs=10):
    query_embedding = user_model(tf.constant([user_id_str]))
    scores, titles = scann_index(query_embedding) # Returns (scores, ids)
    print(f"ScaNN Retrieval Recommendations for user {user_id_str}:\n {titles[0, :num_recs].numpy()}")
    return titles[0, :num_recs]

In [9]:
# Example:
retrieved_candidates_scann = get_retrieval_recommendations_scann("42")

ScaNN Retrieval Recommendations for user 42:
 [b'Client, The (1994)' b'Old Yeller (1957)'
 b'Angels in the Outfield (1994)' b'Rudy (1993)'
 b"Kid in King Arthur's Court, A (1995)"
 b'Bridges of Madison County, The (1995)'
 b'Man Without a Face, The (1993)' b'Circle of Friends (1995)'
 b'Legends of the Fall (1994)' b'Fried Green Tomatoes (1991)']


## 4. The Ranking Model

The ranking model takes the candidates selected by the retrieval (and ANN) stage and scores them to produce a final, ordered list of recommendations. It typically uses more features than the retrieval model to achieve higher precision.

In [10]:
class MovieLensRankingModel(tfrs.Model):
    def __init__(self, user_model, movie_model, task):
        super().__init__()
        self.user_model = user_model
        self.movie_model = movie_model
        self.task = task

        # Ranking specific layers
        self.rating_model = tf.keras.Sequential([
            tf.keras.layers.Dense(256, activation="relu"),
            tf.keras.layers.Dense(128, activation="relu"),
            tf.keras.layers.Dense(1) # Predict the rating
        ])


    def call(self, features: Dict[Text, tf.Tensor]) -> tf.Tensor:
        user_embedding = self.user_model(features["user_id"])
        movie_embedding = self.movie_model(features["movie_title"])
        return self.rating_model(tf.concat([user_embedding, movie_embedding], axis=1))


    def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
        labels = features.pop("user_rating")
        rating_predictions = self(features)
        return self.task(labels=labels, predictions=rating_predictions)


In [11]:
# Prepare data for ranking (user_id, movie_title, user_rating)
ranking_data = ratings.map(lambda x: {
    "user_id": x["user_id"],
    "movie_title": x["movie_title"],
    "user_rating": x["user_rating"]
})

# Ranking Task (e.g., Mean Squared Error for rating prediction)
ranking_task = tfrs.tasks.Ranking(
    loss=tf.keras.losses.MeanSquaredError(),
    metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

In [12]:
# Re-use user and movie models (or define new, potentially more complex ones)
# For simplicity, we reuse the retrieval ones here, but in practice,
# ranking models might have different architectures or feature inputs.

ranking_user_model = user_model    # Or a new user model for ranking
ranking_movie_model = movie_model  # Or a new movie model for ranking


ranking_model = MovieLensRankingModel(ranking_user_model, ranking_movie_model, ranking_task)

ranking_model.compile(optimizer=tf.keras.optimizers.Adagrad(0.1))


# Train the ranking model
history_ranking = ranking_model.fit(ranking_data.batch(4096), epochs=10)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


## Using the Ranking Model with ScaNN Candidates:

In [13]:
def get_ranked_recommendations(user_id_str, retrieved_movie_titles):
    # Create features for the ranking model
    num_retrieved_movies = retrieved_movie_titles.shape[0]
    ranking_features = {
        "user_id": tf.constant([user_id_str] * num_retrieved_movies),
        "movie_title": retrieved_movie_titles
    }

    # Get predicted ratings
    predicted_ratings = ranking_model(ranking_features)

    # Sort movies by predicted rating
    sorted_indices = tf.argsort(predicted_ratings, axis=0, direction='DESCENDING').numpy().flatten()
    ranked_movie_titles = tf.gather(retrieved_movie_titles, sorted_indices).numpy()
    ranked_scores = tf.gather(predicted_ratings, sorted_indices).numpy().flatten()

    print(f"Ranked Recommendations for user {user_id_str}:")
    for title, score in zip(ranked_movie_titles, ranked_scores):
        print(f"  {title.decode('utf-8')}: {score:.4f}")

In [14]:
# Example:
get_ranked_recommendations("42", retrieved_candidates_scann)

Ranked Recommendations for user 42:
  Fried Green Tomatoes (1991): 4.5286
  Circle of Friends (1995): 4.4630
  Rudy (1993): 4.3066
  Old Yeller (1957): 4.2954
  Client, The (1994): 4.1300
  Man Without a Face, The (1993): 4.1003
  Bridges of Madison County, The (1995): 4.0771
  Legends of the Fall (1994): 4.0558
  Angels in the Outfield (1994): 3.8697
  Kid in King Arthur's Court, A (1995): 3.6645


## Summary:

- **Retrieval Model**: Quickly narrows down the vast movie catalog to a manageable set of candidates using efficient embeddings.


- **ScaNN Integration**: The movie embeddings from the retrieval model are indexed by ScaNN for fast approximate nearest neighbor search, making the candidate generation step even faster, especially for large item sets.


- **Ranking Model**: Takes the retrieved candidates (now efficiently sourced via ScaNN) and uses more features (or more complex interactions) to predict a score (like a rating) for each, then orders them to produce the final recommendations.