References: https://www.tensorflow.org/recommenders/examples/basic_ranking

# **RecSys Model 2: Ranking**

Real-world recommender systems are often composed of two stages:

1. **The retrieval stage** is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.
2. **The ranking stage** takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

We're going to focus on the second stage, ranking.

## Imports

In [None]:
# Temporary solution for a bug in the implementation of the tfrs.layers.factorized_top_k module.
# https://github.com/tensorflow/recommenders/issues/712#issuecomment-2041163592

!pip uninstall tensorflow -y
!pip uninstall tensorflow-recommenders -y
#!pip uninstall tensorflow-datasets -y


import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

Found existing installation: tensorflow 2.17.1
Uninstalling tensorflow-2.17.1:
  Successfully uninstalled tensorflow-2.17.1
[0m

In [None]:
!pip install -q tensorflow==2.17
!pip install -q tensorflow-recommenders==0.7.3

#!pip install -q --upgrade tensorflow-datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m601.3/601.3 MB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
#import tensorflow_datasets as tfds

import json
import pandas as pd
from google.colab import drive

In [None]:
import tensorflow_recommenders as tfrs

In [None]:
print(tf.__version__)

2.17.0


In [None]:
print(tfrs.__version__)

v0.7.3


# Importing and preprocessing the dataset

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
JSON_FILE = '/content/drive/My Drive/yelp_academic_dataset_review.json'

In [None]:
# Define the number of lines to read
#n_lines = 25000

# Read the specified number of lines into a list of dictionaries
#with open(JSON_FILE, "r") as file:
#    data = [json.loads(next(file)) for _ in range(n_lines)]

# Convert the list of dictionaries into a DataFrame
#df = pd.DataFrame(data)

In [None]:
# Read the JSON lines file directly into a pandas DataFrame
#df = pd.read_json(JSON_FILE, lines=True)

In [None]:
def get_data(filename):
  # Initialize an empty list to store selected attributes
  filtered_data = []

  # Open and process JSON file line by line
  with open(filename, 'r') as file:
      for line in file:
          record = json.loads(line)
          # Extract only specific attributes
          filtered_data.append({'user_id': record['user_id'], 'business_id': record['business_id'], 'stars': record['stars']})

  # Create a DataFrame
  return pd.DataFrame(filtered_data)

In [None]:
df = get_data(JSON_FILE)

In [None]:
# Display the first few rows
print(df.head())

                  user_id             business_id  stars
0  mh_-eMZ6K5RLWhZyISBhwA  XQfwVwDr-v0ZS3_CbbE5Xw    3.0
1  OyoGAe7OKpv6SyGZT5g77Q  7ATYjTIgM3jUlt4UM3IypQ    5.0
2  8g_iMtfSiwikVnbP2etR0A  YjUWPpI6HXG530lwP-fb2A    3.0
3  _7bHUi9Uuf5__HHc_Q8guQ  kxX2SOes4o-D3ZQBkiMRfA    5.0
4  bcjbaE6dDog4jkNY91ncLQ  e4Vwtrqf-wpJfwesgvdgxQ    4.0


In [None]:
print(len(df)) # total number of entries

6990280


In [None]:
def get_employee_ids_with_null_categories():
  JSON_FILE = '/content/drive/My Drive/yelp_academic_dataset_business.json'

  df = pd.read_json(JSON_FILE, lines=True)

  # Extract business_ids where categories is null (NaN)
  business_ids_with_null_categories = df.loc[df['categories'].isna(), 'business_id'].to_numpy()

  return business_ids_with_null_categories

In [None]:
employee_ids_with_null_categories = get_employee_ids_with_null_categories()

# Remove rows where business_id matches any value in employee_ids_with_null_categories
df = df[~df['business_id'].isin(employee_ids_with_null_categories)]

In [None]:
print(len(df))  # number of entries after removing employees who have null 'categories'

6989591


In [None]:
# Rename columns
df = df.rename(columns={'user_id': 'customer_id', 'business_id': 'employee_id'})

# Display the result
print(df.head())

              customer_id             employee_id  stars
0  mh_-eMZ6K5RLWhZyISBhwA  XQfwVwDr-v0ZS3_CbbE5Xw    3.0
1  OyoGAe7OKpv6SyGZT5g77Q  7ATYjTIgM3jUlt4UM3IypQ    5.0
2  8g_iMtfSiwikVnbP2etR0A  YjUWPpI6HXG530lwP-fb2A    3.0
3  _7bHUi9Uuf5__HHc_Q8guQ  kxX2SOes4o-D3ZQBkiMRfA    5.0
4  bcjbaE6dDog4jkNY91ncLQ  e4Vwtrqf-wpJfwesgvdgxQ    4.0


In [None]:
# Create TensorFlow Dataset using tf.data
tf_dataset = tf.data.Dataset.from_tensor_slices((
    {'customer_id': df['customer_id'].astype(str).values,      # Ensure conversion to strings
    'employee_id': df['employee_id'].astype(str).values,   # Ensure conversion to strings
    'stars': df['stars'].astype(float).values}  # Ensure conversion to floats
))

In [None]:
# Displaying a sample from the TensorFlow Dataset using pprint
for x in tf_dataset.take(1).as_numpy_iterator():
    pprint.pprint(x)

{'customer_id': b'mh_-eMZ6K5RLWhZyISBhwA',
 'employee_id': b'XQfwVwDr-v0ZS3_CbbE5Xw',
 'stars': 3.0}


Let's figure out **unique employee ids** and **customer ids** present in the data.

This is important because we **need to be able to map the raw values of our categorical features to embedding vectors** in our models. To do that, we **need a vocabulary that maps a raw feature value to an integer in a contiguous range**: *this allows us to look up the corresponding embeddings in our embedding tables*.

In [None]:
# Extracting & processing data to build vocabularies (for customer and employee embeddings)

customers = tf_dataset.map(lambda x: x["customer_id"])
employees = tf_dataset.map(lambda x: x["employee_id"])

customer_ids = customers.batch(1_000)
employee_ids = employees.batch(1_000)

unique_customer_ids = np.unique(np.concatenate(list(customer_ids))) # vocabulary for the customer embeddings
unique_employee_ids = np.unique(np.concatenate(list(employee_ids))) # vocabulary for the employee embeddings

In [None]:
unique_customer_ids[:10]

array([b'---1lKK3aKOuomHnwAkAow', b'---2PmXbF47D870stH1jqA',
       b'---UgP94gokyCDuB5zUssA', b'---fa6ZK37T9NjkGKI4oSg',
       b'---r61b7EpVPkb4UVme5tA', b'---zemaUC8WeJeWKqS6p9Q',
       b'--034gGozmK4y5txuPsdAA', b'--0DrQkM0FT-yCQRWw82uQ',
       b'--0FNOzZkEQlz8WzS3WttQ', b'--0Jj_J_MmUJ51f1Y394Uw'], dtype=object)

In [None]:
print(len(unique_customer_ids))

1987685


In [None]:
unique_employee_ids[:10]

array([b'---kPU91CF4Lq2-WlRu9Lw', b'--0iUa4sNDFiZFrAdIWhZQ',
       b'--30_8IhuyMHbSOcNWd6DQ', b'--7PUidqRWpRSpXebiyxTg',
       b'--7jw19RH9JKXgFohspgQw', b'--8IbOsAAxjKRoYsBFL-PA',
       b'--9osgUCSDUWUkoTLdvYhQ', b'--ARBQr1WMsTWiwOKOj-FQ',
       b'--FWWsIwxRwuw9vIMImcQg', b'--FcbSxK1AoEtEAxOgBaCw'], dtype=object)

In [None]:
print(len(unique_employee_ids))

150243


In [None]:
# Split data into a training and evaluation set
# split the data by putting 80% of the ratings in the train set, and 20% in the test set.

tf.random.set_seed(42)
shuffled = tf_dataset.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

'''FOLLWING IS NOT APPLICABLE FOR THIS MODEL 2'''
# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# Therefore, following code snippets to create train and test splits are ommitted during execution.

# But if there is an employee who hasn't worked and received a rating yet, that employee will never be appeared in the recommendations and will never be able to get a work for herself/himself.
# Therefore, it is necessary to make this model generalise to unseen data which will enable such employees to appear in the recommendations to receive job opportunities for themselves.
# (e.g. If employee A has 0 rating [not worked], and if we have use all the train and test dataset to train, then the model will recreate that employee A has 0 rating.
# But if we try to generalise the model using a test set [not used to train], there is a possibility that employee A might receive a bit higher rating by the model.
# This possiblity will enable that employee to appear in recommendations to get a work for herself/himself).
# But this is not our purpose of this model, we just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

trainset_size = round(len(shuffled) * 0.8)
testset_size = round(len(shuffled) * 0.2)

train = shuffled.take(trainset_size)
test = shuffled.skip(trainset_size).take(testset_size)

In [None]:
# Displaying a sample from the TensorFlow train Dataset using pprint
#for x in train.take(1).as_numpy_iterator():
#    pprint.pprint(x)

In [None]:
# Displaying a sample from the TensorFlow test Dataset using pprint
#for x in test.take(1).as_numpy_iterator():
#    pprint.pprint(x)

# Implementing a model

## Architecture

Ranking models do not face the same efficiency constraints as retrieval models do, and so we have a little bit more freedom in our choice of architectures.

A model composed of multiple stacked dense layers is a relatively common architecture for ranking tasks. We can implement it as follows:

In [None]:
class RankingModel(tf.keras.Model):

  def __init__(self):
    super().__init__()
    embedding_dimension = 32 # The dimensionality of the customer and employees embeddings/representations

    # Compute embeddings for customers.
    self.customer_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_customer_ids, mask_token=None),
      # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
      tf.keras.layers.Embedding(len(unique_customer_ids) + 1, embedding_dimension)
    ])

    # Compute embeddings for employees.
    self.employee_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_employee_ids, mask_token=None),
      # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
      tf.keras.layers.Embedding(len(unique_employee_ids) + 1, embedding_dimension)
    ])

    # Compute predictions.
    self.ratings = tf.keras.Sequential([
      # Learn multiple dense layers.
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dense(64, activation="relu"),
      # Make rating predictions in the final layer.
      tf.keras.layers.Dense(1)
  ])

  def call(self, inputs):

    customer_id, employee_id = inputs

    customer_embedding = self.customer_embeddings(customer_id)
    employee_embedding = self.employee_embeddings(employee_id)

    return self.ratings(tf.concat([customer_embedding, employee_embedding], axis=1))

This model takes customer ids and employee ids, and outputs a predicted rating:

In [None]:
RankingModel()((["Ha3iJu77CxlrFm-vQRs_8g"], ["W0vdz23JQtVQX5vJkiCj3g"]))

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[-0.01650509]], dtype=float32)>

## Loss and metrics

The next component is the loss used to train our model. TFRS has several loss layers and tasks to make this easy.

In this instance, we'll make use of the `Ranking` task object: a convenience wrapper that bundles together the loss function and metric computation.

We'll use it together with the `MeanSquaredError` Keras loss in order to predict the ratings.

In [None]:
task = tfrs.tasks.Ranking(
  loss = tf.keras.losses.MeanSquaredError(),
  metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

The task itself is a Keras layer that takes true and predicted as arguments, and returns the computed loss. We'll use that to implement the model's training loop.

## The full model

We can now put it all together into a model. TFRS exposes a base model class (`tfrs.models.Model`) which streamlines bulding models: all we need to do is to set up the components in the `__init__` method, and implement the `compute_loss` method, taking in the raw features and returning a loss value.

The base model will then take care of creating the appropriate training loop to fit our model.

In [None]:
class YelpModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()
    self.ranking_model: tf.keras.Model = RankingModel()
    self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
      loss = tf.keras.losses.MeanSquaredError(),
      metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
    return self.ranking_model(
        (features["customer_id"], features["employee_id"]))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    labels = features.pop("stars")

    rating_predictions = self(features)

    # The task computes the loss and the metrics.
    return self.task(labels=labels, predictions=rating_predictions)

# Fitting and evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

Let's first instantiate the model.

In [None]:
model = YelpModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch, and cache the training and evaluation data.

In [None]:
'''FOLLWING IS NOT APPLICABLE FOR THIS MODEL 2'''
# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# We just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

cached_train = train.shuffle(100_000).batch(8192).cache()
cached_test = test.batch(4096).cache()

#cached_train = shuffled.shuffle(100_000).batch(8192).cache()

Then train the  model:

In [None]:
model.fit(cached_train, epochs=50)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<tf_keras.src.callbacks.History at 0x7b8cd70868f0>

As the model trains, the loss is falling and the RMSE metric is improving.

Finally, we can evaluate our model on the test set:

In [None]:
'''FOLLWING IS NOT APPLICABLE FOR THIS MODEL 2'''
# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# We just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippet to test the model is ommitted during execution.

model.evaluate(cached_test, return_dict=True)



{'root_mean_squared_error': 1.7364033460617065,
 'loss': 3.5482215881347656,
 'regularization_loss': 0,
 'total_loss': 3.5482215881347656}

The lower the RMSE metric, the more accurate our model is at predicting ratings.

# Testing the ranking model

Now we can test the ranking model by computing predictions for a set of employees and then rank these employees based on the predictions:


In [None]:
test_ratings = {}
test_employee_ids = ["YbnJYHNp_fHbI-hcFg48vQ", "DD3TxygdxBxKh9gbjCuLDA", "1bJxvwuMTyXmQGu90WLPhA", "W0vdz23JQtVQX5vJkiCj3g", "lTCoYu00AUV0SHxOa-XXBw"]
for employee_id in test_employee_ids:
  test_ratings[employee_id] = model({
      "customer_id": np.array(["Ha3iJu77CxlrFm-vQRs_8g"]),
      "employee_id": np.array([employee_id])
  })

print("Ratings:")
for employee_id, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
  print(f"{employee_id}: {score}")

Ratings:
YbnJYHNp_fHbI-hcFg48vQ: [[4.92117]]
lTCoYu00AUV0SHxOa-XXBw: [[4.74607]]
DD3TxygdxBxKh9gbjCuLDA: [[4.2790422]]
1bJxvwuMTyXmQGu90WLPhA: [[1.313463]]
W0vdz23JQtVQX5vJkiCj3g: [[1.1837858]]


# Exporting for serving

The model can be easily exported for serving:


In [None]:
tf.saved_model.save(model, "export")

We can now load it back and perform predictions:

In [None]:
loaded = tf.saved_model.load("export")

loaded({"customer_id": np.array(["Ha3iJu77CxlrFm-vQRs_8g"]), "employee_id": ["DD3TxygdxBxKh9gbjCuLDA"]}).numpy()

array([[4.2790422]], dtype=float32)

In [None]:
# Define the folder path for saving the model
save_dir = '/content/drive/My Drive/Colab Notebooks/Saved Models'
#save_dir = '/content/Saved Model'

# Ensure the folder exists
os.makedirs(save_dir, exist_ok=True)

# Path to save the model
model_path = os.path.join(save_dir, "recsys_model_two_ranking")

# Save the model
tf.saved_model.save(
    model,
    model_path
)

# Load the model back
loaded = tf.saved_model.load(model_path)

# Pass a customer id name and employee id to get rating predictions
rating = loaded({"customer_id": np.array(["Ha3iJu77CxlrFm-vQRs_8g"]), "employee_id": ["DD3TxygdxBxKh9gbjCuLDA"]}).numpy()

print("Rating predictions:")
print(f"DD3TxygdxBxKh9gbjCuLDA: {rating[0]}")

Rating predictions:
DD3TxygdxBxKh9gbjCuLDA: [4.2790422]
