References: https://www.tensorflow.org/recommenders/examples/basic_ranking

# **RecSys Model 1: Ranking**

Real-world recommender systems are often composed of two stages:

1. **The retrieval stage** is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.
2. **The ranking stage** takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

We're going to focus on the second stage, ranking.

## Imports

In [1]:
# Temporary solution for a bug in the implementation of the tfrs.layers.factorized_top_k module.
# https://github.com/tensorflow/recommenders/issues/712#issuecomment-2041163592

!pip uninstall tensorflow -y
!pip uninstall tensorflow-recommenders -y
#!pip uninstall tensorflow-datasets -y


import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

Found existing installation: tensorflow 2.17.1
Uninstalling tensorflow-2.17.1:
  Successfully uninstalled tensorflow-2.17.1
[0m

In [2]:
!pip install -q tensorflow==2.17
!pip install -q tensorflow-recommenders==0.7.3

#!pip install -q --upgrade tensorflow-datasets

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m601.3/601.3 MB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h

In [3]:
import os
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
#import tensorflow_datasets as tfds

#import json
import pandas as pd
from google.colab import drive

In [4]:
import tensorflow_recommenders as tfrs

In [5]:
print(tf.__version__)

2.17.0


In [6]:
print(tfrs.__version__)

v0.7.3


# Importing and preprocessing the dataset

In [7]:
drive.mount('/content/drive')

Mounted at /content/drive


In [8]:
JSON_FILE = '/content/drive/My Drive/yelp_academic_dataset_business.json'

In [9]:
# Define the number of lines to read
#n_lines = 25000

# Read the specified number of lines into a list of dictionaries
#with open(JSON_FILE, "r") as file:
#    data = [json.loads(next(file)) for _ in range(n_lines)]

# Read the JSON lines file directly into a pandas DataFrame
df = pd.read_json(JSON_FILE, lines=True)

# Convert the list of dictionaries into a DataFrame
#df = pd.DataFrame(data)

# Display the first few rows
print(df.head())

              business_id                      name  \
0  Pns2l4eNsfO8kk83dixA6A  Abby Rappoport, LAC, CMQ   
1  mpf3x-BjTdTEA3yCZrAYPw             The UPS Store   
2  tUFrWirKiKi_TAnsVWINQQ                    Target   
3  MTSW4McQd7CbVtyjqoe9mw        St Honore Pastries   
4  mWMc6_wTdE0EUBKIGXDVfA  Perkiomen Valley Brewery   

                           address           city state postal_code  \
0           1616 Chapala St, Ste 2  Santa Barbara    CA       93101   
1  87 Grasso Plaza Shopping Center         Affton    MO       63123   
2             5255 E Broadway Blvd         Tucson    AZ       85711   
3                      935 Race St   Philadelphia    PA       19107   
4                    101 Walnut St     Green Lane    PA       18054   

    latitude   longitude  stars  review_count  is_open  \
0  34.426679 -119.711197    5.0             7        0   
1  38.551126  -90.335695    3.0            15        1   
2  32.223236 -110.880452    3.5            22        0   
3  39.9555

In [10]:
print(len(df)) # total number of entries

150346


In [11]:
# Filter rows where 'categories' is not null
df = df[df['categories'].notnull()]

# Select specific columns
df = df[['categories', 'business_id', 'stars']]

# Display the result
print(df.head())

                                          categories             business_id  \
0  Doctors, Traditional Chinese Medicine, Naturop...  Pns2l4eNsfO8kk83dixA6A   
1  Shipping Centers, Local Services, Notaries, Ma...  mpf3x-BjTdTEA3yCZrAYPw   
2  Department Stores, Shopping, Fashion, Home & G...  tUFrWirKiKi_TAnsVWINQQ   
3  Restaurants, Food, Bubble Tea, Coffee & Tea, B...  MTSW4McQd7CbVtyjqoe9mw   
4                          Brewpubs, Breweries, Food  mWMc6_wTdE0EUBKIGXDVfA   

   stars  
0    5.0  
1    3.0  
2    3.5  
3    4.0  
4    4.5  


In [12]:
print(len(df))  # number of entries after removing 103 rows where 'categories' have null value

150243


In [13]:
# Split 'categories' into a list of categories
df['categories'] = df['categories'].str.split(', ')

# Use explode to create a row for each category
df = df.explode('categories').reset_index(drop=True)

# Rename columns
df = df.rename(columns={'categories': 'category', 'business_id': 'employee_id', 'stars': 'overall_star'})

# Display the result
print(df.head())

                       category             employee_id  overall_star
0                       Doctors  Pns2l4eNsfO8kk83dixA6A           5.0
1  Traditional Chinese Medicine  Pns2l4eNsfO8kk83dixA6A           5.0
2         Naturopathic/Holistic  Pns2l4eNsfO8kk83dixA6A           5.0
3                   Acupuncture  Pns2l4eNsfO8kk83dixA6A           5.0
4              Health & Medical  Pns2l4eNsfO8kk83dixA6A           5.0


In [14]:
print(len(df)) # total number of entries after splitting 'categories'

668592


In [15]:
# Create TensorFlow Dataset using tf.data
tf_dataset = tf.data.Dataset.from_tensor_slices((
    {'category': df['category'].astype(str).values,      # Ensure conversion to strings
    'employee_id': df['employee_id'].astype(str).values,   # Ensure conversion to strings
    'overall_star': df['overall_star'].astype(float).values}  # Ensure conversion to floats
))

In [16]:
# Displaying a sample from the TensorFlow Dataset using pprint
for x in tf_dataset.take(1).as_numpy_iterator():
    pprint.pprint(x)

{'category': b'Doctors',
 'employee_id': b'Pns2l4eNsfO8kk83dixA6A',
 'overall_star': 5.0}


Let's figure out **unique employee ids** and **categories** present in the data.

This is important because we **need to be able to map the raw values of our categorical features to embedding vectors** in our models. To do that, we **need a vocabulary that maps a raw feature value to an integer in a contiguous range**: *this allows us to look up the corresponding embeddings in our embedding tables*.

In [17]:
# Extracting & processing data to build vocabularies (for category and employee embeddings)

employees = tf_dataset.map(lambda x: x["employee_id"])
categories = tf_dataset.map(lambda x: x["category"])

employee_ids = employees.batch(1_000)
category_names = categories.batch(1_000)

unique_employee_ids = np.unique(np.concatenate(list(employee_ids))) # vocabulary for the employee embeddings
unique_category_names = np.unique(np.concatenate(list(category_names))) # vocabulary for the category embeddings

In [18]:
unique_employee_ids[:10]

array([b'---kPU91CF4Lq2-WlRu9Lw', b'--0iUa4sNDFiZFrAdIWhZQ',
       b'--30_8IhuyMHbSOcNWd6DQ', b'--7PUidqRWpRSpXebiyxTg',
       b'--7jw19RH9JKXgFohspgQw', b'--8IbOsAAxjKRoYsBFL-PA',
       b'--9osgUCSDUWUkoTLdvYhQ', b'--ARBQr1WMsTWiwOKOj-FQ',
       b'--FWWsIwxRwuw9vIMImcQg', b'--FcbSxK1AoEtEAxOgBaCw'], dtype=object)

In [19]:
print(len(unique_employee_ids))

150243


In [20]:
unique_category_names[:10]

array([b'& Probates', b'3D Printing', b'ATV Rentals/Tours', b'Acai Bowls',
       b'Accessories', b'Accountants', b'Acne Treatment', b'Active Life',
       b'Acupuncture', b'Addiction Medicine'], dtype=object)

In [21]:
print(len(unique_category_names))

1311


In [22]:
# Split data into a training and evaluation set
# split the data by putting 80% of the ratings in the train set, and 20% in the test set.

tf.random.set_seed(42)
shuffled = tf_dataset.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# Therefore, following code snippets to create train and test splits are ommitted during execution.

# But if there is an employee who hasn't worked and received a rating yet, that employee will never be appeared in the recommendations and will never be able to get a work for herself/himself.
# Therefore, it is necessary to make this model generalise to unseen data which will enable such employees to appear in the recommendations to receive job opportunities for themselves.
# (e.g. If employee A has 0 rating [not worked], and if we have use all the train and test dataset to train, then the model will recreate that employee A has 0 rating.
# But if we try to generalise the model using a test set [not used to train], there is a possibility that employee A might receive a bit higher rating by the model.
# This possiblity will enable that employee to appear in recommendations to get a work for herself/himself).
# But this is not our purpose of this model, we just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

#trainset_size = round(len(shuffled) * 0.8)
#testset_size = round(len(shuffled) * 0.2)

#train = shuffled.take(trainset_size)
#test = shuffled.skip(trainset_size).take(testset_size)

In [23]:
# Displaying a sample from the TensorFlow train Dataset using pprint
#for x in train.take(1).as_numpy_iterator():
#    pprint.pprint(x)

In [24]:
# Displaying a sample from the TensorFlow test Dataset using pprint
#for x in test.take(1).as_numpy_iterator():
#    pprint.pprint(x)

# Implementing a model

## Architecture

Ranking models do not face the same efficiency constraints as retrieval models do, and so we have a little bit more freedom in our choice of architectures.

A model composed of multiple stacked dense layers is a relatively common architecture for ranking tasks. We can implement it as follows:

In [25]:
class RankingModel(tf.keras.Model):

  def __init__(self):
    super().__init__()
    embedding_dimension = 32 # The dimensionality of the category and employees embeddings/representations

    # Compute embeddings for category names.
    self.category_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_category_names, mask_token=None),
      # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
      tf.keras.layers.Embedding(len(unique_category_names) + 1, embedding_dimension)
    ])

    # Compute embeddings for employees.
    self.employee_embeddings = tf.keras.Sequential([
      tf.keras.layers.StringLookup(
        vocabulary=unique_employee_ids, mask_token=None),
      # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
      tf.keras.layers.Embedding(len(unique_employee_ids) + 1, embedding_dimension)
    ])

    # Compute predictions.
    self.ratings = tf.keras.Sequential([
      # Learn multiple dense layers.
      tf.keras.layers.Dense(256, activation="relu"),
      tf.keras.layers.Dense(64, activation="relu"),
      # Make rating predictions in the final layer.
      tf.keras.layers.Dense(1)
  ])

  def call(self, inputs):

    category_name, employee_id = inputs

    category_embedding = self.category_embeddings(category_name)
    employee_embedding = self.employee_embeddings(employee_id)

    return self.ratings(tf.concat([category_embedding, employee_embedding], axis=1))

This model takes user ids and movie titles, and outputs a predicted rating:

In [26]:
RankingModel()((["Electricians"], ["TCtVAiGDb05PyLe3v-zXDA"]))

<tf.Tensor: shape=(1, 1), dtype=float32, numpy=array([[0.00085131]], dtype=float32)>

## Loss and metrics

The next component is the loss used to train our model. TFRS has several loss layers and tasks to make this easy.

In this instance, we'll make use of the `Ranking` task object: a convenience wrapper that bundles together the loss function and metric computation.

We'll use it together with the `MeanSquaredError` Keras loss in order to predict the ratings.

In [27]:
task = tfrs.tasks.Ranking(
  loss = tf.keras.losses.MeanSquaredError(),
  metrics=[tf.keras.metrics.RootMeanSquaredError()]
)

The task itself is a Keras layer that takes true and predicted as arguments, and returns the computed loss. We'll use that to implement the model's training loop.

## The full model

We can now put it all together into a model. TFRS exposes a base model class (`tfrs.models.Model`) which streamlines bulding models: all we need to do is to set up the components in the `__init__` method, and implement the `compute_loss` method, taking in the raw features and returning a loss value.

The base model will then take care of creating the appropriate training loop to fit our model.

In [28]:
class YelpModel(tfrs.models.Model):

  def __init__(self):
    super().__init__()
    self.ranking_model: tf.keras.Model = RankingModel()
    self.task: tf.keras.layers.Layer = tfrs.tasks.Ranking(
      loss = tf.keras.losses.MeanSquaredError(),
      metrics=[tf.keras.metrics.RootMeanSquaredError()]
    )

  def call(self, features: Dict[str, tf.Tensor]) -> tf.Tensor:
    return self.ranking_model(
        (features["category"], features["employee_id"]))

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    labels = features.pop("overall_star")

    rating_predictions = self(features)

    # The task computes the loss and the metrics.
    return self.task(labels=labels, predictions=rating_predictions)

# Fitting and evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

Let's first instantiate the model.

In [29]:
model = YelpModel()
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch, and cache the training and evaluation data.

In [30]:
# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# We just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

#cached_train = train.shuffle(100_000).batch(8192).cache()
#cached_test = test.batch(4096).cache()

cached_train = shuffled.shuffle(100_000).batch(8192).cache()

Then train the  model:

In [31]:
model.fit(cached_train, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tf_keras.src.callbacks.History at 0x79ad1235cee0>

As the model trains, the loss is falling and the RMSE metric is improving.

Finally, we can evaluate our model on the test set:

In [32]:
# Since this model is used only to recreate the already obtained overall ratings by the employees who are already in the production database,
# it is suitable to use the test dataset also for training to recreate their ratings as well.
# Because no unseen data are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data (unseen employees but not for unseen categories).
# We just need to recreate the overall ratings of all the employees in our database. This avoids the need for requesting employee ratings from the database for millions of employees for each request done by the customer.
# Therefore, following code snippet to test the model is ommitted during execution.

#model.evaluate(cached_test, return_dict=True)

The lower the RMSE metric, the more accurate our model is at predicting ratings.

# Testing the ranking model

Now we can test the ranking model by computing predictions for a set of employees and then rank these employees based on the predictions:


In [33]:
test_ratings = {}
test_employee_ids = ["z7otfCcjH3Awwck7nsEEqQ", "QL3xkxLAe788em3_3eC4LQ", "TCtVAiGDb05PyLe3v-zXDA", "1xeRysU0YYOnpy-5_3ySag", "HTHUzTl-vDhEcbh7bZfhIg"]
for employee_id in test_employee_ids:
  test_ratings[employee_id] = model({
      "category": np.array(["Electricians"]),
      "employee_id": np.array([employee_id])
  })

print("Ratings:")
for employee_id, score in sorted(test_ratings.items(), key=lambda x: x[1], reverse=True):
  print(f"{employee_id}: {score}")

Ratings:
TCtVAiGDb05PyLe3v-zXDA: [[4.507906]]
QL3xkxLAe788em3_3eC4LQ: [[4.004672]]
1xeRysU0YYOnpy-5_3ySag: [[4.0038233]]
HTHUzTl-vDhEcbh7bZfhIg: [[3.5086079]]
z7otfCcjH3Awwck7nsEEqQ: [[3.508493]]


# Exporting for serving

The model can be easily exported for serving:


In [34]:
tf.saved_model.save(model, "export")

We can now load it back and perform predictions:

In [35]:
loaded = tf.saved_model.load("export")

loaded({"category": np.array(["Electricians"]), "employee_id": ["DD4gTG-FeG_nneXcexJ2eg"]}).numpy()

array([[3.00802]], dtype=float32)

In [36]:
# Define the folder path for saving the model
save_dir = '/content/drive/My Drive/Colab Notebooks/Saved Models'
#save_dir = '/content/Saved Model'

# Ensure the folder exists
os.makedirs(save_dir, exist_ok=True)

# Path to save the model
model_path = os.path.join(save_dir, "recsys_model_one_ranking")

# Save the model
tf.saved_model.save(
    model,
    model_path
)

# Load the model back
loaded = tf.saved_model.load(model_path)

# Pass a category name and employee id to get rating predictions
rating = loaded({"category": np.array(["Electricians"]), "employee_id": ["DD4gTG-FeG_nneXcexJ2eg"]}).numpy()

print("Rating predictions:")
print(f"DD4gTG-FeG_nneXcexJ2eg: {rating[0]}")

Rating predictions:
DD4gTG-FeG_nneXcexJ2eg: [3.00802]
