References: https://www.tensorflow.org/recommenders/examples/basic_retrieval

# **RecSys Model 1: Retrieval**

Real-world recommender systems are often composed of two stages:

1.   **The retrieval stage** is responsible for selecting an initial set of hundreds of candidates from all possible candidates. The main objective of this model is to efficiently weed out all candidates that the user is not interested in. Because the retrieval model may be dealing with millions of candidates, it has to be computationally efficient.

2.   **The ranking stage** takes the outputs of the retrieval model and fine-tunes them to select the best possible handful of recommendations. Its task is to narrow down the set of items the user may be interested in to a shortlist of likely candidates.

In this notebook, we're going to build the first stage, retrieval.

Retrieval models are often composed of two sub-models:

1.   **A query model** computing the query representation (normally a fixed-dimensionality embedding vector) using query features.

2.   **A candidate model** computing the candidate representation (an equally-sized vector) using the candidate features

The outputs of the two models are then multiplied together to give a query-candidate affinity score, with higher scores expressing a better match between the candidate and the query.

# Imports

In [None]:
# Temporary solution for a bug in the implementation of the tfrs.layers.factorized_top_k module.
# https://github.com/tensorflow/recommenders/issues/712#issuecomment-2041163592

!pip uninstall tensorflow -y
!pip uninstall tensorflow-recommenders -y
#!pip uninstall tensorflow-datasets -y


import os
os.environ['TF_USE_LEGACY_KERAS'] = '1'

Found existing installation: tensorflow 2.17.1
Uninstalling tensorflow-2.17.1:
  Successfully uninstalled tensorflow-2.17.1
[0mFound existing installation: tensorflow-datasets 4.9.7
Uninstalling tensorflow-datasets-4.9.7:
  Successfully uninstalled tensorflow-datasets-4.9.7


In [None]:
!pip install -q tensorflow==2.17
!pip install -q tensorflow-recommenders==0.7.3

#!pip install -q --upgrade tensorflow-datasets
!pip install -q scann

[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m601.3/601.3 MB[0m [31m3.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m96.2/96.2 kB[0m [31m7.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m5.3/5.3 MB[0m [31m67.1 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.7/11.7 MB[0m [31m91.7 MB/s[0m eta [36m0:00:00[0m
[?25h

In [None]:
import pprint
import tempfile

from typing import Dict, Text

import numpy as np
import tensorflow as tf
#import tensorflow_datasets as tfds

#import json
import pandas as pd
from google.colab import drive

In [None]:
import tensorflow_recommenders as tfrs

In [None]:
print(tf.__version__)

2.17.0


In [None]:
print(tfrs.__version__)

v0.7.3


# Importing and preprocessing the dataset

In [None]:
drive.mount('/content/drive')

Mounted at /content/drive


In [None]:
JSON_FILE = '/content/drive/My Drive/yelp_academic_dataset_business.json'

In [None]:
# Define the number of lines to read
#n_lines = 25000

# Read the specified number of lines into a list of dictionaries
#with open(JSON_FILE, "r") as file:
#    data = [json.loads(next(file)) for _ in range(n_lines)]

# Read the JSON lines file directly into a pandas DataFrame
df = pd.read_json(JSON_FILE, lines=True)

# Convert the list of dictionaries into a DataFrame
#df = pd.DataFrame(data)

# Display the first few rows
print(df.head())

              business_id                      name  \
0  Pns2l4eNsfO8kk83dixA6A  Abby Rappoport, LAC, CMQ   
1  mpf3x-BjTdTEA3yCZrAYPw             The UPS Store   
2  tUFrWirKiKi_TAnsVWINQQ                    Target   
3  MTSW4McQd7CbVtyjqoe9mw        St Honore Pastries   
4  mWMc6_wTdE0EUBKIGXDVfA  Perkiomen Valley Brewery   

                           address           city state postal_code  \
0           1616 Chapala St, Ste 2  Santa Barbara    CA       93101   
1  87 Grasso Plaza Shopping Center         Affton    MO       63123   
2             5255 E Broadway Blvd         Tucson    AZ       85711   
3                      935 Race St   Philadelphia    PA       19107   
4                    101 Walnut St     Green Lane    PA       18054   

    latitude   longitude  stars  review_count  is_open  \
0  34.426679 -119.711197    5.0             7        0   
1  38.551126  -90.335695    3.0            15        1   
2  32.223236 -110.880452    3.5            22        0   
3  39.9555

In [None]:
print(len(df)) # total number of entries

150346


In [None]:
# Filter rows where 'categories' is not null
df = df[df['categories'].notnull()]

# Select specific columns
df = df[['categories', 'business_id', 'stars']]

# Display the result
print(df.head())

                                          categories             business_id  \
0  Doctors, Traditional Chinese Medicine, Naturop...  Pns2l4eNsfO8kk83dixA6A   
1  Shipping Centers, Local Services, Notaries, Ma...  mpf3x-BjTdTEA3yCZrAYPw   
2  Department Stores, Shopping, Fashion, Home & G...  tUFrWirKiKi_TAnsVWINQQ   
3  Restaurants, Food, Bubble Tea, Coffee & Tea, B...  MTSW4McQd7CbVtyjqoe9mw   
4                          Brewpubs, Breweries, Food  mWMc6_wTdE0EUBKIGXDVfA   

   stars  
0    5.0  
1    3.0  
2    3.5  
3    4.0  
4    4.5  


In [None]:
print(len(df))  # number of entries after removing 103 rows where 'categories' have null value

150243


In [None]:
# Split 'categories' into a list of categories
df['categories'] = df['categories'].str.split(', ')

# Use explode to create a row for each category
df = df.explode('categories').reset_index(drop=True)

# Rename columns
df = df.rename(columns={'categories': 'category', 'business_id': 'employee_id', 'stars': 'overall_star'})

# Display the result
print(df.head())

                       category             employee_id  overall_star
0                       Doctors  Pns2l4eNsfO8kk83dixA6A           5.0
1  Traditional Chinese Medicine  Pns2l4eNsfO8kk83dixA6A           5.0
2         Naturopathic/Holistic  Pns2l4eNsfO8kk83dixA6A           5.0
3                   Acupuncture  Pns2l4eNsfO8kk83dixA6A           5.0
4              Health & Medical  Pns2l4eNsfO8kk83dixA6A           5.0


In [None]:
print(len(df)) # total number of entries after splitting 'categories'

668592


In [None]:
# Create TensorFlow Dataset using tf.data
tf_dataset = tf.data.Dataset.from_tensor_slices((
    {'category': df['category'].astype(str).values,      # Ensure conversion to strings
    'employee_id': df['employee_id'].astype(str).values,   # Ensure conversion to strings
    'overall_star': df['overall_star'].astype(float).values}  # Ensure conversion to floats
))

In [None]:
# Displaying a sample from the TensorFlow Dataset using pprint
for x in tf_dataset.take(1).as_numpy_iterator():
    pprint.pprint(x)

{'category': b'Doctors',
 'employee_id': b'Pns2l4eNsfO8kk83dixA6A',
 'overall_star': 5.0}


Let's figure out **unique employee ids** and **categories** present in the data.

This is important because we **need to be able to map the raw values of our categorical features to embedding vectors** in our models. To do that, we **need a vocabulary that maps a raw feature value to an integer in a contiguous range**: *this allows us to look up the corresponding embeddings in our embedding tables*.

In [None]:
# Extracting & processing data to build vocabularies (for query and candidate towers)

employees = tf_dataset.map(lambda x: x["employee_id"])
categories = tf_dataset.map(lambda x: x["category"])

employee_ids = employees.batch(1_000)
category_names = categories.batch(1_000)

unique_employee_ids = np.unique(np.concatenate(list(employee_ids))) # vocabulary for the candidate tower
unique_category_names = np.unique(np.concatenate(list(category_names))) # vocabulary for the query tower

In [None]:
unique_employee_ids[:10]

array([b'---kPU91CF4Lq2-WlRu9Lw', b'--0iUa4sNDFiZFrAdIWhZQ',
       b'--30_8IhuyMHbSOcNWd6DQ', b'--7PUidqRWpRSpXebiyxTg',
       b'--7jw19RH9JKXgFohspgQw', b'--8IbOsAAxjKRoYsBFL-PA',
       b'--9osgUCSDUWUkoTLdvYhQ', b'--ARBQr1WMsTWiwOKOj-FQ',
       b'--FWWsIwxRwuw9vIMImcQg', b'--FcbSxK1AoEtEAxOgBaCw'], dtype=object)

In [None]:
print(len(unique_employee_ids))

150243


In [None]:
unique_category_names[:10]

array([b'& Probates', b'3D Printing', b'ATV Rentals/Tours', b'Acai Bowls',
       b'Accessories', b'Accountants', b'Acne Treatment', b'Active Life',
       b'Acupuncture', b'Addiction Medicine'], dtype=object)

In [None]:
print(len(unique_category_names))

1311


In [None]:
# Data to train/test the model
tf_dataset = tf_dataset.map(lambda x: {
    "employee_id": x["employee_id"],
    "category": x["category"],
})

In [None]:
# Split data into a training and evaluation set

tf.random.set_seed(42)
shuffled = tf_dataset.shuffle(100_000, seed=42, reshuffle_each_iteration=False)

# Since this model creates just a retrival index, it is suitable to use the test dataset also for training to index them as well.
# Because no unseen data/queries are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

#trainset_size = round(len(shuffled) * 0.8)
#testset_size = round(len(shuffled) * 0.2)

#train = shuffled.take(trainset_size)
#test = shuffled.skip(trainset_size).take(testset_size)

# Implementing a model
 A two-tower retrieval model, we can build each tower separately and then combine them in the final model.

In [None]:
# The dimensionality of the query and candidate representations
embedding_dimension = 32

## The query tower
A query model computing the query representation (normally a fixed-dimensionality embedding vector) using query features.


 Use Keras preprocessing layers to first convert category names to integers, and then convert those to category name embeddings via an `Embedding` layer. Note that we use the list of unique category names we computed earlier as a vocabulary

In [None]:
category_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_category_names, mask_token=None),
  # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
  tf.keras.layers.Embedding(len(unique_category_names) + 1, embedding_dimension)
])

## The candidate tower
A candidate model computing the candidate representation (an equally-sized vector) using the candidate features

 Use Keras preprocessing layers to first convert employee ids to integers, and then convert those to employee id embeddings via an `Embedding` layer. Note that we use the list of unique employee ids we computed earlier as a vocabulary

In [None]:
employee_model = tf.keras.Sequential([
  tf.keras.layers.StringLookup(
      vocabulary=unique_employee_ids, mask_token=None),
  # We add an additional embedding to account for unknown tokens (to handle unseen or out-of-vocabulary (OOV) data.)
  tf.keras.layers.Embedding(len(unique_employee_ids) + 1, embedding_dimension)
])

## Metrics

In our training data we have positive (category, employee) pairs. To figure out how good our model is, we need to compare the affinity score that the model calculates for this pair to the scores of all the other possible candidates: if the score for the positive pair is higher than for all other candidates, our model is highly accurate.

To do this, we can use the `tfrs.metrics.FactorizedTopK metric`. The metric has one required argument: the dataset of candidates that are used as implicit negatives for evaluation.

In our case, that's the employee ids dataset, converted into embeddings via our employee model:

In [None]:
metrics = tfrs.metrics.FactorizedTopK(
  candidates=employees.batch(128).map(employee_model)
)

## Loss

The next component is the loss used to train our model. TFRS has several loss layers and tasks to make this easy.

In this instance, we'll make use of the Retrieval task object: a convenience wrapper that bundles together the loss function and metric computation:

In [None]:
task = tfrs.tasks.Retrieval(
  metrics=metrics
)

The task itself is a Keras layer that takes the query and candidate embeddings as arguments, and returns the computed loss: we'll use that to implement the model's training loop.

## The full model

We can now put it all together into a model. TFRS exposes a base model class (`tfrs.models.Model`) which streamlines building models: all we need to do is to set up the components in the` __init__` method, and implement the compute_loss method, taking in the raw features and returning a loss value.

The base model will then take care of creating the appropriate training loop to fit our model.

In [None]:
class YelpModel(tfrs.Model):

  def __init__(self, category_model, employee_model):
    super().__init__()
    self.employee_model: tf.keras.Model = employee_model
    self.category_model: tf.keras.Model = category_model
    self.task: tf.keras.layers.Layer = task

  def compute_loss(self, features: Dict[Text, tf.Tensor], training=False) -> tf.Tensor:
    # We pick out the category features and pass them into the category model.
    category_embeddings = self.category_model(features["category"])
    # And pick out the employee features and pass them into the employee model,
    # getting embeddings back.
    positive_employee_embeddings = self.employee_model(features["employee_id"])

    if training:
      # The task computes the loss and not the metrics during training to speed up the process.
      return self.task(category_embeddings, positive_employee_embeddings, compute_metrics=False)


    # The task computes the loss and the metrics.
    return self.task(category_embeddings, positive_employee_embeddings)

The `tfrs.Model` base class is a simply convenience class: it allows us to compute both training and test losses using the same method.

# Fitting and evaluating

After defining the model, we can use standard Keras fitting and evaluation routines to fit and evaluate the model.

Let's first instantiate the model.

In [None]:
model = YelpModel(category_model, employee_model)
model.compile(optimizer=tf.keras.optimizers.Adagrad(learning_rate=0.1))

Then shuffle, batch, and cache the training and evaluation data

In [None]:
# Since this model creates just a retrival index, it is suitable to use the test dataset also for training to index them as well.
# Because no unseen data/queries are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data.
# Therefore, following code snippets to create train and test splits are ommitted during execution.

#cached_train = train.shuffle(100_000).batch(8192).cache()
#cached_test = test.batch(4096).cache()

cached_train = shuffled.shuffle(100_000).batch(8192).cache()

Then train the model:

In [None]:
model.fit(cached_train, epochs=200)

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200
Epoch 24/200
Epoch 25/200
Epoch 26/200
Epoch 27/200
Epoch 28/200
Epoch 29/200
Epoch 30/200
Epoch 31/200
Epoch 32/200
Epoch 33/200
Epoch 34/200
Epoch 35/200
Epoch 36/200
Epoch 37/200
Epoch 38/200
Epoch 39/200
Epoch 40/200
Epoch 41/200
Epoch 42/200
Epoch 43/200
Epoch 44/200
Epoch 45/200
Epoch 46/200
Epoch 47/200
Epoch 48/200
Epoch 49/200
Epoch 50/200
Epoch 51/200
Epoch 52/200
Epoch 53/200
Epoch 54/200
Epoch 55/200
Epoch 56/200
Epoch 57/200
Epoch 58/200
Epoch 59/200
Epoch 60/200
Epoch 61/200
Epoch 62/200
Epoch 63/200
Epoch 64/200
Epoch 65/200
Epoch 66/200
Epoch 67/200
Epoch 68/200
Epoch 69/200
Epoch 70/200
Epoch 71/200
Epoch 72/200
Epoch 73/200
Epoch 74/200
Epoch 75/200
Epoch 76/200
Epoch 77/200
Epoch 78

<tf_keras.src.callbacks.History at 0x7fa66bee6e30>

As the model trains, the loss is falling and a set of top-k retrieval metrics is updated. These tell us whether the true positive is in the top-k retrieved items from the entire candidate set. For example, a top-5 categorical accuracy metric of 0.2 would tell us that, on average, the true positive is in the top 5 retrieved items 20% of the time.

Note that, in this example, we evaluate the metrics during training as well as evaluation. Because this can be quite slow with large candidate sets, it may be prudent to turn metric calculation off in training, and only run it in evaluation.

Finally, we can evaluate our model on the test set:

In [None]:
# Since this model creates just a retrival index, it is suitable to use the test dataset also for training to index them as well.
# Because no unseen data/queries are given as input to this model under any circumstance, the model doesn't need to generalise to unseen data.
# Therefore, following code snippet to test the model is ommitted during execution.


#model.evaluate(cached_test, return_dict=True)

## Making predictions

Now that we have a model, we would like to be able to make predictions. We can use the `tfrs.layers.factorized_top_k.BruteForc`e layer to do this.

In [None]:
unique_employee_ids = tf.constant(unique_employee_ids)  # Convert to Tensor to make the data (numpy array) ready for subsequent TensorFlow operations
unique_employee_ids = tf.data.Dataset.from_tensor_slices(unique_employee_ids)  # Convert the tensor into a Dataset

In [None]:
# Create a model that takes in raw query features, and
index = tfrs.layers.factorized_top_k.BruteForce(model.category_model, k=1000)
# recommends employees out of the entire unique employee dataset.
index.index_from_dataset(
  tf.data.Dataset.zip((unique_employee_ids.batch(1000), unique_employee_ids.batch(1000).map(model.employee_model)))
)

# Get recommendations.
_, employee_ids = index(tf.constant(["Electricians"]), k=20)
print(f"Recommendations for category 'Electricians': {employee_ids[0, :5]}")

Recommendations for category 'Electricians': [b'z7otfCcjH3Awwck7nsEEqQ' b'QL3xkxLAe788em3_3eC4LQ'
 b'TCtVAiGDb05PyLe3v-zXDA' b'1xeRysU0YYOnpy-5_3ySag'
 b'HTHUzTl-vDhEcbh7bZfhIg']


In [None]:
print(len(employee_ids[0]))

20


Of course, the BruteForce layer is going to be too slow to serve a model with many possible candidates. The following sections shows how to speed this up by using an approximate retrieval index.

An approximate retrieval index to speed up predictions. This will make it possible to efficiently surface recommendations from sets of tens of millions of candidates.

To do so, we can use the `scann` package. This is an optional dependency of TFRS, and we installed it separately at the beginning of this notebook by calling `!pip install -q scann`.

Once installed we can use the TFRS `ScaNN` layer:

In [None]:
# Create a model that takes in raw query features, and
scann_index = tfrs.layers.factorized_top_k.ScaNN(model.category_model, k=1000)
# recommends employees out of the entire unique employee dataset.
scann_index.index_from_dataset(
  tf.data.Dataset.zip((unique_employee_ids.batch(1000), unique_employee_ids.batch(1000).map(model.employee_model)))
)

<tensorflow_recommenders.layers.factorized_top_k.ScaNN at 0x7fa66bf46110>

This layer will perform approximate lookups: this makes retrieval slightly less accurate, but orders of magnitude faster on large candidate sets.

In [None]:
# Get recommendations.
_, employee_ids = scann_index(tf.constant(["Electricians"]), k=20)
print(f"Recommendations for category 'Electricians': {employee_ids[0, :5]}")

Recommendations for category 'Electricians': [b'TCtVAiGDb05PyLe3v-zXDA' b'HTHUzTl-vDhEcbh7bZfhIg'
 b'QBVZcOmWi-dK4HOcmnNrLg' b'z7otfCcjH3Awwck7nsEEqQ'
 b'DD4gTG-FeG_nneXcexJ2eg']


In [None]:
print(len(employee_ids[0]))

20


# Model serving

After the model is trained, we need a way to deploy it.

In a two-tower retrieval model, serving has two components:


*   **a serving query model**, taking in features of the query and transforming them into a query embedding, and
*   **a serving candidate model**. This most often takes the form of an approximate nearest neighbours (ANN) index which allows fast approximate lookup of candidates in response to a query produced by the query model.


In TFRS, both components can be packaged into a single exportable model, giving us a model that takes the raw category names and returns the ids of top/most similar employees for that category. This is done via exporting the model to a `SavedModel` format, which makes it possible to serve using TensorFlow Serving.

To deploy a model like this, we simply export the `BruteForce` layer and/or `ScaNN` layer we created above:

In [None]:
# Export the query model.
with tempfile.TemporaryDirectory() as tmp:
  path = os.path.join(tmp, "model")

  # Save the index.
  tf.saved_model.save(
      scann_index,
      path,
      options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
  )

  # Load it back; can also be done in TensorFlow Serving.
  loaded = tf.saved_model.load(path)

  # Pass a category name in, get top predicted employee ids back.
  scores, employee_ids = loaded(tf.constant(["Electricians"]))

  print(f"Recommendations for category 'Electricians': {employee_ids[0][:5]}")



Recommendations for category 'Electricians': [b'TCtVAiGDb05PyLe3v-zXDA' b'HTHUzTl-vDhEcbh7bZfhIg'
 b'QBVZcOmWi-dK4HOcmnNrLg' b'z7otfCcjH3Awwck7nsEEqQ'
 b'DD4gTG-FeG_nneXcexJ2eg']


In [None]:
print(len(employee_ids[0]))

1000


In [None]:
# Define the folder path for saving the model
save_dir = '/content/drive/My Drive/Colab Notebooks/Saved Models'
#save_dir = '/content/Saved Model'

# Ensure the folder exists
os.makedirs(save_dir, exist_ok=True)

# Path to save the model
model_path = os.path.join(save_dir, "recsys_model_one_retrieval")

# Save the ScaNN index
tf.saved_model.save(
    scann_index,
    model_path,
    options=tf.saved_model.SaveOptions(namespace_whitelist=["Scann"])
)

# Load the model back
loaded = tf.saved_model.load(model_path)

# Pass a category name and get top recommendations
scores, employee_ids = loaded(tf.constant(["Electricians"]))

print(f"Recommendations for category 'Electricians': {employee_ids[0][:5]}")



Recommendations for category 'Electricians': [b'TCtVAiGDb05PyLe3v-zXDA' b'HTHUzTl-vDhEcbh7bZfhIg'
 b'QBVZcOmWi-dK4HOcmnNrLg' b'z7otfCcjH3Awwck7nsEEqQ'
 b'DD4gTG-FeG_nneXcexJ2eg']


In [None]:
print(len(employee_ids[0]))

1000
