# 👩🏻‍🔬 Offline inference pipeline: Computing item embeddings

In this notebook you will compute the candidate embeddings and populate a Hopsworks feature group with a vector index.

In [1]:
%load_ext autoreload
%autoreload 2

import warnings

warnings.filterwarnings("ignore")

from loguru import logger
from recsys.config import settings
from recsys.gcp.vertex_ai import model_registry
from recsys.gcp.bigquery import client as bq_client
from recsys.gcp.feature_store import client as fs_client
from recsys.core.embeddings.computation import compute_embeddings
from recsys.gcp.feature_store.datasets import create_training_dataset
from recsys.core.embeddings.preprocessing import preprocess_candidates
from recsys.data.preprocessing.splitting import train_validation_test_split

## ☁️ Connect to Vertex AI Feature Online Store

In [2]:
fos = fs_client.get_client()

[32m2025-02-20 11:54:53.675[0m | [1mINFO    [0m | [36mrecsys.gcp.feature_store.client[0m:[36mget_client[0m:[36m31[0m - [1mRetrieving Feature Store from us-central1/recsys-dev-gonzo-2/recsys_feature_store_dev[0m


In [3]:
trans_fv, articles_fv, customers_fv, _ = fs_client.get_feature_views(fos)

# Computing candidate embeddings

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Vertex AI Model Registry in previous steps:

In [4]:
candidate_model, candidate_features = model_registry.get_model(
    model_name="candidate_tower_v1"
)

[32m2025-02-20 11:54:59.269[0m | [1mINFO    [0m | [36mrecsys.gcp.vertex_ai.model_registry[0m:[36mget_model[0m:[36m164[0m - [1mDownloading '3333276152230838272' version gs://recsys-dev-gonzo-2-vertex-staging-us-central1/vertex_ai_auto_staging/2025-02-19-14:37:18.875[0m
[32m2025-02-20 11:55:06.715[0m | [1mINFO    [0m | [36mrecsys.gcp.vertex_ai.model_registry[0m:[36mget_model[0m:[36m175[0m - [1mExtracted 3 input features from model[0m


### Get candidates data

Now, we get the training retrieval data containing all the features required for the candidate embedding model.

In [5]:
training_data = create_training_dataset(trans_fv, articles_fv, customers_fv)

[32m2025-02-20 11:55:06.770[0m | [1mINFO    [0m | [36mrecsys.gcp.feature_store.datasets[0m:[36mcreate_training_dataset[0m:[36m46[0m - [1mFetching transactions data...[0m
[32m2025-02-20 11:55:06.771[0m | [1mINFO    [0m | [36mrecsys.gcp.bigquery.client[0m:[36mfetch_feature_view_data[0m:[36m185[0m - [1mFetching data from feature view: transactions[0m
[32m2025-02-20 11:55:07.128[0m | [1mINFO    [0m | [36mrecsys.gcp.bigquery.client[0m:[36mfetch_feature_view_data[0m:[36m198[0m - [1mExecuting query: SELECT customer_id, article_id, t_dat, price, month_sin, month_cos FROM `recsys-dev-gonzo-2.recsys_dataset.recsys_transactions`[0m
[32m2025-02-20 11:55:11.324[0m | [1mINFO    [0m | [36mrecsys.gcp.bigquery.client[0m:[36mfetch_feature_view_data[0m:[36m201[0m - [1mDataFrame shape: (23799, 6)[0m
[32m2025-02-20 11:55:11.325[0m | [1mINFO    [0m | [36mrecsys.gcp.feature_store.datasets[0m:[36mcreate_training_dataset[0m:[36m51[0m - [1mFetching cust

In [6]:
train_df, val_df, test_def, _, _, _ = train_validation_test_split(
    df=training_data,
    validation_size=settings.TWO_TOWER_DATASET_VALIDATION_SPLIT_SIZE,
    test_size=settings.TWO_TOWER_DATASET_TEST_SPLIT_SIZE,
)

[32m2025-02-20 11:55:20.732[0m | [1mINFO    [0m | [36mrecsys.data.preprocessing.splitting[0m:[36mtrain_validation_test_split[0m:[36m316[0m - [1mSplit complete: train=19005 rows, validation=2389 rows, test=2405 rows[0m


In [7]:
train_df.head(3)

customer_id,article_id,t_dat,price,month_sin,month_cos,age,club_member_status,age_group,garment_group_name,index_group_name
str,str,i64,f64,f64,f64,f64,str,str,str,str
"""621788f7946826d475ae634d5138fd…","""511105002""",0,0.022017,0.5,0.866025,35.0,"""ACTIVE""","""26-35""","""Accessories""","""Ladieswear"""
"""f17d07ee3b52dc06ba23e5dbd0621a…","""619506001""",0,0.013542,0.5,0.866025,33.0,"""ACTIVE""","""26-35""","""Accessories""","""Sport"""
"""c7d488a1e7c4a6141e313199d41e55…","""701973001""",0,0.015237,0.5,0.866025,42.0,"""ACTIVE""","""36-45""","""Accessories""","""Baby/Children"""


### Compute embeddings

Next you compute the embeddings of all candidate items that were used to train the retrieval model.

In [8]:
item_df = preprocess_candidates(train_df, candidate_features)
item_df.head(3)

garment_group_name,article_id,index_group_name
str,str,str
"""Jersey Fancy""","""708536005""","""Divided"""
"""Woven/Jersey/Knitted mix Baby""","""633919004""","""Baby/Children"""
"""Knitwear""","""636938005""","""Ladieswear"""


In [9]:
embeddings_df = compute_embeddings(item_df, candidate_model)
embeddings_df.head()

article_id,embeddings
i64,list[f64]
708536005,"[-0.833923, -2.762513, … 2.363503]"
633919004,"[0.309819, -0.272627, … 0.620053]"
636938005,"[-0.357415, 0.367256, … 0.365261]"
754370005,"[-0.71657, 0.054884, … 0.572164]"
697902001,"[-2.347109, -2.577516, … 2.290054]"


# <span style="color:#ff5f27">Create Vertex AI Embedding Index </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FV.

In [10]:
logger.info("Uploading 'candidates' Feature to BigQuery.")
bq_client.load_features(candidates_df=embeddings_df)
logger.info("✅ Uploaded 'candidates' Feature to BigQuery!")

[32m2025-02-20 11:55:20.988[0m | [1mINFO    [0m | [36m__main__[0m:[36m<module>[0m:[36m1[0m - [1mUploading 'candidates' Feature to BigQuery.[0m
[32m2025-02-20 11:55:20.993[0m | [34m[1mDEBUG   [0m | [36mrecsys.gcp.bigquery.client[0m:[36mconvert_types[0m:[36m60[0m - [34m[1mConverted article_id to STRING[0m
[32m2025-02-20 11:55:20.993[0m | [34m[1mDEBUG   [0m | [36mrecsys.gcp.bigquery.client[0m:[36mconvert_types[0m:[36m60[0m - [34m[1mConverted embeddings to FLOAT64[0m
[32m2025-02-20 11:55:21.002[0m | [1mINFO    [0m | [36mrecsys.core.embeddings.storage[0m:[36mprocess_for_storage[0m:[36m33[0m - [1mProcessed embeddings in embeddings[0m
[32m2025-02-20 11:55:21.003[0m | [34m[1mDEBUG   [0m | [36mrecsys.gcp.bigquery.client[0m:[36mupload_dataframe[0m:[36m93[0m - [34m[1mDataFrame types before upload:[0m
[32m2025-02-20 11:55:21.003[0m | [34m[1mDEBUG   [0m | [36mrecsys.gcp.bigquery.client[0m:[36mupload_dataframe[0m:[36m95[0m 