# 👩🏻‍🔬 Offline inference pipeline: Computing item embeddings

In this notebook you will compute the candidate embeddings and populate a Hopsworks feature group with a vector index.

In [1]:
%load_ext autoreload
%autoreload 2

import os
import joblib
import warnings
import polars as pl

warnings.filterwarnings("ignore")

from loguru import logger
from recsys.config import settings
from recsys.gcp.bigquery import client as bq_client
from recsys.core.embeddings.computation import compute_embeddings
from recsys.gcp.feature_store.datasets import create_training_dataset
from recsys.core.embeddings.preprocessing import preprocess_candidates
from recsys.data.preprocessing.splitting import train_validation_test_split

In [2]:
path = os.getcwd()[:-9]
fullpath = os.path.join(path, 'data/preprocessed')

In [3]:
trans_df = pl.read_csv(f'{fullpath}/transactions.csv')
articles_df = pl.read_parquet(f'{fullpath}/articles.parquet')
customers_df = pl.read_csv(f'{fullpath}/customers.csv')

# Computing candidate embeddings

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Vertex AI Model Registry in previous steps:

In [4]:
model_path = os.path.join(path, 'notebooks/ranking_model/ranking')

In [5]:
candidate_model = joblib.load(model_path)

In [6]:
candidate_model

### Get candidates data

Now, we get the training retrieval data containing all the features required for the candidate embedding model.

In [7]:
training_data = create_training_dataset(trans_df, articles_df, customers_df)

[32m2025-04-02 11:08:01.127[0m | [1mINFO    [0m | [36mrecsys.gcp.feature_store.datasets[0m:[36mcreate_training_dataset[0m:[36m30[0m - [1mJoining features...[0m


In [8]:
train_df, val_df, test_def, _, _, _ = train_validation_test_split(
    df=training_data,
    validation_size=settings.TWO_TOWER_DATASET_VALIDATION_SPLIT_SIZE,
    test_size=settings.TWO_TOWER_DATASET_TEST_SPLIT_SIZE,
)

[32m2025-04-02 11:08:01.284[0m | [1mINFO    [0m | [36mrecsys.data.preprocessing.splitting[0m:[36mtrain_validation_test_split[0m:[36m314[0m - [1mSplit complete: train=19005 rows, validation=2389 rows, test=2405 rows[0m


### Compute embeddings

Next you compute the embeddings of all candidate items that were used to train the retrieval model.

We can recover this features from the X_train.columns from the previous notebook

In [9]:
features = ['age',
 'product_type_name',
 'product_group_name',
 'graphical_appearance_name',
 'colour_group_name',
 'perceived_colour_value_name',
 'perceived_colour_master_name',
 'department_name',
 'index_name',
 'index_group_name',
 'section_name',
 'garment_group_name',
 'month_sin',
 'month_cos',
 'article_id']

In [20]:
item_df = preprocess_candidates(train_df, features)
item_df.head(3)

TypeError: cannot create expression literal for value of type generator.

Hint: Pass `allow_object=True` to accept any value and create a literal of type Object.

In [14]:
embeddings_df = compute_embeddings(item_df, candidate_model)
embeddings_df.head()

ValueError: could not convert string to float: 'Dress'

# <span style="color:#ff5f27">Create Vertex AI Embedding Index </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FV.

In [None]:
logger.info("Uploading 'candidates' Feature to BigQuery.")
bq_client.load_features(candidates_df=embeddings_df)
logger.info("✅ Uploaded 'candidates' Feature to BigQuery!")