## <span style="color:#ff5f27">👨🏻‍🏫 Build Index </span>

In this notebook you will create a feature group for your candidate embeddings.

In [1]:
import time

# Start the timer
notebook_start_time = time.time()

## <span style="color:#ff5f27">📝 Imports </span>

In [2]:
import tensorflow as tf
import pprint
import numpy as np
import pandas as pd

import warnings
warnings.filterwarnings('ignore')

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [3]:
import hopsworks

project = hopsworks.login(api_key_value = "Dkez37cDPamSnJUf.HDsceFNWsdWX9blAXWtJxcez9tYRKw6eDYN2TQ5AbNjr9lrQKlMLB7nAZ2wgGBQd")

fs = project.get_feature_store()
mr = project.get_model_registry()

2025-03-15 22:02:16,491 INFO: Initializing external client
2025-03-15 22:02:16,491 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-03-15 22:02:24,822 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1218722


## <span style="color:#ff5f27">🎯 Compute Candidate Embeddings </span>

You start by computing candidate embeddings for all items in the training data.

First, you load your candidate model. Recall that you uploaded it to the Hopsworks Model Registry in the previous notebook. If you don't have the model locally you can download it from the Model Registry using the following code:

In [4]:
model = mr.get_model(
    name="candidate_model",
    version=1,
)
model_path = model.download()

Downloading model artifact (2 dirs, 4 files)... DONE

If you already have the model saved locally you can simply replace `model_path` with the path to your model.

In [5]:
candidate_model = tf.saved_model.load(model_path)

Next you compute the embeddings of all candidate items that were used to train the retrieval model.

In [6]:
feature_view = model.get_feature_view()

In [7]:
train_df, val_df, test_df, _, _, _ = feature_view.train_validation_test_split(
    validation_size=0.1, 
    test_size=0.1,
    description='Retrieval dataset splits',
)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (7.36s) 




In [8]:
train_df.head(3)

Unnamed: 0,customer_id,article_id,t_dat,price,month_sin,month_cos,customers_age,customers_club_member_status,customers_age_group,articles_garment_group_name,articles_index_group_name
0,f7048acb8188d98bde3a5c495475a3c86faafe0eede1f2...,670265002,1540252800000,0.013542,-0.8660254,0.5,48.0,ACTIVE,46-55,"Under-, Nightwear",Ladieswear
1,5d34f84e6cbe9ec4706872bb65376097af1e53f0c7dac5...,751471035,1593475200000,0.033881,1.224647e-16,-1.0,30.0,ACTIVE,26-35,Trousers,Ladieswear
2,baf6dc7ea8575732794751bb80824fe84fd40e6af86193...,719308002,1558137600000,0.059305,0.5,-0.866025,48.0,ACTIVE,46-55,Dresses Ladies,Divided


In [9]:
# Get the list of input features for the candidate model
candidate_features = [*candidate_model.signatures['serving_default'].structured_input_signature[-1].keys()]

# Select the candidate features from the training DataFrame
item_df = train_df[candidate_features]

# Drop duplicate rows based on the 'article_id' column to get unique candidate items
item_df.drop_duplicates(subset="article_id", inplace=True)

item_df.head(3)

Unnamed: 0,articles_garment_group_name,articles_index_group_name,article_id
0,"Under-, Nightwear",Ladieswear,670265002
1,Trousers,Ladieswear,751471035
2,Dresses Ladies,Divided,719308002


In [10]:
# Create a TensorFlow dataset from the item DataFrame
item_ds = tf.data.Dataset.from_tensor_slices(
    {col: item_df[col] for col in item_df})

# Compute embeddings for all candidate items using the candidate_model
candidate_embeddings = item_ds.batch(2048).map(
    lambda x: (x["article_id"], candidate_model(x))
)

> Strictly speaking, you haven't actually computed the candidate embeddings yet, as the dataset functions are lazily evaluated.

## <span style="color:#ff5f27">⚙️ Data Preparation </span>


In [11]:
# Concatenate all article IDs and embeddings from the candidate_embeddings dataset
all_article_ids = tf.concat([batch[0] for batch in candidate_embeddings], axis=0)
all_embeddings = tf.concat([batch[1] for batch in candidate_embeddings], axis=0)

# Convert tensors to numpy arrays
all_article_ids_np = all_article_ids.numpy().astype(int)
all_embeddings_np = all_embeddings.numpy()

# Convert numpy arrays to lists
items_ids_list = all_article_ids_np.tolist()
embeddings_list = all_embeddings_np.tolist()

In [12]:
# Create a DataFrame
data_emb = pd.DataFrame({
    'article_id': items_ids_list, 
    'embeddings': embeddings_list,
})

data_emb.head()

Unnamed: 0,article_id,embeddings
0,670265002,"[0.6440691947937012, -0.45159974694252014, 0.1..."
1,751471035,"[0.45408013463020325, -0.40176618099212646, 0...."
2,719308002,"[0.33983850479125977, 0.05192527174949646, 0.3..."
3,759231002,"[0.9627549648284912, -0.706524133682251, 1.202..."
4,746518001,"[-0.14691144227981567, 0.3576121926307678, 0.5..."


## <span style="color:#ff5f27">🪄 Feature Group Creation </span>

Now you are ready to create a feature group for your candidate embeddings.

To begin with, you need to create your Embedding Index where you will specify the name of the embeddings feature and the embeddings length.
Then you attach this index to the FG.

In [13]:
from hsfs import embedding

# Create the Embedding Index
embedding_index = embedding.EmbeddingIndex()

embedding_index.add_embedding(
    "embeddings",                           # Embeddings feature name
    len(data_emb["embeddings"].iloc[0]),    # Embeddings length
)

In [14]:
# Get or create the 'candidate_embeddings_fg' feature group
candidate_embeddings_fg = fs.get_or_create_feature_group(
    name="candidate_embeddings_fg",
    embedding_index=embedding_index,  # Specify the Embedding Index
    primary_key=['article_id'],
    version=1,
    description='Embeddings for each article',
    online_enabled=True,
)

candidate_embeddings_fg.insert(data_emb)

Feature Group created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fg/1420706


Uploading Dataframe: 100.00% |██████████| Rows 11948/11948 | Elapsed Time: 00:01 | Remaining Time: 00:00


Launching job: candidate_embeddings_fg_1_offline_fg_materialization
Job started successfully, you can follow the progress at 
https://c.app.hopsworks.ai:443/p/1218722/jobs/named/candidate_embeddings_fg_1_offline_fg_materialization/executions


(Job('candidate_embeddings_fg_1_offline_fg_materialization', 'SPARK'), None)

## <span style="color:#ff5f27">🪄 Feature View Creation </span>


In [15]:
# Get or create the 'candidate_embeddings' feature view
feature_view = fs.get_or_create_feature_view(
    name="candidate_embeddings",
    version=1,
    description='Embeddings of each article',
    query=candidate_embeddings_fg.select(["article_id"]),
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fv/candidate_embeddings/version/1


---

In [16]:
# End the timer
notebook_end_time = time.time()

# Calculate and print the execution time
notebook_execution_time = notebook_end_time - notebook_start_time
print(f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds")

⌛️ Notebook Execution time: 206.37 seconds


---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

At this point you have a recommender system that is able to generate a set of candidate items for a customer. However, many of these could be poor, as the candidate model was trained with only a few subset of the features. In the next notebook, you'll create a ranking dataset to train a *ranking model* to do more fine-grained predictions.