# 🧬 Training pipeline: Training retrieval model

In this notebook, you will train a retrieval model that will be able to quickly generate a small subset of candidate items from a large collection of items. Your model will be based on the *two-tower architecture*, which embeds queries and candidates (keys) into a shared low-dimensional vector space. Here, a query consists of features of a customer and a transaction (e.g. timestamp of the purchase), whereas a candidate consists of features of a particular item. All queries will have a user ID and all candidates will have an item ID, and the model will be trained such that the embedding of a user will be close to all the embeddings of items the user has previously bought.

After training the model you will save and upload its components to the Vertex AI Model Registry.

Let's go ahead and load the data.

## 📝 Imports

In [4]:
%load_ext autoreload
%autoreload 2

import warnings

warnings.filterwarnings("ignore")

from loguru import logger

from recsys import gcp_integrations, training
from recsys.config import settings

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


## ☁️ Connect to Vertex AI Feature Online Store

In [5]:
project, fs = gcp_integrations.get_feature_store()

[32m2025-02-11 15:04:56.131[0m | [1mINFO    [0m | [36mrecsys.gcp_integrations.feature_store[0m:[36mget_feature_store[0m:[36m17[0m - [1mRetrieving Feature Store from us-central1/recsys-dev-gonzo/recsys_feature_store_dev[0m
[32m2025-02-11 15:04:57.154[0m | [31m[1mERROR   [0m | [36mrecsys.gcp_integrations.feature_store[0m:[36mget_feature_store[0m:[36m37[0m - [31m[1mError retrieving Feature Store: 'FeatureOnlineStoreServiceClient' object has no attribute 'feature_online_store_path'[0m


AttributeError: 'FeatureOnlineStoreServiceClient' object has no attribute 'feature_online_store_path'

## 💿 Create training dataset
You will train your retrieval model with a subset of features.

For the query embedding you will use:
- `customer_id`: ID of the customer.
- `age`: age of the customer at the time of purchase.
- `month_sin`, `month_cos`: time of year the purchase was made.

For the candidate embedding you will use:
- `article_id`: ID of the item.
- `garment_group_name`: type of garment.
- `index_group_name`: menswear/ladieswear etc.

In [None]:
feature_view = gcp_integrations.feature_store.create_retrieval_feature_view()

In [None]:
dataset = training.two_tower.TwoTowerDataset(
    feature_view=feature_view, batch_size=settings.TWO_TOWER_MODEL_BATCH_SIZE
)

train_ds, val_ds = dataset.get_train_val_split()

Let's take a look at our dataset:

In [None]:
logger.info(f"Training samples: {len(dataset.properties['train_df']):,}")
logger.info(f"Validation samples: {len(dataset.properties['val_df']):,}")

logger.info(f"Number of users: {len(dataset.properties['user_ids']):,}")
logger.info(f"Number of items: {len(dataset.properties['item_ids']):,}")

In [None]:
dataset.properties["train_df"].head()

## 🗼🗼 Build the Two Tower model

The two tower model consist of two models:
- **Query model**: Generates a query representation of a given user and transaction features.
- **Candidate model**: Generates an item representation given item features.

**Both models produce embeddings that live in the same embedding space**. You let this space be low-dimensional to prevent overfitting on the training data. (Otherwise, the model might simply memorize previous purchases, which makes it recommend items customers already have bought).

You start with creating the query model.

In [None]:
query_model_factory = training.two_tower.QueryTowerFactory(dataset=dataset)
query_model = query_model_factory.build()

You will evaluate the two tower model using the *top-100 accuracy*. That is, for each transaction in the validation data you will generate the associated query embedding and retrieve the set of the 100 items that are closest to this query in the embedding space. The top-100 accuracy measures how often the item that was actually bought is part of this subset. To evaluate this, you create a dataset of all unique items in the training data.

### Training the model

In [None]:
trainer = training.two_tower.TwoTowerTrainer(dataset=dataset, model=model)
history = trainer.train(train_ds, val_ds)

Let's take a look at the training and validation loss:

In [None]:
import matplotlib.pyplot as plt

# Create figure with two subplots
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(10, 6))

# Training loss subplot
ax1.plot(history.history["loss"], label="Training Loss", color="blue")
ax1.set_title("Training Loss Over Time")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.legend()
ax1.grid(True)

# Validation loss subplot
ax2.plot(history.history["val_loss"], label="Validation Loss", color="red")
ax2.set_title("Validation Loss Over Time")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Loss")
ax2.legend()
ax2.grid(True)

# Adjust layout to prevent overlap
plt.tight_layout()
plt.show()  # Uncomment to show the plot

# 🗄️ Upload models to Vertex AI model registry

In [None]:
# TO-DO: CREATE MODEL REGISTRY IN TERRAFORM AND FUNCTIONS TO RETRIEVE IT
mr = project.get_model_registry()

Save models

In [None]:
# TO - DO
# ALSO CREATE THE METHODS TO UPLOAD THE MODEL TO THE REGISTRY

None