In [1]:
import sys
from pathlib import Path

In [2]:
root_dir = str(Path().absolute().parent)
print("⛳️ Local environment")

# Add the root directory to the `PYTHONPATH` to use the `recsys` Python module from the notebook.
if root_dir not in sys.path:
    print(f"Adding the following directory to the PYTHONPATH: {root_dir}")
    sys.path.append(root_dir)

⛳️ Local environment
Adding the following directory to the PYTHONPATH: /home/nnaemeka/Documents/code/Real-time Personalized Recommender


## Feature pipleline: Computing features

### Imports

In [3]:
%load_ext autoreload
%autoreload 2

import warnings
from pprint import pprint

import polars as pl

warnings.filterwarnings("ignore")

from recsys.config import settings
from recsys.raw_data_sources import h_and_m as h_and_m_raw_data

### Constants

These are the default settings used across the lessons. You can always override them in the .env file that sits at the root of the repository:

In [4]:
pprint(dict(settings))

{'CUSTOMER_DATA_SIZE': <CustomerDatasetSize.SMALL: 'SMALL'>,
 'CUSTOM_HOPSWORKS_INFERENCE_ENV': 'custom_env_name',
 'FEATURES_EMBEDDING_MODEL_ID': 'all-MiniLM-L6-v2',
 'HOPSWORKS_API_KEY': None,
 'OPENAI_API_KEY': None,
 'OPENAI_MODEL_ID': 'gpt-4o-mini',
 'RANKING_DATASET_VALIDATON_SPLIT_SIZE': 0.1,
 'RANKING_EARLY_STOPPING_ROUNDS': 5,
 'RANKING_ITERATIONS': 100,
 'RANKING_LEARNING_RATE': 0.2,
 'RANKING_MODEL_TYPE': 'ranking',
 'RANKING_SCALE_POS_WEIGHT': 10,
 'RECSYS_DIR': PosixPath('/home/nnaemeka/Documents/code/Real-time Personalized Recommender/recsys'),
 'TWO_TOWER_DATASET_TEST_SPLIT_SIZE': 0.1,
 'TWO_TOWER_DATASET_VALIDATION_SPLIT_SIZE': 0.1,
 'TWO_TOWER_LEARNING_RATE': 0.01,
 'TWO_TOWER_MODEL_BATCH_SIZE': 2048,
 'TWO_TOWER_MODEL_EMBEDDING_SIZE': 16,
 'TWO_TOWER_NUM_EPOCHS': 10,
 'TWO_TOWER_WEIGHT_DECAY': 0.001}


In [8]:
from recsys import hopsworks_integration

In [10]:
project, fs = hopsworks_integration.get_feature_store()

[32m2025-01-10 15:50:58.439[0m | [1mINFO    [0m | [36mrecsys.hopsworks_integration.feature_store[0m:[36mget_feature_store[0m:[36m16[0m - [1mLogin to Hopsworks using cached API key.[0m


2025-01-10 15:50:58,440 INFO: Initializing external client
2025-01-10 15:50:58,440 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-10 15:51:00,162 INFO: Closing external client and cleaning up certificates.
Connection closed.
Copy your Api Key (first register/login): https://c.app.hopsworks.ai/account/api/generated
2025-01-10 15:51:11,518 INFO: Initializing external client
2025-01-10 15:51:11,519 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-01-10 15:51:14,691 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1209494



### The H&M dataset

To show how a recommender system using the two tower architecture works, we will use the H&M Personalized Fashion Recommendations dataset.

It consists of:
* articles
* customers
* transactions

### 🗄️ Articles data

The article_id and product_code serve different purposes in the context of H&M's product database:

* Article ID: This is a unique identifier assigned to each individual article within the database. It is typically used for internal tracking and management purposes. Each distinct item or variant of a product (e.g., different sizes or colors) would have its own unique article_id.

* Product Code: This is also a unique identifier, but it is associated with a specific product or style rather than individual articles. It represents a broader category or type of product within H&M's inventory. Multiple articles may share the same product code if they belong to the same product line or style.

While both are unique identifiers, the article_id is specific to individual items, whereas the product_code represents a broader category or style of product.

Here is an example:

Product: Basic T-Shirt

* Product Code: TS001

* Article IDs:
    * Article ID: 1001 (Size: Small, Color: White)
    * Article ID: 1002 (Size: Medium, Color: White)
    * Article ID: 1003 (Size: Large, Color: White)
    * Article ID: 1004 (Size: Small, Color: Black)
    * Article ID: 1005 (Size: Medium, Color: Black)

In this example, "TS001" is the product code for the basic t-shirt style. Each variant of this t-shirt (e.g., different sizes and colors) has its own unique article_id.


In [5]:
articles_df = h_and_m_raw_data.extract_articles_df()
articles_df.shape

(105542, 25)