## <span style="color:#ff5f27">👨🏻‍🏫 Train Ranking Model </span>

In this notebook, you will train a ranking model using gradient boosted trees. 

In [1]:
import time

# Start the timer
notebook_start_time = time.time()

## <span style="color:#ff5f27">📝 Imports </span>

In [2]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [3]:
import hopsworks

project = hopsworks.login(api_key_value = "Dkez37cDPamSnJUf.HDsceFNWsdWX9blAXWtJxcez9tYRKw6eDYN2TQ5AbNjr9lrQKlMLB7nAZ2wgGBQd")

fs = project.get_feature_store()

  from .autonotebook import tqdm as notebook_tqdm


2025-03-15 22:06:03,991 INFO: Initializing external client
2025-03-15 22:06:03,996 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-03-15 22:06:06,873 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1218722


In [4]:
customers_fg = fs.get_feature_group(
    name="customers",
    version=1,
)

articles_fg = fs.get_feature_group(
    name="articles",
    version=1,
)

trans_fg = fs.get_feature_group(
    name="transactions",
    version=1,
)

interactions_fg = fs.get_feature_group(
    name="interactions",
    version=1,
)

rank_fg = fs.get_feature_group(
    name="ranking",
    version=1,
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

In [5]:
# Select features
selected_features_customers = customers_fg.select_all()

fs.get_or_create_feature_view( 
    name='customers',
    query=selected_features_customers,
    version=1,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fv/customers/version/1


<hsfs.feature_view.FeatureView at 0x1fe2b208310>

In [6]:
# Select features
selected_features_articles = articles_fg.select_except(['embeddings']) 

fs.get_or_create_feature_view(
    name='articles',
    query=selected_features_articles,
    version=1,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fv/articles/version/1


<hsfs.feature_view.FeatureView at 0x1fe2b228790>

In [7]:
selected_features_llm_assistant = trans_fg.select([
    "customer_id",
    "t_dat",
    "price",
    "sales_channel_id",
    "year",
    "month",
    "day",
    "day_of_week",
]).join(
    customers_fg.select([
        "club_member_status",
        "age",
        "age_group",
    ]), 
    on="customer_id", 
    prefix="customer_",
).join(
    articles_fg.select([
        "prod_name",
        "product_type_name",
        "product_group_name",
        "graphical_appearance_name",
        "colour_group_name",
        "section_name",
        "garment_group_name",
        "article_description",
    ]), 
    on="article_id", 
    prefix="article_",
).join(
    interactions_fg.select([
        "interaction_score",
]),
    on=["customer_id", "article_id"],
    prefix="interaction_",
)

# Create the feature view
llm_assistant_feature_view = fs.get_or_create_feature_view(
    name='llm_assistant_context',
    query=selected_features_llm_assistant,
    version=1
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fv/llm_assistant_context/version/1


In [8]:
# Select features
selected_features_ranking = rank_fg.select_except(["customer_id", "article_id"]).join(
    trans_fg.select(["month_sin", "month_cos"]), 
    prefix="trans_",
)

feature_view_ranking = fs.get_or_create_feature_view(
    name='ranking',
    query=selected_features_ranking,
    labels=["label"],
    version=1,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1218722/fs/1206352/fv/ranking/version/1


## <span style="color:#ff5f27">🗄️ Train Data loading </span>

In [9]:
X_train, X_val, y_train, y_val = feature_view_ranking.train_test_split(
    test_size=0.1,
    description='Ranking training dataset',
)

X_train.head(3)

Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (5.58s) 




Unnamed: 0,age,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name,trans_month_sin,trans_month_cos
0,31.0,Bra,Underwear,Melange,Dark Grey,Dark,Grey,Mama Lingerie,Lingeries/Tights,Ladieswear,Womens Lingerie,"Under-, Nightwear",0.5,-0.8660254
1,22.0,Trousers,Garment Lower body,Denim,Light Blue,Dusty Light,Blue,Denim Trousers,Divided,Divided,Ladies Denim,Trousers Denim,1.0,6.123234000000001e-17
2,67.0,Top,Garment Upper body,Solid,Light Beige,Dusty Light,Beige,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy,-1.0,-1.83697e-16


In [10]:
y_train.head(3)

Unnamed: 0,label
0,1
1,1
2,0


## <span style="color:#ff5f27">🏃🏻‍♂️ Model Training </span>

Let's train a model.

In [11]:
cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True,
)

model.fit(
    pool_train, 
    eval_set=pool_val,
)

0:	learn: 0.5151511	test: 0.5148830	best: 0.5148830 (0)	total: 520ms	remaining: 51.5s
1:	learn: 0.3955101	test: 0.3950133	best: 0.3950133 (1)	total: 665ms	remaining: 32.6s
2:	learn: 0.3099050	test: 0.3092096	best: 0.3092096 (2)	total: 1.1s	remaining: 35.5s
3:	learn: 0.2462480	test: 0.2453690	best: 0.2453690 (3)	total: 1.3s	remaining: 31.3s
4:	learn: 0.1978071	test: 0.1967568	best: 0.1967568 (4)	total: 1.45s	remaining: 27.5s
5:	learn: 0.1603387	test: 0.1591268	best: 0.1591268 (5)	total: 1.54s	remaining: 24.2s
6:	learn: 0.1310290	test: 0.1296631	best: 0.1296631 (6)	total: 1.69s	remaining: 22.5s
7:	learn: 0.1079138	test: 0.1063998	best: 0.1063998 (7)	total: 1.78s	remaining: 20.5s
8:	learn: 0.0895764	test: 0.0879182	best: 0.0879182 (8)	total: 1.92s	remaining: 19.4s
9:	learn: 0.0749775	test: 0.0731808	best: 0.0731808 (9)	total: 2.21s	remaining: 19.9s
10:	learn: 0.0633187	test: 0.0613885	best: 0.0613885 (10)	total: 2.39s	remaining: 19.4s
11:	learn: 0.0539871	test: 0.0519289	best: 0.0519289 (

<catboost.core.CatBoostClassifier at 0x1fe2b33cad0>

## <span style="color:#ff5f27">👮🏻‍♂️ Model Validation </span>

Next, you'll evaluate how well the model performs on the validation data.

In [12]:
preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore,
}
print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     20897
           1       0.96      1.00      0.98      2019

    accuracy                           1.00     22916
   macro avg       0.98      1.00      0.99     22916
weighted avg       1.00      1.00      1.00     22916



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features your model considers important.

In [13]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        X_train.columns, 
        model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'trans_month_cos': 57.21420079190766,
 'trans_month_sin': 38.21078315550941,
 'garment_group_name': 0.9665549325124905,
 'product_group_name': 0.9033111378551797,
 'age': 0.869185063131642,
 'perceived_colour_master_name': 0.4310532694175557,
 'colour_group_name': 0.4171872303792953,
 'index_name': 0.39352912952979635,
 'perceived_colour_value_name': 0.27880771508017543,
 'department_name': 0.16874477600478968,
 'index_group_name': 0.0715768804041732,
 'graphical_appearance_name': 0.049318704316874504,
 'section_name': 0.025747213951008836,
 'product_type_name': 0.0}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, you'll save your model.

In [14]:
joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### <span style="color:#ff5f27">💾  Upload Model to Model Registry </span>

You'll upload the model to the Hopsworks Model Registry.

In [15]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

In [16]:
input_example = X_train.sample().to_dict("records")
                                         
ranking_model = mr.python.create_model(
    name="ranking_model", 
    description="Ranking model that scores item candidates",
    version=1,
    metrics=metrics,
    feature_view=feature_view_ranking,
    input_example=input_example,
)
ranking_model.save("ranking_model.pkl")

Uploading: 100.000%|██████████| 454857/454857 elapsed<00:01 remaining<00:00  1.00it/s]
Uploading: 100.000%|██████████| 468/468 elapsed<00:02 remaining<00:00<00:06,  1.65s/it]
Model export complete: 100%|██████████| 6/6 [00:11<00:00,  1.93s/it]                   

Model created, explore it at https://c.app.hopsworks.ai:443/p/1218722/models/ranking_model/1





Model(name: 'ranking_model', version: 1)

---

In [17]:
# End the timer
notebook_end_time = time.time()

# Calculate and print the execution time
notebook_execution_time = notebook_end_time - notebook_start_time
print(f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds")

⌛️ Notebook Execution time: 123.09 seconds


---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library.