## <span style="color:#ff5f27">👨🏻‍🏫 Train Ranking Model </span>

In this notebook, you will train a ranking model using gradient boosted trees. 

In [1]:
import time

# Start the timer
notebook_start_time = time.time()

## <span style="color:#ff5f27">📝 Imports </span>

In [None]:
!pip install -r requirements.txt

In [2]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [3]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/17527
Connected. Call `.close()` to terminate connection gracefully.


In [4]:
customers_fg = fs.get_feature_group(
    name="customers",
    version=1,
)

articles_fg = fs.get_feature_group(
    name="articles",
    version=2,
)

trans_fg = fs.get_feature_group(
    name="transactions",
    version=1,
)

rank_fg = fs.get_feature_group(
    name="ranking",
    version=1,
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

In [5]:
# Select features
selected_features_customers = customers_fg.select_all()

fs.get_or_create_feature_view( 
    name='customers',
    query=selected_features_customers,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/customers/version/1


<hsfs.feature_view.FeatureView at 0x7fd0040d5900>

In [6]:
# Select features
selected_features_articles = articles_fg.select_except(['embeddings']) 

fs.get_or_create_feature_view(
    name='articles',
    query=selected_features_articles,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/articles/version/1


<hsfs.feature_view.FeatureView at 0x7fcf20294310>

In [7]:
# Select features
selected_features_ranking = rank_fg.select_except(["customer_id", "article_id"])

feature_view_ranking = fs.get_or_create_feature_view(
    name='ranking',
    query=selected_features_ranking,
    labels=["label"],
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/ranking/version/1


## <span style="color:#ff5f27">🗄️ Train Data loading </span>

In [8]:
X_train, X_val, y_train, y_val = feature_view_ranking.train_test_split(
    test_size=0.1,
    description='Ranking training dataset',
)

X_train.head(3)

Finished: Reading data from Hopsworks, using ArrowFlight (61.49s) 




Unnamed: 0,age,month_sin,month_cos,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,30.0,0.5,-0.8660254,Sweater,Garment Upper body,Solid,Dark Turquoise,Dark,Turquoise,Knitwear,Ladieswear,Ladieswear,Womens Tailoring,Knitwear
1,22.0,-1.0,-1.83697e-16,Vest top,Garment Upper body,Solid,Light Red,Medium,Red,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy
2,23.0,-0.866025,0.5,Trousers,Garment Lower body,Denim,Blue,Light,Blue,Denim Trousers,Divided,Divided,Ladies Denim,Trousers Denim


In [9]:
y_train.head(3)

Unnamed: 0,label
0,0
1,0
2,1


## <span style="color:#ff5f27">🏃🏻‍♂️ Model Training </span>

Let's train a model.

In [10]:
cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True,
)

model.fit(
    pool_train, 
    eval_set=pool_val,
)

0:	learn: 0.6685178	test: 0.6684138	best: 0.6684138 (0)	total: 7.68s	remaining: 12m 40s
1:	learn: 0.6458337	test: 0.6458419	best: 0.6458419 (1)	total: 13.4s	remaining: 10m 55s
2:	learn: 0.6309779	test: 0.6311238	best: 0.6311238 (2)	total: 19.8s	remaining: 10m 39s
3:	learn: 0.6201642	test: 0.6203815	best: 0.6203815 (3)	total: 25.3s	remaining: 10m 6s
4:	learn: 0.6114273	test: 0.6117170	best: 0.6117170 (4)	total: 31.5s	remaining: 9m 58s
5:	learn: 0.6052160	test: 0.6055633	best: 0.6055633 (5)	total: 36.9s	remaining: 9m 37s
6:	learn: 0.5999197	test: 0.6002212	best: 0.6002212 (6)	total: 42.5s	remaining: 9m 24s
7:	learn: 0.5952303	test: 0.5956808	best: 0.5956808 (7)	total: 47.8s	remaining: 9m 9s
8:	learn: 0.5917079	test: 0.5922496	best: 0.5922496 (8)	total: 53s	remaining: 8m 55s
9:	learn: 0.5871315	test: 0.5875103	best: 0.5875103 (9)	total: 58.5s	remaining: 8m 46s
10:	learn: 0.5851834	test: 0.5854168	best: 0.5854168 (10)	total: 1m 3s	remaining: 8m 37s
11:	learn: 0.5823804	test: 0.5825627	best

<catboost.core.CatBoostClassifier at 0x7fcf20375f00>

## <span style="color:#ff5f27">👮🏻‍♂️ Model Validation </span>

Next, you'll evaluate how well the model performs on the validation data.

In [11]:
preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore,
}
print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.96      0.40      0.56    414936
           1       0.22      0.91      0.36     78185

    accuracy                           0.48    493121
   macro avg       0.59      0.65      0.46    493121
weighted avg       0.84      0.48      0.53    493121



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features your model considers important.

In [12]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        X_train.columns, 
        model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'month_cos': 14.391052030455674,
 'section_name': 13.211122380694814,
 'product_group_name': 9.980529675176275,
 'product_type_name': 9.774257346430925,
 'department_name': 7.572276640117973,
 'garment_group_name': 6.135883120695915,
 'perceived_colour_master_name': 5.920165680532657,
 'graphical_appearance_name': 5.60679440979916,
 'index_name': 5.593757313769614,
 'age': 5.270226238463603,
 'perceived_colour_value_name': 4.678572330787849,
 'index_group_name': 4.17635691864563,
 'month_sin': 3.9868909337016087,
 'colour_group_name': 3.7021149807283167}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, you'll save your model.

In [13]:
joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### <span style="color:#ff5f27">💾  Upload Model to Model Registry </span>

You'll upload the model to the Hopsworks Model Registry.

In [14]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [15]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates",
)
ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://snurran.hops.works/p/17527/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

---

In [16]:
# End the timer
notebook_end_time = time.time()

# Calculate and print the execution time
notebook_execution_time = notebook_end_time - notebook_start_time
print(f"⌛️ Notebook Execution time: {notebook_execution_time:.2f} seconds")

⌛️ Notebook Execution time: 638.80 seconds


---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library.