## <span style="color:#ff5f27">👨🏻‍🏫 Train Ranking Model </span>

In this notebook, you will train a ranking model using gradient boosted trees. 

## <span style="color:#ff5f27">📝 Imports </span>

In [1]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

## <span style="color:#ff5f27">🔮 Connect to Hopsworks Feature Store </span>

In [2]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

Connected. Call `.close()` to terminate connection gracefully.

Logged in to project, explore it here https://snurran.hops.works/p/17527
Connected. Call `.close()` to terminate connection gracefully.


In [3]:
customers_fg = fs.get_feature_group(
    name="customers",
    version=1,
)

articles_fg = fs.get_feature_group(
    name="articles",
    version=2,
)

trans_fg = fs.get_feature_group(
    name="transactions",
    version=1,
)

rank_fg = fs.get_feature_group(
    name="ranking",
    version=1,
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

In [4]:
# Select features
selected_features_customers = customers_fg.select_all()

fs.get_or_create_feature_view( 
    name='customers',
    query=selected_features_customers,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/customers/version/1


<hsfs.feature_view.FeatureView at 0x7fc6b68a9900>

In [4]:
# Select features
selected_features_articles = articles_fg.select_except(['embeddings']) 

fs.get_or_create_feature_view(
    name='articles',
    query=selected_features_articles,
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/articles/version/1


<hsfs.feature_view.FeatureView at 0x7feaec287910>

In [6]:
# Select features
selected_features_ranking = rank_fg.select_except(["customer_id", "article_id"])

feature_view_ranking = fs.get_or_create_feature_view(
    name='ranking',
    query=selected_features_ranking,
    labels=["label"],
    version=1,
)

Feature view created successfully, explore it at 
https://snurran.hops.works/p/17527/fs/17475/fv/ranking/version/1


## <span style="color:#ff5f27">🗄️ Train Data loading </span>

In [7]:
X_train, X_val, y_train, y_val = feature_view_ranking.train_test_split(
    test_size=0.1,
    description='Ranking training dataset',
)

X_train.head(3)

Finished: Reading data from Hopsworks, using ArrowFlight (62.39s) 




Unnamed: 0,age,month_sin,month_cos,product_type_name,product_group_name,graphical_appearance_name,colour_group_name,perceived_colour_value_name,perceived_colour_master_name,department_name,index_name,index_group_name,section_name,garment_group_name
0,46.0,-0.8660254,0.5,Earring,Accessories,Solid,Gold,Bright,Metal,Jewellery,Ladies Accessories,Ladieswear,Womens Small accessories,Accessories
3,53.0,-0.5,0.866025,Sweater,Garment Upper body,Solid,Red,Medium,Red,Jersey fancy,Ladieswear,Ladieswear,Womens Everyday Collection,Jersey Fancy
4,24.0,-2.449294e-16,1.0,Dress,Garment Full body,Solid,Black,Dark,Black,Dresses,Divided,Divided,Divided Collection,Dresses Ladies


In [8]:
y_train.head(3)

Unnamed: 0,label
0,0
3,1
4,0


## <span style="color:#ff5f27">🏃🏻‍♂️ Model Training </span>

Let's train a model.

In [9]:
cat_features = list(
    X_train.select_dtypes(include=['string', 'object']).columns
)

pool_train = Pool(X_train, y_train, cat_features=cat_features)
pool_val = Pool(X_val, y_val, cat_features=cat_features)

model = CatBoostClassifier(
    learning_rate=0.2,
    iterations=100,
    depth=10,
    scale_pos_weight=10,
    early_stopping_rounds=5,
    use_best_model=True,
)

model.fit(
    pool_train, 
    eval_set=pool_val,
)

0:	learn: 0.6122373	test: 0.6121214	best: 0.6121214 (0)	total: 4.18s	remaining: 6m 53s
1:	learn: 0.5588780	test: 0.5586605	best: 0.5586605 (1)	total: 9.38s	remaining: 7m 39s
2:	learn: 0.5228194	test: 0.5225191	best: 0.5225191 (2)	total: 12.8s	remaining: 6m 53s
3:	learn: 0.4969102	test: 0.4965672	best: 0.4965672 (3)	total: 19.3s	remaining: 7m 42s
4:	learn: 0.4788610	test: 0.4784777	best: 0.4784777 (4)	total: 24.7s	remaining: 7m 49s
5:	learn: 0.4665516	test: 0.4661433	best: 0.4661433 (5)	total: 27.7s	remaining: 7m 13s
6:	learn: 0.4544422	test: 0.4540722	best: 0.4540722 (6)	total: 33s	remaining: 7m 18s
7:	learn: 0.4452461	test: 0.4449183	best: 0.4449183 (7)	total: 39.3s	remaining: 7m 31s
8:	learn: 0.4388605	test: 0.4385451	best: 0.4385451 (8)	total: 44.6s	remaining: 7m 30s
9:	learn: 0.4338347	test: 0.4335593	best: 0.4335593 (9)	total: 49.7s	remaining: 7m 27s
10:	learn: 0.4292860	test: 0.4290383	best: 0.4290383 (10)	total: 55.2s	remaining: 7m 26s
11:	learn: 0.4261473	test: 0.4259250	best: 

<catboost.core.CatBoostClassifier at 0x7fc5d926a1a0>

## <span style="color:#ff5f27">👮🏻‍♂️ Model Validation </span>

Next, you'll evaluate how well the model performs on the validation data.

In [10]:
preds = model.predict(pool_val)

precision, recall, fscore, _ = precision_recall_fscore_support(y_val, preds, average="binary")

metrics = {
    "precision" : precision,
    "recall" : recall,
    "fscore" : fscore,
}
print(classification_report(y_val, preds))

              precision    recall  f1-score   support

           0       0.96      0.21      0.34    333587
           1       0.37      0.98      0.54    158885

    accuracy                           0.46    492472
   macro avg       0.67      0.59      0.44    492472
weighted avg       0.77      0.46      0.40    492472



It can be seen that the model has a low F1-score on the positive class (higher is better). The performance could potentially be improved by adding more features to the dataset, e.g. image embeddings.

Let's see which features your model considers important.

In [11]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        X_train.columns, 
        model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'section_name': 11.556182308308788,
 'graphical_appearance_name': 10.84679260622565,
 'garment_group_name': 10.289137530609244,
 'product_type_name': 9.40026643703017,
 'perceived_colour_master_name': 7.630184840202655,
 'perceived_colour_value_name': 6.83913727138985,
 'department_name': 6.5077778686167305,
 'index_name': 6.441329394566311,
 'month_sin': 5.827919054139396,
 'age': 5.717436382307535,
 'product_group_name': 5.668551166925661,
 'index_group_name': 5.53892669146241,
 'colour_group_name': 5.2856626102810855,
 'month_cos': 2.4506958379345005}

It can be seen that the model places high importance on user and item embedding features. Consequently, better trained embeddings could yield a better ranking model.

Finally, you'll save your model.

In [12]:
joblib.dump(model, 'ranking_model.pkl')

['ranking_model.pkl']

### <span style="color:#ff5f27">💾  Upload Model to Model Registry </span>

You'll upload the model to the Hopsworks Model Registry.

In [13]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

Connected. Call `.close()` to terminate connection gracefully.


In [14]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

input_example = X_train.sample().to_dict("records")
input_schema = Schema(X_train)
output_schema = Schema(y_train)
model_schema = ModelSchema(input_schema, output_schema)

ranking_model = mr.python.create_model(
    name="ranking_model", 
    metrics=metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates",
)
ranking_model.save("ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Model created, explore it at https://snurran.hops.works/p/17527/models/ranking_model/1


Model(name: 'ranking_model', version: 1)

---
## <span style="color:#ff5f27">⏩️ Next Steps </span>

Now you have trained both a retrieval and a ranking model, which will allow you to generate recommendations for users. In the next notebook, you'll take a look at how you can deploy these models with the `HSML` library.