## <span style="color:#ff5f27">👨🏻‍🏫 Train Ranking Model </span>

In this notebook, we will train a ranking model using gradient boosted trees. 

In [1]:
import pandas as pd
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
import joblib

In [4]:
import hopsworks

project = hopsworks.login()

fs = project.get_feature_store()

2025-05-21 14:06:12,446 INFO: Initializing external client
2025-05-21 14:06:12,451 INFO: Base URL: https://c.app.hopsworks.ai:443
2025-05-21 14:06:13,869 INFO: Python Engine initialized.

Logged in to project, explore it here https://c.app.hopsworks.ai:443/p/1220788


### get feature groups

In [None]:

users_fg = fs.get_feature_group(
    name="users",
    version=1
)

events_fg = fs.get_feature_group(
    name="events",
    version=1
)

weather_rank_fg = fs.get_feature_group(
    name="weather_ranking",
    version=1
)

no_weather_rank_fg = fs.get_feature_group(
    name="no_weather_ranking",
    version=1
)

## <span style="color:#ff5f27">⚙️ Feature View Creation </span>

In [17]:
# Select features
selected_features_customers = users_fg.select_all()

fs.get_or_create_feature_view( 
    name='users',
    query=selected_features_customers,
    version=1,
)

<hsfs.feature_view.FeatureView at 0x7f36f8550df0>

In [18]:
# Select features
selected_features_articles = events_fg.select_all()

fs.get_or_create_feature_view(
    name='events',
    query=selected_features_articles,
    version=1,
)

<hsfs.feature_view.FeatureView at 0x7f36f8434a90>

In [6]:
NO_WEATHER_SELECTED_FEATURES =['interaction_distance_to_event',
       'event_type','event_city', 'duration',
       'attendance_rate', 'event_indoor_capability', 'user_city',
       'age', 'user_interests','label']

WEATHER_SELECTED_FEATURES =['interaction_distance_to_event', 'title',
       'event_type','event_city', 'duration','weather_condition', 'temperature',
       'attendance_rate', 'event_indoor_capability', 'user_city',
       'user_weather_preference', 'age', 'user_interests','label']

In [None]:
# Select weather features
features_weather_ranking = weather_rank_fg.select(WEATHER_SELECTED_FEATURES)
# Select no weather features
features_no_weather_ranking = no_weather_rank_fg.select(NO_WEATHER_SELECTED_FEATURES)

In [26]:
# Create feature view for weather ranking
feature_view_ranking_weather = fs.get_or_create_feature_view(
    name='weather_ranking',
    query=features_weather_ranking,
    labels=["label"],
    version=1,
)
# Create feature view for no weather ranking
feature_view_ranking_no_weather = fs.get_or_create_feature_view(
    name='no_weather_ranking',
    query=features_no_weather_ranking,
    labels=["label"],
    version=1,
)

Feature view created successfully, explore it at 
https://c.app.hopsworks.ai:443/p/1220788/fs/1208418/fv/no_weather_ranking_2/version/1


---

---

---

---

## <span style="color:#ff5f27">🗄️ Train Data loading </span>

In [15]:
# Get feature views weather ranking
feature_view_ranking_weather = fs.get_feature_view(name='weather_ranking', version=1)


In [16]:
# Get feature views no weather ranking
feature_view_ranking_no_weather = fs.get_feature_view(name='no_weather_ranking', version=1)


In [17]:
# Get training and validation data directly from feature views for weather ranking
weather_X_train, weather_X_val, weather_y_train, weather_y_val = \
    feature_view_ranking_weather.train_test_split(
    test_size=0.1,
    description='Weather ranking training dataset',
)


Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (4.78s) 



In [54]:
weather_X_train.head()

Unnamed: 0,interaction_distance_to_event,event_type,event_city,duration,weather_condition,temperature,attendance_rate,event_indoor_capability,user_city,user_weather_preference,age,user_interests
0,14.0,Business & Networking,New York,480,Cloudy,16.6,75.228082,True,New York,any,41,tech
1,16921.0,Community & Causes,Paris,240,Cloudy,16.9,79.345494,False,Sydney,outdoor,46,travel
2,10.584158,Health & Wellness,Paris,240,Clear,18.1,12.692646,False,Paris,outdoor,49,food
4,7.0,Technology,New York,240,Cloudy,16.0,81.857555,True,New York,indoor,34,tech literature
5,6581.0,Sports & Fitness,Berlin,180,Cloudy,13.3,76.052787,False,Toronto,outdoor,34,food fashion


In [18]:
# Get training and validation data directly from feature views for no weather ranking
no_weather_X_train, no_weather_X_val, no_weather_y_train, no_weather_y_val = \
    feature_view_ranking_no_weather.train_test_split(
    test_size=0.1,
    description='No-weather ranking training dataset',
)


Finished: Reading data from Hopsworks, using Hopsworks Feature Query Service (4.33s) 



In [32]:
weather_X_train.columns


Index(['interaction_distance_to_event', 'event_type', 'event_city', 'duration',
       'weather_condition', 'temperature', 'attendance_rate',
       'event_indoor_capability', 'user_city', 'user_weather_preference',
       'age', 'user_interests'],
      dtype='object')

In [2]:
# import pandas as pd
# users_df = pd.read_csv('/home/nkama/masters_thesis_project/thesis/partially_synthetic/data/main_data/users.csv')
# events_df = pd.read_csv("/home/nkama/masters_thesis_project/thesis/partially_synthetic/data/main_data/events.csv")
# interactions_df = pd.read_csv('/home/nkama/masters_thesis_project/thesis/partially_synthetic/data/main_data/interactions.csv')



# # Merge user/event features into interactions
# interactions_df = interactions_df.merge(users_df, on="user_id")
# interactions_df = interactions_df.merge(events_df, on="event_id", suffixes=('_user', '_event'))
# from sklearn.model_selection import train_test_split

# NO_WEATHER_SELECTED_FEATURES =['interaction_type',
#        'distance_to_event', 'interaction_label',
#         'gender', 'joinedAt', 'location', 'age',
#       'indoor_outdoor_preference', 'user_interests', 
#        'start_time', 'city', 'yes_count',
#        'maybe_count', 'invited_count', 'no_count', 'total_users', 'category', 
#        'title', 'event_type','event_indoor_capability']

# WEATHER_SELECTED_FEATURES =['interaction_type',
#        'distance_to_event', 'interaction_label',
#         'gender', 'joinedAt', 'location', 'age',
#       'indoor_outdoor_preference',
#        'start_time', 'city', 'yes_count',
#        'maybe_count', 'invited_count', 'no_count', 'total_users',
#        'weather_description', 'category','event_type',
#        'event_indoor_capability', 'temperature_2m_mean', 'precipitation_sum']

# # )
# # Splitting the dataset into features and labels
# weather_X = interactions_df[WEATHER_SELECTED_FEATURES]  # Features
# weather_y = interactions_df['interaction_label']   

# no_weather_X = interactions_df[NO_WEATHER_SELECTED_FEATURES]  # Features
# no_weather_y = interactions_df['interaction_label']                   # Labels

# # Splitting the dataset into training and evaluation sets
# weather_X_train, weather_X_val, weather_y_train, weather_y_val = \
#     train_test_split(weather_X, weather_y, test_size=0.2, random_state=42)

# no_weather_X_train, no_weather_X_val, no_weather_y_train, no_weather_y_val = \
#     train_test_split(no_weather_X, no_weather_y, test_size=0.2, random_state=42)

In [53]:
from catboost import CatBoostClassifier, Pool
from sklearn.metrics import classification_report, precision_recall_fscore_support
from sklearn.metrics import confusion_matrix
import numpy as np

# Final version without text fields (title, user_interests)
def train_catboost_without_text_fields(
    train_df, val_df, train_y, val_y
):
    # Drop the text fields if present
    text_columns = ["title", "user_interests"]
    train_df = train_df.drop(columns=[col for col in text_columns if col in train_df.columns])
    val_df = val_df.drop(columns=[col for col in text_columns if col in val_df.columns])

    # Identify categorical features
    cat_features = train_df.select_dtypes(include=["object", "bool"]).columns.tolist()

    # Create CatBoost Pools
    train_pool = Pool(train_df, train_y, cat_features=cat_features)
    val_pool = Pool(val_df, val_y, cat_features=cat_features)
    # Calculate class weights
    pos_weight = len(train_y[train_y == 0]) / len(train_y[train_y == 1])


    # Train the model
    model = CatBoostClassifier(
        learning_rate=0.2,
        iterations=100,
        depth=10,
        early_stopping_rounds=5,
        use_best_model=True,
        scale_pos_weight=10,  # Handle class imbalance
        verbose=False
    )


    model.fit(train_pool, eval_set=val_pool)

    # Evaluation
    preds = model.predict(val_pool)
    precision, recall, fscore, _ = precision_recall_fscore_support(val_y, preds, average="binary")
    print("\nClassification Report:")
    print(classification_report(val_y, preds))

    metrics = {
        "precision": precision,
        "recall": recall,
        "fscore": fscore,
    }
    
    preds = model.scores = model.predict_proba(val_pool)[:, 1] 
    print("Predicted Class Distribution:", np.unique(preds, return_counts=True))

    # print("\nConfusion Matrix:")
    # print(confusion_matrix(val_y, preds))

    return model, metrics, val_pool

"CatBoost training function excluding title and user_interests."


'CatBoost training function excluding title and user_interests.'

In [58]:
# Use this function to train on your weather / no-weather datasets
weather_model, weather_metrics, weather_val_pool = train_catboost_without_text_fields(
    train_df=weather_X_train,
    val_df=weather_X_val,
    train_y=weather_y_train,
    val_y=weather_y_val
)

#Save the models using Joblib
joblib.dump(weather_model, '/home/nkama/masters_thesis_project/thesis/models/weather_ranking_model.pkl')
print("\nModels saved successfully!")




Classification Report:



              precision    recall  f1-score   support

           0       0.00      0.00      0.00      8984
           1       0.45      1.00      0.62      7381

    accuracy                           0.45     16365
   macro avg       0.23      0.50      0.31     16365
weighted avg       0.20      0.45      0.28     16365

Predicted Class Distribution: (array([0.76678693, 0.77340086, 0.7816716 , ..., 0.97642114, 0.97662405,
       0.97718694]), array([1, 1, 1, ..., 1, 1, 1]))

Models saved successfully!


In [5]:
feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        weather_X_train.columns, 
        weather_model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'interaction_type': 76.5778113641695,
 'weather_description': 4.30674008101765,
 'age': 3.237467817179253,
 'category': 2.9370056032853573,
 'maybe_count': 2.7196625311286935,
 'event_indoor_capability': 2.2254651728353663,
 'indoor_outdoor_preference': 2.1986309453821233,
 'temperature_2m_mean': 1.9819653541549123,
 'distance_to_event': 1.712327496290426,
 'total_users': 0.8689565646688637,
 'yes_count': 0.7116664731004836,
 'invited_count': 0.3244445822618073,
 'precipitation_sum': 0.14332783183238257,
 'interaction_label': 0.05452766519541397,
 'no_count': 2.939750863428235e-07,
 'gender': 2.2352268228832082e-07,
 'joinedAt': 0.0,
 'location': 0.0,
 'start_time': 0.0,
 'city': 0.0,
 'event_type': 0.0}

In [59]:

# Use this function to train on your weather / no-weather datasets
no_weather_model, no_weather_metrics, no_weather_val_pool = train_catboost_without_text_fields(
    train_df=no_weather_X_train,
    val_df=no_weather_X_val,
    train_y=no_weather_y_train,
    val_y=no_weather_y_val
)

joblib.dump(no_weather_model, '/home/nkama/masters_thesis_project/thesis/models/no_weather_ranking_model.pkl')
print("\nModels saved successfully!")


Classification Report:



              precision    recall  f1-score   support

           0       0.00      0.00      0.00      8902
           1       0.46      1.00      0.63      7463

    accuracy                           0.46     16365
   macro avg       0.23      0.50      0.31     16365
weighted avg       0.21      0.46      0.29     16365

Predicted Class Distribution: (array([0.70096467, 0.70809418, 0.7224312 , ..., 0.97727748, 0.97746981,
       0.97780965]), array([1, 1, 1, ..., 1, 1, 1]))

Models saved successfully!


In [7]:

feat_to_score = {
    feature: score 
    for feature, score 
    in zip(
        no_weather_X_train.columns, 
        no_weather_model.feature_importances_,
    )
}

feat_to_score = dict(
    sorted(
        feat_to_score.items(),
        key=lambda item: item[1],
        reverse=True,
    )
)
feat_to_score

{'interaction_type': 54.72122052353688,
 'interaction_label': 30.8715625371324,
 'maybe_count': 2.3459777055818587,
 'city': 2.2449991247582415,
 'no_count': 1.6503214458250421,
 'age': 1.5090039137423563,
 'total_users': 1.4746309724073807,
 'distance_to_event': 1.1914178305045156,
 'yes_count': 1.1667068796907545,
 'indoor_outdoor_preference': 1.1514013475146947,
 'category': 0.7191426447743281,
 'title': 0.48369153655103775,
 'invited_count': 0.3897249471402347,
 'start_time': 0.08019859084027804,
 'gender': 0.0,
 'joinedAt': 0.0,
 'location': 0.0,
 'user_interests': 0.0}

In [8]:
# from sklearn.metrics import roc_auc_score, average_precision_score, ndcg_score, precision_score, recall_score
# import numpy as np


# def evaluate_ranking_model_proba(model, val_pool, val_y, k_list=[5, 10]):
#     """
#     Evaluate a CatBoost ranking model using predicted probabilities, not binary class outputs.
#     """

#     # Predict class probabilities (not class labels)
#     proba = model.predict_proba(val_pool)[:, 1]  # Probability for class 1

#     results = {
#         "AUC": roc_auc_score(val_y, proba),
#         "Average Precision (MAP)": average_precision_score(val_y, proba),
#     }

#     # Convert to numpy arrays
#     true_labels = np.array(val_y)
#     predicted_scores = np.array(proba)

#     # Sort by predicted score
#     sorted_indices = np.argsort(predicted_scores)[::-1]
#     sorted_true = true_labels[sorted_indices]

#     for k in k_list:
#         top_k = sorted_true[:k]
#         precision_at_k = np.mean(top_k)
#         recall_at_k = np.sum(top_k) / np.sum(true_labels)
#         ndcg_at_k = ndcg_score(
#             y_true=true_labels.reshape(1, -1),
#             y_score=predicted_scores.reshape(1, -1),
#             k=k
#         )



#         results[f"Precision@{k}"] = precision_at_k
#         results[f"Recall@{k}"] = recall_at_k
#         results[f"NDCG@{k}"] = ndcg_at_k

#     return results

# "✅ Evaluation function ready: scores ranking model using AUC, MAP, Precision@K, Recall@K, and NDCG@K."


'✅ Evaluation function ready: scores ranking model using AUC, MAP, Precision@K, Recall@K, and NDCG@K.'

In [None]:

# # Evaluate weather-aware model
# weather_scores = evaluate_ranking_model_proba(
#     model=weather_model,
#     val_pool=weather_val_pool,
#     val_y=weather_y_val
# )

# # Evaluate no-weather model
# no_weather_scores = evaluate_ranking_model_proba(
#     model=no_weather_model,
#     val_pool=no_weather_val_pool,
#     val_y=no_weather_y_val
# )

# # Compare results
# print("Weather Model Scores:")
# for k, v in weather_scores.items():
#     print(f"{k}: {v:.4f}")

# print("\nNo-Weather Model Scores:")
# for k, v in no_weather_scores.items():
#     print(f"{k}: {v:.4f}")


In [60]:
# Connect to Hopsworks Model Registry
mr = project.get_model_registry()

In [62]:
from hsml.schema import Schema
from hsml.model_schema import ModelSchema

# Create model schema for weather ranking model
input_example = weather_X_train.sample().to_dict("records")
input_schema = Schema(weather_X_train)
output_schema = Schema(weather_y_train)
model_schema = ModelSchema(input_schema, output_schema)

weather_ranking_model = mr.python.create_model(
    name="weather_ranking_model", 
    metrics=weather_metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates",
)
weather_ranking_model.save("/home/nkama/masters_thesis_project/thesis/models/weather_ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Uploading /home/nkama/masters_thesis_project/thesis/models/weather_ranking_model.pkl: 0.000%|          | 0/498…

Uploading /home/nkama/masters_thesis_project/thesis/notebooks/input_example.json: 0.000%|          | 0/351 ela…

Uploading /home/nkama/masters_thesis_project/thesis/notebooks/model_schema.json: 0.000%|          | 0/1094 ela…

Model created, explore it at https://c.app.hopsworks.ai:443/p/1220788/models/weather_ranking_model/1


Model(name: 'weather_ranking_model', version: 1)

In [63]:
# Create model schema for no weather ranking model  
input_example = no_weather_X_train.sample().to_dict("records")
input_schema = Schema(no_weather_X_train)
output_schema = Schema(no_weather_y_train)
model_schema = ModelSchema(input_schema, output_schema)

no_weather_ranking_model = mr.python.create_model(
    name="no_weather_ranking_model", 
    metrics=no_weather_metrics,
    model_schema=model_schema,
    input_example=input_example,
    description="Ranking model that scores item candidates",
)
no_weather_ranking_model.save("/home/nkama/masters_thesis_project/thesis/models/no_weather_ranking_model.pkl")

  0%|          | 0/6 [00:00<?, ?it/s]

Uploading /home/nkama/masters_thesis_project/thesis/models/no_weather_ranking_model.pkl: 0.000%|          | 0/…

Uploading /home/nkama/masters_thesis_project/thesis/notebooks/input_example.json: 0.000%|          | 0/261 ela…

Uploading /home/nkama/masters_thesis_project/thesis/notebooks/model_schema.json: 0.000%|          | 0/856 elap…

Model created, explore it at https://c.app.hopsworks.ai:443/p/1220788/models/no_weather_ranking_model/1


Model(name: 'no_weather_ranking_model', version: 1)