# Model Training

**Notebook 3: Model Training**

This notebook trains two LightFM models using Optuna for hyperparameter optimization:
1. **Baseline model**: collaborative filtering only (user-item interactions)
2. **Feature-enhanced model**: adds item features (beer style + brewery)

LightFM uses matrix factorization with WARP/BPR loss to learn user and item embeddings.

**NOTE: This training sequence uses the `lightfm-next` package to allow forwards-compatibility with python 3.10+ which the base LightFM package lacks**

### Discusssion on Feature Selection

The first model will not use item features, making it a pure matrix factorization (MF) model which corresponds to the SVD model. This model will only train on interaction (rating) data. The second model will learn brewery and style metadata embeddings, which should be informative/contain useful signals as some users will prefer beers from certain breweries or styles. The item features should also improve the item-item similarity results. Based on the results of our EDA, we want to avoid giving the model weak/noisy signals as the data is already sparse. Therefore, we've opted to not engineer or use additional features. For example, we'll leave out ABV as it is most likely a noisy variable to include. 

### Init

In [1]:
# Import LightFM and evaluation libraries
import scipy
import numpy as np
import pandas as pd
import lightfm
import seaborn as sns
import matplotlib as plt
from lightfm import LightFM
from lightfm import data
from lightfm.evaluation import precision_at_k, recall_at_k, auc_score, reciprocal_rank
from lightfm.cross_validation import random_train_test_split
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import CountVectorizer
from unidecode import unidecode # to deal with accents



In [2]:
# Load preprocessed train/val/test splits + datasets
train = pd.read_parquet('data/train.parquet', engine='pyarrow')
val = pd.read_parquet('data/val.parquet', engine='pyarrow')
test = pd.read_parquet('data/test.parquet', engine='pyarrow')
filtered = pd.read_parquet('data/filtered.parquet', engine='pyarrow')

In [3]:
# Initialize LightFM Dataset object for ID mapping
light_data = lightfm.data.Dataset()

### LightFM Dataset Construction

In [4]:
# first we need to fit the dataset with all users and items in the training set
light_data = lightfm.data.Dataset()

In [5]:
# Next, we'll create a list of all unique item features (styles + breweries)
metadata = list(set(filtered['style'].unique()).union(set(filtered['brewery'].unique())))
# Now, fit the dataset with users, items, and item features from the training set
light_data.fit(users=filtered['user'].unique(),
               items=filtered['beer_id'].unique(),
               item_features=metadata)
# Store item mappings for later use
user_mappings = light_data._user_id_mapping
item_mappings = light_data._item_id_mapping
inv_user_mappings = {v:k for k, v in user_mappings.items()}
inv_item_mappings = {v:k for k, v in item_mappings.items()}

In [6]:
# Build interaction matrices with sample weights from rating bins
# Train uses weights (0, 0.01, 0.09, 0.9), validation is binary
train_interactions = light_data.build_interactions(train[['user', 'beer_id', 'weight_all']].values)
val_interactions = light_data.build_interactions(val[['user', 'beer_id']].values)

In [7]:
# Construct item feature matrix with helper function
from helpers.training import construct_item_features
features = construct_item_features(light_data, filtered, b=.2, s=.1)

  from .autonotebook import tqdm as notebook_tqdm


### Optimization with Optuna

We'll use the `optuna` library to optimize our model with intelligent hyperparameter searching and trial pruning.

**Objective**: Maximize Precision@50 on validation set

We'll optimize over precision@50 because Brewtiful displays many recommendations at once, coming from diverse style categories. We want to not only show one or two highly relevant items, but multiple relevant items across different categories. 50 should be a sufficiently high k to ensure that we're optimizing the model to select many relevant items.

We'll train models both with and without item features to see if the item features help.

**Hyperparameters to Optimize**:
- no_components: The dimensionality of the feature latent embeddings
- learning_schedule: One of (‘adagrad’, ‘adadelta’)
- loss: Loss function. We'll test both WARP and BPR (both popular loss functions, but WARP usually performs better when precision is the goal).
- learning_rate: Initial learning rate for the adagrad learning schedule.
- item_alpha: L2 penalty on item features.
- user_alpha: L2 penalty on user features.
- max_sampled: Maximum number of negative samples used during WARP fitting.
- epochs: Number of epochs to run model training.

#### No Item Features (Pure MF)

In [8]:
import optuna
from helpers.training import create_objective


# Initialize the study
init_study = optuna.create_study(direction="maximize")

[I 2025-11-08 17:54:49,488] A new study created in memory with name: no-name-e5cedf27-a494-43f1-9ac5-770b05a11f7f


In [9]:
# Hyperparameter tuning for BASELINE model (no item features)
# Parameters tuned: embedding size, learning rate, regularization, loss function, epochs
# Starting from sensible defaults, then exploring 150 trials
# Uses WARP loss (Weighted Approximate-Rank Pairwise) for ranking optimization

# Add in our original hyperparmeter values as a starting point for Optuna
init_study.enqueue_trial(params={"no_components":10, 
                            					"learning_schedule":'adagrad', 
                            					"loss":'warp',
                            					"learning_rate":0.05,
                            					"item_alpha":1e-10, 
                            					"user_alpha":1e-10, 
                            					"max_sampled":10,
                            					"epochs":20})


# Run the optimisation        
init_study.optimize(create_objective(train_interactions=train_interactions[0], 
                                val_interactions=val_interactions[0], sample_weight=train_interactions[1], 
                                use_item_features=False, p_k=50, loss=['warp', 'bpr'],  enable_pruning=True, pruning_interval=5), 
                                n_trials=150)

# Save the best parameters
best_params = init_study.best_params
for k, v in best_params.items():
    print(k,":",v)

[I 2025-11-08 17:55:37,165] Trial 0 finished with value: 0.1568261981010437 and parameters: {'no_components': 10, 'learning_schedule': 'adagrad', 'loss': 'warp', 'item_alpha': 1e-10, 'user_alpha': 1e-10, 'learning_rate': 0.05, 'max_sampled': 10, 'epochs': 20}. Best is trial 0 with value: 0.1568261981010437.
[I 2025-11-08 17:56:43,948] Trial 1 finished with value: 0.1551637202501297 and parameters: {'no_components': 24, 'learning_schedule': 'adadelta', 'loss': 'warp', 'item_alpha': 3.6388328509503195e-08, 'user_alpha': 2.3479280663625877e-07, 'rho': 0.992726554945493, 'epsilon': 4.6781797360102236e-07, 'max_sampled': 7, 'epochs': 29}. Best is trial 0 with value: 0.1568261981010437.
[I 2025-11-08 17:58:05,561] Trial 2 finished with value: 0.06811082363128662 and parameters: {'no_components': 26, 'learning_schedule': 'adagrad', 'loss': 'bpr', 'item_alpha': 1.69489840914113e-07, 'user_alpha': 1.3200287339041189e-10, 'learning_rate': 0.5511291329402712, 'epochs': 32}. Best is trial 0 with v

no_components : 125
learning_schedule : adadelta
loss : warp
item_alpha : 9.311257093797117e-07
user_alpha : 5.5829799172677285e-08
rho : 0.9049541118958679
epsilon : 8.421159595417498e-08
max_sampled : 14
epochs : 48


In [10]:
# Save best parameters for later use
import pickle
with open('artifacts/best_params_no_item_features.pkl', 'wb') as f:
    pickle.dump(best_params, f)

#### Item Features

This model will not only learn identity embeddings for every user and item, but will also learn embeddings for breweries and styles. However, LightFM gives all features equal weighting by default. It's critical to allow for optimization over the weighting of item features, as the default weighting is likely much too high, hiding the item's unique signal behind their metadata features. We want the metadata features to complement the identity features, not overpower them.

**Additional Hyperparameters**:
- b: relative weight of brewery embeddings to identity features
- s: relative weight of style embeddings to identity features

In [11]:
# Initialize new Optuna study for ITEM FEATURES model
# Define the study
item_feature_study = optuna.create_study(direction="maximize")

[I 2025-11-08 20:52:53,005] A new study created in memory with name: no-name-9440778e-1df3-4f90-b778-bc9e7f7087f3


In [12]:
# Hyperparameter tuning for feature-enhanced model
# Additional parameters: b (brewery weight), s (style weight)
# These control how much style/brewery metadata influences embeddings

# Add in our original hyperparmeter values as a starting point for Optuna
item_feature_study.enqueue_trial(params={"no_components":10, 
                            					"learning_schedule":'adagrad', 
                            					"loss":'warp',
                            					"learning_rate":0.05,
                            					"item_alpha":1e-10, 
                            					"user_alpha":1e-10, 
                            					"max_sampled":10,
                            					"epochs":20,
                                                "b": 0.2,
												"s": 0.1})



# Run the optimisation        
item_feature_study.optimize(create_objective(train_interactions=train_interactions[0], 
                                val_interactions=val_interactions[0], sample_weight=train_interactions[1], 
                                p_k=10, use_item_features = True, light_data=light_data, dataset=filtered,
                                loss=['warp', 'bpr'], enable_pruning=True, pruning_interval=5), 
                                n_trials=150)

# Save the best parameters
item_feature_best_params = item_feature_study.best_params
for k, v in item_feature_best_params.items():
    print(k,":",v)

[I 2025-11-08 20:53:50,874] Trial 0 finished with value: 0.167170450091362 and parameters: {'no_components': 10, 'learning_schedule': 'adagrad', 'loss': 'warp', 'item_alpha': 1e-10, 'user_alpha': 1e-10, 'learning_rate': 0.05, 'max_sampled': 10, 'epochs': 20, 'b': 0.2, 's': 0.1}. Best is trial 0 with value: 0.167170450091362.
[I 2025-11-08 20:56:17,983] Trial 1 finished with value: 0.15205709636211395 and parameters: {'no_components': 23, 'learning_schedule': 'adagrad', 'loss': 'warp', 'item_alpha': 7.778230385925488e-08, 'user_alpha': 1.3340496799811835e-07, 'learning_rate': 0.011022614761051925, 'max_sampled': 8, 'epochs': 49, 'b': 0.13145230121577267, 's': 0.10554380763123505}. Best is trial 0 with value: 0.167170450091362.
[I 2025-11-08 20:57:43,401] Trial 2 finished with value: 0.09924433380365372 and parameters: {'no_components': 54, 'learning_schedule': 'adagrad', 'loss': 'bpr', 'item_alpha': 6.0004869947923865e-09, 'user_alpha': 4.774202949111062e-07, 'learning_rate': 0.74374135

no_components : 75
learning_schedule : adadelta
loss : warp
item_alpha : 9.854944294071034e-10
user_alpha : 9.565341504354942e-07
rho : 0.9444957056553378
epsilon : 2.878992523820034e-07
max_sampled : 11
epochs : 46
b : 0.009064583858183008
s : 0.0005742885432749662


In [13]:
# Save feature-enhanced model parameters
with open('artifacts/best_params_with_item_features.pkl', 'wb') as f:
    pickle.dump(item_feature_best_params, f)

In [None]:
optuna.importance.get_param_importances(item_feature_study)