# Hybrid Matrix Factorization (LightFM)

Now, we will be implementing a hybrid Matrix Factorization model through the LightFM Library. Previously, we only took into account whether a user reviewed a product and the rating they gave it. However, our dataset has a lot of other features about the products and users that could help with our analysis (a product's store, price, category, etc). This hybrid model learns the embeddings for these additional features on top of the user and item embeddings from plain Matrix Factorization. This allows hybrid models to get the best of Collaborative Filtering and Content-based Filtering techniques.

In [1]:
!pip install lightfm



In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.dates as mdates
import numpy as np
import gzip
import json
import os
import gcsfs
import pickle
import multiprocessing
import gc

from implicit.evaluation import mean_average_precision_at_k, precision_at_k, AUC_at_k
from tqdm import tqdm
from collections import defaultdict
from sklearn.preprocessing import LabelEncoder, MultiLabelBinarizer
from scipy.sparse import coo_matrix, save_npz, load_npz, vstack, hstack, csr_matrix
from implicit.als import AlternatingLeastSquares
from glob import glob
from matplotlib.ticker import FuncFormatter, MultipleLocator
from lightfm import LightFM
from lightfm.data import Dataset
from lightfm.evaluation import precision_at_k, recall_at_k, auc_score



In [None]:
pd.reset_option('display.float_format')
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)

In [2]:
train_df = pd.read_parquet("gs://amazon-reviews-project/train_df.parquet", engine="pyarrow")

Similarily with the original Matrix Factorization model, we want to remove users/products that do not add any signal to the data.

In [3]:
user_counts = train_df['user_id'].value_counts()
product_counts = train_df['parent_asin'].value_counts()

In [4]:
min_reviews = 5

valid_users = user_counts[user_counts > min_reviews].index
valid_products = product_counts[product_counts > min_reviews].index

filtered_train_df = train_df[train_df["user_id"].isin(valid_users) & train_df["parent_asin"].isin(valid_products)]

In [5]:
filtered_train_df = filtered_train_df.drop(['timestamp', 'price_missing', 'datetime', 'year'], axis=1)

We also want to calculate confidence scores for each user-product interaction, using the optimized formula from the Matrix Factorization notebook.

In [6]:
alpha = 10.0
filtered_train_df["confidence"] = 1.0 + alpha * np.log1p(filtered_train_df["rating"])

Now, we have to decide which of the features we can input as user and item metadata. 

In [20]:
filtered_train_df.head()

Unnamed: 0,user_id,parent_asin,rating,history,category,title,average_rating,price,store,confidence
0,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B096WPNG8Q,5.0,,Patio_Lawn_and_Garden,"Mosser Lee ML0560 Spanish Moss, 250 Cubic Inches",4.6,4.97,Mosser Lee,18.917595
1,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B000BQT5IG,3.0,B096WPNG8Q,Patio_Lawn_and_Garden,"Combat Indoor and Outdoor Ant Killing Gel, 27 ...",4.4,5.48,Combat,14.862944
2,AFJTRBXMURLHS5EGNXLUHDHIZRFQ,B002FGU2MI,4.0,B096WPNG8Q B000BQT5IG,Patio_Lawn_and_Garden,SWIMLINE HYDROTOOLS Mini Venturi Pool & Spa Va...,3.9,16.93,Swimline,17.094379
3,AEFKF6R2GUSK2AWPSWRR4ZO36JVQ,B073V7N6RQ,5.0,,Patio_Lawn_and_Garden,Raisman Rewind Recoil Starter Assembly Compati...,4.5,25.99,Raisman,18.917595
4,AEFKF6R2GUSK2AWPSWRR4ZO36JVQ,B01J0RIRUS,4.0,B073V7N6RQ,Patio_Lawn_and_Garden,AUTOKAY Recoil Pull Start Compatible with Brig...,4.0,16.99,AUTOKAY,17.094379


History represents a list of the items each user previously reviewed, which makes it a perfect way to understand user behavior. Thus, we will encode it as a user metadata feature. Currently, history is stored as a space seperated list of product ids. We want to convert this into a python list.

In [7]:
filtered_train_df["history"] = filtered_train_df["history"].fillna("").apply(lambda x: x.split())

In [28]:
filtered_train_df["history"].head()

0                          []
1                [B096WPNG8Q]
2    [B096WPNG8Q, B000BQT5IG]
3                          []
4                [B073V7N6RQ]
Name: history, dtype: object

For item metadata, we have category, title, average_rating, price, and store. However, we want to make sure the metadata we include will add signal to the data and not noise. We also want to make sure the feature doesn't unnecessarily increase the complexity of the model (too many different categories). For these reasons, we will exclude title (too many unique values). Store could potentially add signal, but let's make sure it won't make the data too high-dimensional.

In [15]:
store_counts = filtered_train_df['store'].value_counts()

top_200_stores = store_counts.head(200)

top_200_total = top_200_stores.sum()

other_total = store_counts.iloc[200:].sum()

print(f"The top 200 most popular stores cover {top_200_total / store_counts.sum():.2%} of interactions.")
print(f"The other stores cover {other_total / store_counts.sum():.2%} of interactions.")

The top 200 most popular stores cover 23.10% of interactions.
The other stores cover 76.90% of interactions.


This means there are too many unique store names so adding it as a item metadata feature would significantly increase the dimensionality of the data and drown out signal.

Now, it makes sense that the price, average product rating, and category would affect whether a user would enjoy a product. However, LightFM only accepts categories as metadata input. Thus, we will convert price and average product rating (currently floats) into bins based on quantiles.

In [8]:
# convert price column to bins
filtered_train_df["price"] = filtered_train_df["price"].replace(-1, np.nan)

filtered_train_df["price_bins"], price_bin_edges = pd.qcut(filtered_train_df["price"], q=5, 
                                                           labels=["Very Cheap", "Cheap", "Medium", "Expensive", "Very Expensive"],
                                                           retbins=True)

filtered_train_df["price_bins"] = filtered_train_df["price_bins"].cat.add_categories("Missing")
filtered_train_df["price_bins"] = filtered_train_df["price_bins"].fillna("Missing")

In [11]:
price_bin_edges

array([    0.  ,    11.47,    17.75,    25.97,    44.92, 11099.  ])

In [9]:
# rating labels are more positive since rating data is positively skewed (as seen in the bin edges)
rating_labels =  ["Negative", "Positive", "Very Positive", "Near Perfect", "Perfect"]

filtered_train_df['avg_rating_bin'], ratings_bin_edges= pd.qcut(filtered_train_df['average_rating'], q=5, labels=rating_labels, retbins=True)

In [26]:
ratings_bin_edges

array([1. , 4.2, 4.4, 4.5, 4.7, 5. ])

In [23]:
filtered_train_df[["price_bins", "avg_rating_bin"]].head()

Unnamed: 0,price_bins,avg_rating_bin
0,Very Cheap,Good
1,Very Cheap,Negative
2,Cheap,Very Negative
3,Expensive,Impartial
4,Cheap,Very Negative


The category column only has 14 categories, so we can keep it as is.

We have cleaned all the features we want to use for our item and user metadata. Now, we want build user and item feature matrices that are compatible with LightFM so we can train a hybrid recommendation model that leverages both interaction data and metadata.

In [10]:
# clean up notebook memory
del train_df, user_counts, product_counts, valid_users, valid_products, price_bin_edges, ratings_bin_edges
gc.collect()

filtered_train_df = filtered_train_df.drop(["price", "average_rating"], axis=1)

In [11]:
# build a mapping between product id and category
asin_to_category = filtered_train_df.drop_duplicates("parent_asin").set_index("parent_asin")["category"].to_dict()

In [12]:
# convert the history list of product ids to list of categories (offers more information for the model)
user_to_cat_tokens = defaultdict(set)

for row in filtered_train_df.itertuples():
    uid = row.user_id
    for asin in row.history:
        cat = asin_to_category.get(asin)
        if pd.notna(cat):
            user_to_cat_tokens[uid].add(f"user_cat={cat}")

user_to_cat_tokens = {uid: sorted(list(tokens)) for uid, tokens in user_to_cat_tokens.items()}

filtered_train_df["user_category_tokens"] = filtered_train_df["user_id"].map(user_to_cat_tokens).fillna("").apply(list)

In [13]:
# cleaning up notebook memory
filtered_train_df = filtered_train_df.drop(["history"], axis=1)

In [14]:
# convert categorical columns to token format ("cat = Video Games", "price_bin = Cheap", "avg_rating_bin = Perfect") for proper indexing (LightFM Dataset
# objects expect this format)
# need to use vectorized operations for memory efficiency
filtered_train_df["cat_token"] = "cat=" + filtered_train_df["category"].astype(str)
filtered_train_df["price_token"] = "price_bin=" + filtered_train_df["price_bins"].astype(str)
filtered_train_df["rating_token"] = "avg_rating_bin=" + filtered_train_df["avg_rating_bin"].astype(str)


filtered_train_df["item_feature_tokens"] = (filtered_train_df["cat_token"].str.
                                            cat([filtered_train_df["price_token"], filtered_train_df["rating_token"]],sep="|").str.split("|"))

In [15]:
# cleaning up notebook memory
filtered_train_df = filtered_train_df.drop(["cat_token", "price_token", "rating_token"], axis=1)

In [16]:
# fit Dataset object with user/item IDs and the feature tokens so LightFM can properly index all user and item metadata features
dataset = Dataset()

interaction_user_ids = filtered_train_df["user_id"].astype(str).unique()
interaction_item_ids = filtered_train_df["parent_asin"].astype(str).unique()

dataset.fit(users = interaction_user_ids, items = interaction_item_ids,
    user_features = sorted(set(token for tokens in filtered_train_df["user_category_tokens"] for token in tokens)),
    item_features = sorted(set(token for tokens in filtered_train_df["item_feature_tokens"] for token in tokens))
)

In [23]:
# uses unique user and category metadata tokens to build sparse matrices that can be used as input to the fit() function
all_user_ids = filtered_train_df["user_id"].unique()

user_feature_dict = {
    user_id: user_to_cat_tokens.get(user_id, []) for user_id in all_user_ids
}

user_feature_tuples = list(user_feature_dict.items())
item_feature_tuples = list(filtered_train_df.groupby("parent_asin")["item_feature_tokens"].last().items())

user_features = dataset.build_user_features(user_feature_tuples)
item_features = dataset.build_item_features(item_feature_tuples)

In [24]:
interaction_tuples = list(zip(filtered_train_df["user_id"], filtered_train_df["parent_asin"], filtered_train_df["confidence"]))

interactions, weights = dataset.build_interactions(interaction_tuples)

In [25]:
with open("saved_models/lightfm_dataset.pickle", "wb") as f:
    pickle.dump(dataset, f)

In [None]:
# training the model
num_cores = multiprocessing.cpu_count()

model = LightFM(loss="warp", no_components=64, learning_rate=0.05, item_alpha=1e-6, user_alpha=1e-6, random_state=42)

model.fit(interactions = interactions, user_features = user_features, item_features = item_features, sample_weight = weights, epochs = 20,
          num_threads = num_cores, verbose = True)

Epoch:   0%|          | 0/20 [00:00<?, ?it/s]

In [None]:
# save the trained model
with open("saved_models/lightfm_model.pickle", "wb") as f:
    pickle.dump(model, f)

In [45]:
# save all the matrices for evaluation
save_npz("lightfm_user_features.npz", user_features)
save_npz("lightfm_item_features.npz", item_features)
save_npz("lightfm_interactions.npz", interactions)
save_npz("lightfm_weights.npz", weights)

Now, we have a trained LightFM model with its corresponding user features, item features, interaction matrices. We will use these matrices to evaluate the model against both the training and test data. We will use the below metrics to evaluate our model (note that we replace MAP with recall as LightFM does not support MAP):
- Precision at k: Out of the top k items the model recommends for a user, how many are actually relevant (items the user interacted with or liked)?
- Recall at K: Out of all the relevant items for a user, what percentage appear in the top k recommendations the model makes?
- AUC at K: Measures the probability that, out of the top k recommendations, a randomly chosen relevant item is ranked higher than a randomly chosen irrelevant item

In [2]:
# read saved dataset object
with open("saved_models/lightfm_dataset.pickle", "rb") as f:
    loaded_data = pickle.load(f)

In [3]:
# read saved model
with open("saved_models/lightfm_model.pickle", "rb") as f:
    loaded_model = pickle.load(f)

In [4]:
user_features = load_npz("saved_csr_matrices/lightfm_user_features.npz")
item_features = load_npz("saved_csr_matrices/lightfm_item_features.npz")
train_interactions = load_npz("saved_csr_matrices/lightfm_interactions.npz")

In [5]:
train_interactions = train_interactions.tocsr()

In [6]:
# read test review data
fs = gcsfs.GCSFileSystem()

all_files = fs.glob("amazon-reviews-project/test_data/*.csv.gz")

test_dfs = []

for file in all_files:
    base = os.path.basename(file).replace(".csv.gz", "")
    category, _ = base.split('.', 1)

    with fs.open(file, 'rb') as f:
        df = pd.read_csv(f, compression='gzip')
        df['category'] = category

    test_dfs.append(df)

test_df = pd.concat(test_dfs, ignore_index=True)

In [7]:
# get saved user/feature mappings from user-item matrix
u_map, ufeat_map, i_map, ifeat_map = loaded_data.mapping()

In [8]:
# only keep users/items present in the training mapping
test_df = test_df[test_df["user_id"].isin(u_map.keys()) & test_df["parent_asin"].isin(i_map.keys())]

In [9]:
# build same confidence matrix as in training
alpha = 10.0
test_df["confidence"] = 1.0 + alpha * np.log1p(test_df["rating"])

In [10]:
# build interaction tuples
test_interaction_tuples = list(zip(test_df["user_id"], test_df["parent_asin"], test_df["confidence"]))

In [11]:
# build interaction matrix for test data
test_interactions, test_weights = loaded_data.build_interactions(test_interaction_tuples)

In [12]:
test_interactions = test_interactions.tocsr()

In [13]:
# evaluate model on random sample of training data (5000 data points) as evaluating on entire training dataset is unfeasible
num_cores = multiprocessing.cpu_count()

active_user_ids = np.where(train_interactions.getnnz(axis=1) > 5)[0]
subset_user_ids = np.random.choice(active_user_ids, size=5000, replace=False)

train_subset = train_interactions[subset_user_ids, :]
user_features_subset = user_features[subset_user_ids, :]

print("=== Subset Training Data Evaluation (5000 users) ===")

train_prec = precision_at_k(
    loaded_model,
    train_subset,
    user_features=user_features_subset,
    item_features=item_features,
    k=10,
    num_threads=num_cores
).mean()
print(f"Precision@10: {train_prec:.4f}")

train_rec = recall_at_k(
    loaded_model,
    train_subset,
    user_features=user_features_subset,
    item_features=item_features,
    k=10,
    num_threads=num_cores
).mean()
print(f"Recall@10:    {train_rec:.4f}")

train_auc = auc_score(
    loaded_model,
    train_subset,
    user_features=user_features_subset,
    item_features=item_features,
    num_threads=num_cores
).mean()
print(f"AUC:          {train_auc:.4f}\n")

=== Subset Training Data Evaluation (5000 users) ===
Precision@10: 0.0105
Recall@10:    0.0072
AUC:          0.9739



In [15]:
# evaluate model on random sample of testing data (5000 data points) as evaluating on entire testing dataset is unfeasible

active_user_ids_test = np.where(test_interactions.getnnz(axis=1) > 5)[0]
subset_user_ids_test = np.random.choice(active_user_ids_test, size=5000, replace=False)

test_subset = test_interactions[subset_user_ids_test, :]
user_features_subset_test = user_features[subset_user_ids_test, :]

print("=== Subset Testing Data Evaluation (5000 users) ===")

test_prec = precision_at_k(
    loaded_model,
    test_subset,
    user_features=user_features_subset_test,
    item_features=item_features,
    k=10,
    num_threads=num_cores
).mean()
print(f"Precision@10: {test_prec:.4f}")

test_rec = recall_at_k(
    loaded_model,
    test_subset,
    user_features=user_features_subset_test,
    item_features=item_features,
    k=10,
    num_threads=num_cores
).mean()
print(f"Recall@10:    {test_rec:.4f}")

test_auc = auc_score(
    loaded_model,
    test_subset,
    user_features=user_features_subset_test,
    item_features=item_features,
    num_threads=num_cores
).mean()
print(f"AUC:          {test_auc:.4f}")

=== Subset Testing Data Evaluation (5000 users) ===
Precision@10: 0.0041
Recall@10:    0.0039
AUC:          0.8668


Based on the testing data, we can see the LightFM model performed better than the plain matrix factorization model from previous notebooks on all metrics.

Original MF Model:
- Precision@10: 0.0039
- MAP@10:       0.0019
- AUC:          0.5020

LightFM Model:
- Precision@10: 0.0041
- Recall@10:    0.0039
- AUC:          0.8668

Interpretation:
- Precision@10: Only about 0.41% of the items in the top 10 recommendations per user are actually relevant 
- Recall@10: Only 0.39% of all the relevant items for a user appear in the top 10 recommendations the model makes
- AUC: The model is very good at ranking relevant products higher than non relevant products

We can see the LightFM model still struggles to recommend relevant items to users (low precision and recall for the top 10 recommendations). However, given the high AUC score, this model is significantly better at generally ranking relevant items over irrelevant items. This means that, while the model is unable to provide relevant recommendations, it has a much better understanding of the types of products thart users are not interested in interacting with, and ranks them lower. This improvement was due to the model incorporating user and product metadata.

Out of the three models I experimented with (recommend the most popular items, matrix factorization, LightFM), LightFM performed the best in terms of understanding which items customers are more likely to engage with and which items customers are not interested in. However, the LightFM model also required more data cleaning, preprocessing, a large amount of compute, and training time for a marginal increase (this model took more than 20 hours to train). In a production setting, it may not be worth investing that much compute and time for marginal improvements, and rather we settle for a simple popularity recommendation method.