# Data Preprocessing

**Notebook 2: Data Preprocessing**

This notebook cleans the data, creates weighted interactions based on ratings, and performs temporal train/validation/test splits to prevent data leakage.

In [1]:
# Load necessary libraries and data
import pandas as pd
import numpy as np
import scipy.sparse as sp
from scipy.sparse import csr_matrix
reviews = pd.read_parquet('data/reviews.parquet', engine='pyarrow')

### Data Cleaning

For collaborative filtering to work properly, we'll require at most one interaction per user per item. Also, in my production dataset, we dropped beers without ABV information to preserve data quality, so we'll do the same here.

In [2]:
# Clean the data:
# 1. Drop rows with missing values (removes ~82k rows with missing user/brewery/ABV)
# 2. Sort by user, date, beer_id for chronological processing
# 3. Remove duplicate reviews, keeping the latest review per user-beer pair
reviews.dropna(inplace=True)
reviews.sort_values(by=['user', 'date', 'beer_id'], inplace=True)
reviews.drop_duplicates(subset=['user', 'beer_id'], keep='last', inplace=True)

In [3]:
# save pre-filter count 
pre_filter_count = reviews.shape[0]

In [4]:
# Ensures that users have minimum interaction history
item_counts = reviews['beer_id'].value_counts()
filtered = reviews.loc[lambda x: x["beer_id"].map(item_counts) >= 5]
user_counts = filtered['user'].value_counts()
filtered = filtered.loc[lambda x: x["user"].map(user_counts) >= 5]

In [5]:
# After cleaning: 1418686 reviews (94.33% retention)
print(filtered.shape[0])
print(f"Retained {filtered.shape[0] / pre_filter_count:.2%} of reviews after filtering.")

1418686
Retained 94.33% of reviews after filtering.


In [6]:
# Final sparsity: 0.51% - improved density for training
len(filtered) / (len(filtered['user'].unique()) * len(filtered['beer_id'].unique()))

0.005110646019842053

In [7]:
# We improved the data sparsity by 7.1x the original!
print(round(0.005110646019842053 / 0.0007194294107235213, 2))

7.1


### Creating Weighted Implicit Feedback

**Why transform explicit ratings into weights?**

LightFM works with feedback where interactions have varying confidence levels. We transform 5-star ratings into weights:

| Rating Range | Weight | Interpretation |
|--------------|--------|----------------|
| 0-2 stars | 0.0 | Negative signal (user dislikes this beer) |
| 2-3 stars | 0.01 | Weak positive (tried it, not impressed) |
| 3-4 stars | 0.09 | Moderate positive (liked it) |
| 4-5 stars | 0.9 | Strong positive (loved it) |

This allows the model to distinguish between good and bad experiences.

Note that we choose to weight negative interactions by 0. This is because the loss functions implemented by LightFM will treat interacted-with items as "positive interactions" and the uninteracted with items as "negative interactions". In this case, we don't want beers users disliked to be counted as positive interactions. For beers users tried but didn't like very much, we interpret it as a weak signal (the user cared enough to try the beer even if they didn't love it).

In [8]:
filtered['weight_all'] = pd.cut(
    filtered['rating'],
    bins=[-float('inf'), 2, 3, 4, float('inf')],
    labels=[0, 0.01, 0.09, .9],
    right=False
).astype(float)

### Temporal Train/Val/Test Split

**Why temporal splitting?**

Unlike random splitting, temporal splits simulate real-world deployment. The goal of recommender systems is to predict what users will want to try next, based on their interaction history.

**Split strategy:**
- Power users (â‰¥300 reviews): reserve last 200 reviews for val/test (100 each)
- Regular users (<300 reviews): all data goes to training

This approach saves as much data as possible for training (necessary due to sparse data), while reserving validation data for users with enough history to meaningfully measure their taste preferences.

In [9]:
# Temporal splitting into train/val/test sets
filtered_sorted = filtered.sort_values(['user', 'date'])

# Initialize empty lists for splits
train, val, test = [], [], []

# Group by user and split based on interaction count
for user, group in filtered_sorted.groupby('user'):
    n = len(group)
    if n >= 300:
        train.append(group.iloc[:-200])
        val.append(group.iloc[-200:-100])
        test.append(group.iloc[-100:])
    else:
        train.append(group)

# Concatenate all user splits into final datasets
train = pd.concat(train)
val = pd.concat(val)
test = pd.concat(test)

In [10]:
# view dataset sizes
print(f"train length: {train.shape[0]}")
print(f"val length: {val.shape[0]}")
print(f"test length: {test.shape[0]}")

train length: 1180486
val length: 119100
test length: 119100


In [11]:
# item coverage
train_items = set(np.unique(train['beer_id']))
val_items = set(np.unique(val['beer_id']))
test_items = set(np.unique(test['beer_id']))

print(f"items in train: {len(train_items)}")
print(f"items in validation: {len(val_items)}")
print(f"items in test: {len(test_items)}")

# user coverage
train_users = set(np.unique(train['user']))
val_users = set(np.unique(val['user']))
test_users = set(np.unique(test['user']))

print(f"users in train: {len(train_users)}")
print(f"users in validation: {len(val_users)}")
print(f"users in test: {len(test_users)}")

items in train: 19110
items in validation: 14435
items in test: 14263
users in train: 14492
users in validation: 1191
users in test: 1191


### Cold-Start Items

If any beers appear in val/test but not in the training set, our model won't be able to generate predictions for those items because it only learns embeddings for items it trains on. Therefore, we'll exclude these items when evaluating the model.

In [12]:
# save cold-start items for evaluation exclusion
cold_start_items = pd.DataFrame((val_items.union(test_items)) - train_items)
# There are no cold start items in train or test!
print(val['beer_id'].isin(cold_start_items).sum())
print(test['beer_id'].isin(cold_start_items).sum())

# there is nothing to remove

0
0


### Final Dataset Split

Evaluation framework:

| Split | Interactions | Users | Items |
|-------|-------------|-------|-------|
| **Train** | 1,180,486 | 14,492 | 19,110 |
| **Validation** | 119,100 | 1,191 | 14,435 |
| **Test** | 119,100 | 1,191 | 14,263 |

**Key properties:**
- Balanced val/test sets (same size, same users)
- Temporal ordering preserved (train < val < test in time)
- Ready for model training!

In [13]:
# Save all processed datasets for model training
filtered.to_parquet('data/filtered.parquet', engine='pyarrow', index=False)
train.to_parquet('data/train.parquet', engine='pyarrow', index=False)
val.to_parquet('data/val.parquet', engine='pyarrow', index=False)
test.to_parquet('data/test.parquet', engine='pyarrow', index=False)