# Pre-Processing

In [3]:
%%capture
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)
# import packages
from utils.imports import *

In [4]:
# Download latest version of data
path = kagglehub.dataset_download("rdoume/beerreviews", path='beer_reviews.csv', force_download = True)
beer = pd.read_csv(path)
#remove nulls
beer = beer[-beer.isna().any(axis=1)]

Downloading from https://www.kaggle.com/api/v1/datasets/download/rdoume/beerreviews?dataset_version_number=1&file_name=beer_reviews.csv...


100%|██████████| 27.4M/27.4M [00:00<00:00, 68.1MB/s]

Extracting zip of beer_reviews.csv...





#### Multiple reviews for the same item
We found earlier that there were around 14000 instances of a user reviewing the same beer more than once. Since basic collaborative filtering frameworks only account for a single user-item interaction, we need to specify an approach for dealing with these cases. In our simple model, we'll take the most recent rating as the "true" value. Later we might experiment with different approaches.

In [5]:
# let's make a new dataframe
beer_simple = beer.copy()
# sort by the relevant columns
beer_simple = beer_simple.sort_values(by=['review_profilename', 'beer_beerid', 'review_time'])
# keep only the most recent review for the user-beer key
beer_simple = beer_simple.drop_duplicates(subset=['review_profilename', 'beer_beerid'], keep="last")


In [6]:
# test using SQL
query = "SELECT review_profilename, beer_beerid \
    FROM beer_simple GROUP BY review_profilename, beer_beerid\
    HAVING COUNT(*)>1 \
    ORDER BY review_profilename, beer_beerid"
#use duckdb to query the data
db.sql(query).df()


Unnamed: 0,review_profilename,beer_beerid


#### Threshold Choice
We're going to look at the performance of models using several different thresholds for review counts. There are some different considerations to make. First of all, we saw from the EDA that many beers and users only have one review - this is the cold start problem. To construct a meaningful collaborative filter model, we'll need at least three reviews per user/item. In the special case of using 3 as a threshold, we'll have to forgo the validation set entirely so that we have multiple data points per user/item. We'll investigate how different thresholds affect the tradeoff between coverage of recommended items and the quality of recommendations.

As a baseline, we're going to start with a requiring at least 5 reviews per user and 3 reviews per item. These thresholds have been chosen since we want to balance allowing the model to recommend a large amount of items (less strict item threshold) while providing high-quality recommendations (stricter user threshold). Later, we'll experiment with different thresholds.

In [7]:
#create a dataframe for users and beers with the specific threshold
baseline = beer_simple.copy()
baseline = baseline.groupby('beer_beerid').filter(lambda x: x.shape[0] >= 3)
baseline = baseline.groupby('review_profilename').filter(lambda x: x.shape[0] >= 5)

In [8]:
beer_simple.nunique().loc[['review_profilename','beer_beerid', 'beer_style']]

review_profilename    32908
beer_beerid           49000
beer_style              104
dtype: int64

In [9]:
baseline.nunique().loc[['review_profilename','beer_beerid', 'beer_style']]

review_profilename    14556
beer_beerid           26113
beer_style              104
dtype: int64

In this case, we see that we've retained over half of our items. As our model is quite simple, we'll lose a lot of coverage (almost half of all items). To properly address this, we would need to expand our model (e.g. using content-based recommendations with NLP), but since this is a simple project, we'll proceed.

#### Data Splitting
Now it's time to split our data. We're going to leave the last rating as a test - we'll try and predict a user's *next* rating using all their past ratings as training data. This data splitting method approximates many real-world use cases, where we might want to predict a user's future behaviour given their actions until the current time. First, we need to encode the users and items.

In [10]:
# step 1: encode users and items to integer indices
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()
# fit encoders to the values in the set
user_encoder.fit(baseline['review_profilename'])  
item_encoder.fit(baseline['beer_beerid'])
# create a mapping from original values to integer indices
user_map = dict(zip(user_encoder.classes_, user_encoder.transform(user_encoder.classes_)))
item_map = dict(zip(item_encoder.classes_, item_encoder.transform(item_encoder.classes_)))
# make mapped columns in validation set
baseline.loc[:, 'user_idx'] = user_encoder.transform(baseline['review_profilename'])
baseline.loc[:, 'item_idx'] = item_encoder.transform(baseline['beer_beerid'])

In [11]:
%%capture
# generate test set and update training set
# save the last review for each user
test = baseline.drop_duplicates(subset=['review_profilename'], keep="last")
# remove last review in dataframe
train = baseline.groupby('review_profilename', group_keys=False).apply(
    lambda x: x.iloc[:-1])

In [12]:
%%capture
# generate validation set and update training set
# save the last review for each user
validation = train.drop_duplicates(subset=['review_profilename'], keep="last")
# remove last review in dataframe
train = train.groupby('review_profilename', group_keys=False).apply(
    lambda x: x.iloc[:-1])

In [13]:
# test that we've split correctly
baseline.shape[0] == train.shape[0] + validation.shape[0] + test.shape[0]

True

In [14]:
# keep only relevant columns
cols = ['review_profilename','beer_beerid', 'review_overall', 'user_idx', 'item_idx']
train = train[cols]
validation = validation[cols]
test = test[cols]

#### Formatting our Data for CF
Now we need to make a user-item matrix. Our simple model is only going to use the overall rating data. We will filter out the unseen items in the validation and test sets as CF is incapable of making meaningful predictions on unseen items. We'll add these items back after we choose a model and train it on the entire dataset.

In [15]:
# save known items
known_items = set(train['item_idx'])
# remove unknown items from validation
validation = validation[validation['item_idx'].isin(known_items)].copy()
# remove unknown items from test
test = test[test['item_idx'].isin(known_items)].copy()

In [16]:
def create_sparse_matrix(data, num_users, num_items):
    # create sparse matrix
    ratings = data['review_overall'].values
    rows = data['user_idx']
    cols = data['item_idx']
    coo = coo_matrix((ratings, (rows, cols)), shape=(num_users, num_items))
    return coo

In [17]:
# create sparse matrix
n_users = train['user_idx'].max() + 1
n_items = train['item_idx'].max() + 1
sparse = create_sparse_matrix(train, n_users, n_items)
# convert to csr for efficient row ops
ui_csr = sparse.tocsr()

In [18]:
# save matrices for use in other files
from scipy.sparse import save_npz
save_npz("../data/ui_csr.npz", ui_csr)

data = {
    "train": train,
    "validation": validation,
    "test": test,
    "baseline": baseline
}

with open("../data/dataframes.pkl", "wb") as f:
    pickle.dump(data, f)

In [19]:
# pickle encoders
with open("../artifacts/user_encoder.pkl", "wb") as f:
    pickle.dump(user_encoder, f)

with open("../artifacts/item_encoder.pkl", "wb") as f:
    pickle.dump(item_encoder, f)

with open("../artifacts/user_map.pkl", "wb") as f:
    pickle.dump(user_map, f)

with open("../artifacts/item_map.pkl", "wb") as f:
    pickle.dump(item_map, f)