# Model Training
## Simple Neighbourhood Approach (User/Item Collaborative Filtering)
As a first step, we will use basic neighbourhood-based collaborative filtering (CF) techniques (user and item based), with a simple model as a baseline. Here we implement user and item-based CF as described in Chapter 4 of *Recommender Systems Handbook, 3rd Ed.* (Ricci et. al, 2022).

### Pre-Processing

In [88]:
%%capture
import scipy as sp
import scipy.stats as stats
import powerlaw as pl
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import duckdb as db
import recbole as rb

In [89]:
# Download latest version of data
path = kagglehub.dataset_download("rdoume/beerreviews", path='beer_reviews.csv', force_download = True)
beer = pd.read_csv(path)
#remove nulls
beer = beer[-beer.isna().any(axis=1)]

100%|██████████| 27.4M/27.4M [00:00<00:00, 79.3MB/s]


#### Multiple reviews for the same item
We found earlier that there were around 14000 instances of a user reviewing the same beer more than once. Since basic collaborative filtering frameworks only account for a single user-item interaction, we need to specify an approach for dealing with these cases. In our simple model, we'll take the most recent rating as the "true" value. Later we'll experiment with different approaches.

In [90]:
# let's make a new dataframe
beer_simple = beer.copy()
# sort by the relevant columns
beer_simple = beer_simple.sort_values(by=['review_profilename', 'beer_beerid', 'review_time'])
# keep only the most recent review for the user-beer key
beer_simple = beer_simple.drop_duplicates(subset=['review_profilename', 'beer_beerid'], keep="last")


In [91]:
# test using SQL
query = "SELECT review_profilename, beer_beerid \
    FROM beer_simple GROUP BY review_profilename, beer_beerid\
    HAVING COUNT(*)>1 \
    ORDER BY review_profilename, beer_beerid"
#use duckdb to query the data
db.sql(query).df()


Unnamed: 0,review_profilename,beer_beerid


#### Threshold Choice
We're going to look at the performance of models using several different thresholds for review counts. There are some different considerations to make. First of all, we saw from the EDA that many beers and users only have one review - this is the cold start problem. To construct a meaningful collaborative filter model, we'll need at least three reviews per user/item. In the special case of using 3 as a threshold, we'll have to forgo the validation set entirely so that we have multiple data points per user/item. We'll investigate how different thresholds affect the tradeoff between coverage of recommended items and the quality of recommendations.

In [128]:
#set thresholds
t = [3,5,10,20,50]
raw_data = []
for i in t:
    #create dataframes for users and beers with at least i reviews
    df = beer_simple.groupby('review_profilename').filter(lambda x: x.shape[0] >= i)
    df = df.groupby('beer_beerid').filter(lambda x: x.shape[0] >= i)
    #append
    raw_data.append(df)

In [None]:
raw_data[0].nunique()

brewery_id               3313
brewery_name             3272
review_time           1447995
review_overall             10
review_aroma                9
review_appearance          10
review_profilename      18596
beer_style                104
review_palate               9
review_taste                9
beer_name               24099
beer_abv                  437
beer_beerid             25946
dtype: int64

In [129]:
raw_data[4].nunique()

brewery_id                819
brewery_name              818
review_time           1055884
review_overall              9
review_aroma                9
review_appearance           9
review_profilename       4706
beer_style                103
review_palate               9
review_taste                9
beer_name                4628
beer_abv                  244
beer_beerid              4713
dtype: int64

Observe the difference in unique `beer_beerid` and `review-profilename`. As we saw during the EDA, we're going to lose a lot of coverage when we look at the high threshold data sets.

#### Data Splitting
Now that we've cleaned our data and we have our datasets for different thresholds, it's time to split our data. We'll split our data "manually". We're going to leave the last rating as a test - we'll try and predict a user's *next* rating using all their past ratings as training data. This data splitting method approximates many real-world use cases, where we might want to predict a user's future behaviour given their actions until the current time.

In [None]:
test = []
for i in range(5):
    #keep the last review for each user
    test.append(raw_data[i].drop_duplicates(subset=['review_profilename'], keep="last"))
    print(test[-1].shape[0])

18596
14600
10574
7580
4706
