# Model Training
## Simple Neighbourhood Approach (User/Item Collaborative Filtering)
As a first step, we will use basic neighbourhood-based collaborative filtering (CF) techniques (user and item based), with a simple model as a baseline. Here we implement user and item-based CF as described in Chapter 4 of *Recommender Systems Handbook, 3rd Ed.* (Ricci et. al, 2022). 

### Pre-Processing

In [37]:
%%capture
import scipy as sp
import scipy.stats as stats
import powerlaw as pl
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import duckdb as db

In [23]:
# Download latest version of data
path = kagglehub.dataset_download("rdoume/beerreviews", path='beer_reviews.csv', force_download = True)
beer = pd.read_csv(path)
#remove nulls
beer = beer[-beer.isna().any(axis=1)]

100%|██████████| 27.4M/27.4M [00:00<00:00, 77.1MB/s]


In [24]:
#set random seed
np.random.seed(69420)

#### Multiple reviews for the same item
We found earlier that there were around 14000 instances of a user reviewing the same beer more than once. Since basic collaborative filtering frameworks only account for a single user-item interaction, we need to specify an approach for dealing with these cases. In our simple model, we'll take the most recent rating as the "true" value. Later we'll experiment with different approaches.

In [None]:
# let's make a new dataframe
beer_simple = beer.copy()
# sort by the relevant columns
beer_simple = beer_simple.sort_values(by=['review_profilename', 'beer_beerid', 'review_time'])
# keep only the most recent review for the user-beer key
beer_simple = beer_simple.drop_duplicates(subset=['review_profilename', 'beer_beerid'], keep="last")


In [42]:
# test using SQL
query = "SELECT review_profilename, beer_beerid \
    FROM beer_simple GROUP BY review_profilename, beer_beerid\
    HAVING COUNT(*)>1 \
    ORDER BY review_profilename, beer_beerid"
#use duckdb to query the data
db.sql(query).df()


Unnamed: 0,review_profilename,beer_beerid


#### Threshold Choice
We're going to look at the performance of models using several different thresholds for review counts. There are some different considerations to make. First of all, we saw from the EDA that many beers and users only have one review - this is the cold start problem. Also, since we'll be using a training, validation, and testing set, we need a threshold of at least 3. We'll investigate how different thresholds affect the tradeoff between coverage of recommended items and the quality of recommendations.

In [None]:
t = [3,5,10,20,50]
training_sets
for i in t:
