# Simple Neighbourhood Approach (User-Based CF)
As a first step, we will use basic neighbourhood-based collaborative filtering (CF) techniques, with a simple model as a baseline.

### Pre-Processing

In [1]:
%%capture
import scipy as sp
import scipy.stats as stats
import powerlaw as pl
import kagglehub
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
import duckdb as db
import recbole as rb
import scipy as sp
import surprise as sks
import sklearn as sk
from scipy.sparse import coo_matrix, csr_matrix
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.metrics.pairwise import cosine_similarity

In [2]:
# Download latest version of data
path = kagglehub.dataset_download("rdoume/beerreviews", path='beer_reviews.csv', force_download = True)
beer = pd.read_csv(path)
#remove nulls
beer = beer[-beer.isna().any(axis=1)]

100%|██████████| 27.4M/27.4M [00:00<00:00, 81.5MB/s]


#### Multiple reviews for the same item
We found earlier that there were around 14000 instances of a user reviewing the same beer more than once. Since basic collaborative filtering frameworks only account for a single user-item interaction, we need to specify an approach for dealing with these cases. In our simple model, we'll take the most recent rating as the "true" value. Later we might experiment with different approaches.

In [3]:
# let's make a new dataframe
beer_simple = beer.copy()
# sort by the relevant columns
beer_simple = beer_simple.sort_values(by=['review_profilename', 'beer_beerid', 'review_time'])
# keep only the most recent review for the user-beer key
beer_simple = beer_simple.drop_duplicates(subset=['review_profilename', 'beer_beerid'], keep="last")


In [4]:
# test using SQL
query = "SELECT review_profilename, beer_beerid \
    FROM beer_simple GROUP BY review_profilename, beer_beerid\
    HAVING COUNT(*)>1 \
    ORDER BY review_profilename, beer_beerid"
#use duckdb to query the data
db.sql(query).df()


Unnamed: 0,review_profilename,beer_beerid


#### Threshold Choice
We're going to look at the performance of models using several different thresholds for review counts. There are some different considerations to make. First of all, we saw from the EDA that many beers and users only have one review - this is the cold start problem. To construct a meaningful collaborative filter model, we'll need at least three reviews per user/item. In the special case of using 3 as a threshold, we'll have to forgo the validation set entirely so that we have multiple data points per user/item. We'll investigate how different thresholds affect the tradeoff between coverage of recommended items and the quality of recommendations.

As a baseline, we're going to start with a requiring at least 5 reviews per user and 3 reviews per item. These thresholds have been chosen since we want to balance allowing the model to recommend a large amount of items (less strict item threshold) while providing high-quality recommendations (stricter user threshold). Later, we'll experiment with different thresholds.

In [5]:
#create a dataframe for users and beers with the specific threshold
baseline = beer_simple.copy()
baseline = baseline.groupby('beer_beerid').filter(lambda x: x.shape[0] >= 3)
baseline = baseline.groupby('review_profilename').filter(lambda x: x.shape[0] >= 5)

In [6]:
beer_simple.nunique().loc[['review_profilename','beer_beerid', 'beer_style']]

review_profilename    32908
beer_beerid           49000
beer_style              104
dtype: int64

In [7]:
baseline.nunique().loc[['review_profilename','beer_beerid', 'beer_style']]

review_profilename    14556
beer_beerid           26113
beer_style              104
dtype: int64

In this case, we see that we've retained over half of our items. As our model is quite simple, we'll lose a lot of coverage (almost half of all items). To properly address this, we would need to expand our model (e.g. using content-based recommendations with NLP), but since this is a simple project, we'll proceed.

#### Data Splitting
Now it's time to split our data. We're going to leave the last rating as a test - we'll try and predict a user's *next* rating using all their past ratings as training data. This data splitting method approximates many real-world use cases, where we might want to predict a user's future behaviour given their actions until the current time. First, we need to encode the users and items.

In [8]:
# step 1: encode users and items to integer indices
user_encoder = LabelEncoder()
item_encoder = LabelEncoder()
# fit encoders to the values in the set
user_encoder.fit(baseline['review_profilename'])  
item_encoder.fit(baseline['beer_beerid'])
# create a mapping from original values to integer indices
user_map = dict(zip(user_encoder.classes_, user_encoder.transform(user_encoder.classes_)))
item_map = dict(zip(item_encoder.classes_, item_encoder.transform(item_encoder.classes_)))
# make mapped columns in validation set
baseline.loc[:, 'user_idx'] = user_encoder.transform(baseline['review_profilename'])
baseline.loc[:, 'item_idx'] = item_encoder.transform(baseline['beer_beerid'])

In [9]:
%%capture
# generate test set and update training set
# save the last review for each user
test = baseline.drop_duplicates(subset=['review_profilename'], keep="last")
# remove last review in dataframe
train = baseline.groupby('review_profilename', group_keys=False).apply(
    lambda x: x.iloc[:-1])

In [10]:
%%capture
# generate validation set and update training set
# save the last review for each user
validation = train.drop_duplicates(subset=['review_profilename'], keep="last")
# remove last review in dataframe
train = train.groupby('review_profilename', group_keys=False).apply(
    lambda x: x.iloc[:-1])

In [11]:
# test that we've split correctly
baseline.shape[0] == train.shape[0] + validation.shape[0] + test.shape[0]

True

In [12]:
# keep only relevant columns
cols = ['review_profilename','beer_beerid', 'review_overall', 'user_idx', 'item_idx']
train = train[cols]
validation = validation[cols]
test = test[cols]

#### Formatting our Data for CF
Now we need to make a user-item matrix. Our simple model is only going to use the overall rating data. We will filter out the unseen items in the validation and test sets as CF is incapable of making meaningful predictions on unseen items. We'll add these items back after we choose a model and train it on the entire dataset.

In [13]:
# save known items
known_items = set(train['item_idx'])
# remove unknown items from validation
validation = validation[validation['item_idx'].isin(known_items)].copy()
# remove unknown items from test
test = test[test['item_idx'].isin(known_items)].copy()

In [14]:
def create_sparse_matrix(data, num_users, num_items):
    # create sparse matrix
    ratings = data['review_overall'].values
    rows = data['user_idx']
    cols = data['item_idx']
    coo = coo_matrix((ratings, (rows, cols)), shape=(num_users, num_items))
    return coo

In [15]:
# create sparse matrix
n_users = train['user_idx'].max() + 1
n_items = train['item_idx'].max() + 1
sparse = create_sparse_matrix(train, n_users, n_items)
# convert to csr for efficient row ops
ui_csr = sparse.tocsr()

We're going to mean-centre each user's score to account for the fact that some users tend to be more lenient or harsh. 

In [16]:
# get sum of scores per user
user_scores = ui_csr.sum(axis=1).A1
# count number of user reviews
user_counts = np.diff(ui_csr.indptr)
# get mean vector
user_mean_scores = user_scores / user_counts

Now, it's time to compute the similar matrix. We'll use cosine similarity on our mean-centred data (adjusted-cosine similarity).

In [17]:
cos_sim = cosine_similarity(ui_csr, dense_output=False)

### Training
We're ready to begin training our model. For this simple example, we'll validate our choice of $k$ nearest neighbours, first defining a prediction function. Note that we're using a prediction function and rounding to the nearest .5 instead of approaching our ratings as a classificaiton problem. We use this approach for simplicity.
#### Predict Function

In [18]:
def predict(user, item, train, similarity, user_mean_scores, k, clipped=True):
    """
    Predict ratings for the user-item pair using k nearest neighbours, 
    rounded to the nearest .5 and capped in [0,5]. If item is unseen, default to global mean.
    
    Parameters:
    -user: user index for whom to predict ratings
    -item: item index for which to predict ratings
    -train: training data in sparse matrix format
    -similarity: similarity matrix in sparse format
    -user_mean_scores: mean scores for each user
    -k: number of nearest neighbouts to consider
    
    Returns: ordinal prediction for user-item pair
    """
    # find neighbours of user for item
    nbs = train[:,item].nonzero()[0]
    nbs = nbs[nbs != user] #exclude self
    if nbs.size == 0:
        # no neighbours, return mean score
        return user_mean_scores[user]
    
    # get ratings and mean-centre them
    ratings = train[nbs,item].toarray().flatten()
    ratings -= user_mean_scores[nbs]
    # set limit for k
    k = min(k, nbs.size)

    # get similarity scores
    sims = similarity[user, :].toarray().flatten()
    # get similarity scores for neighbours
    sims = sims[nbs]
    # take k-nearest similarities
    sims = sims[np.argsort(sims)[-k:]]
    # get corresponding k-nearest ratings
    ratings = ratings[np.argsort(sims)[-k:]]
    # compute weighted average
    if np.sum(np.abs(sims)) == 0:
        return user_mean_scores[user]
    weighted_avg = np.dot(sims, ratings) / np.sum(np.abs(sims))
    # recenter
    weighted_avg += user_mean_scores[user]
    if clipped == False:
        return weighted_avg
    # round to nearest .5
    weighted_avg = np.round(weighted_avg * 2) / 2 
    # clip to [0,5]
    weighted_avg = np.clip(weighted_avg, 0, 5)
    # return prediction
    return weighted_avg

#### Evaluating the effect of neighbourhood size
Now we'll evaluate over different choices of k.

In [19]:
def evaluate_k(train, validation, similarity, user_mean_scores, k):
    preds = []
    actuals = []

    for row in validation.itertuples(index=False):
        u = row.user_idx
        i = row.item_idx
        true_r = row.review_overall
        pred = predict(u, i, train, similarity, user_mean_scores, k)
        preds.append(pred)
        actuals.append(true_r)

        RMSE = np.sqrt(mean_squared_error(actuals, preds))
        MAE = mean_absolute_error(actuals,preds)

    return RMSE, MAE

In [22]:
# let's experiment with different values of k
k_values = [3, 5, 10, 20, 50, 100]
train, validation, similarity, user_mean_scores = (
    ui_csr, validation, cos_sim, user_mean_scores
    )
for k in k_values:
    RMSE, MAE = evaluate_k(train, validation, similarity, user_mean_scores, k)
    print(f'The RMSE for user-based CF with {k}-NN is \
          {RMSE}')
    print(f'The MAE for user-based CF with {k}-NN is \
          {MAE}')
    print('\n')

The RMSE for user-based CF with 3-NN is           0.8281375351604974
The MAE for user-based CF with 3-NN is           0.6062849787578912


The RMSE for user-based CF with 5-NN is           0.7948962457823016
The MAE for user-based CF with 5-NN is           0.5767247519566189


The RMSE for user-based CF with 10-NN is           0.7731675049216724
The MAE for user-based CF with 10-NN is           0.555877013054669


The RMSE for user-based CF with 20-NN is           0.766080997095158
The MAE for user-based CF with 20-NN is           0.5465076312264295


The RMSE for user-based CF with 50-NN is           0.7604301674859361
The MAE for user-based CF with 50-NN is           0.5415636400772108


The RMSE for user-based CF with 100-NN is           0.7582446812298936
The MAE for user-based CF with 100-NN is           0.5399732653019377




#### Top-N predictions
We're not only interested in prediction accuracy; we'd also like to know how effective our algorithm is at predicting novel or less popular items. To measure this, we'll look at the top-N items as calculated by taking the $N$ items for a user with the highest predicted ratings. To avoid interminable runtime, we'll use a a fast matrix-based function which precomputes the most similar neighbours for each user (not for a user-item pair).


In [20]:
def predict_top_N_fast(user, train, similarity, user_mean_scores, k=10, N=10):
    """
    Fast top-N prediction using user-based CF with vectorized matrix ops.

    Parameters:
    - user: target user index
    - train: CSR matrix of shape (n_users, n_items)
    - similarity: (n_users, n_users) sparse matrix or dense array
    - user_mean_scores: array of user mean ratings
    - k: number of nearest neighbours
    - N: number of top items to return

    Returns:
    - List of top-N item indices predicted for the user
    """
    # 1. Get top-k most similar users to target user
    user_sims = similarity[user, :].toarray().flatten()
    topk_idx = np.argsort(user_sims)[-k:]
    topk_sims = user_sims[topk_idx]  # shape: (k,)

    # 2. Get their ratings and mean-center
    ratings = train[topk_idx, :].toarray()  # shape: (k, n_items)
    means = user_mean_scores[topk_idx][:, np.newaxis]
    ratings_centered = ratings - means  # shape: (k, n_items)

    # 3. Weighted sum of centered ratings
    numerator = topk_sims @ ratings_centered  # shape: (n_items,)
    denominator = np.sum(np.abs(topk_sims)) + 1e-8  # to avoid div by 0

    preds = user_mean_scores[user] + numerator / denominator  # shape: (n_items,)

    # 4. Mask out already rated items
    rated_items = train[user, :].nonzero()[1]
    preds[rated_items] = -np.inf  # exclude known ratings

    # 5. Return top-N items
    top_N_items = np.argsort(preds)[-N:][::-1]
    return top_N_items.tolist()

In [23]:
k_values = [3, 5, 10, 20]
N = 10

for k in k_values:
    recommended_beers = set()
    for user in range(ui_csr.shape[0]):
        preds = predict_top_N_fast(user, train, similarity, user_mean_scores, k, N)
        # update the set of recommended beers
        recommended_beers.update(preds)
    print(f'For k = {k} nearest neighbours and top-{N} beers recommended:')
    print(f'The number of recommended beers over all users is {len(recommended_beers)}')
    print(f'This corresponds to {len(recommended_beers) / ui_csr.shape[1] * 100:.2f}% \
        of all beers in the training matrix.')

For k = 3 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 6497
This corresponds to 24.88%         of all beers in the training matrix.
For k = 5 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 6128
This corresponds to 23.47%         of all beers in the training matrix.
For k = 10 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 5035
This corresponds to 19.28%         of all beers in the training matrix.
For k = 20 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 3814
This corresponds to 14.61%         of all beers in the training matrix.


It's worth mentioning that this experiment does not tell the entire story when it comes to recommending items from a catalog. This simple user-based CF model is only capable of recommending beers which have been reviewed at least three times - this already excludes the many beers which have only been reviewed once or twice. This is in addition to the fact that the system is only likely to recommend a fraction of the total available beers as demonstrated by the above code. We can investigate how recommendations differ given different neighbourhood sizes.

In [24]:
def get_top_N_beers(user, train, similarity, user_mean_scores, baseline, k = 3, N = 5):
    #predict top 5 beers with 3 neighbours
    preds = predict_top_N_fast(user, train, similarity, user_mean_scores, k, N)
    # get item dict. mappings
    beer_ids = item_encoder.inverse_transform(preds)
    # get some aggregate statistics
    top_N = baseline[baseline['beer_beerid'].isin(beer_ids)].groupby([
        'beer_beerid', 'beer_name', 'brewery_name', 'beer_style'], group_keys=False).agg(
            {'review_overall': ['mean', 'count', 'std'],
            'beer_abv': ['mean']}
        ).sort_values(by=('review_overall', 'count'), ascending=False)
    return top_N

In [25]:
# set params
user, k, N = 69, 3, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
276,Sierra Nevada Pale Ale,Sierra Nevada Brewing Co.,American Pale Ale (APA),4.24813,2406,0.529726,5.6
665,Anchor Liberty Ale,Anchor Brewing Company,American Pale Ale (APA),4.09607,1374,0.540642,6.0
3842,Trappistes Rochefort 6,Brasserie de Rochefort,Belgian Strong Dark Ale,4.138889,756,0.51135,7.5
21505,Lammin Kataja Olut,Lammin Sahti Oy,Sahti,3.150485,103,0.756783,7.0
21519,Smoky Mountain Porter,Natty Greene's Pub & Brewing Co.,American Porter,4.208333,12,0.257464,5.1
21498,Dog & Pony Double Dry-hopped Imperial IPA,Maritime Pacific Brewing Company,American Double / Imperial IPA,3.9,5,0.223607,7.5
21502,Dusty Trail Pale,Amnesia Brewing,American Pale Ale (APA),3.3,5,0.67082,5.2
21499,Julöl,Grebbestad Bryggeri,Vienna Lager,3.625,4,0.478714,5.3
21515,Haake Beck Pils,Brauerei Beck & Co.,German Pilsener,3.25,4,0.645497,4.9
21514,Premium Pilsener,Brauerei Herrenhausen KG,German Pilsener,3.5,3,0.0,4.9


In [26]:
# set params
user, k, N = 69, 20, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1093,Two Hearted Ale,"Bell's Brewery, Inc.",American IPA,4.320482,2529,0.50668,7.0
276,Sierra Nevada Pale Ale,Sierra Nevada Brewing Co.,American Pale Ale (APA),4.24813,2406,0.529726,5.6
14916,Hop Wallop,Victory Brewing Company,American Double / Imperial IPA,3.987629,1738,0.606045,8.5
1118,Chocolate Stout,Rogue Ales,American Stout,4.115407,1733,0.59269,6.0
18862,Burton Baton,Dogfish Head Brewery,American Double / Imperial IPA,4.010145,1380,0.557992,10.0
6947,Cuvée Van De Keizer Blauw (Blue),Brouwerij Het Anker,Belgian Strong Dark Ale,4.14592,723,0.594126,11.0
2233,Summit Winter Ale,Summit Brewing Company,Winter Warmer,3.791489,235,0.494772,6.1
3646,Urthel Hibernus Quentum,De Leyerth Brouwerijen (Urthel),Tripel,4.038462,221,0.543236,9.0
27265,Bell's Wheat Love,"Bell's Brewery, Inc.",Wheatwine,3.983553,152,0.547171,7.7
35405,Victor's MemoriAle Altbier,Two Brothers Brewing Company,Altbier,4.095238,42,0.508922,7.8


In [27]:
# set params
user, k, N = 420, 3, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
142,Ommegang (Abbey Ale),Brewery Ommegang,Dubbel,4.040414,1497,0.591441,8.5
1385,Delirium Tremens,Brouwerij Huyghe,Belgian Strong Pale Ale,4.022912,1353,0.589786,8.5
228,Great Lakes Dortmunder Gold,Great Lakes Brewing Company,Dortmunder / Export Lager,4.290899,868,0.499479,5.8
760,Weihenstephaner Kristallweissbier,Bayerische Staatsbrauerei Weihenstephan,Kristalweizen,4.168576,611,0.512366,5.4
773,Goudenband,Brouwerij Liefmans,Flanders Oud Bruin,4.133333,465,0.576541,8.0
62328,Estate Homegrown Wet Hop Ale,Sierra Nevada Brewing Co.,American IPA,4.134085,399,0.441493,6.7
38366,Samuel Adams Dunkelweizen,Boston Beer Company (Samuel Adams),Dunkelweizen,3.68,275,0.553159,5.1
67262,Longshot Blackened Hops,Boston Beer Company (Samuel Adams),American Black Ale,3.924731,186,0.467577,7.0
67267,Longshot Friar Hop Ale,Boston Beer Company (Samuel Adams),Belgian IPA,3.591892,185,0.561141,9.0


In [28]:
# set params
user, k, N = 420, 20, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
131,Ayinger Celebrator Doppelbock,Privatbrauerei Franz Inselkammer KG / Brauerei Aying,Doppelbock,4.293592,2013,0.513722,6.7
2751,Racer 5 India Pale Ale,Bear Republic Brewing Co.,American IPA,4.229022,1871,0.472854,7.0
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
10325,Péché Mortel (Imperial Stout Au Cafe),Brasserie Dieu Du Ciel,American Double / Imperial Stout,4.264685,1396,0.528709,9.5
646,Westmalle Trappist Tripel,Brouwerij Westmalle,Tripel,4.196698,1393,0.549363,9.5
19216,Oak Aged Yeti Imperial Stout,Great Divide Brewing Company,Russian Imperial Stout,4.082671,1385,0.551066,9.5
11922,Titan IPA,Great Divide Brewing Company,American IPA,4.137327,1227,0.494349,7.1
25755,Heavy Seas - Loose Cannon (Hop3 Ale),Heavy Seas Beer,American IPA,4.073559,1006,0.47322,7.25
5428,New Holland Dragon's Milk Oak Barrel Ale,New Holland Brewing Company,American Stout,3.76247,842,0.654713,10.0
1287,Bell's Porter,"Bell's Brewery, Inc.",American Porter,3.985976,820,0.517801,5.6


We can see clearly that increasing the neighbourhood size tends to recommend more popular beers. This effect is commonly observed when implementing user-based CF. We can think critically about why this may be. When generating recommendations from a small number of neighbours, we only take the recommendations from the most similar users - and their "votes" towards recommendations have a very large impact on the predicted items. Therefore, a more rare and polarizing beer may be recommended to a user when using a small value for k if that user's closest neighbours give it a high score. Conversely, including a larger number of neighbours when generating predictions tends to push predicted scores towards the mean; more rare and polarizing beers will tend to get pushed out by popular and highly-rated beers. The "signal" from similar users will be lost. As we saw from the EDA, there are a relatively small number of highly-rated and highly-reviewed beers. Continuing to increase the value of k will lead to those beers being recommended:

In [29]:
# set params, make k big
user, k, N = 420, 500, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
131,Ayinger Celebrator Doppelbock,Privatbrauerei Franz Inselkammer KG / Brauerei Aying,Doppelbock,4.293592,2013,0.513722,6.7
2751,Racer 5 India Pale Ale,Bear Republic Brewing Co.,American IPA,4.229022,1871,0.472854,7.0
141,Hennepin (Farmhouse Saison),Brewery Ommegang,Saison / Farmhouse Ale,4.243311,1794,0.509828,7.7
3457,Three Philosophers Belgian Style Blend (Quadrupel),Brewery Ommegang,Quadrupel (Quad),3.981173,1620,0.564645,9.8
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
16403,"Smuttynose IPA ""Finest Kind""",Smuttynose Brewing Company,American IPA,4.124649,1424,0.577223,6.9
10325,Péché Mortel (Imperial Stout Au Cafe),Brasserie Dieu Du Ciel,American Double / Imperial Stout,4.264685,1396,0.528709,9.5
6518,Dale's Pale Ale,Oskar Blues Grill & Brew,American Pale Ale (APA),4.070605,1388,0.544409,6.5
19216,Oak Aged Yeti Imperial Stout,Great Divide Brewing Company,Russian Imperial Stout,4.082671,1385,0.551066,9.5
11922,Titan IPA,Great Divide Brewing Company,American IPA,4.137327,1227,0.494349,7.1


All of these beers are among the most popular and well-liked beers on the site. This demonstrates why it's not beneficial to choose a large value of k simply because it results in a low RMSE; in the context of an item catalog, a recommender should be capable of providing personalized and novel recommendations.

# Learning Model (SVD)
Now that we've implemented naive user-based CF, we'll implement a more advanced model - the SVD model.

INSERT DESCRIPTION HERE

One disadvantage of the SVD model is that it can't generalize to unseen items - predictions rely on item and user factors which are learned during model training. When training our model, 

In [None]:
class SVD():

    def __init__(self, k=50, lr=0.005, reg=0.02, patience=100, epsilon=10**(-3)):
        # initialize hyperparameters
        self.k = k
        self.lr = lr
        self.reg = reg
        self.patience = patience
        self.epsilon = epsilon
        # parameters will be set during fit
        self.B_u = None
        self.B_i = None
        self.P = None
        self.Q = None
        self.RMSE = 0
        self.random_seed = 420

    def fit(self, train, validation, verbose=True):
        """
        Computes the SVD model parameters using stochastic gradient descent.
        Ends after patience # of epochs have passed or the relative RMSE improvement threshold is < epsilon

        Args:
            train: sparse matrix of training data
            validation: sparse matrix of validation data
            k: number of latent factors
            lr: learning rate for SGD
            reg: regularization hyperparameter for learned parameters
            patience: maximum number of epochs to run SGD
            epsilon: Relative RMSE improvement threshold cutoff
        """
        # use random seed
        np.random.seed(self.random_seed)
        # make sure train is in COO format
        train = train.tocoo()
        # count users and items
        n_users = train.shape[0]
        n_items = train.shape[1]
        # get global mean rating
        mu = train.data.mean() 
        # initialize biases
        self.B_u = np.zeros(n_users)
        self.B_i = np.zeros(n_items)
        # initialize factors
        self.P = np.random.normal(loc=0.0, scale=0.1, size=(n_users, k))
        self.Q = np.random.normal(loc=0.0, scale=0.1, size=(n_items, k))
        # store all interactions
        interactions = list(zip(
            train.row, #get rows
            train.col, #get cols
            train.data #get ratings
        ))
        # initialize RMSE counter
        RMSE_past = 0.001
        
        #loop until relative RMSE improvement threshold is < epsilon or patience runs out
        for t in range(self.patience):
            # randomize order for SGD
            np.random.shuffle(interactions)
            # loop over all interactions and update params
            for u, i, rating in interactions:
                self.__update(mu, u, i, rating)
            # get RMSE
            RMSE = self.__get_val_RMSE(mu, self.B_u, self.B_i, self.P, self.Q, validation)
            # calculate improvement threshold
            threshold = np.abs(RMSE-RMSE_past) / RMSE_past
            t += 1
            # break if RMSE stops improving
            if threshold < self.epsilon:
                self.RMSE = RMSE
                print(f'Stopped after {t} iterations')
                print(f'Final RMSE is: {RMSE}')
                break
            # update RMSE
            RMSE_past = RMSE
            if verbose == True:
                print(f'Iteration: {t}')
                print(f'current validation RMSE: {RMSE}')
        return
    
    def __update(self, mu, u, i, rating):
        """
        Update SVD model parameters in a pass of SGD.
        Args:
            -mu: global mean of ratings
            -u: user index
            -i: item index
            -rating: user u's rating of item i
        """
        #predict rating
        e = rating - (mu + self.B_u[u] + self.B_i[i] + self.P[u] @ self.Q[i])
        #make parameter updates
        self.B_u[u] += self.lr * (e-self.reg*self.B_u[u])
        self.B_i[i] += self.lr * (e-self.reg*self.B_i[i])
        self.Q[i] += self.lr * (e*self.P[u]-self.reg*self.Q[i])
        self.P[u] += self.lr * (e*self.Q[i]-self.reg*self.P[u])

    def __get_val_RMSE(self, mu, B_u, B_i, P, Q, validation):
        """
        Generate predictions on validation data and return RMSE.
        Args:
            -mu: global mean of ratings
            -B_u: user biases
            -B_i: item biases
            -P: user factors
            -Q: item factors
        """
        # get values
        user_idx = validation['user_idx'].values
        item_idx = validation['item_idx'].values
        ratings = validation['review_overall'].values
        # get factor scores
        factor_scores = np.sum(np.multiply(
            P[user_idx], # user factors
            Q[item_idx] # item factors
        ), axis = 1)

        # generate predictions
        preds = mu + B_u[user_idx] + B_i[item_idx] + factor_scores
        # calculate error
        errors = ratings - preds
        # calculate RMSE
        RMSE = np.sqrt(np.mean(errors**2))
        return RMSE



        


    
    

In [91]:
SVD_model = SVD()
SVD_model.fit(ui_csr,validation)

Iteration: 1
current validation RMSE: 0.7618323590368138
Iteration: 2
current validation RMSE: 0.7491855221344007
Iteration: 3
current validation RMSE: 0.7415152230789009
Iteration: 4
current validation RMSE: 0.7361185319272969
Iteration: 5
current validation RMSE: 0.7322814278324057
Iteration: 6
current validation RMSE: 0.7293672748196092
Iteration: 7
current validation RMSE: 0.7268499378979351
Iteration: 8
current validation RMSE: 0.7250700608993398
Iteration: 9
current validation RMSE: 0.7234021498114351
Iteration: 10
current validation RMSE: 0.7220608142111653
Iteration: 11
current validation RMSE: 0.7209228238740131
Iteration: 12
current validation RMSE: 0.7199141369012151
Iteration: 13
current validation RMSE: 0.7189495434773804
Iteration: 14
current validation RMSE: 0.7182105211722325
Stopped after 15 iterations
Final RMSE is: 0.7179242234744753
