# Simple Neighbourhood Approach (User-Based CF)
As a first step, we will use basic neighbourhood-based collaborative filtering (CF) techniques, with a simple model as a baseline.

### Pre-Processing

In [35]:
%%capture
import sys
import os

# Add project root to Python path
project_root = os.path.abspath("..")
if project_root not in sys.path:
    sys.path.append(project_root)
# import packages
from utils.imports import *

In [36]:
# import pandas dataframes
with open("../data/dataframes.pkl", "rb") as f:
    data = pickle.load(f)

train = data["train"]
validation = data["validation"]
baseline = data["baseline"]

# load sparse matrix
ui_csr = load_npz("../data/ui_csr.npz")

# load encodings
with open("../artifacts/user_encoder.pkl", "rb") as f:
    user_encoder = pickle.load(f)
with open("../artifacts/item_encoder.pkl", "rb") as f:
    item_encoder = pickle.load(f)
with open("../artifacts/user_map.pkl", "rb") as f:
    user_map = pickle.load(f)
with open("../artifacts/item_map.pkl", "rb") as f:
    item_map = pickle.load(f)

#### Formatting our Data for CF
We're going to mean-centre each user's score to account for the fact that some users tend to be more lenient or harsh. 

In [37]:
# get sum of scores per user
user_scores = ui_csr.sum(axis=1).A1
# count number of user reviews
user_counts = np.diff(ui_csr.indptr)
# get mean vector
user_mean_scores = user_scores / user_counts

Now, it's time to compute the similar matrix. We'll use cosine similarity on our mean-centred data (adjusted-cosine similarity).

In [38]:
cos_sim = cosine_similarity(ui_csr, dense_output=False)

### Training
We're ready to begin training our model. For this simple example, we'll validate our choice of $k$ nearest neighbours, first defining a prediction function. Note that we're using a prediction function and rounding to the nearest .5 instead of approaching our ratings as a classificaiton problem. We use this approach for simplicity.
#### Predict Function

In [39]:
def predict(user, item, train, similarity, user_mean_scores, k, clipped=True):
    """
    Predict ratings for the user-item pair using k nearest neighbours, 
    rounded to the nearest .5 and capped in [0,5]. If item is unseen, default to global mean.
    
    Parameters:
    -user: user index for whom to predict ratings
    -item: item index for which to predict ratings
    -train: training data in sparse matrix format
    -similarity: similarity matrix in sparse format
    -user_mean_scores: mean scores for each user
    -k: number of nearest neighbouts to consider
    
    Returns: ordinal prediction for user-item pair
    """
    # find neighbours of user for item
    nbs = train[:,item].nonzero()[0]
    nbs = nbs[nbs != user] #exclude self
    if nbs.size == 0:
        # no neighbours, return mean score
        return user_mean_scores[user]
    
    # get ratings and mean-centre them
    ratings = train[nbs,item].toarray().flatten()
    ratings -= user_mean_scores[nbs]
    # set limit for k
    k = min(k, nbs.size)

    # get similarity scores
    sims = similarity[user, :].toarray().flatten()
    # get similarity scores for neighbours
    sims = sims[nbs]
    # take k-nearest similarities
    sims = sims[np.argsort(sims)[-k:]]
    # get corresponding k-nearest ratings
    ratings = ratings[np.argsort(sims)[-k:]]
    # compute weighted average
    if np.sum(np.abs(sims)) == 0:
        return user_mean_scores[user]
    weighted_avg = np.dot(sims, ratings) / np.sum(np.abs(sims))
    # recenter
    weighted_avg += user_mean_scores[user]
    if clipped == False:
        return weighted_avg
    # round to nearest .5
    weighted_avg = np.round(weighted_avg * 2) / 2 
    # clip to [0,5]
    weighted_avg = np.clip(weighted_avg, 0, 5)
    # return prediction
    return weighted_avg

#### Evaluating the effect of neighbourhood size
Now we'll evaluate over different choices of k.

In [40]:
def evaluate_k(train, validation, similarity, user_mean_scores, k):
    preds = []
    actuals = []

    for row in validation.itertuples(index=False):
        u = row.user_idx
        i = row.item_idx
        true_r = row.review_overall
        pred = predict(u, i, train, similarity, user_mean_scores, k)
        preds.append(pred)
        actuals.append(true_r)

        RMSE = np.sqrt(mean_squared_error(actuals, preds))
        MAE = mean_absolute_error(actuals,preds)

    return RMSE, MAE

In [41]:
# let's experiment with different values of k
k_values = [3, 5, 10, 20, 50, 100]
train, validation, similarity, user_mean_scores = (
    ui_csr, validation, cos_sim, user_mean_scores
    )
for k in k_values:
    RMSE, MAE = evaluate_k(train, validation, similarity, user_mean_scores, k)
    print(f'The RMSE for user-based CF with {k}-NN is \
          {RMSE}')
    print(f'The MAE for user-based CF with {k}-NN is \
          {MAE}')
    print('\n')

The RMSE for user-based CF with 3-NN is           0.8281375351604974
The MAE for user-based CF with 3-NN is           0.6062849787578912


The RMSE for user-based CF with 5-NN is           0.7948962457823016
The MAE for user-based CF with 5-NN is           0.5767247519566189


The RMSE for user-based CF with 10-NN is           0.7731675049216724
The MAE for user-based CF with 10-NN is           0.555877013054669


The RMSE for user-based CF with 20-NN is           0.766080997095158
The MAE for user-based CF with 20-NN is           0.5465076312264295


The RMSE for user-based CF with 50-NN is           0.7604301674859361
The MAE for user-based CF with 50-NN is           0.5415636400772108


The RMSE for user-based CF with 100-NN is           0.7582446812298936
The MAE for user-based CF with 100-NN is           0.5399732653019377




#### Top-N predictions
We're not only interested in prediction accuracy; we'd also like to know how effective our algorithm is at predicting novel or less popular items. To measure this, we'll look at the top-N items as calculated by taking the $N$ items for a user with the highest predicted ratings. To avoid interminable runtime, we'll use a a fast matrix-based function which precomputes the most similar neighbours for each user (not for a user-item pair).


In [42]:
def predict_top_N_fast(user, train, similarity, user_mean_scores, k=10, N=10):
    """
    Fast top-N prediction using user-based CF with vectorized matrix ops.

    Parameters:
    - user: target user index
    - train: CSR matrix of shape (n_users, n_items)
    - similarity: (n_users, n_users) sparse matrix or dense array
    - user_mean_scores: array of user mean ratings
    - k: number of nearest neighbours
    - N: number of top items to return

    Returns:
    - List of top-N item indices predicted for the user
    """
    # 1. Get top-k most similar users to target user
    user_sims = similarity[user, :].toarray().flatten()
    topk_idx = np.argsort(user_sims)[-k:]
    topk_sims = user_sims[topk_idx]  # shape: (k,)

    # 2. Get their ratings and mean-center
    ratings = train[topk_idx, :].toarray()  # shape: (k, n_items)
    means = user_mean_scores[topk_idx][:, np.newaxis]
    ratings_centered = ratings - means  # shape: (k, n_items)

    # 3. Weighted sum of centered ratings
    numerator = topk_sims @ ratings_centered  # shape: (n_items,)
    denominator = np.sum(np.abs(topk_sims)) + 1e-8  # to avoid div by 0

    preds = user_mean_scores[user] + numerator / denominator  # shape: (n_items,)

    # 4. Mask out already rated items
    rated_items = train[user, :].nonzero()[1]
    preds[rated_items] = -np.inf  # exclude known ratings

    # 5. Return top-N items
    top_N_items = np.argsort(preds)[-N:][::-1]
    return top_N_items.tolist()

In [43]:
k_values = [3, 5, 10, 20]
N = 10

for k in k_values:
    recommended_beers = set()
    for user in range(ui_csr.shape[0]):
        preds = predict_top_N_fast(user, train, similarity, user_mean_scores, k, N)
        # update the set of recommended beers
        recommended_beers.update(preds)
    print(f'For k = {k} nearest neighbours and top-{N} beers recommended:')
    print(f'The number of recommended beers over all users is {len(recommended_beers)}')
    print(f'This corresponds to {len(recommended_beers) / ui_csr.shape[1] * 100:.2f}% \
        of all beers in the training matrix.')

For k = 3 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 6514
This corresponds to 24.95%         of all beers in the training matrix.
For k = 5 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 6140
This corresponds to 23.52%         of all beers in the training matrix.
For k = 10 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 5041
This corresponds to 19.31%         of all beers in the training matrix.
For k = 20 nearest neighbours and top-10 beers recommended:
The number of recommended beers over all users is 3813
This corresponds to 14.60%         of all beers in the training matrix.


It's worth mentioning that this experiment does not tell the entire story when it comes to recommending items from a catalog. This simple user-based CF model is only capable of recommending beers which have been reviewed at least three times - this already excludes the many beers which have only been reviewed once or twice. This is in addition to the fact that the system is only likely to recommend a fraction of the total available beers as demonstrated by the above code. We can investigate how recommendations differ given different neighbourhood sizes.

In [44]:
def get_top_N_beers(user, train, similarity, user_mean_scores, baseline, k = 3, N = 5):
    #predict top 5 beers with 3 neighbours
    preds = predict_top_N_fast(user, train, similarity, user_mean_scores, k, N)
    # get item dict. mappings
    beer_ids = item_encoder.inverse_transform(preds)
    # get some aggregate statistics
    top_N = baseline[baseline['beer_beerid'].isin(beer_ids)].groupby([
        'beer_beerid', 'beer_name', 'brewery_name', 'beer_style'], group_keys=False).agg(
            {'review_overall': ['mean', 'count', 'std'],
            'beer_abv': ['mean']}
        ).sort_values(by=('review_overall', 'count'), ascending=False)
    return top_N

In [45]:
# set params
user, k, N = 69, 3, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
276,Sierra Nevada Pale Ale,Sierra Nevada Brewing Co.,American Pale Ale (APA),4.24813,2406,0.529726,5.6
665,Anchor Liberty Ale,Anchor Brewing Company,American Pale Ale (APA),4.09607,1374,0.540642,6.0
3842,Trappistes Rochefort 6,Brasserie de Rochefort,Belgian Strong Dark Ale,4.138889,756,0.51135,7.5
21505,Lammin Kataja Olut,Lammin Sahti Oy,Sahti,3.150485,103,0.756783,7.0
21493,Flensburger Winterbock,Flensburger Brauerei GmbH Und Co. KG,Bock,3.5,10,0.408248,7.0
21498,Dog & Pony Double Dry-hopped Imperial IPA,Maritime Pacific Brewing Company,American Double / Imperial IPA,3.9,5,0.223607,7.5
21502,Dusty Trail Pale,Amnesia Brewing,American Pale Ale (APA),3.3,5,0.67082,5.2
21499,Julöl,Grebbestad Bryggeri,Vienna Lager,3.625,4,0.478714,5.3
21515,Haake Beck Pils,Brauerei Beck & Co.,German Pilsener,3.25,4,0.645497,4.9
21514,Premium Pilsener,Brauerei Herrenhausen KG,German Pilsener,3.5,3,0.0,4.9


In [46]:
# set params
user, k, N = 69, 20, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
1093,Two Hearted Ale,"Bell's Brewery, Inc.",American IPA,4.320482,2529,0.50668,7.0
276,Sierra Nevada Pale Ale,Sierra Nevada Brewing Co.,American Pale Ale (APA),4.24813,2406,0.529726,5.6
14916,Hop Wallop,Victory Brewing Company,American Double / Imperial IPA,3.987629,1738,0.606045,8.5
1118,Chocolate Stout,Rogue Ales,American Stout,4.115407,1733,0.59269,6.0
18862,Burton Baton,Dogfish Head Brewery,American Double / Imperial IPA,4.010145,1380,0.557992,10.0
6947,Cuvée Van De Keizer Blauw (Blue),Brouwerij Het Anker,Belgian Strong Dark Ale,4.14592,723,0.594126,11.0
2233,Summit Winter Ale,Summit Brewing Company,Winter Warmer,3.791489,235,0.494772,6.1
3646,Urthel Hibernus Quentum,De Leyerth Brouwerijen (Urthel),Tripel,4.038462,221,0.543236,9.0
27265,Bell's Wheat Love,"Bell's Brewery, Inc.",Wheatwine,3.983553,152,0.547171,7.7
35405,Victor's MemoriAle Altbier,Two Brothers Brewing Company,Altbier,4.095238,42,0.508922,7.8


In [47]:
# set params
user, k, N = 420, 3, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
142,Ommegang (Abbey Ale),Brewery Ommegang,Dubbel,4.040414,1497,0.591441,8.5
1385,Delirium Tremens,Brouwerij Huyghe,Belgian Strong Pale Ale,4.022912,1353,0.589786,8.5
228,Great Lakes Dortmunder Gold,Great Lakes Brewing Company,Dortmunder / Export Lager,4.290899,868,0.499479,5.8
773,Goudenband,Brouwerij Liefmans,Flanders Oud Bruin,4.133333,465,0.576541,8.0
62328,Estate Homegrown Wet Hop Ale,Sierra Nevada Brewing Co.,American IPA,4.134085,399,0.441493,6.7
38366,Samuel Adams Dunkelweizen,Boston Beer Company (Samuel Adams),Dunkelweizen,3.68,275,0.553159,5.1
67262,Longshot Blackened Hops,Boston Beer Company (Samuel Adams),American Black Ale,3.924731,186,0.467577,7.0
67267,Longshot Friar Hop Ale,Boston Beer Company (Samuel Adams),Belgian IPA,3.591892,185,0.561141,9.0
38224,Point Oktoberfest,Stevens Point Brewery,Märzen / Oktoberfest,3.698718,78,0.560263,5.15


In [48]:
# set params
user, k, N = 420, 20, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
131,Ayinger Celebrator Doppelbock,Privatbrauerei Franz Inselkammer KG / Brauerei Aying,Doppelbock,4.293592,2013,0.513722,6.7
2751,Racer 5 India Pale Ale,Bear Republic Brewing Co.,American IPA,4.229022,1871,0.472854,7.0
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
10325,Péché Mortel (Imperial Stout Au Cafe),Brasserie Dieu Du Ciel,American Double / Imperial Stout,4.264685,1396,0.528709,9.5
646,Westmalle Trappist Tripel,Brouwerij Westmalle,Tripel,4.196698,1393,0.549363,9.5
19216,Oak Aged Yeti Imperial Stout,Great Divide Brewing Company,Russian Imperial Stout,4.082671,1385,0.551066,9.5
11922,Titan IPA,Great Divide Brewing Company,American IPA,4.137327,1227,0.494349,7.1
25755,Heavy Seas - Loose Cannon (Hop3 Ale),Heavy Seas Beer,American IPA,4.073559,1006,0.47322,7.25
5428,New Holland Dragon's Milk Oak Barrel Ale,New Holland Brewing Company,American Stout,3.76247,842,0.654713,10.0
1287,Bell's Porter,"Bell's Brewery, Inc.",American Porter,3.985976,820,0.517801,5.6


We can see clearly that increasing the neighbourhood size tends to recommend more popular beers. This effect is commonly observed when implementing user-based CF. We can think critically about why this may be. When generating recommendations from a small number of neighbours, we only take the recommendations from the most similar users - and their "votes" towards recommendations have a very large impact on the predicted items. Therefore, a more rare and polarizing beer may be recommended to a user when using a small value for k if that user's closest neighbours give it a high score. Conversely, including a larger number of neighbours when generating predictions tends to push predicted scores towards the mean; more rare and polarizing beers will tend to get pushed out by popular and highly-rated beers. The "signal" from similar users will be lost. As we saw from the EDA, there are a relatively small number of highly-rated and highly-reviewed beers. Continuing to increase the value of k will lead to those beers being recommended:

In [49]:
# set params, make k big
user, k, N = 420, 500, 10
get_top_N_beers(user, ui_csr, similarity, user_mean_scores, baseline, k=k, N=N)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,Unnamed: 3_level_0,review_overall,review_overall,review_overall,beer_abv
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,mean,count,std,mean
beer_beerid,beer_name,brewery_name,beer_style,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
131,Ayinger Celebrator Doppelbock,Privatbrauerei Franz Inselkammer KG / Brauerei Aying,Doppelbock,4.293592,2013,0.513722,6.7
2751,Racer 5 India Pale Ale,Bear Republic Brewing Co.,American IPA,4.229022,1871,0.472854,7.0
141,Hennepin (Farmhouse Saison),Brewery Ommegang,Saison / Farmhouse Ale,4.243311,1794,0.509828,7.7
3457,Three Philosophers Belgian Style Blend (Quadrupel),Brewery Ommegang,Quadrupel (Quad),3.981173,1620,0.564645,9.8
226,Great Lakes Edmund Fitzgerald Porter,Great Lakes Brewing Company,American Porter,4.322813,1600,0.466191,5.8
16403,"Smuttynose IPA ""Finest Kind""",Smuttynose Brewing Company,American IPA,4.124649,1424,0.577223,6.9
10325,Péché Mortel (Imperial Stout Au Cafe),Brasserie Dieu Du Ciel,American Double / Imperial Stout,4.264685,1396,0.528709,9.5
6518,Dale's Pale Ale,Oskar Blues Grill & Brew,American Pale Ale (APA),4.070605,1388,0.544409,6.5
19216,Oak Aged Yeti Imperial Stout,Great Divide Brewing Company,Russian Imperial Stout,4.082671,1385,0.551066,9.5
11922,Titan IPA,Great Divide Brewing Company,American IPA,4.137327,1227,0.494349,7.1


All of these beers are among the most popular and well-liked beers on the site. This demonstrates why it's not beneficial to choose a large value of k simply because it results in a low RMSE; in the context of an item catalog, a recommender should be capable of providing personalized and novel recommendations.