Following a tutorial from:

https://www.kaggle.com/code/gspmoreira/recommender-systems-in-python-101

The objective of a RecSys is to recommend relevant items for users, based on their preference. Preference and relevance are subjective, and they are generally inferred by items users have consumed previously.
The main families of methods for RecSys are:

- **Collaborative Filtering:** This method makes automatic predictions (filtering) about the interests of a user by collecting preferences or taste information from many users (collaborating). The underlying assumption of the collaborative filtering approach is that if a person A has the same opinion as a person B on a set of items, A is more likely to have B's opinion for a given item than that of a randomly chosen person.
- **Content-Based Filtering:** This method uses only information about the description and attributes of the items users has previously consumed to model user's preferences. In other words, these algorithms try to recommend items that are similar to those that a user liked in the past (or is examining in the present). In particular, various candidate items are compared with items previously rated by the user and the best-matching items are recommended.
- **Hybrid methods:** Recent research has demonstrated that a hybrid approach, combining collaborative filtering and content-based filtering could be more effective than pure approaches in some cases. These methods can also be used to overcome some of the common problems in recommender systems such as cold start and the sparsity problem.


In [5]:
import scipy
import math
import random
import sklearn
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from scipy.sparse import csr_matrix
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse.linalg import svds
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt

In [3]:
df = pd.read_csv('beer_reviews.csv')
df.drop('index', axis='columns', inplace=True)
df.head(5).transpose()

Unnamed: 0,0,1,2,3,4
brewery_id,10325,10325,10325,10325,1075
brewery_name,Vecchio Birraio,Vecchio Birraio,Vecchio Birraio,Vecchio Birraio,Caldera Brewing Company
review_time,1234817823,1235915097,1235916604,1234725145,1293735206
review_overall,1.5,3.0,3.0,3.0,4.0
review_aroma,2.0,2.5,2.5,3.0,4.5
review_appearance,2.5,3.0,3.0,3.5,4.0
review_profilename,stcules,stcules,stcules,stcules,johnmichaelsen
beer_style,Hefeweizen,English Strong Ale,Foreign / Export Stout,German Pilsener,American Double / Imperial IPA
review_palate,1.5,3.0,3.0,2.5,4.0
review_taste,1.5,3.0,3.0,3.0,4.5


Recommender systems have a problem known as user cold-start, in which is hard do provide personalized recommendations for users with none or a very few number of consumed items, due to the lack of information to model their preferences.
For this reason, we are keeping in the dataset only users with at least 5 interactions.

In [17]:
users_review_count_df = df.groupby(['review_profilename', 'review_time']).size().groupby('review_profilename').size()
print('# users: %d' % len(users_review_count_df))
users_with_enough_reviews_df = users_review_count_df[users_review_count_df >= 5].reset_index()[['review_profilename']]
print('# users with at least 5 interactions: %d' % len(users_with_enough_reviews_df))



# users: 33387
# users with at least 5 interactions: 14811


In [19]:
print('# of interactions: %d' % len(df))
reviews_from_selected_users_df = df.merge(users_with_enough_reviews_df, 
               how = 'right',
               left_on = 'review_profilename',
               right_on = 'review_profilename')
print('# of interactions from users with at least 5 interactions: %d' % len(reviews_from_selected_users_df))



# of interactions: 1586614
# of interactions from users with at least 5 interactions: 1553931
