# Baseline Model

In [1]:
import numpy as np
import pandas as pd
# pip install scikit-surprise
from surprise import Dataset, Reader, KNNWithMeans
from surprise.model_selection import GridSearchCV

## Intro

There are multiple ways to implement a recommendation system:

- collaborative filtering
- content-based filtering
- hybrid

In particular, collaborative filtering can be further divided into two types (a hyper-parameter):

- user-based: find similar users based on ratings a user gave out
- item-based: find similar items based on ratings given to an item

In either case, the algorithm relies on a user-item matrix, in which the rows match the users and columns the items. From here, we can then make predictions after calculating similarities amongst the users or items. This is known as a memory-based approach. If we apply an extra step to reduce the sparse user-item matrix with matrix factorization, this would be called a model-based approach.

For our baseline model, we will implement the memory-based collaborative filtering technique.

We also need to install `scikit-surprise`, a recommendation system package: `pip install scikit-surprise`. One of its functions `KNNWithMeans` would be particularly useful here. It is a basic collaborative filtering algorithm, taking into account the mean ratings of each user.

## Prep Data

We require a data frame with 3 columns: user, item, rating; with each row corresponding to a user's rating for a particular restaurant.

In [2]:
review_df = pd.read_feather('data/yelp_review_cleaned.feather')

In [3]:
df = review_df.loc[:, ['user_id', 'business_id', 'stars']]

Note that we previously scaled `stars`, so we will now un-scale it as we're now using it as our response:

In [4]:
stars_scaled_unique = sorted(list(df['stars'].unique()))

In [5]:
stars_scale_map = dict(list(zip(stars_scaled_unique, range(1, 6))))
stars_scale_map

{-2.0123613662910693: 1,
 -1.2947376560318022: 2,
 -0.5771139457725354: 3,
 0.1405097644867315: 4,
 0.8581334747459984: 5}

In [6]:
df['stars'] = df['stars'].map(stars_scale_map)

In [7]:
df.stars.unique()

array([3, 5, 4, 1, 2])

We will take a smaller random sample out of concerns for the hardware:

In [8]:
df.shape

(5257329, 3)

In [9]:
sub = df.sample(10000, random_state=42)

In [10]:
sub.head()

Unnamed: 0,user_id,business_id,stars
1322294,0lpxU4Dfi8AeBt0SeCrEuw,tQKqrLs16Xi-lFrd3_CBAQ,1
4297632,5nw1Zc3fi_ehDJFd3mUEYA,nLxNJuvgoHQHn_IGYifRnw,1
2143059,7fDqaGdUMccXQ4bnPwR6yg,etaIhl-sduOKc6J_qHmmtA,3
3068250,GyFJNSJjI5aWww-D0Btcbw,GlKffg2PMtzByocI5OHIQA,3
1371839,o66iBwIWxfWPypnqfrHVNw,XVFUNtPWYpxhoWPtBQHFdQ,2


## Build Recommender

Test: Collaborative filtering (item-based matrix, memory-based method, cosine-similarity):

In [11]:
# load in data into scikit-surprise format
reader = Reader(rating_scale=(1, 5))
data = Dataset.load_from_df(sub, reader)

In [12]:
# configs
sim_options = {
    "name": "cosine",  # to use item-based cosine similarity
    "user_based": False,  # Compute similarities between items
}

algo = KNNWithMeans(sim_options=sim_options)

In [13]:
trainingSet = data.build_full_trainset()

In [14]:
algo.fit(trainingSet)

Computing the cosine similarity matrix...
Done computing similarity matrix.


<surprise.prediction_algorithms.knns.KNNWithMeans at 0x3baeb9cc0>

In [15]:
prediction = algo.predict(sub.iloc[4, 0], sub.iloc[4, 1])
prediction.est

2.0

Now incorporate hyper-parameter tuning with grid search:

In [16]:
sim_options = {
    "name": ["msd", "cosine"],
    "min_support": [3, 4, 5],
    "user_based": [False, True],
}

param_grid = {"sim_options": sim_options}

In [17]:
gs = GridSearchCV(KNNWithMeans, param_grid, measures=["rmse", "mae"], cv=3)
gs.fit(data)

Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computing similarity matrix.
Computing the msd similarity matrix...
Done computi

In [22]:
print(f'best rmse: {gs.best_score["rmse"]}')
print(f'best params: {gs.best_params["rmse"]}')

best rmse: 1.4087381876138967
best params: {'sim_options': {'name': 'msd', 'min_support': 3, 'user_based': True}}


## Up Next

- Test model-based approach in collaborative filtering
- Test content-based recommenders
- Use more complex models such as neural networks rather than just cosine similarities