# Movie Recommendations (Matrix Factorization)

![alt text](amazon_prime.png "Movie Recommendations (source https://www.amazon.com)")

In this notebook, we will have a look at the [MovieLens](https://grouplens.org/datasets/movielens/) dataset, which is a popular dataset for building and benchmarking recommender systems. The dataset version we work with is the 1M dataset, which contains 1,000,209 ratings for about 3,900 movies made by 6,040 users in the year 2000.

In order to build a recommender system based on matrix factorization filtering, we will make use of the [Surpise](https://surprise.readthedocs.io/en/stable/index.html) library.

In [1]:
import pandas as pd
import numpy as np

# parse all the data
movies = pd.read_csv('movies.csv', 
                     sep='\t', 
                     encoding='latin-1', 
                     usecols=['movie_id', 'title', 'genres'])

users = pd.read_csv('users.csv', 
                    sep='\t', 
                    encoding='latin-1', 
                    usecols=['user_id', 'gender', 'zipcode', 'age_desc', 'occ_desc'])

ratings = pd.read_csv('ratings.csv', 
                      sep='\t', 
                      encoding='latin-1', 
                      usecols=['user_id', 'movie_id', 'rating'])

# print the first 10 rows
print("The ratings dataframe:")
print(ratings.head(10))

The ratings dataframe:
   user_id  movie_id  rating
0        1      1193       5
1        1       661       3
2        1       914       3
3        1      3408       4
4        1      2355       5
5        1      1197       3
6        1      1287       5
7        1      2804       5
8        1       594       4
9        1       919       4


For the same of demonstration, we only consider a subset of the dataset (otherwise, training and testing the model takes much longer).

In [2]:
# subset of the rankings dataframe (random_state to get the same sequence of 
# random elements each time this cell is executed)
small_ratings = ratings.sample(frac=0.1, random_state=0)

In [3]:
# generate Dataset (Surpise library) via the DataFrame (Pandas library)
from surprise import Dataset
from surprise import Reader

reader = Reader(rating_scale=(1, 5))

# required order: user id, item id, and rating
data = Dataset.load_from_df(small_ratings[['user_id', 'movie_id', 'rating']], reader)

Let us split the data into training and test set. The **test set must not be touched during training** (in order to obtain a realistic estimate for the performance of the model on new, unseen data). In case model parameters have also to be tuned, one has to make use of an **additional validation set** (i.e., the training set has to be split up). Note that you can also use the **automatic cross-validation procedure** provided by the Surprise library for that, see the [GridSearchCV](https://surprise.readthedocs.io/en/stable/getting_started.html#tune-algorithm-parameters-with-gridsearchcv) example (not used here though).

In [4]:
from surprise.model_selection import train_test_split

# use 25% of the data as test set (random subsets!)
# random set: same random subsets each time this cell is executed
trainset, testset = train_test_split(data, test_size=0.25, random_state=0)

Next, we instantiate a matrix factorization model, see [https://surprise.readthedocs.io/en/stable/matrix_factorization.html#](https://surprise.readthedocs.io/en/stable/matrix_factorization.html). Afterwards, we fit the model on the training set.

In [5]:
from surprise import SVD

algo = SVD(n_factors=100, n_epochs=50, biased=True, verbose=1)

In [6]:
# fit the model on the training set
algo.fit(trainset)

Processing epoch 0
Processing epoch 1
Processing epoch 2
Processing epoch 3
Processing epoch 4
Processing epoch 5
Processing epoch 6
Processing epoch 7
Processing epoch 8
Processing epoch 9
Processing epoch 10
Processing epoch 11
Processing epoch 12
Processing epoch 13
Processing epoch 14
Processing epoch 15
Processing epoch 16
Processing epoch 17
Processing epoch 18
Processing epoch 19
Processing epoch 20
Processing epoch 21
Processing epoch 22
Processing epoch 23
Processing epoch 24
Processing epoch 25
Processing epoch 26
Processing epoch 27
Processing epoch 28
Processing epoch 29
Processing epoch 30
Processing epoch 31
Processing epoch 32
Processing epoch 33
Processing epoch 34
Processing epoch 35
Processing epoch 36
Processing epoch 37
Processing epoch 38
Processing epoch 39
Processing epoch 40
Processing epoch 41
Processing epoch 42
Processing epoch 43
Processing epoch 44
Processing epoch 45
Processing epoch 46
Processing epoch 47
Processing epoch 48
Processing epoch 49


<surprise.prediction_algorithms.matrix_factorization.SVD at 0x7f1dfcd29b50>

Finally, we compute predictions for the hold-out test set. **This set should only be used at the very end.** That is, model selection (e.g., selecting a good assignment for k or trying out other models) has to be done on the training data only! Afterwards, we can compute the RMSE to assess the quality of the model on the whole test set.

In [7]:
predictions = algo.test(testset)

In [8]:
from surprise import accuracy
accuracy.rmse(predictions)

RMSE: 0.9721


0.9720503252199726

In [9]:
# take first test instance
user_id, movie_id, r_ui = testset[0]

# get a prediction for specific user and item
pred = algo.predict(user_id, movie_id, r_ui=r_ui, verbose=True)

user: 655        item: 1235       r_ui = 4.00   est = 3.21   {'was_impossible': False}
