# Recommender Systems Workshop
*Created by Jack Douglas, UWaterloo Data Science Club*
<br>

**REMINDER: You can make a copy of this notebook by going to File > Save a copy in Drive**


## Summary

In this notebook, we will go over the full implementation of a collaborative filtering recommender system and the processing of the feature vectors for content-based filtering. 

We will be using the MovieLens 100k dataset for this notebook. MovieLens has several datasets containing explicit ratings for thousands of movies from thousands of users. Read more about MovieLens here: https://movielens.org/. 

As well, we will be using the Surprise scikit package, which allows you to implement collaborative filtering recommender systems a few lines! Read more about Surprise here: http://surpriselib.com/.


## Packages/Libraries

We are installing the Surprise scikit package for implementing the collaborative recommender system, Pandas for creating dataframes, and NumPy for manipulating and operating on the data.

In [1]:
!pip install scikit-surprise
import pandas as pd # Data analysis library (particularly dataframes)
import numpy as np # Computation/data analysis library
from tqdm import tqdm # Progress bar library
from surprise import Dataset 
from surprise import Reader
from surprise import SVD
from surprise.model_selection import cross_validate



## Dataset

As mentioned, we will be using the MovieLens 100K dataset for this notebook. It contains 100,000 ratings and 3,600 tag applications applied to 9,000 movies by 600 users.

Download the MovieLens 100K dataset here: https://files.grouplens.org/datasets/movielens/ml-latest-small.zip

Find more information on other recommender system datasets here: https://www.kdnuggets.com/2016/02/nine-datasets-investigating-recommender-systems.html

First, since we are working in Google Colab, we must mount our Google Drive to get access to all the files stored in the Google Drive.

In [2]:
from google.colab import drive

drive.mount('/content/gdrive/')

Drive already mounted at /content/gdrive/; to attempt to forcibly remount, call drive.mount("/content/gdrive/", force_remount=True).


We are interested in the movies and ratings database tables in the MovieLens 100k dataset. We are creating dataframes from the tables using the Pandas library.

In [3]:
# ATTENTION: you will have to change these file paths to wherever the datasets exist in your drive
movies = pd.read_csv("/content/gdrive/MyDrive/UW/UW Data Science Club/Recommender Systems Workshop/ml-latest-small/movies.csv")
ratings = pd.read_csv("/content/gdrive/MyDrive/UW/UW Data Science Club/Recommender Systems Workshop/ml-latest-small/ratings.csv")

Here are the first five entries of the movies and ratings dataframes:




In [4]:
movies.head()

Unnamed: 0,movieId,title,genres
0,1,Toy Story (1995),Adventure|Animation|Children|Comedy|Fantasy
1,2,Jumanji (1995),Adventure|Children|Fantasy
2,3,Grumpier Old Men (1995),Comedy|Romance
3,4,Waiting to Exhale (1995),Comedy|Drama|Romance
4,5,Father of the Bride Part II (1995),Comedy


In [5]:
ratings.head()

Unnamed: 0,userId,movieId,rating,timestamp
0,1,1,4.0,964982703
1,1,3,4.0,964981247
2,1,6,4.0,964982224
3,1,47,5.0,964983815
4,1,50,5.0,964982931


Here we are revisualizing the ratings dataframe so that it looks more like the examples shown in the slides:

In [6]:
ds = ratings.pivot(index="movieId", columns='userId', values='rating')
ds.head()

userId,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,...,571,572,573,574,575,576,577,578,579,580,581,582,583,584,585,586,587,588,589,590,591,592,593,594,595,596,597,598,599,600,601,602,603,604,605,606,607,608,609,610
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
1,4.0,,,,4.0,,4.5,,,,,,,,2.5,,4.5,3.5,4.0,,3.5,,,,,,3.0,,,,5.0,3.0,3.0,,,,,,,5.0,...,,4.0,5.0,,,,,,4.0,3.0,,,,5.0,,,5.0,,,4.0,,,,,,4.0,4.0,,3.0,2.5,4.0,,4.0,3.0,4.0,2.5,4.0,2.5,3.0,5.0
2,,,,,,4.0,,4.0,,,,,,,,,,3.0,3.0,3.0,3.5,,,,,,4.0,,,,,,,,,,,,,,...,,,4.5,,,,,,,,,,,,,4.0,,,,2.5,,4.0,,4.0,,,,,2.5,4.0,,4.0,,5.0,3.5,,,2.0,,
3,4.0,,,,,5.0,,,,,,,,,,,,,3.0,,,,,,,,,,,,,3.0,,,,,,,,,...,,,,,,,,,,,,,,,,,,3.0,,3.0,,,,4.0,,,,,1.5,,,,,,,,,2.0,,
4,,,,,,3.0,,,,,,,,3.0,,,,,,,,,,,,,,,,,,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,1.5,,,,,,,,,,
5,,,,,,5.0,,,,,,,,,,,,,,,,,,,,,,,,,3.0,,,,,,,,,,...,,,,,,,,,,,,,,,,,,,,2.0,,,,,,,,,,2.5,,,,3.0,,,,,,


## Collaborative Filtering

For the collaborative recommender system, we are using the Surprise scikit package. Here we are making a reader which has a rating scale from 1 to 5. As well, we are creating the dataset from the ratings dataframe which has only user behavioural data.

In [7]:
reader = Reader(rating_scale=(1, 5))
# The dataframe containing the ratings. It must have three columns, corresponding to the user (raw) ids, the item(raw) ids, and the ratings, in this order.
data = Dataset.load_from_df(ratings[['userId', 'movieId', 'rating']], reader)

Recall that we are trying to have our recommender system learn the feature vectors for every movie. Surprise makes implementing this super easy! We just choose the particular algorithm we want to use to learn the feauture vectors which solve this regression problem. Here is a list of algorithms you can choose from: https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html. We have chosen the Singular Value Decomposition (SVD) algorithm, which is an algorithm which can solve our system of linear equations for linear regression. It has been shown to work the best for solving recommender systems.

In [8]:
# '%%capture' just removes output from a cell
%%capture 
algo = SVD() # We will use singular value decomposition (SVD) as our learning algo 
trainset = data.build_full_trainset() # Using the Surprise build_full_trainset to creating our training set
algo.fit(trainset) # Training the model (ie. learning the feature vectors)
testset = trainset.build_anti_testset() # Creating the testing set
predictions = algo.test(testset) # Making predictions on testing set

Recall that the user 1 rated movie 1 (Toy Story) highly and had no prediction for the movie 2 (Jumanji). Toy Story and Jumanji are both categorized as childrens movies with adventure and fantasy. Let's see what our prediction rating is for user 1 on Jumanji:

In [9]:
user_id = 1
movie_id = 2
for e in predictions:
  if e[0] == user_id and e[1]==movie_id:
    print(movies['title'].where(movies['movieId'] == e[1])[1] + ' Prediction: ' + str(e[3]))

Jumanji (1995) Prediction: 3.9483549274778076


We will now use cross-validation to assess the accuracy of our collaborative recommender system.

In [10]:
cross_validate(algo, data, measures=['RMSE', 'MAE'], cv=5, verbose=True);

Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.8723  0.8688  0.8776  0.8766  0.8738  0.8738  0.0031  
MAE (testset)     0.6703  0.6668  0.6745  0.6710  0.6711  0.6707  0.0024  
Fit time          4.86    4.79    4.81    4.82    4.82    4.82    0.02    
Test time         0.18    0.18    0.15    0.15    0.18    0.17    0.01    


If we look at the mean RMSE, that means that our predictions would likely be off by around $\sqrt{0.87}=0.93$ when used on an independent set. That means that if our recommender predicts a rating will give a movie a 4, the user will most likely be in the 3 to 5 range.

## Content-based Filtering

One drawback of content-based filtering is that it is harder to implement than collaborative filtering because there is a number of features that you examine can vary. In this section of the notebook, I will go over how you would process the feature vectors before applying the linear regression.

For the content-based recommender system, we are going to use the genre tags to create our feature vectors. To do this, we need to do some processing and encode what the genres are for each movie.

In [11]:
# Movie genre attribute processing
all_genres = [s.split("|") for s in movies[movies.genres.notnull()].genres]
movie_profile = movies[['movieId', 'title', 'genres']]
movie_profile.rename(columns={'movieId': 'movieId'}, inplace=True)
genres = [item.strip() for l in all_genres for item in l ]
unique_genres = set(genres)
for genre in unique_genres:
  movie_profile[genre] = 0
  
for i in range(len(movie_profile)):
  if type(movie_profile['genres'].iloc[i]) != None.__class__:
    Genres = movie_profile.iloc[i].genres.split('|')
    for g in Genres:
      movie_profile[g].iloc[i] = 1
      
movie_profile = movie_profile.drop(columns=['title', 'genres']).set_index('movieId')
movie_profile.sort_index(axis=0, inplace=True)

movie_profile.head()

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  iloc._setitem_with_indexer(indexer, value)


Unnamed: 0_level_0,Animation,Romance,Crime,IMAX,(no genres listed),Western,Children,Action,Documentary,Comedy,Adventure,Horror,Fantasy,Thriller,War,Drama,Musical,Mystery,Sci-Fi,Film-Noir
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1
1,1,0,0,0,0,0,1,0,0,1,1,0,1,0,0,0,0,0,0,0
2,0,0,0,0,0,0,1,0,0,0,1,0,1,0,0,0,0,0,0,0
3,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0
4,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0
5,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0


Now that we have our feature vectors (which are notably simplified) we would now have to do more data processing before applying the linear regression. If you want to build a content-based recommender on your own, you can use the scikit-learn library to apply the linear regression. Here is a link to the library: https://scikit-learn.org/stable/index.html.