# Lab:  Low-Rank Approximation for Movie Recommendations

A common application of low-rank approximation is for recommender systems.  In this lab, we will create a very primitive recommendation system for movies.  Through the lab, you will learn to:

* Represent ratings data as a sparse matrix
* Evaluate the mean absolute error (MAE) using simple movie or user biases
* Create a low-rank model in Tensorflow for predicting the movie rating using `Embedding` layers
* Train the model and optimize the embedding dimension.
* Make predictions on new users.

## Loading the MovieLens Dataset

We first load some common packages.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

[GroupLens](https://grouplens.org/) is a research organization at the University of Minnesota that has done extensive work in recommendation systems among other topics.  They have excellent datasets on movie recommendations as part of their [MovieLens project](https://movielens.org/).  In this lab, we will use a relatively small dataset, `Movielens 1m` that has 1 million ratings.  If you are interested in continuing research in this area, they have much larger datasets as well.  But, this relatively small one is sufficient to illustrate the basic concepts.

To get the data, go to the webpage:

https://grouplens.org/datasets/movielens/latest/

and download and unzip the files, `ml-1m.zip`.  Alternatively, you can just run the following command which will download unzip the file for you.

In [5]:
import tqdm
import requests
import os
import zipfile

# Set the files names for movies and ratings files
ml_dir = 'ml-1m'
ratings_fn = os.path.join(ml_dir,'ratings.dat')
movies_fn = os.path.join(ml_dir,'movies.dat')

def download_file(src_url, dst_fn):
    
    if os.path.exists(dst_fn):
        print('File %s already exists' % dst_fn)
        return
    
    print('Downloading %s' % dst_fn)
    
    # Streaming, so we can iterate over the response.
    r = requests.get(src_url, stream=True)

    # Total size in MB.
    total_size = int(r.headers.get('content-length', 0)); 
    block_size = 1024
    wrote = 0 
    with open(dst_fn, 'wb') as f:
        with tqdm.notebook.tqdm(total=total_size//block_size, unit='kB', 
                           unit_scale=True, unit_divisor=1024) as pbar:
            for data in r.iter_content(block_size):
                wrote = wrote + len(data)
                pbar.update(1)
                f.write(data)
    if total_size != 0 and wrote != total_size:
        print("ERROR, something went wrong") 

# Test if all files are downloaded
files_exists = False
if os.path.exists(ml_dir):
    if os.path.exists(ratings_fn) and os.path.exists(movies_fn):
        files_exists = True

if files_exists:
    print('Files %s and %s already downloaded' % (ratings_fn, movies_fn))

else:
    # First download the zip file if needed
    src_url = 'http://files.grouplens.org/datasets/movielens/ml-1m.zip'
    dst_fn = 'ml-1m.zip'
    download_file(src_url, dst_fn)
    
    # Then, unzip the file
    print('Unzipping %s...' % dst_fn)
    zip_ref = zipfile.ZipFile(dst_fn, 'r')
    zip_ref.extractall('.')
    zip_ref.close()
    print('Unzip completed')

Downloading ml-1m.zip


ImportError: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html

Read the movies files with the `read_csv` command.  Print the first 5 entries of the dataframe.  You will see that the file has a list of movies.  Each movie has a `movieId` and `title`.

In [2]:
# TODO:  Read the movies 
import os
dir = './ml-1m/'
movies = pd.read_csv('./ml-1m/movies.dat',sep='::', names=['MovieID', 'Title', 'Genres'])
print(movies.size)
# TODO:  Use the movies.head() to display the first 5 entries

11649


Extract the following columns from the `movies` dataframe:
*  Extract the `movieId` column, convert to an `np.array` and store in `movie_ids`
*  Extract the `title` column, convert to a list (using `.tolist()`) and store in `titles`

In [3]:
# TODO:
movie_ids = movies['MovieID']
titles = movies['Title']

The following function returns the string of a movie title, given its movie id.

In [4]:
def get_movie_title(movie_id):
    I = np.where(movie_ids == movie_id)[0]
    if len(I) == 0:
        return 'unknown'
    else:
        return titles[I[0]]

Load the `ratings.dat` file into a `pandas` dataframe `ratings`.  Use the `head` method to print the first five rows of the dataframe.  This is a large file, so it may take a minute to read in.

In [5]:
# TODO
ratings = pd.read_csv('./ml-1m/ratings.dat',sep='::', names=['UserID','MovieID', 'Rating', 'Timestamp'])
ratings.head()

Unnamed: 0,UserID,MovieID,Rating,Timestamp
0,1,1193,5,978300760
1,1,661,3,978302109
2,1,914,3,978301968
3,1,3408,4,978300275
4,1,2355,5,978824291


Extract three columns from the `ratings` dataframe:

* Set `user` to `ratings['userId']`,
* Set `movie` to `ratings['movieId']`
* Set `y` to `ratings['rating']`

Convert to each to an `np.array`.  Print:

* Total number of movies (the maximum movie index)
* Total number of users
* Total number of ratings
* The average fraction of movies rated per user

You should see that only a small fraction of the movies are rated by each user.

In [6]:
# TODO
user = ratings['UserID']
movie = ratings['MovieID']
y = ratings['Rating']
print('Total number of movies: ',movie.max())
print('Total number of users: ', user.max())
print('Total number of ratings: ', y.size)

Total number of movies:  3952
Total number of users:  6040
Total number of ratings:  1000209


Our goal will be to predict the rating `y` from the indices `movie` and `user`.  We need to split the data into training and test.
Create training and test data of the form:

* Training data:  `Xtr = [usertr, movietr]` and `ytr` for approximately 75% of the samples.
* Test data:  `Xts = [userts, moviets]` and `yts` for approximately 25% of the samples.

In [7]:
# TODO
tr_size = int(y.size * 0.75)
ts_size = y.size - tr_size

usertr = user[0:tr_size]
movietr = movie[0:tr_size]
userts = user[tr_size + 1:]
moviets = movie[tr_size + 1:]

Xtr = np.transpose(np.vstack((usertr, movietr)))
ytr = y[0:tr_size]
Xts = np.transpose(np.vstack((userts, moviets)))
yts = y[tr_size + 1:]


## Simple Rating Prediction Based on Average Rating

Before we try to perform a complex algorithm for predicting a movie rating, we will first compute some simple statistics to get you familar with the data set.  First, compute the average movie rating across all movies in the training data set.

In [8]:
# TODO
y0 = np.average(ytr)

Next, find the average rating per movie id. For each movie id, `i` compute `ymean[i]`, the average rating for that movie in the training data set and `ycnt[i]`, the number of ratings the movie had.  If `ycnt[i]==0`, set `ymean[i]=y0`, where `y0` is the average overall rating. 

You will want to think about how you do this computation efficiently since the data set `ytr` has a large number of entries.  Make you sure you go over the entries in `ytr` only once.  Even if you do it efficiently, it will take a minute, so you may want to add a progress bar.

In [14]:
# TODO
# ymean = ...
ymean = [0.0]*movie_ids.max()
ycnt = [0]*movie_ids.max()
print(movie_ids.size)
for i, movieid in enumerate(movietr):
    ycnt[movieid-1] += 1
    ymean[movieid-1] = (ymean[movieid-1] * (ycnt[movieid-1] - 1) + y[i])/ycnt[movieid-1]
for i in range(0, len(ymean)):
    if ycnt[i] == 0:
        ymean[i] = y0

3883


Print all the movies that had an average rating over 4.8.  Print their titles, the average rating and the number of ratings they had.  You will see that most of the movies with very high ratings had very few ratings.

In [15]:
# TODO
print(ymean[:5])
for i, rate_mean in enumerate(ymean):
    if rate_mean > 4.8:
        print("Movie ", i, ": ", get_movie_title(i) ,"averate rating: ", rate_mean, " number of ratings: ", ycnt[i])

[4.144167758846663, 3.1821631878557874, 3.0600000000000005, 2.6576576576576576, 3.014354066985645]
Movie  556 :  War Room, The (1993) averate rating:  5.0  number of ratings:  1
Movie  577 :  Andre (1994) averate rating:  5.0  number of ratings:  1
Movie  786 :  Eraser (1996) averate rating:  5.0  number of ratings:  3
Movie  852 :  Tin Cup (1996) averate rating:  5.0  number of ratings:  1
Movie  988 :  Grace of My Heart (1996) averate rating:  5.0  number of ratings:  1
Movie  1829 :  Chinese Box (1997) averate rating:  5.0  number of ratings:  1
Movie  3171 :  Room at the Top (1959) averate rating:  5.0  number of ratings:  1
Movie  3232 :  Seven Chances (1925) averate rating:  5.0  number of ratings:  2
Movie  3279 :  Knockout (1999) averate rating:  5.0  number of ratings:  1
Movie  3880 :  Ballad of Ramblin' Jack, The (2000) averate rating:  5.0  number of ratings:  1


Now, for each `i` in the test data set, compute `yhat[i]` to be the mean rating for the movie in rating `i`.  Find the average value `|yhat[i]-yts[i]|`.  This is called the *mean absolute error* or MAE and is a common metric in evaluating recommendation predictions.  If you did everything correctly, you should get an MAE ~= 0.78.  That means that simply using the average movie rating by users will predict the rating of another user within 0.78 on average.

In [19]:
# TODO
yhat = [0] * yts.size
for i, movieid in enumerate(moviets):
    yhat[i] = ymean[movieid - 1]
mae = abs(yhat - yts)

## Building a Neural Network Recommender
We now build a neural network for predicting the ratings.  First, we load the necessary packages.

In [31]:
import tensorflow as tf

from tensorflow.keras.layers import Input, Embedding, Dot, Reshape, Dense, Flatten, Add, Lambda
from tensorflow.keras.models import Model
import tensorflow.keras.backend as K
from tensorflow.keras import regularizers
from tensorflow.keras.optimizers import Adam, SGD, RMSprop

We can now create a neural network in Tensorflow as follows:

*  Set the embedding dimension to `emb_dim=4`.  
*  Let `userid_in` and `movieid_in` be the input user and movie indices.  These can be created in Tensorflow with `Input` layers with `shape = (1,)`. 
*  The user index generates a bias `user_bias`.  Use an `Embedding` layer with `output_dim=1` followed by a `Flatten` layer.  
*  The user index also generates a weight `user_wt`.  Use a second `Embedding` layer with `output_dim=emb_dim` followed by a `Flatten` layer.  
*  The movie index generates biases `movie_bias` and `movie_wt` similar to the user bias.
*  We then make the rating prediction with `yhat = Dot(user_wt, movie_wt) + user_bias + movie_bias`. 
*  Optionally, you can add bias and weight regularization, although I found these did not help significantly.
*  Set the model to `mod = Model([userid_in, movieid_in], yhat)`.

Print a summary of the model `mod.summary()`. 


In [32]:
# TODO:
K.clear_session()
l2_reg = 1e-8
emb_dim = 4
userid_in  = Input(name='userid_in', shape=(1,))
movieid_in  = Input(name='movieid_in', shape=(1,))
useremb0 = Embedding(input_dim=usertr.size, output_dim=1,name='useremb0',
                embeddings_regularizer=regularizers.l2(l2_reg))(userid_in)
user_bias = Flatten(name='user_bias')(useremb0)
useremb1 = Embedding(input_dim=usertr.size, output_dim=emb_dim,name='useremb1',
                embeddings_regularizer=regularizers.l2(l2_reg))(userid_in)
user_wt = Flatten(name='user_wt')(useremb1)

movieemb0 = Embedding(input_dim=movietr.size, output_dim=1,name='movieemb0',
                embeddings_regularizer=regularizers.l2(l2_reg))(movieid_in)
movie_bias = Flatten(name='movie_bias')(movieemb0)
movieemb1 = Embedding(input_dim=movietr.size, output_dim=emb_dim,name='movieemb1',
                embeddings_regularizer=regularizers.l2(l2_reg))(movieid_in)
movie_wt = Flatten(name='movie_wt')(movieemb1)

yhat = Dot(name='product',axes=1)([user_wt, movie_wt]) + user_bias + movie_bias
mod = Model([userid_in, movieid_in], yhat)
mod.summary()


Model: "functional_1"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
userid_in (InputLayer)          [(None, 1)]          0                                            
__________________________________________________________________________________________________
movieid_in (InputLayer)         [(None, 1)]          0                                            
__________________________________________________________________________________________________
useremb1 (Embedding)            (None, 1, 4)         3000624     userid_in[0][0]                  
__________________________________________________________________________________________________
movieemb1 (Embedding)           (None, 1, 4)         3000624     movieid_in[0][0]                 
_______________________________________________________________________________________

Compile the model with `Adam` optimizer with a learning rate of `0.01` (I found these numbers to work out well).  Use the 
`'mean_absolute_error'` loss.  Then fit the model with 8 epochs.  Use a batch size of 1000.  

In [34]:
# TODO
opt = Adam(lr=0.01)
mod.compile(optimizer=opt, loss='mean_absolute_error', metrics=['mean_absolute_error'])
hist = mod.fit(usertr, movietr,ytr,epochs=8,batch_size=1000, verbose=0, validation_data=(userts, moviets,yts))

InternalError: 2 root error(s) found.
  (0) Internal:  Blas xGEMMStridedBatched launch failed : a.shape=[1000,1,4], b.shape=[1000,4,1], m=1, n=1, k=4, batch_size=1000
	 [[node functional_1/product/MatMul (defined at <ipython-input-34-32931e3bc2d0>:4) ]]
	 [[AddN_4/_70]]
  (1) Internal:  Blas xGEMMStridedBatched launch failed : a.shape=[1000,1,4], b.shape=[1000,4,1], m=1, n=1, k=4, batch_size=1000
	 [[node functional_1/product/MatMul (defined at <ipython-input-34-32931e3bc2d0>:4) ]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_2028]

Function call stack:
train_function -> train_function


Print the training and test loss as a function of the epochs.  If you did it correctly the final test loss should be around  0.71 and the training loss should be 0.68.  This is a little better than the MAE you get just using the average movie rating.

In [14]:
# TODO

## Making Predictions!

Select a random user, `user_id`.  Then, for each movie index, use the model to predict the ratings. Set the predictions to `yhat`.

In [15]:
# TODO
#    user_id = ...
#    yhat = mod.predict(...)

Print the names of the movies with the top 10 predicted ratings for the user as well as the average rating that those movies had.  You will see that the network may predict ratings above 5!  We could have avoided this by limiting the output.

In [16]:
# TODO

## Bonus:  Optimizing the Embedding Dimension

You can try to optimize the embedding dimension.  Try different dimensions from 0 to 8.  

In [17]:
# TODO


Plot the training and test loss as a function of the embedding layer. We see we get a minimum with an embedding dimension around 4 or 5.