# BLU11 - Personalized recommenders - Exercise notebook

In [None]:
import os
import hashlib # for grading purposes
import json
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

In this exercise notebook, you'll create a collaborative and a content-based recommender system. We'll be using a standard data set for recommendations, the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. Our data set is an adapted version from the smallest version of this data set.

We have 3 data files.

The `ratings.csv` has 70,584 ratings provided by 147 users for 9,298 movies.

In [None]:
df_ratings = pd.read_csv(os.path.join("data", "ratings.csv"))
df_ratings.head()

The `movies.csv` has the movie metadata with `movieId`, `title`, `genres`, and `year` for 9,742 movies.

In [None]:
df_movies = pd.read_csv(os.path.join("data","movies.csv"))
df_movies = df_movies.set_index("movieId")
df_movies.head()

The `tags.csv` is another metadata file with 3.5k tags provided by 49 users for 1,551 movies.

In [None]:
df_tags = pd.read_csv(os.path.join("data", "tags.csv"))
df_tags.head()

## Exercise 1 - The ratings matrix
In this exercise you will build the ratings matrix and a few helper functions.

### Exercise 1.1 - Create the ratings dataframe

Transform the `df_ratings` dataframe from the long to the wide form. The dataframe should have the `userId` as the index, the `movieId`s as the column names and the correspondent`rating` as the values. Make sure that the index and column names are in ascending order. Assign the transformed dataframe to a variable called `df_ratings_transformed`.

In [None]:
# df_ratings_transformed = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df_ratings_transformed.shape == (147, 9298), 'The shape of the transformed dataframe is not correct.'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in df_ratings_transformed.columns])).encode()).hexdigest() == \
'98c4241a6e147b336a9fd944f8c0ec7079f4f25ede42e93c870da6dc821da183', 'The column names are not correct.'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in df_ratings_transformed.index])).encode()).hexdigest() == \
'8c9ebdf4f3c63ef2d32811a7254ff46844abe4779142a530b242bf10ac79b79b', 'The index is not correct.'
assert 10*(df_ratings_transformed.sum().sum()) == 2416885, 'The content of the dataframe is not correct.'
df_ratings_transformed.head()

### Exercise 1.2 - Create utility functions

As you can see in the `df_ratings_transformed` dataframe, neither the `userId`s and nor the `movieId`s are a continuous sequence of integers. We'll be working with matrices which have no column names or index, so we need to know how to translate the row/column positions to `userId`/`movieId`s.

Create a set of utility functions that will transform the given `userId`/`movieId` to a row/column number and vice versa. We will refer to the row/column numbers as `index` in the functions. The functions should:   
- Accept a single `userId`/`movieId` or row/column index.
- Return the correspondent id, if accepting an index.
- Return the correspondent index, if accepting an id.

Below are the arrays of `userId`s/`movieId`s extracted from the row/column indexes of the dataframe. The correspondence between the array index and the `userId`s/`movieId`s is the same as in the dataframe.

In [None]:
user_id_array = df_ratings_transformed.index.to_numpy()
movie_id_array = df_ratings_transformed.columns.to_numpy()

In [None]:
def get_user_index(user_id, user_id_array=user_id_array):
    """
    Find the position of the user_id in user_id_array.
    
    Parameters:
        user_id (int)
        user_id_array (np.ndarray)
    
    Returns:
        user_index (int)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

def get_user_id(user_index, user_id_array=user_id_array):
    """
    Find the user_id at the user_index position in user_id_array.
    
    Parameters:
        user_index (int)
        user_id_array (np.ndarray)
    
    Returns:
        user_id (int)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

def get_movie_index(movie_id, movie_id_array=movie_id_array):
    """
    Finds the position of movie_id in movie_id_array.
    
    Parameters:
        movie_id (int)
        movie_id_array (np.ndarray)
    
    Returns:
        movie_index (int)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

def get_movie_id(movie_index, movie_id_array=movie_id_array):
    """
    Finds the movie_id at the movie_index position in movie_id_array.
    
    Parameters:
        movie_index (int)
        movie_id_array (np.ndarray)
    
    Returns:
        movie_id (int)
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
user_index_45 = get_user_index(45)
user_id_45 = get_user_id(45)
movie_index_87 = get_movie_index(87)
movie_id_98 = get_movie_id(98)
assert hashlib.sha256(json.dumps(str(user_index_45)).encode()).hexdigest() == \
'f9cacf3cb91a12e03bc4546834f95a50a4c5fe02276ac260148ea9296c442d39', 'The get_user_index function does not return the correct value.'
assert hashlib.sha256(json.dumps(str(user_id_45)).encode()).hexdigest() == \
'42af87c97cab24c40fe4037f1fda64f5bd68e0b99fd11d3fca7c7f43aa7d6d68', 'The get_user_id function does not return the correct value.'
assert hashlib.sha256(json.dumps(str(movie_index_87)).encode()).hexdigest() == \
'1191917a48bb1c9571801d5e42e51928936befac8fccbae64585c9f861b2e6db', 'The get_movie_index function does not return the correct value.'
assert hashlib.sha256(json.dumps(str(movie_id_98)).encode()).hexdigest() == \
'96fc2701d1fa4b980855efd337725f91ad3f3626a8a384810157eec58df67117', 'The get_movie_id function does not return the correct value.'

### Exercise 1.3 - Create the ratings matrix

Implement a function called `create_ratings_matrix` that accepts the dataframe `df_ratings_transformed` and transforms it into a csr matrix.
The order of users and movies in the `df_ratings_transformed` dataframe should be maintained in the ratings matrix. Fill the missing values with zeros.

In [None]:
def create_ratings_matrix(df_ratings_transformed):
    """
    Creates a rating matrix from df_ratings_transformed, keeping the same organization of movies and users.
    
    Parameters:
        df_ratings_transformed (pd.DataFrame)
    
    Returns:
        R (csr_matrix): the ratings matrix created from the input dataframe
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
R = create_ratings_matrix(df_ratings_transformed)
assert R.shape == df_ratings_transformed.shape, 'The shape of the matrix is not correct.'
assert R[0].sum() == 1013, 'The values of the ratings matrix are not correct.'
assert 10*R.sum() == 2416885, 'The values of the ratings matrix are not correct.'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in R.nonzero()[0]])).encode()).hexdigest() == \
'00a39b75a7be50fd0d3f982bc2dfb840ee0cc9aa4999d8724ac0d059e0cf3948', 'Did you fill in the NA values?'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in R.nonzero()[1]])).encode()).hexdigest() == \
'11c6f9e1114b17663292f8f4602e5206d7c9697fc5d183e9a5b5ad8639b7cc27', 'Did you fill in the NA values?'

## Exercise 2 - Collaborative filtering

Now that we have the ratings matrix, let's use it to create collaborative recommendations. Before we start, we're going to merge the dataframe `df_ratings` with `df_movies` because we will need it later on. The merged dataframe is called `df_ratings_movies`:

In [None]:
df_ratings_movies = pd.merge(df_ratings, df_movies, on="movieId", how="left").sort_values("rating", ascending=False)
df_ratings_movies.head()

### Exercise 2.1 - Calculate similarities

Create a function named `calculate_similarities` that accepts a ratings matrix and a string named `similarity_type`. The function should return a matrix of all mutual user or item cosine similarities according to the `similarity_type` parameter.

In [None]:
def calculate_similarities(ratings_matrix, similarity_type='users'):
    """
    Get the cosine similarity between users or items.
    
    Parameters:
        ratings_matrix (csr_matrix): ratings matrix
        similarity_type (str): "users" or "items"

    Returns:
        similarities (csr_matrix): sparse matrix of the cosine similarity between all users or items
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
user_similarities = calculate_similarities(ratings_matrix=R, similarity_type="users")
assert isinstance(user_similarities, csr_matrix), 'The result should be a csr_matrix.'
np.testing.assert_almost_equal(user_similarities.sum(), 4811.96805343766, decimal=2, 
                               err_msg='The calculated user similarities are not correct.')

item_similarities = calculate_similarities(ratings_matrix=R, similarity_type="items")
assert isinstance(item_similarities, csr_matrix), 'The result should be a csr_matrix.'
np.testing.assert_almost_equal(item_similarities.sum(), 6325401.184577281, decimal=2, 
                               err_msg='The calculated user similarities are not correct.')

### Exercise 2.2 - Nearest neighbor

Now that we have the similarities, we can start looking for the recommendations. Let's say we want recommendations for a user called John with the id `user_id`. In the first step, we need to find the user who is the most similar to John. Build a function called `get_closest_user_id` that accepts the similarity matrix `user_similarities` and John's `user_id` and returns the id of the user who is most similar to John.

In [None]:
def get_closest_user_id(user_similarities, user_id):
    """
    Return the id of the closest user to user_id.
    
    Parameters:
        user_similarities (csr_matrix): matrix of similarities between all users
        user_id (int): John's id for whom we want to find the closest neighbor
    
    Returns:
        closest_user_id (int): the id closest to user_id
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return closest_user_id

In [None]:
neighbor_25=get_closest_user_id(user_similarities, 25)
assert hashlib.sha256(json.dumps(str(neighbor_25)).encode()).hexdigest() == \
'f20e4586c63ba3b2c06a97c4e585acea4a2977c3d8d81dc2d1f2275439ad90a7', 'Not correct.'
neighbor_414=get_closest_user_id(user_similarities, 414)
assert hashlib.sha256(json.dumps(str(neighbor_414)).encode()).hexdigest() == \
'5705a5b4b12fba416aa77cbcbc61d27c5618b10b606d4fb6eb5aaf21be95bec8', 'Not correct.'

### Exercise 2.3 - Get recommendations from the closest user

Now we want to look at movies that John's closest neighbor has watched and select recommendations for John. Create a function called `get_closest_user_recommendations` that accepts two user ids, that of John and its closest neighbor (`user_id` and `closest_id`) and the `df_ratings_movies` dataframe and returns a dataframe with movies watched by the closest neighbor with a rating greater than 2 that John hasn't watched yet. The dataframe should have the columns `movieId`, `title`, `genres`, and `year`.

In [None]:
def get_closest_user_recommendations(df_ratings_movies, user_id, closest_id):
    """
    Gets a dataframe with movies that closest_id has seen with ratings higher than 2
    that user_id has not seen.
    
    Parameters:
        df_ratings_movies (pd.DataFrame): dataframe of ratings and movies information
        user_id (int): John's id for whom we want the recommendations
        closest_id (int): id of John's nearest neighbor
    
    Returns:
        df_recommended_movies (pd.DataFrame): dataframe of selected movie recommendations with columns
                                      movieId, title, genres, year
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_recommended_movies

In [None]:
user_recommendations = get_closest_user_recommendations(df_ratings_movies, user_id=25, closest_id=30)
assert isinstance(user_recommendations, pd.DataFrame), 'The result should be a dataframe.'
assert user_recommendations.shape==(5, 4), 'The shape of the dataframe is not correct.'
assert list(user_recommendations.columns.sort_values()) == ['genres', 'movieId', 'title', 'year'], \
'The columns of the dataframe are not correct.'
assert user_recommendations.movieId.sum() == 5665, 'The selected movieIds are not correct.'
assert hashlib.sha256(json.dumps(' '.join(user_recommendations.title.sort_values())).encode()).hexdigest() == \
'903d15630ec24d3c8a029df82d9ab3577614494879cd24cb409dc5f0ef4f901a', 'The selected movie titles are not correct.'
user_recommendations

### Exercise 2.4 - Get user predictions with collaborative filtering
Now that we have recommendations for John, let's get recommendations for all users. Build a function named `make_user_predictions` that predicts each user's ratings for the movies that haven't been rated by that user yet, based on the ratings and user similarities. The function should take the similarity matrix `user_similarities` and the ratings matrix `R` and return a matrix of predictions.

In [None]:
def make_user_predictions(user_similarities, R):
    """
    Makes collaborative predictions for all users based on similarities.
    
    Parameters:
        user_similarities (csr_matrix): similarity matrix with the shape (n_users, n_users)
        R (csr_matrix): ratings matrix with the shape (n_users, n_items)
    
    Returns:
        user_predictions (csr_matrix): matrix of predictions with the shape (n_users, n_items)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return user_predictions

In [None]:
user_predictions = make_user_predictions(user_similarities, R)
assert isinstance(user_predictions, csr_matrix), 'The result should be a sparse matrix.'
assert user_predictions.shape==(147, 9298), 'The shape of the matrix is not correct.'
assert user_predictions.nnz==1296222, 'Did you remove the predictions for the existing ratings?'
np.testing.assert_almost_equal(user_predictions[32].sum(), 28563.954242858323, decimal=3,
err_msg='The values of the matrix are not correct.')
np.testing.assert_almost_equal(user_predictions[103].sum(), 27871.43147298375, decimal=3,
err_msg='The values of the matrix are not correct.')

### Exercise 2.5 - Get top n movies

The next step is to select the best predictions for the recommendations. Create a function named `get_top_n` that takes the matrix of predictions `user_predictions`, a user id `user_id`, and the number of recommendations `n` and returns the top n unseen movies with the highest predicted rating for the given user.

In [None]:
def get_top_n(user_predictions, user_id, n=10):
    """
    Returns the top n movies for a given user.
    
    Parameters:
        user_predictions (csr_matrix): matrix of ratings predictions with the shape (n_users, n_items)
        user_id (int): the user for which to select the recommendations
        n (int): number of recommendations to select
    
    Returns:
        movie_ids (list): list of recommended movie ids, from the most to the least recommended
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return movie_ids

In [None]:
top_movies_ids = get_top_n(user_predictions,25)
assert isinstance(top_movies_ids, list), 'The result should be a list.'
assert len(top_movies_ids)==10, 'The length of the list is not correct.'
assert user_predictions[25,[get_movie_index(i) for i in top_movies_ids]].sum()==50, 'The selected movies are not correct.'
df_movies.loc[top_movies_ids]

## Exercise 3 - Content-based filtering

Now let's do predictions based on the characteristics of the items. We will use information from the `tag.csv` file that we loaded in  `df_tags` dataframe.

In [None]:
df_tags.head()

### Exercise 3.1 - Processing the tags

Since we can have several tags for one movie, we need to join them into a single string. We also need to do other text processing operations.

Create a function named `process_movie_tags` that applies the following processing to the tags in the `df_tags` dataframe:
- Removes whitespaces for multi-word tags.
- Lowercases the tags.
- Joins all the tags for the same `movieId` into a single string, with tags separated by a space.

The function should return the processed tag strings in a pandas series with `movieId` as the index.

In [None]:
def process_movie_tags(df):
    """
    Preprocesses movie tags - joins multiword tags, lowercases, 
    groups all tags for the same movie into one string, separated by spaces.
    
    Parameters:
        df (pd.DataFrame): original dataframe with the tags with columns userId, movieId, tag
              
    Returns:
        tags (pd.Series): series with the processed tags and movieId as the index
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return tags

In [None]:
movie_tags = process_movie_tags(df_tags).sort_index()
assert isinstance(movie_tags, pd.Series), 'The results should be a pandas series.'
assert movie_tags.shape[0] == 1551, 'The length of the series is not correct.'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in movie_tags.index])).encode()).hexdigest() == \
'50922d202f2772b749436b2b90dd30ad74b498c5a8f460b37a56c82fa7edcfdb', 'The index of the series is not correct.'
assert hashlib.sha256(json.dumps(','.join(movie_tags)).encode()).hexdigest() == \
'f3f8dcfb169e3ad2b7b253687a7a2d6620dbbb136315df170840b0b3a59acb32', 'The processed tags are not correct.'

### Exercise 3.2 - Add more information about movies
To be able to use all the metadata, we need to do more preprocessing.
1. From the dataframe `df_movies`, create a new dataframe named `df_movies_processed`, with `movieId` as the index.
1. From the column `year`, create a new column named `decade` with the help of the function `extract_decade` defined below. Drop the column `year`.
1. Clean the column `genres`: lowercase, remove spaces from the multi-word genres and separate the genres with a space.
1. Join `df_movies_processed` with the series `movie_tags` and name the new column `tags`. Replace any missing tags with empty strings.
1. From `df_movies_processed`, select only the movies that have ratings and store them in the dataframe `df_movies_processed_w_ratings` and sort the dataframe by the index.   

In [None]:
def extract_decade(value):
    try: 
        decade = int(10*round(value/10))
        return str(decade)
    except:
        return ""

In [None]:
df_movies_processed = df_movies.copy()
# df_movies_processed_w_ratings = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert isinstance(df_movies_processed_w_ratings, pd.DataFrame), 'The result should be a dataframe.'
assert df_movies_processed_w_ratings.shape == (9298, 4), 'The shape of the dataframe is not correct.'
assert ('genres' and 'decade' and 'tags' and 'title') in df_movies_processed_w_ratings.columns, 'The column names are not correct.'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in df_movies_processed_w_ratings.index])).encode()).hexdigest() == \
'98c4241a6e147b336a9fd944f8c0ec7079f4f25ede42e93c870da6dc821da183', 'The index of the dataframe is not correct.'
assert hashlib.sha256(json.dumps(' '.join(df_movies_processed_w_ratings.genres)).encode()).hexdigest() == \
'c0271e0fc65b536e43586a792f7a183eaa6d6485c47754ca4bdbd552547aa8af', 'The genres column in not correct.'
assert hashlib.sha256(json.dumps(' '.join(df_movies_processed_w_ratings.decade)).encode()).hexdigest() == \
'56d3ce49ac8527b22eac25da092d37688bcb12f430b9a87d1ac63f77e0113d95', 'The decase column is not correct.'
assert (df_movies_processed_w_ratings.tags=="").sum() == 7769, 'Did you fill the missing tag values?'
hashlib.sha256(json.dumps(' '.join(df_movies_processed_w_ratings.tags)).encode()).hexdigest() == \
'1f6c83f0885a736e201c9657ae81afc18f147477dfd47c34f76cc59624bf7df3', 'The tags column is not correct.'
df_movies_processed_w_ratings.head()

### Exercise 3.3 - Calculate the item profiles
#### Exercise 3.3.1 - Tf-idf vectorization

Create a function named `get_tf_idf` that accepts a pandas series where the values are of type `str` and returns the correspondent tf-idf vectorization of that series.

In [None]:
def get_tf_idf(doc_series):
    """
    Generates the tf-idf vectorization of doc_series.
    
    Parameters:
        doc_series (pd.Series): series of strings with the shape (n_items, )
              
    Returns:
        doc_tf_idf (csr_matrix): the input series processed by the tf-idf vectorizer
                                 with the shape (n_items, n_tfidf_features)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return doc_tf_idf

In [None]:
tag_tf_idf  = get_tf_idf(df_movies_processed_w_ratings["tags"])
assert isinstance(tag_tf_idf,csr_matrix), 'The result should be a sparse matrix.'
assert tag_tf_idf.shape == (9298, 1458), 'The shape of the matrix is not correct.'
np.testing.assert_almost_equal(tag_tf_idf.sum(), 2051.1163566437426, decimal=2, 
                               err_msg='The values of the matrix are not correct.')

#### Exercise 3.3.2 - Calculate item profiles

Let's now create the item profiles from the `tags`, `genres`, and `decade` information. 

Create a function named `get_item_profiles` that accepts the dataframe `df_movies_processed_w_ratings` and returns an item profile csr matrix. The function should calculate tf-idf for the `tags`, `genres`, and `decades` columns, then horizontally concatenate the three resulting csr matrices in this order and return the result.

In [None]:
def get_item_profiles(df=df_movies_processed_w_ratings):   
    """
    Creates the item profiles from tf-idf features from the tags, 
    genres, and decade columns of the input dataframe.
    
    Parameters:
        df (pd.DataFrame): the df_movies_processed_w_ratings dataframe
              
    Returns:
        item_profiles (csr_matrix): item profiles matrix with the shape (n_items, n_tfidf_features)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return item_profiles

In [None]:
item_profiles = get_item_profiles(df_movies_processed_w_ratings)
assert isinstance(item_profiles, csr_matrix), 'The result should be a csr_matrix.'
assert item_profiles.shape == (9298, 1493), 'The shape of the matrix is not correct.'
np.testing.assert_almost_equal(item_profiles.sum(), 24961.948215625787, decimal=3, 
                               err_msg='The values of the matrix are not correct.')

### Exercise 3.4 - User profiles
The next step is to transform the user vectors in the ratings matrix to user profiles by projecting them into the movie attribute space. Create a function called `make_user_profiles` that accepts the ratings matrix and the item profiles matrix and calculates the user profiles.

In [None]:
def make_user_profiles(ratings, item_profiles):
    """
    Calculate the user profiles - user vectors projected into the movie attribute space.
    
    Parameters:
        ratings (pd.DataFrame): ratings matrix withe the shape (n_users, n_items)
        item_profiles (csr_matrix): item profiles matrix with the shape (n_items, n_tfidf_features)
    
    Returns:
        user_profiles (csr_matrix): user profiles matrix with the shape (n_users, n_tfidf_features)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return user_profiles

In [None]:
user_profiles = make_user_profiles(R, item_profiles)
assert isinstance(user_profiles, csr_matrix), 'The result should be a csr matrix.'
assert user_profiles.shape == (147, 1493), 'The shape of the matrix is not correct.'
np.testing.assert_almost_equal(user_profiles.sum(), 812651.284681659, decimal=3, 
                               err_msg='The values of the matrix are not correct.')

### Exercise 3.5 - The Moment of truth
Finally, let's make predictions, this time based not only on the ratings, but also on the item attributes. Define the function below which takes the ratings, item profiles, and user profiles matrices and returns the matrix of predictions. The predictions for the user-item pairs with already existing ratings should be set to zero.

In [None]:
def make_predictions(R, item_profiles, user_profiles):
    """
    Makes content-based predictions based on the ratings matrix, the item profiles,
    and the user profiles.
    
    Parameters:
        R (csr_matrix): ratings matrix with the shape (n_users, n_items)
        item_profiles (csr_matrix): item profiles matrix with the shape (n_items, n_tfidf_features)
        user_profiles (csr_matrix): user profiles matrix with the shape (n_users, n_tfidf_features)                    
        
    Returns:
        predictions (csr_matrix): matrix of prediction with the shape (n_users, n_items)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return predictions

In [None]:
pred = make_predictions(R, item_profiles, user_profiles)
assert isinstance(pred, csr_matrix), 'The result should be a csr matrix.'
assert pred.shape == (147, 9298), 'The shape of the matrix is not correct.'
np.testing.assert_almost_equal(pred.sum(), 477847.56765819935, decimal=3, 
                               err_msg='The values of the matrix are not correct.')
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in pred.nonzero()[0]])).encode()).hexdigest() == \
'e66600d6c0ae50e4005585ab0c047e02e9db42fb5bd2a78bd45fb6870ca58154', 'Did you set the existing ratings to zero?'
assert hashlib.sha256(json.dumps(' '.join([str(i) for i in pred.nonzero()[1]])).encode()).hexdigest() == \
'f2a6dc419b756f339a34f41d58fb83d0ba2e01c468311eb7241bb8f90a44bd93', 'Did you set the existing ratings to zero?'

With these new and hopefully better predictions, we can recommend these movies to John:

In [None]:
top_movies_ids = get_top_n(pred, 25)
df_movies.loc[top_movies_ids]

These are quite different from the collaborative recommendations, I wonder which one John likes better.

Congrats on conquering the personalized recommender systems!