# BLU11 - Exercise Notebook

## Create your own movie recommender system

This exercise notebook will help you create a Recommender System using Collaborative and Content-based filtering and, in the end, it will help you to pick out some movies according to users' preferences.

## Overall Strategy

1. **Setup:** Import and pre-process the data
1. **Collaborative Filtering:** normally better but may have the cold-start problem
1. **Content-based Filtering:** larger amounts of available information about the movie (not only ratings)

In [1]:
import os
import hashlib # for grading purposes
import json
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

import warnings
warnings.filterwarnings('ignore')

We'll be using a standard dataset for recommendations, called the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. Our dataset is an adapted version from the smallest version of this dataset.

We have 3 data files:

#### Dataset Ratings

`ratings.csv`: has 70,584 ratings, provided by 147 users, for 9,298 movies.

In [2]:
df_ratings = pd.read_csv(os.path.join("data", "ratings.csv"))
df_ratings.head()

Unnamed: 0,userId,movieId,rating
0,474,185,3.0
1,62,96079,4.5
2,298,127198,3.0
3,414,3836,5.0
4,68,7360,4.5


#### Dataset Movies

`movies.csv`: has the `movieId`, `title`, `genres`, and `year` for 9,742 movies

In [3]:
df_movies = pd.read_csv(os.path.join("data","movies.csv"))
df_movies = df_movies.set_index("movieId")
df_movies.head()

Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,Toy Story,Adventure|Animation|Children|Comedy|Fantasy,1995.0
2,Jumanji,Adventure|Children|Fantasy,1995.0
3,Grumpier Old Men,Comedy|Romance,1995.0
4,Waiting to Exhale,Comedy|Drama|Romance,1995.0
5,Father of the Bride Part II,Comedy,1995.0


#### Dataset Tags

`tags.csv`: has 3.5k tags, provided by 49 users, for 1,551 movies.

In [4]:
df_tags = pd.read_csv(os.path.join("data", "tags.csv"))
df_tags.head()

Unnamed: 0,userId,movieId,tag
0,2,60756,funny
1,2,60756,Highly quotable
2,2,60756,will ferrell
3,2,89774,Boxing story
4,2,89774,MMA


# 1) Build the ratings matrix

## 1.1 -  Create a ratings dataframe

Transform the `df_ratings` dataframe in order to have the `userId` as the index, the `movieId` as the column name and the correspondent`rating` as the values.   
Make sure that index and column names are in ascending order.  
Assign the transformed dataframe to a variable called `df_ratings_transformed`.

In [9]:
df_ratings

Unnamed: 0,userId,movieId,rating
0,474,185,3.0
1,62,96079,4.5
2,298,127198,3.0
3,414,3836,5.0
4,68,7360,4.5
...,...,...,...
70579,590,920,5.0
70580,105,3285,4.0
70581,234,2710,5.0
70582,603,1680,4.0


In [6]:
df_ratings_transformed = df_ratings.pivot(index='userId', columns='movieId', values='rating')
df_ratings_transformed = df_ratings_transformed.sort_index(axis=0).sort_index(axis=1)

In [7]:
assert df_ratings_transformed.shape == (147, 9298)
assert df_ratings_transformed.columns[888] == 1228
assert df_ratings_transformed.index[64] == 275
df_ratings_transformed.head()

movieId,1,2,3,4,5,6,7,8,9,10,...,187717,188189,188301,188675,188751,188797,188833,189381,189713,190183
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
18,3.5,3.0,,,,4.0,,,,,...,,,,,,,,,,
19,4.0,3.0,3.0,,,,2.0,,,2.0,...,,,,,,,,,,


## 1.2 - Create utility functions

As you can see in the df_ratings_transformed, neither the userIds and nor the movieIds are a continuous sequence of integers. To make further processing easier, we would like to know the row/column position of each userId/movieId.

Create a set utility functions that will transform userId/movieId to row/column number and vice versa. We will refer to the row/column numbers as `index` in the functions. The functions must follow these requirements:   
- Accept a single userId/movieId or row/column index;
- Return the correspondent id, if accepting an index;
- Return the correspondent index, if accepting an id;

Below are the arrays of userIds/movieIds extracted from the row/column indexes of the dataframe. The correspondence between the array index and the userIds/movieIds is the same as in the dataframe.

In [8]:
user_id_array = df_ratings_transformed.index.to_numpy()
movie_id_array = df_ratings_transformed.columns.to_numpy()

In [9]:
user_id_array

array([  1,   4,   6,  18,  19,  20,  21,  25,  28,  30,  41,  42,  45,
        50,  51,  57,  62,  63,  64,  66,  68,  73,  82,  84,  89,  91,
       103, 104, 105, 111, 119, 122, 125, 132, 135, 139, 140, 156, 160,
       166, 169, 177, 182, 186, 187, 195, 198, 199, 200, 202, 212, 217,
       219, 220, 221, 222, 226, 232, 234, 239, 246, 249, 263, 274, 275,
       280, 282, 288, 290, 292, 294, 298, 304, 305, 307, 312, 313, 317,
       318, 325, 328, 330, 332, 339, 352, 354, 356, 357, 365, 367, 368,
       372, 380, 381, 382, 385, 387, 391, 414, 425, 428, 432, 434, 438,
       448, 452, 453, 462, 469, 474, 477, 479, 480, 483, 484, 489, 495,
       509, 514, 517, 520, 522, 525, 534, 552, 555, 560, 561, 562, 563,
       567, 570, 573, 580, 586, 590, 594, 596, 597, 599, 600, 603, 605,
       606, 607, 608, 610])

In [10]:
user_id =1
user_id_array[0]

1

In [11]:
def get_user_index(user_id, user_id_array=user_id_array):
    """
    For the user_id, returns the correspondent position in user_id_array
    
    Parameters
    ----------
    user_id : int
    user_id_array: np.array
    
    Returns
    -------
    user_index : int
    """
    user_index = np.where(user_id_array == user_id)[0][0]
    return user_index

def get_user_id(user_index, user_id_array=user_id_array):
    """
    For the user_index, returns the correspondent id in user_id_array
    
    Parameters
    ----------
    user_index : int
    user_id_array: np.array
    
    Returns
    -------
    user_id : int
    """
    user_id = user_id_array[user_index]
   
    
    return user_id

def get_movie_index(movie_id, movie_id_array=movie_id_array):
    """
    For the movie_id, returns the correspondent position in movie_id_array
    
    Parameters
    ----------
    movie_id : int
    movie_id_array: np.array
    
    Returns
    -------
    movie_index : int
    """
    movie_index = np.where(movie_id_array == movie_id)[0][0]
    
    return movie_index

def get_movie_id(movie_index, movie_id_array=movie_id_array):
    """
    For the movie_index, returns the correspondent id in movie_id_array
    
    Parameters
    ----------
    movie_index : int
    movie_id_array: np.array
    
    Returns
    -------
    movie_id : int
    """
    movie_id = movie_id_array[movie_index]
    
    return movie_id

In [12]:
user_index_45 = get_user_index(45)
user_id_45 = get_user_id(45)
movie_index_87 = get_movie_index(87)
movie_id_98 = get_movie_id(98)

In [13]:
assert hashlib.sha256(json.dumps(int(user_index_45)).encode()).hexdigest() == "6b51d431df5d7f141cbececcf79edf3dd861c3b4069f0b11661a3eefacbba918"
assert hashlib.sha256(json.dumps(int(user_id_45)).encode()).hexdigest() == "1dfacb2ea5a03e0a915999e03b5a56196f1b1664d2f768d1b7eff60ac059789d"
assert hashlib.sha256(json.dumps(int(movie_index_87)).encode()).hexdigest() == "f74efabef12ea619e30b79bddef89cffa9dda494761681ca862cff2871a85980"
assert hashlib.sha256(json.dumps(int(movie_id_98)).encode()).hexdigest() == "e5b861a6d8a966dfca7e7341cd3eb6be9901688d547a72ebed0b1f5e14f3d08d"

### 1.3 - Create a ratings matrix

Create a function called `create_ratings_matrix` that accepts the dataframe `df_ratings_transformed`  that contains userIds as indexes, movieIds as column names and ratings as values. The function should return a csr matrix.
The order of users and movies in the `df_ratings_transformed` dataframe should be maintained in the ratings matrix. 

In [14]:
df_ratings_transformed

movieId,1,2,3,4,5,6,7,8,9,10,...,187717,188189,188301,188675,188751,188797,188833,189381,189713,190183
userId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4.0,,4.0,,,4.0,,,,,...,,,,,,,,,,
4,,,,,,,,,,,...,,,,,,,,,,
6,,4.0,5.0,3.0,5.0,4.0,4.0,3.0,,3.0,...,,,,,,,,,,
18,3.5,3.0,,,,4.0,,,,,...,,,,,,,,,,
19,4.0,3.0,3.0,,,,2.0,,,2.0,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
605,4.0,3.5,,,,,,,,,...,,,,,,,,,,
606,2.5,,,,,,2.5,,,,...,,,,,,,,,,
607,4.0,,,,,,,,,,...,,,,,,,,,,
608,2.5,2.0,2.0,,,,,,,4.0,...,,,,,,,,,,


In [15]:
def create_ratings_matrix(df_ratings_transformed):
    """
    Creates a rating matrix from df_ratings_transformed, following the same organization of movies and users
    
    Parameters
    ----------
    df_ratings_transformed: pd.DataFrame
    
    Returns
    -------
    R : csr_matrix
    """
    R = csr_matrix(df_ratings_transformed.fillna(0)) 
   
    return R

In [16]:
R = create_ratings_matrix(df_ratings_transformed)
assert hashlib.sha256(json.dumps(int(R.shape[0])).encode()).hexdigest() == "1d28c120568c10e19b9d8abe8b66d0983fa3d2e11ee7751aca50f83c6f4a43aa"
assert hashlib.sha256(json.dumps(int(R.shape[1])).encode()).hexdigest() == "5c7b55dd4c978558ebd771143a57aa9825ca25ba65e6df89c7270fe10c7e9929"
assert hashlib.sha256(json.dumps(int(R[45].sum())).encode()).hexdigest() == "fc9e91cc78e1817d80b4ba8c2dc9a638d0c57959825ee34f5e3d7688ad80dfb9"

# 2) Collaborative filtering

Now that we have the ratings matrix, let's calculate similarities and give recommendations based on it.


## 2.1 -  Calculate similarities

Create a function named `calculate_similarities` that accepts a ratings matrix and a string named `similarity_type`.    
If `similarity_type` equals `users`, we want to return the users similarity, if it equals `items`, we want the items similarity.

In [17]:
def calculate_similarities(ratings_matrix, similarity_type):
    """
    Get the cosine similarity between users or items.
    
    Parameters
    ----------
    ratings_matrix : csr_matrix
                          Ratings matrix.
              
    similarity_type: str, "users" or "items"

    Returns
    -------
    similarities : csr_matrix
                        sparse representation of the cosine similarity between users or items.
    """
    if similarity_type == 'users':
        similarities = cosine_similarity(ratings_matrix, dense_output=False)
    elif similarity_type == 'items':
        similarities = cosine_similarity(ratings_matrix.T, dense_output=False)
    else:
        raise ValueError('Invalid similarity type. Must be "users" or "items"')

    return similarities

In [18]:
users_similarities = calculate_similarities(ratings_matrix=R, similarity_type="users")
assert isinstance(users_similarities, csr_matrix)
items_similarities = calculate_similarities(ratings_matrix=R, similarity_type="items")
assert isinstance(items_similarities, csr_matrix)
np.testing.assert_almost_equal(users_similarities[73].sum(), 38.97, 1)
np.testing.assert_almost_equal(items_similarities[82].sum(), 1170.75, 1)

## 2.2 - Calculate the nearest neighbor of a user

John who has the `userId` nº 25 wants to see a new movie and asked us for recommendations.    
Let's first take a look at his viewing history. For that, we need to merge the dataframe `df_ratings` with `df_movies`. The merged dataframe is called `df_ratings_title` and has been created in the cell below for you.

In [19]:
df_ratings_title = pd.merge(df_ratings, df_movies, on="movieId", how="left").sort_values("rating", ascending=False)

Now let's only select and preview the information that corresponds to `John` in `df_ratings_title`.

In [20]:
john_id = 25
df_john = df_ratings_title[df_ratings_title.userId==john_id]
df_john

Unnamed: 0,userId,movieId,rating,title,genres,year
8299,25,260,5.0,Star Wars: Episode IV - A New Hope,Action|Adventure|Sci-Fi,1977.0
59927,25,357,5.0,Four Weddings and a Funeral,Comedy|Romance,1994.0
30786,25,179819,5.0,Star Wars: The Last Jedi,Action|Adventure|Fantasy|Sci-Fi,2017.0
39291,25,924,5.0,2001: A Space Odyssey,Adventure|Drama|Sci-Fi,1968.0
5882,25,1196,5.0,Star Wars: Episode V - The Empire Strikes Back,Action|Adventure|Sci-Fi,1980.0
61913,25,1270,5.0,Back to the Future,Adventure|Comedy|Sci-Fi,1985.0
3838,25,33493,5.0,Star Wars: Episode III - Revenge of the Sith,Action|Adventure|Sci-Fi,2005.0
43541,25,1210,5.0,Star Wars: Episode VI - Return of the Jedi,Action|Adventure|Sci-Fi,1983.0
4269,25,1240,5.0,"Terminator, The",Action|Sci-Fi|Thriller,1984.0
58831,25,5,5.0,Father of the Bride Part II,Comedy,1995.0


## 2.2.1  - Get the closest user

Now we want to find out who is the most similar user to John. For that, let's build a function called `get_closest_user_id` that accepts `users_similarities` and a `user_id`. It should return the id of the user who is most similar.

In [21]:
def get_closest_user_id(users_similarities, user_id):
    """
    Return the id of the closest user to user_id.
    Hint: Use the utility functions to convert between id and index
    
    Parameters
    ----------
    users_similarities : csr_matrix
    user_id: int
    
    Returns
    -------
    closest_user_id : int
    """
    
    user_index = get_user_index(user_id)
    
    similarity_values = users_similarities[user_index].toarray()[0]
    sorted_indices = similarity_values.argsort()[::-1]
    closest_user_index = sorted_indices[1]
    closest_user_id = get_user_id(closest_user_index)
    
    return closest_user_id

With help of the function `get_closest_user_id`, let's see who is the closest user to John.

In [22]:
closest_user_to_john = get_closest_user_id(users_similarities, john_id)
"The Id of the closest user to John is %s" % str(closest_user_to_john)

'The Id of the closest user to John is 30'

In [23]:
assert  hashlib.sha256(json.dumps(int(closest_user_to_john)).encode()).hexdigest() == "624b60c58c9d8bfb6ff1886c2fd605d2adeb6ea4da576068201b6c6958ce93f4"

Let's take a look at the movies rated by the user most similar to John.

In [24]:
df_closest_user = df_ratings_title[df_ratings_title["userId"] == closest_user_to_john]
df_closest_user

Unnamed: 0,userId,movieId,rating,title,genres,year
25970,30,1371,5.0,Star Trek: The Motion Picture,Adventure|Sci-Fi,1979.0
12997,30,1196,5.0,Star Wars: Episode V - The Empire Strikes Back,Action|Adventure|Sci-Fi,1980.0
51886,30,11,5.0,"American President, The",Comedy|Drama|Romance,1995.0
46785,30,260,5.0,Star Wars: Episode IV - A New Hope,Action|Adventure|Sci-Fi,1977.0
29723,30,1374,5.0,Star Trek II: The Wrath of Khan,Action|Adventure|Sci-Fi|Thriller,1982.0
48799,30,2571,5.0,"Matrix, The",Action|Sci-Fi|Thriller,1999.0
51915,30,1210,4.5,Star Wars: Episode VI - Return of the Jedi,Action|Adventure|Sci-Fi,1983.0
56214,30,1270,4.5,Back to the Future,Adventure|Comedy|Sci-Fi,1985.0
67305,30,2628,4.5,Star Wars: Episode I - The Phantom Menace,Action|Adventure|Sci-Fi,1999.0
29288,30,338,4.0,Virtuosity,Action|Sci-Fi|Thriller,1995.0


## 2.3 - Get recommendations from the closest user

Create a function called `get_closest_user_recommendations` that accepts the dataframes `df_user` and `df_closest_user` and returns a dataframe with `movieId`, `title`, `genres`, and `year` of the movies watched by the closest user, with a rating greater than 2, that the user hasn't watched yet.

In [25]:
df_john

Unnamed: 0,userId,movieId,rating,title,genres,year
8299,25,260,5.0,Star Wars: Episode IV - A New Hope,Action|Adventure|Sci-Fi,1977.0
59927,25,357,5.0,Four Weddings and a Funeral,Comedy|Romance,1994.0
30786,25,179819,5.0,Star Wars: The Last Jedi,Action|Adventure|Fantasy|Sci-Fi,2017.0
39291,25,924,5.0,2001: A Space Odyssey,Adventure|Drama|Sci-Fi,1968.0
5882,25,1196,5.0,Star Wars: Episode V - The Empire Strikes Back,Action|Adventure|Sci-Fi,1980.0
61913,25,1270,5.0,Back to the Future,Adventure|Comedy|Sci-Fi,1985.0
3838,25,33493,5.0,Star Wars: Episode III - Revenge of the Sith,Action|Adventure|Sci-Fi,2005.0
43541,25,1210,5.0,Star Wars: Episode VI - Return of the Jedi,Action|Adventure|Sci-Fi,1983.0
4269,25,1240,5.0,"Terminator, The",Action|Sci-Fi|Thriller,1984.0
58831,25,5,5.0,Father of the Bride Part II,Comedy,1995.0


In [26]:
def get_closest_user_recommendations(df_user, df_closest_user):
    """
    Get the movies in df_closest_user that are not in df_user and have ratings higher than 2
    
    Parameters
    ----------
    df_user : pd.DataFrame
    df_closest_user: pd.DataFrame
    
    Returns
    -------
    df_new_movies : pd.DataFrame
    """
    
    closest_user_new_movies = df_closest_user[~df_closest_user['movieId'].isin(df_user.movieId)]
    
    df_new_movies = closest_user_new_movies[closest_user_new_movies['rating'] > 2]

    
    return df_new_movies[['movieId', 'title', 'genres', 'year']]

In [27]:
user_recommendations = get_closest_user_recommendations(df_john, df_closest_user)
assert list(user_recommendations.columns.sort_values()) == ['genres', 'movieId', 'title', 'year']
assert  hashlib.sha256(json.dumps(int(user_recommendations.movieId.sum())).encode()).hexdigest() == "ad0d7f127516cc1330d5d41d6cab963a831fc6d563568f0b8417f6d7e544f13b"
assert hashlib.sha256(json.dumps(int(user_recommendations.movieId.sort_values().iloc[1])).encode()).hexdigest() == "5d8f6cce532a7aeb57196be62344095936793400b3aeb3580d248b17d5518a86"
print("These are the movies that the closest user to John can recommend:")
user_recommendations

These are the movies that the closest user to John can recommend:


Unnamed: 0,movieId,title,genres,year
25970,1371,Star Trek: The Motion Picture,Adventure|Sci-Fi,1979.0
51886,11,"American President, The",Comedy|Drama|Romance,1995.0
29723,1374,Star Trek II: The Wrath of Khan,Action|Adventure|Sci-Fi|Thriller,1982.0
48799,2571,"Matrix, The",Action|Sci-Fi|Thriller,1999.0
29288,338,Virtuosity,Action|Sci-Fi|Thriller,1995.0


## 2.4 - Get user predictions with collaborative filtering
Let's build a function named `make_user_predictions` that predicts the user's ratings for the movies that haven't been rated by the users yet, based on the ratings and user similarities. 

In [28]:
def make_user_predictions(users_similarities, R):
    """
    Parameters
    ----------
    users_similarities : csr_matrix, shape: (n_users, n_users)
                Matrix with the similarities between users.
    
    R : csr_matrix, shape: (n_users, n_items)
        Matrix with the available ratings.
    
    Returns
    -------
    users_predictions : csr_matrix, shape: (n_users, n_items)
                        Ratings predictions.
    """
    
    
    S = users_similarities
    weighted_sum = np.dot(S,R)
        
    R_boolean = R.copy() 
    R_boolean[R_boolean > 0] = 1 
    preds = np.divide(weighted_sum, np.dot(S,R_boolean)) 
    
    preds[R.nonzero()] = 0
    
    users_predictions = csr_matrix(preds)

    return users_predictions

In [29]:
users_predictions = make_user_predictions(users_similarities, R)
assert hashlib.sha256(json.dumps(users_predictions.shape[0]).encode()).hexdigest() == "1d28c120568c10e19b9d8abe8b66d0983fa3d2e11ee7751aca50f83c6f4a43aa"
assert hashlib.sha256(json.dumps(users_predictions.shape[1]).encode()).hexdigest() == "5c7b55dd4c978558ebd771143a57aa9825ca25ba65e6df89c7270fe10c7e9929"
assert hashlib.sha256(json.dumps((users_predictions == 0).size).encode()).hexdigest() == "bcdbafca562b7a2eaf193e1802357698be9870e0b354ce92a3bd03d22b4043ea", 'Did you return a csr matrix?'
np.testing.assert_almost_equal(users_predictions[33].sum(), 28672.67, 1)
assert hashlib.sha256(json.dumps(round(users_predictions[34].toarray()[0][5], 1)).encode()).hexdigest() == "04125177931fbd4afa4af7296dbdc95e9f209268cb4518f08aa568aa503993a2"

## 2.5 - Get top 10 movies

From users predictions, let's create a function named `get_top_n` that returns, by default, the top 10 unseen movies with the greatest predicted rating for a given user Id.

In [30]:
def get_top_n(users_predictions, user_id, n=10):
    """
    Returns the top n movies for a given user
    
    Hint: Use the utility functions to convert between id and index
    
    Parameters
    ----------
    users_predictions : csr_matrix, shape: (n_users, n_items)
                        Ratings predictions.
    
    user_id: int
    
    n: int
    
    Returns
    -------
    movies_ids : list
    """
    
    
    user_index = get_user_index(user_id)

    pred = users_predictions[user_index]
    top = np.negative(pred).toarray().argsort()[:, :n]

    movies_ids= []
    for s in top[0] :
        id_ = get_movie_id(s)
        movies_ids.append(id_)
    
        
    return movies_ids

top_movies_ids = get_top_n(users_predictions,john_id)
top_movies_ids

[6442, 94810, 88448, 78836, 128914, 162344, 33649, 53355, 53280, 2824]

In [31]:
top_movies_ids = get_top_n(users_predictions,john_id)
assert hashlib.sha256(json.dumps(int(sum(top_movies_ids))).encode()).hexdigest() == "ce03c60619dc47c9cc0818d5da415a3d73918806fce620c6fa0f257415c8e1c9"
assert hashlib.sha256(json.dumps(int(len(top_movies_ids))).encode()).hexdigest() == "4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5"
print("These are the top 10 movies that we can recommend to John using collaborative filtering.")
df_movies.loc[top_movies_ids]

These are the top 10 movies that we can recommend to John using collaborative filtering.


Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
6442,Belle époque,Comedy|Romance,1992.0
94810,Eva,Drama|Fantasy|Sci-Fi,2011.0
88448,Paper Birds (Pájaros de papel),Comedy|Drama,2010.0
78836,Enter the Void,Drama,2009.0
128914,Tom Segura: Completely Normal,Comedy,2014.0
162344,Tom Segura: Mostly Stories,Comedy,2016.0
33649,Saving Face,Comedy|Drama|Romance,2004.0
53355,Sun Alley (Sonnenallee),Comedy|Romance,1999.0
53280,"Breed, The",Horror|Thriller,2006.0
2824,On the Ropes,Documentary|Drama,1999.0


In [32]:
hashlib.sha256(json.dumps(int(sum(top_movies_ids))).encode()).hexdigest() == "ce03c60619dc47c9cc0818d5da415a3d73918806fce620c6fa0f257415c8e1c9"

True

# 3) Content-based filtering

Now let's do predictions based on the characteristics of the items.
For that purpose, we can make use of the `tag.csv` dataset to describe the movies.

## 3.1 - Processing the tags

Since we have multiple tags for the same movie, we can join them into a single string.

Create a function named `create_movie_document_tags` that applies the following processing to the tags in the `df_tags` dataframe:
- Removes whitespaces for multi-word tags;
- Lowercases the tags; 
- Joins all the tags for one movieId into a single string, with tags separated by a space;   

The function should return the tags strings in a series with `movieId` as index.  

In [33]:
def create_movie_document_tags(df):
    """
    1st step of preprocessing the movie profiles.
    Join the multiple tags (lowercased and without spaces) into a single string for each movie, 
    with the tags separated by a space.
    
    Parameters
    ----------
    df : pd.DataFrame
              Original dataframe with the tags
              
    Returns
    -------
    tags : pd.Series
                        Series with the joined tags strings and movieId as index
    """
    # Remove whitespaces, lowercases the tags
    df['tag'] = df['tag'].str.replace(' ', '').str.lower()

    # Groupby movieId and join tags
    tags = df.groupby('movieId')['tag'].apply(' '.join)
    return tags

In [34]:
df_document_tags = create_movie_document_tags(df_tags)
assert isinstance(df_document_tags, pd.Series)
assert df_document_tags.shape[0] == 1551
assert hashlib.sha256(json.dumps(df_document_tags.loc[50]).encode()).hexdigest() == "a3bbafd77ce83eac3385449cebf8f7db4d5174b30da52e7cbdc35550e4d547cd"

## 3.2 - Add more information about movies

From the dataframe `df_movies`, create a new dataframe named `df_movies_processed`, with `movieId` as the index. Create a new column named `decade`, from the column `year` with the help of the function `extract_decade` below. After creating the column `decade`, drop the column `year`.   
Also, clean the column `genres` in order to have each genre in lowercase, without spaces (in the cases where each genre has more than one word), and separate genres with a space.   
Finally, join `df_movies_processed` with the series `df_document_tags` by `movieId` and give the name `tags` to the new column created.  
If there are any missing values, fill them with empty strings.   
Select only the movies that have ratings, assign them to `df_movies_processed_w_ratings`, and keep the same order in which they appear in the ratings matrix.   

In [35]:
def extract_decade(value):
    try: 
        decade = int(10*round(value/10))
        return str(decade)
    except:
        return ""

In [51]:
df_movies_processed = df_movies[["genres", "year"]]

df_movies_processed["decade"] =  df_movies_processed["year"].apply(extract_decade)
df_movies_processed = df_movies_processed.drop("year", axis=1)
df_movies_processed["genres"] = df_movies_processed["genres"].str.lower().str.replace(" ", "").str.replace("|", " ")

df_movies_processed = df_movies_processed.join(df_document_tags, how="left")
df_movies_processed = df_movies_processed.fillna("")

movie_ids_w_ratings = np.intersect1d(df_movies_processed.index, df_ratings["movieId"])
df_movies_processed_w_ratings = df_movies_processed.loc[movie_ids_w_ratings]

df_movies_processed_w_ratings = df_movies_processed_w_ratings.rename(columns={"tag": "tags"})


In [52]:
df_movies_processed_w_ratings

Unnamed: 0_level_0,genres,decade,tags
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,adventure animation children comedy fantasy,2000,pixar pixar fun
2,adventure children fantasy,2000,fantasy magicboardgame robinwilliams game
3,comedy romance,2000,
4,comedy drama romance,2000,
5,comedy,2000,pregnancy remake
...,...,...,...
188797,comedy,2020,
188833,adventure comedy fantasy,2020,
189381,action crime thriller,2020,
189713,comedy crime drama,2020,


In [54]:
assert df_movies_processed_w_ratings.shape == (9298, 3)
assert ('genres' and 'decade' and 'tags') in df_movies_processed_w_ratings.columns
assert hashlib.sha256(json.dumps(int(sum(df_movies_processed_w_ratings.index))).encode()).hexdigest() == "131cb34d8c2b760e68aca73489b1adb4df2bac32813f0f094ebf425fe93cc037"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].decade).encode()).hexdigest() == "637116a317aaf00163e5ae8e254bd0fa5625603b98429d84f87e5c8c5240e350"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].genres).encode()).hexdigest() == "2a82415726224b3ddf84350f8d9f94213c447329073f7bc9096abbc8dcfa9660"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].tags).encode()).hexdigest() == "12ae32cb1ec02d01eda3581b127c1fee3b0dc53572ed6baf239721a03d82e126"
assert df_movies_processed_w_ratings[df_movies_processed_w_ratings.tags==""].size == 23307
df_movies_processed_w_ratings.head()

Unnamed: 0_level_0,genres,decade,tags
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,adventure animation children comedy fantasy,2000,pixar pixar fun
2,adventure children fantasy,2000,fantasy magicboardgame robinwilliams game
3,comedy romance,2000,
4,comedy drama romance,2000,
5,comedy,2000,pregnancy remake


In [45]:
df_movies_processed_w_ratings.columns

Index(['genres', 'decade', 'tags'], dtype='object')

## 3.3 - Calculate the profiles for the Items   
### 3.3.1 - Calculate the tf-idf

Create a function named `get_tf_idf` that accepts a pandas series where the values are of type `str` and returns the correspondent tf-idf for that series.

In [58]:
def get_tf_idf(doc_series: pd.Series):
    """
    Generates the tf-idf of doc_series
    
    Parameters
    ----------
    doc_series : pd.Series, shape:(n_items, )
                 series where elements are of type str
              
    Returns
    -------
    doc_tf_idf : csr_matrix, shape: (n_items, n_tfidf_features)
    """
    tfidf = TfidfVectorizer()

    doc_term_matrix = tfidf.fit_transform(doc_series)

    doc_tf_idf = csr_matrix(doc_term_matrix)
    
    return doc_tf_idf

In [59]:
tag_tf_idf  = get_tf_idf(df_movies_processed_w_ratings["tags"])
assert tag_tf_idf.shape == (9298, 1458)
np.testing.assert_almost_equal(tag_tf_idf[:, 5].toarray().sum(), 2.26, 2)

### 3.3.2 - Calculate item profiles

Let's now create an item's profile that contains information regarding `tags`, `genres`, and `decades`. 

Create a function named `get_items_profile` that accepts a dataframe, calculates tf-idf for each column, concatenates horizontally all the  csr_matrices that are the result of the tf-idf, and returns it as a csr_matrix.        
Hint: Use the function `get_if_idf` and the method `np.hstack`.

In [101]:
df_movies_processed_w_ratings

Unnamed: 0_level_0,genres,decade,tags
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,adventure animation children comedy fantasy,2000,pixar pixar fun
2,adventure children fantasy,2000,fantasy magicboardgame robinwilliams game
3,comedy romance,2000,
4,comedy drama romance,2000,
5,comedy,2000,pregnancy remake
...,...,...,...
188797,comedy,2020,
188833,adventure comedy fantasy,2020,
189381,action crime thriller,2020,
189713,comedy crime drama,2020,


In [123]:
def get_items_profile(df: pd.DataFrame):   
    """
    Creates the item profiles from tf-idf applied to all the columns of df
    
    Parameters
    ----------
    df : pd.DataFrame, shape:(n_items, n_features)
              
              
    Returns
    -------
    item_profiles : csr_matrix, shape: (n_items, n_tfidf_features)
    """
    
    
    
    
    tf_tags = get_tf_idf(df.tags)
    tf_genres = get_tf_idf(df.genres)
    tf_dec = get_tf_idf(df.decade)
    
    item_profiles_ = np.hstack([tf_tags.toarray(), tf_genres.toarray(),tf_dec.toarray()])
    
    return csr_matrix(item_profiles_)


   
    

In [126]:
item_profiles = get_items_profile(df_movies_processed_w_ratings)

assert hashlib.sha256(json.dumps(int(item_profiles[45, :].toarray()[0].sum())).encode()).hexdigest() == "ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d"

In [127]:
item_profiles = get_items_profile(df_movies_processed_w_ratings)
assert item_profiles.shape == (9298, 1493), 'Did you return a csr_matrix?'
assert hashlib.sha256(json.dumps(int(item_profiles[45, :].toarray()[0].sum())).encode()).hexdigest() == "ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d"

## 3.4 - User profiles
The next step is to create a function called `make_user_profiles` that accepts the ratings matrix and the items profiles and calculates the user profiles.

In [130]:
def make_user_profiles(ratings, item_profiles):
    """
    Calculate the user profiles 
    
    Parameters
    ----------
    ratings : pd.DataFrame, shape: (n_users, n_items)
    
    item_profiles : csr_matrix, shape: (n_items, n_tfidf_features)
    
    Returns
    -------
    user_profiles : csr_matrix, shape: (n_users, n_tfidf_features)
    """
    R = ratings
    
    user_profiles = np.dot(R, item_profiles)

    return user_profiles

In [131]:
user_profiles = make_user_profiles(R, item_profiles)
assert user_profiles.shape == (147, 1493)
assert hashlib.sha256(json.dumps(int(user_profiles[10, :].toarray()[0].sum())).encode()).hexdigest() == "3c7f572560e6d2f14680d05690428dbedc48378a6b8015d86024428f36791dad"

## 3.5 - The Moment of Truth
Finally, let's make predictions, this time based not only on ratings, but also on items characteristics. 

In [132]:
def make_predictions(R, item_profiles, user_profiles):
    """
    Make predictions based on the ratings matrix, the item profiles and the user profiles we calculated previously.
    
    
    Parameters
    ----------
    R : csr_matrix. shape: (n_users, n_items)
        Matrix containing the ratings
        
    item_profiles : csr_matrix. shape: (n_items, n_tfidf_features)
                    Matrix containing the TF-IDF features calculated for the items.
                    
    user_profiles : csr_matrix. shape: (n_users, n_tfidf_features)
                    Matrix containing the user profiles as the product of the ratings with the item profiles.
                    
                    
    Returns
    -------
    predictions : csr_matrix. shape: (n_users, n_items)
                  Matrix with the predictions. Already rated content is suppressed to 0.
    """
    preds = cosine_similarity(user_profiles, item_profiles)
    
    preds[R.nonzero()] = 0
    predictions=csr_matrix(preds)
    
    return predictions

In [133]:
pred = make_predictions(R, item_profiles, user_profiles)
assert pred.shape == (147, 9298)
assert hashlib.sha256(json.dumps(int(pred[:, 50].toarray().sum())).encode()).hexdigest() == "bbb965ab0c80d6538cf2184babad2a564a010376712012bd07b0af92dcd3097d"
assert hashlib.sha256(json.dumps(pred[pred == 0].size).encode()).hexdigest() == "ade54f1d9e8f688033b38ac8289791a62e9e1b9fde4ea1eb8ae8cb7f3ac503a8"

With these new predictions, we can give recommendations to John, taking into account not only the ratings but also the movies' content.

In [134]:
top_movies_ids = get_top_n(pred, john_id, 20)
df_movies.loc[top_movies_ids]

Unnamed: 0_level_0,title,genres,year
movieId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8633,"Last Starfighter, The",Action|Adventure|Comedy|Sci-Fi,1984.0
1375,Star Trek III: The Search for Spock,Action|Adventure|Sci-Fi,1984.0
2105,Tron,Action|Adventure|Sci-Fi,1982.0
4941,Flash Gordon,Action|Adventure|Sci-Fi,1980.0
3704,Mad Max Beyond Thunderdome,Action|Adventure|Sci-Fi,1985.0
3702,Mad Max,Action|Adventure|Sci-Fi,1979.0
4987,Spacehunter: Adventures in the Forbidden Zone,Action|Adventure|Sci-Fi,1983.0
2528,Logan's Run,Action|Adventure|Sci-Fi,1976.0
3070,Adventures of Buckaroo Banzai Across the 8th D...,Adventure|Comedy|Sci-Fi,1984.0
26444,"Hitch Hikers Guide to the Galaxy, The",Adventure|Comedy|Sci-Fi,1981.0
