# BLU11 - Exercise Notebook

## Create your own movie recommender system

This exercise notebook will help you create a Recommender System using Collaborative and Content-based filtering and, in the end, it will help you to pick some movies according to users' preferences.

## Overall Strategy

1. **Setup:** Import and pre-process the data
1. **Collaborative Filtering:** normally better but may have the cold-start problem
1. **Content-based Filtering:** larger amounts of accessible information about the movie (not only ratings)

In [None]:
# Define your setup
import os
import hashlib # for grading purposes
import json
import pandas as pd
import numpy as np

from scipy.sparse import csr_matrix

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import normalize

import warnings
warnings.filterwarnings('ignore')

We'll be using a standard dataset for recommendations, called the [MovieLens](https://grouplens.org/datasets/movielens/) dataset. Our dataset is an adapted version from the smallest version of the dataset available.

We have 3 data files available:   

#### Dataset Ratings

`ratings.csv`: has 70,584 ratings, provided by 147 users, for 9,298 movies.

In [None]:
df_ratings = pd.read_csv(os.path.join("data", "ratings.csv"))
df_ratings.head()

#### Dataset Movies

`movies.csv`: has the `movieId`, `title`, `genre`, and year for 9,742 movies

In [None]:
df_movies = pd.read_csv(os.path.join("data","movies.csv"))
df_movies = df_movies.set_index("movieId")
df_movies.head()

#### Dataset Tags

`tags.csv`: has 3.5k tags, provided by 49 users, for 1,551 movies.

In [None]:
df_tags = pd.read_csv(os.path.join("data", "tags.csv"))
df_tags.head()

# 1) Build the Ratings Matrix

## 1.1 -  Create a Ratings Dataframe

Transform the `df_ratings` dataframe in order to have the `userId` as the index, the `movieId` as the column name and the correspondent`rating` as the values.   
Make sure that index and column names are in ascending order.  
Assign the transformed dataframe to a variable called `df_ratings_transformed`.

In [None]:
# df_ratings_transformed = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df_ratings_transformed.shape == (147, 9298)
assert df_ratings_transformed.columns[888] == 1228
assert df_ratings_transformed.index[64] == 275
df_ratings_transformed.head()

## 1.2 - Create utility functions

Create a set of utility functions that do the correspondence between user/movie Id and index.    

The functions must follow the requirements below:   
- Accepts movie/user's index or Id;
- Accepts an array of movie/user's Id that keeps the same order of movies and users  as `df_ratings_transformed`;
- Returns the correspondent id, if it accepts an index, of the index position on the array of movie/user's Id;
- Returns the correspondent index, if it accepts an id of the same array;

The array of movie/user's Id should have the following default values:

In [None]:
user_id_array = df_ratings_transformed.index.to_numpy()
movie_id_array = df_ratings_transformed.columns.to_numpy()

In [None]:
def get_user_index(user_id, user_id_array=user_id_array):
    """
    For the user_id, returns the correspondent position on user_id_array
    
    Parameters
    ----------
    user_id : int
    user_id_array: np.array
    
    Returns
    -------
    user_index : int
    """
    # user_index = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return user_index

def get_user_id(user_index, user_id_array=user_id_array):
    """
    For the user_index, returns the correspondent id on user_id_array
    
    Parameters
    ----------
    user_index : int
    user_id_array: np.array
    
    Returns
    -------
    user_id : int
    """
    # user_id = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return user_id

def get_movie_index(movie_id, movie_id_array=movie_id_array):
    """
    For the movie_id, returns the correspondent position on movie_id_array
    
    Parameters
    ----------
    movie_id : int
    movie_id_array: np.array
    
    Returns
    -------
    movie_index : int
    """
    # movie_index = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return movie_index

def get_movie_id(movie_index, movie_id_array=movie_id_array):
    """
    For the movie_index, returns the correspondent id on movie_id_array
    
    Parameters
    ----------
    movie_index : int
    movie_id_array: np.array
    
    Returns
    -------
    movie_id : int
    """
    # movie_id = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return movie_id

In [None]:
user_index_45 = get_user_index(45)
user_id_45 = get_user_id(45)
movie_index_87 = get_movie_index(87)
movie_id_98 = get_movie_id(98)

In [None]:
assert hashlib.sha256(json.dumps(int(user_index_45)).encode()).hexdigest() == "6b51d431df5d7f141cbececcf79edf3dd861c3b4069f0b11661a3eefacbba918"
assert hashlib.sha256(json.dumps(int(user_id_45)).encode()).hexdigest() == "1dfacb2ea5a03e0a915999e03b5a56196f1b1664d2f768d1b7eff60ac059789d"
assert hashlib.sha256(json.dumps(int(movie_index_87)).encode()).hexdigest() == "f74efabef12ea619e30b79bddef89cffa9dda494761681ca862cff2871a85980"
assert hashlib.sha256(json.dumps(int(movie_id_98)).encode()).hexdigest() == "e5b861a6d8a966dfca7e7341cd3eb6be9901688d547a72ebed0b1f5e14f3d08d"

### 1.3 - Create a Ratings Matrix

Create a function called `create_ratings_matrix` that accepts the dataframe `df_ratings_transformed`,  that contains users' Ids as indexes, items' Ids as column names and ratings as values. 
The order of users and items in the `df_ratings_transformed` dataframe should be maintained in the ratings matrix.

In [None]:
def create_ratings_matrix(df_ratings_transformed):
    """
    Creates a rating matrix from df_ratings_transformed, following the same organization of movies and users
    
    Parameters
    ----------
    df_ratings_transformed: pd.DataFrame
    
    Returns
    -------
    R : csr_matrix
    """
    # R = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return R

In [None]:
R = create_ratings_matrix(df_ratings_transformed)
assert hashlib.sha256(json.dumps(int(R.shape[0])).encode()).hexdigest() == "1d28c120568c10e19b9d8abe8b66d0983fa3d2e11ee7751aca50f83c6f4a43aa"
assert hashlib.sha256(json.dumps(int(R.shape[1])).encode()).hexdigest() == "5c7b55dd4c978558ebd771143a57aa9825ca25ba65e6df89c7270fe10c7e9929"
assert hashlib.sha256(json.dumps(int(R[45].sum())).encode()).hexdigest() == "fc9e91cc78e1817d80b4ba8c2dc9a638d0c57959825ee34f5e3d7688ad80dfb9"


# 2) Collaborative Filtering

Now that we have the ratings matrix already built, let's calculate similarities and give recommendations based on it.


## 2.1 -  Calculate Similarities

Create a function, named `calculate_similarities`, that accepts a ratings matrix, and a string named `similarity_type`.    
If `similarity_type` equals `users`, that indicates that we want to return the users similarity, if it equals `items`, it indicates that we want the items similarity.

In [None]:
def calculate_similarities(ratings_matrix, similarity_type):
    """
    Get the cosine similarity between users.
    
    Parameters
    ----------
    ratings_matrix : csr_matrix
              Ratings matrix.
              
    similarity_type: str, "users" or "items"

    Returns
    -------
    similarities : csr_matrix
                        sparse representation of the cosine similarity between users or items.
    """
    # similarities = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return similarities

In [None]:
users_similarities = calculate_similarities(ratings_matrix=R, similarity_type="users")
assert isinstance(users_similarities, csr_matrix)
items_similarities = calculate_similarities(ratings_matrix=R, similarity_type="items")
assert isinstance(items_similarities, csr_matrix)
np.testing.assert_almost_equal(users_similarities[73].sum(), 38.97, 1)
np.testing.assert_almost_equal(items_similarities[82].sum(), 1170.75, 1)

## 2.2 - Calculate the nearest neighbor of a user

John, that has the `userId` nº 25, wants to see a new movie and asked us for recommendations.    
Let's first take a look at his past history. For that, we need to first merge the dataframe `df_ratings` with `df_movies`, on column `movieId`. This new dataframe is called `df_ratings_title` and has been created in the cell below for you.

In [None]:
df_ratings_title = pd.merge(df_ratings, df_movies, on="movieId", how="left").sort_values("rating", ascending=False)

Now let's only select and preview the information that corresponds to `John` on `df_ratings_title`.

In [None]:
john_id = 25
df_john = df_ratings_title[df_ratings_title.userId==john_id]
df_john

## 2.2.1  - Get Closest User

Now we want to find out who is the most similar user to John. For that, let's build a function called `get_closest_user_id`, that accepts `users_similarities` and a `user_id`. It should return the id of the user who is most similar.

In [None]:
def get_closest_user_id(users_similarities, user_id):
    """
    Return the id of the closest user to user_id.
    Hint: Use the utility functions to convert between id and index
    
    Parameters
    ----------
    users_similarities : csr_matrix
    user_id: int
    
    Returns
    -------
    closest_user_id : int
    """
    # closest_user_id = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return closest_user_id

With help of the function `get_closest_user_id`, let's see who is the closest user to John.

In [None]:
closest_user_to_john = get_closest_user_id(users_similarities, john_id)
"The Id of the closest user to John is %s" % str(closest_user_to_john)

In [None]:
assert  hashlib.sha256(json.dumps(int(closest_user_to_john)).encode()).hexdigest() == "624b60c58c9d8bfb6ff1886c2fd605d2adeb6ea4da576068201b6c6958ce93f4"

Let's take a look at the movies rated by the user most similar to John.

In [None]:
df_closest_user = df_ratings_title[df_ratings_title["userId"] == closest_user_to_john]
df_closest_user

## 2.3 - Get recommendations from the closest user

Create a function called `get_closest_user_recommendations`, that accepts the dataframe `df_user` and `df_closest_user`, and returns a dataframe with `genres`, `movieId`, `title` and `year` of the movies watched by the closest user, with a rating greater than 2, that the user hasn't watched yet.

In [None]:
def get_closest_user_recommendations(df_user, df_closest_user):
    """
    Get the movies on df_closest_user, that are not on df_user and have ratings bigger than 2
    
    Parameters
    ----------
    df_user : pd.DataFrame
    df_closest_user: pd.DataFrame
    
    Returns
    -------
    df_new_movies : pd.DataFrame
    """
    # df_new_movies = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return df_new_movies

In [None]:
user_recommendations = get_closest_user_recommendations(df_john, df_closest_user)
assert list(user_recommendations.columns.sort_values()) == ['genres', 'movieId', 'title', 'year']
assert  hashlib.sha256(json.dumps(int(user_recommendations.movieId.sum())).encode()).hexdigest() == "ad0d7f127516cc1330d5d41d6cab963a831fc6d563568f0b8417f6d7e544f13b"
assert hashlib.sha256(json.dumps(int(user_recommendations.movieId.sort_values().iloc[1])).encode()).hexdigest() == "5d8f6cce532a7aeb57196be62344095936793400b3aeb3580d248b17d5518a86"
print("Those are the movies that the closest user to John can recommend to him.")
user_recommendations

## 2.4 - Get Users Predictions with collaborative filtering
Let's build a function, named `make_user_predictions`, that predicts the user's ratings, for the movies that haven't been rated by the users yet, based on the user's ratings and similarities between them. 

In [None]:
def make_user_predictions(users_similarities, R):
    """
    Parameters
    ----------
    users_similarities : csr_matrix, shape: (n_users, n_users)
                Matrix with the similarities between users.
    
    R : csr_matrix, shape: (n_users, n_items)
        Matrix with the available ratings.
    
    Returns
    -------
    users_predictions : csr_matrix, shape: (n_users, n_items)
                        Ratings predictions.
    """
    # users_predictions = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return users_predictions

In [None]:
users_predictions = make_user_predictions(users_similarities, R)
assert hashlib.sha256(json.dumps(users_predictions.shape[0]).encode()).hexdigest() == "1d28c120568c10e19b9d8abe8b66d0983fa3d2e11ee7751aca50f83c6f4a43aa"
assert hashlib.sha256(json.dumps(users_predictions.shape[1]).encode()).hexdigest() == "5c7b55dd4c978558ebd771143a57aa9825ca25ba65e6df89c7270fe10c7e9929"
assert hashlib.sha256(json.dumps((users_predictions == 0).size).encode()).hexdigest() == "bcdbafca562b7a2eaf193e1802357698be9870e0b354ce92a3bd03d22b4043ea"
np.testing.assert_almost_equal(users_predictions[33].sum(), 28672.67, 1)
assert hashlib.sha256(json.dumps(round(users_predictions[34].toarray()[0][5], 1)).encode()).hexdigest() == "04125177931fbd4afa4af7296dbdc95e9f209268cb4518f08aa568aa503993a2"

## 2.5 - Get top 10 movies

From users predictions, let's create a function named `get_top_n`, that returns, by default, the top 10 unseen movies, with the greatest predicted rating, for a given user Id.

In [None]:
def get_top_n(users_predictions, user_id, n=10):
    """
    Returns the top n movies for a given user
    
    Hint: Use the utility functions to convert between id and index
    
    Parameters
    ----------
    users_predictions : csr_matrix, shape: (n_users, n_items)
                        Ratings predictions.
    
    user_id: int
    
    n: int
    
    Returns
    -------
    movies_ids : list
    """
    # movies_ids = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return movies_ids

In [None]:
top_movies_ids = get_top_n(users_predictions,john_id)
assert hashlib.sha256(json.dumps(int(sum(top_movies_ids))).encode()).hexdigest() == "ce03c60619dc47c9cc0818d5da415a3d73918806fce620c6fa0f257415c8e1c9"
assert hashlib.sha256(json.dumps(int(len(top_movies_ids))).encode()).hexdigest() == "4a44dc15364204a80fe80e9039455cc1608281820fe2b24f1e5233ade6af1dd5"
print("Those are the top 10 movies that we can recommend to John using collaborative filtering.")
df_movies.loc[top_movies_ids]

# 3) Content-based Filtering

Now let's do predictions based on the characteristics of the items themselves.    
For that purpose, we can make use of the `tag.csv` dataset to describe the movies.

## 3.1 - Processing the Tags

Since we have multiple tags for the same movie, we can join them on a single string (document), for each movie, to describe it.   
Create a function named `create_movie_document_tags`, that applies the following processing to tags:  
- Remove whitespaces for multi-word tags;
- Lowercase tags; 
- Join all the tags in a single document, separated by comma;   

Then, returns a series that has the correspondence between `movieId` and all the given `tags`, in a single document, already processed. The series should have `movieId` as index.  

In [None]:
def create_movie_document_tags(df):
    """
    1st step of preprocessing the movies contents.
    Join the multiple tags (lower cased and without spaces) in a single row for each movie, with the tags separated by a space.
    
    Parameters
    ----------
    df : pd.DataFrame
              Original dataframe for the tags
              
    Returns
    -------
    tags : pd.Series
                        Series with movieId as Index and the multiple tags in a document.
    """
    # tags = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return tags

In [None]:
df_document_tags = create_movie_document_tags(df_tags)
assert isinstance(df_document_tags, pd.Series)
assert df_document_tags.shape[0] == 1551
assert hashlib.sha256(json.dumps(df_document_tags.loc[50]).encode()).hexdigest() == "a3bbafd77ce83eac3385449cebf8f7db4d5174b30da52e7cbdc35550e4d547cd"

## 3.2 - Add more information about movies

From the dataframe `df_movies`, create a new dataframe named `df_movies_processed`, with `movieId` as the index. Create a new column named `decade`, from the column `year` with help of the function `extract_decade` below. After creating the column `decade`, drop the column `year`.   
Also, clean the column `genres` in order to have each genre in lower case, without spaces (in the cases where each genre has more than one word), and separate genres with one space.   
Finally, join the serie `df_document_tags` by `movieId` and give the name `tags` to the new column created.  
If there is any missing values, fill them with empty strings.   
Select only the movies that have ratings, assign it to `df_movies_processed_w_ratings`, and keep the same order they appear in the ratings matrix.   

In [None]:
def extract_decade(value):
    try: 
        decade = int(10*round(value/10))
        return str(decade)
    except:
        return ""

In [None]:
df_movies_processed = df_movies[["genres", "year"]]
# df_movies_processed["decade"] = ...
# df_movies_processed["genres"] = ...
# df_movies_processed["tags"] = ...
# df_movies_processed_w_ratings = ...
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
assert df_movies_processed_w_ratings.shape == (9298, 3)
assert ('genres' and 'decade' and 'tags') in df_movies_processed_w_ratings.columns
assert hashlib.sha256(json.dumps(int(sum(df_movies_processed_w_ratings.index))).encode()).hexdigest() == "131cb34d8c2b760e68aca73489b1adb4df2bac32813f0f094ebf425fe93cc037"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].decade).encode()).hexdigest() == "637116a317aaf00163e5ae8e254bd0fa5625603b98429d84f87e5c8c5240e350"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].genres).encode()).hexdigest() == "2a82415726224b3ddf84350f8d9f94213c447329073f7bc9096abbc8dcfa9660"
assert hashlib.sha256(json.dumps(df_movies_processed_w_ratings.iloc[58].tags).encode()).hexdigest() == "12ae32cb1ec02d01eda3581b127c1fee3b0dc53572ed6baf239721a03d82e126"
assert df_movies_processed_w_ratings[df_movies_processed_w_ratings.tags==""].size == 23307
df_movies_processed_w_ratings.head()

## 3.3 - Calculate the Profiles for the Items   
### 3.3.1 - Calculate the tf-idf

Create a function named `get_tf_idf`, that accepts a pandas series where the values are type `str`, and returns the correspondent tf-idf for that serie.

In [None]:
def get_tf_idf(doc_series: pd.Series):
    """
    Generates the tf-idf of doc_series
    
    Parameters
    ----------
    doc_series : pd.Series, shape:(n_items, )
              serie where elements are of type str
              
    Returns
    -------
    doc_tf_idf : csr_matrix, shape: (n_items, n_tfidf_features)
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return doc_tf_idf

In [None]:
tag_tf_idf  = get_tf_idf(df_movies_processed_w_ratings["tags"])
assert tag_tf_idf.shape == (9298, 1458)
np.testing.assert_almost_equal(tag_tf_idf[:, 5].toarray().sum(), 2.26, 2)

### 3.3.2 - Calculate items profile

Let's now create an item's profile that contains information regarding `tags`, `genres`, and `decades`. 

Create a function named `get_items_profile`, that accepts a dataframe, calculates tf-idf for each column, concatenates horizontally all the  csr_matrix that are a result of the tf-idf, and returns it as a csr_matrix.        
Hint: Use the function `get_if_idf` and the method `np.hstack`.

In [None]:
def get_items_profile(df: pd.DataFrame):   
    """
    Creates the item profiles from tf-idf applied to all the columns on df
    
    Parameters
    ----------
    df : pd.DataFrame, shape:(n_items, n_features)
              
              
    Returns
    -------
    item_profiles : csr_matrix, shape: (n_items, n_tfidf_features)
    """
    # item_profiles = ...
    # YOUR CODE HERE
    raise NotImplementedError()
    return item_profiles

In [None]:
item_profiles = get_items_profile(df_movies_processed_w_ratings)
assert item_profiles.shape == (9298, 1493)
assert hashlib.sha256(json.dumps(int(item_profiles[45, :].toarray()[0].sum())).encode()).hexdigest() == "ef2d127de37b942baad06145e54b0c619a1f22327b2ebbcfbec78f5564afe39d"

## 3.4 - User Profiles
The next step is to create a function called `make_user_profiles` that accepts the ratings matrix and the items profiles, calculates, and returns the user profiles.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
user_profiles = make_user_profiles(R, item_profiles)
assert user_profiles.shape == (147, 1493)
assert hashlib.sha256(json.dumps(int(user_profiles[10, :].toarray()[0].sum())).encode()).hexdigest() == "3c7f572560e6d2f14680d05690428dbedc48378a6b8015d86024428f36791dad"

## 3.5 - The Moment of Truth
Finally, let's make predictions, this time based not only on ratings, but also on items characteristics. 

In [None]:
#todo compare tfidf with tags vs tfidf for tags with one hot encode? for genre, decade and compare the results
#
def make_predictions(R, item_profiles, user_profiles):
    """
    Make predictions based on the ratings matrix, the item profiles and the user profiles we calculated previously.
    
    
    Parameters
    ----------
    R : csr_matrix. shape: (n_users, n_items)
        Matrix containing the ratings initially assigned.
        
    item_profiles : csr_matrix. shape: (n_items, n_tfidf_features)
                    Matrix containing the TF-IDF features calculated for the items.
                    
    user_profiles : csr_matrix. shape: (n_users, n_tfidf_features)
                    Matrix containing the user profiles as the product of the ratings with the item profiles.
                    
                    
    Returns
    -------
    predictions : csr_matrix. shape: (n_users, n_items)
                  Matrix with the predictions. Already rated content is suppressed to 0.
    """
    # YOUR CODE HERE
    raise NotImplementedError()
    return predictions

In [None]:
pred = make_predictions(R, item_profiles, user_profiles)
assert pred.shape == (147, 9298)
assert hashlib.sha256(json.dumps(int(pred[:, 50].toarray().sum())).encode()).hexdigest() == "bbb965ab0c80d6538cf2184babad2a564a010376712012bd07b0af92dcd3097d"
assert hashlib.sha256(json.dumps(pred[pred == 0].size).encode()).hexdigest() == "ade54f1d9e8f688033b38ac8289791a62e9e1b9fde4ea1eb8ae8cb7f3ac503a8"

## 3.6 - Get Top-20 items for John (Not Graded)

With these new predictions, we can give recommendations to John, having in account not only the ratings but also the movies' content.

In [None]:
top_movies_ids = get_top_n(pred, john_id, 20)
df_movies.loc[top_movies_ids]