# BLU10 - Non-personalized Recommenders: Exercise notebook

In [None]:
import os
import numpy as np
import scipy as sp
from scipy.sparse import csr_matrix
import pandas as pd
from mlxtend.frequent_patterns import apriori
import hashlib
import json

You will be working with data from an online bookstore. Every time a customer buys a book, the customer can rate the book and the bookstore uses that data to create recommendations to future customers.

In this exercise notebook, you will help the bookstore team to choose which books to display in different areas of the website.

The bookstore provided a datafile with customer ratings and another file with information about the book genres:

* `BookRatings.csv` has the historical ratings given by the customers and represents all the books sold. 
* `BooksInfo.csv`: has the information about the main genre of the book.

Let's load and preview the data.

In [None]:
ratings = pd.read_csv('data/BookRatings.csv')
books_info = pd.read_csv('data/BooksInfo.csv')

In [None]:
ratings.head()

In [None]:
books_info.head()

## Exercise 0 - EDA (ungraded)
Let's first check if the data is complete:
- check for ratings with incomplete data
- check for the duplicated records in ratings 
- check for books without genre

In [None]:
# use this cell for your code

## Exercise 1 - Ratings matrix

In this exercise, you will create the ratings matrix and check its properties.

### Exercise 1.1 - Create the ratings matrix
Implement the function below which should transform the `ratings` dataframe to a ratings matrix.

In [None]:
def make_ratings(data):
    """
    Creates a ratings matrix from the provided dataframe.
    Fills the missing values with 0.
    
    Parameters:
        data (pd.DataFrame): the ratings dataframe with ratings per ISBN and User-ID
        
    Returns:
        R (np.ndarray): Ratings matrix created from data
    """

    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
R = make_ratings(ratings)
assert isinstance(R,np.ndarray), 'The ratings matrix should be a numpy array.'
assert hashlib.sha256(json.dumps(R.shape).encode()).hexdigest() == \
'b466a45e18554bb465511315ce036735e079772d9136499b0066f0acd78d5a8d', 'The shape of the ratings matrix is not correct.'
assert hashlib.sha256(json.dumps(R[0].sum()).encode()).hexdigest() == \
'0729c13ebd725201c1445a00c825237d305ff650cd72f50e45259bd942a75ef4', 'The ratings matrix is not correct.'
assert hashlib.sha256(json.dumps(R.sum()).encode()).hexdigest() == \
'bec147ffed0b304733e83e0732667d8f85aa5374f8ae52ebbb8734b3780097d2', 'The ratings matrix is not correct.'
assert hashlib.sha256(json.dumps(R[:,0].sum()).encode()).hexdigest() == \
'f1e42019aecc858ffbcca7fddec511b761b474916fde37b1a6ff321a9b459330', 'The ratings matrix is not correct.'
f"We have {R.shape[0]} users and {R.shape[1]} items."

### Exercise 1.2 - Density score
Implement the function below to calculate the density score of the ratings matrix R.

In [None]:
def get_density_score(matrix):
    """
    Calculates the density score of a numpy ratings matrix.
    
    Parameters:
        matrix (np.ndarray): ratings matrix
        
    Returns:
        dense_score (float): density score of the matrix
    """
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
dense_score = get_density_score(R)
assert isinstance(dense_score, float), 'The density score should be a float.'
np.testing.assert_almost_equal(dense_score,0.0004,4, err_msg='The score is not correct.')
f"The density score of the ratings matrix is {dense_score}."

### Exercise 1.3 - Sparse ratings matrix
As you just saw, the matrix is very sparse. Implement the function below which converts a numpy ratings matrix like the one from exercise 1.2 to a scipy compressed sparse row matrix.

In [None]:
def get_csr(matrix):
    """
    Transforms the provided numpy matrix to a scipy compressed sparse row matrix
    
    Parameters:
        matrix (np.ndarray): the ratings matrix
    
    Returns:
        H (sp.sparse.csr_matrix): the compressed sparse row matrix
        
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
sparse_mat = get_csr(R)
assert sparse_mat.shape == R.shape, 'The shape of the sparse matrix is not correct.'
assert sparse_mat.sum() == R.sum(), 'The content of the matrix is not correct.'
assert hashlib.sha256(json.dumps(str(sparse_mat)).encode()).hexdigest() == \
'f0012636d41b3a8f1995247d731cc4b38ba50aa47b8dcd563923d3682e329830', 'The sparse matrix is not correct.'

## Exercise 2 - Non-personalized recommendations
Now we will use the information about books and their genres and ratings to create non-personalized recommendations. The dataframe manipulation functions from BLU02 and BLU04 might come in handy.

### Exercise 2.1 - Merge the data sets

Implement the function below to merge the dataframes `ratings` and `books_info` in order to have information about the genre of each book. Include only the books that have a rating. 

In [None]:
def get_book_ratings_df(ratings_, books_info_):
    """
    Merges the provided dataframes. Includes only books with a rating.
    
    Parameters:
        ratings_ (pd.DataFrame): dataframe with User-ID, ISBN, and Book-Rating
        books_info_ (pd.DataFrame): dataFrame with ISBN and Genre
        
    Returns:
        book_ratings (pd.DataFrame): dataframe with all the information for books that have a rating.
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
book_ratings = get_book_ratings_df(ratings, books_info)
assert isinstance(book_ratings,pd.DataFrame), 'The result should be a dataframe.'
assert book_ratings.shape==(108910, 4), 'The shape of book_ratings is not correct.'
assert np.sum([i in ['User-ID', 'ISBN', 'Book-Rating', 'Genre'] for i in book_ratings.columns]) == 4, \
'The columns of book_ratings are not correct.'
assert book_ratings['Book-Rating'].sum() == 842362, 'The content of book_ratings is not correct.'
book_ratings.head()

### Exercise 2.2 - The most popular books in the store

The bookstore wants to display on their website a collection of the most popular books in the store. Since we don't have information on purchases we are going to use the ratings to assess popularity.

Create a function that takes the merged dataframe from exercise 2.1 and returns a list with the ISBNs of the top N most popular books in the store - the N books that received the most ratings. The values in the list should be ordered from the most popular to the least popular book.

In [None]:
def get_popular_books(df, n):
    """
    Finds the n books with the most ratings (the most popular books).
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        n (int): how many books to find
        
    Returns:
        top_n_popular_books (list): list with the ISBNs of the top n popular books ordered
                                    from the most to the least popular book
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
top_5_popular_books = get_popular_books(book_ratings, 5)
assert isinstance(top_5_popular_books, list), 'The result should be a list.'
assert len(top_5_popular_books) == 5, 'The length of the list is not correct.'
assert hashlib.sha256(json.dumps(''.join(top_5_popular_books)).encode()).hexdigest() == \
'0c5a95dd00da083b53d000a115b3dd038248f0e2df748180ad6111f533ae308b', 'The selected ISBNs are not correct.'
top_5_popular_books

### Exercise 2.3 - The best rated books

The bookstore also wants to display on the website a collection of the books with the best ratings in the store. 

Create a function that returns the top N best rated books with at least k ratings. Use the mean rating of each book for comparison. The function should return a list of the ISBNs of the top N books ordered from the best to the worst rated book.

In [None]:
def get_topn_rates(df, n, k):
    """
    Finds the top n best rated books with more than k ratings.
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        n (int): how many books to find
        k (int): minimum number of ratings that a book should have to be considered
        
    Returns:
        top_books (list): list of ISBNs of top n best mean rated books with at least k ratings.
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
top5_rates = get_topn_rates(book_ratings, 5, 10)
assert isinstance(top5_rates, list), 'The result should be a list.'
assert len(top5_rates) == 5, 'The length of the list is not correct.'
assert hashlib.sha256(json.dumps(''.join(top5_rates)).encode()).hexdigest() == \
'a0ee87be17e923a41c3838be3f3d47ad271560e030b6382b4e916d536ba9fc21', 'The selected ISBNs are not correct.'
top5_rates

### Exercise 2.4 - Loyal customers

The bookstore wants to reward the customers that gave the most ratings on the website. 

Create a function that returns a list of the top N users that gave the most ratings. Order the list by the number of given ratings in descending order.

In [None]:
def get_loyal_customers(df, n):
    """
    Finds the customers which gave the most ratings.
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        n (int): number of customers to find
        
    Returns:
        top_n_loyal_customers (list): a list of the n user IDs which gave the most ratings
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
top_10_loyal_customers = get_loyal_customers(book_ratings, 10)
assert isinstance(top_10_loyal_customers, list), 'The result should be a list.'
assert len(top_10_loyal_customers) == 10, 'The length of the list is not correct.'
assert hashlib.sha256(json.dumps(''.join([str(i) for i in top_10_loyal_customers])).encode()).hexdigest() == \
'80b627ea07b2eedf56a56f6887c6eddb6586d8a2e9c2c6ff7c194d4b3133b33f', 'The selected userIDs are not correct.'
top_10_loyal_customers

### Exercise 2.5 - The genre of the most rated book
Implement the function below which should find the genre of the most rated book.

In [None]:
def genre_of_most_rated_book(df):
    """
    Finds the genre of the most rated book.
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        
    Returns:
        genre (str): the genre of the most rated book
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
genre_top_book = genre_of_most_rated_book(book_ratings)
assert isinstance(genre_top_book, str), 'The result should be a string.'
assert hashlib.sha256(json.dumps(genre_top_book).encode()).hexdigest() == \
'73a05d46de472f9fc34b6377495f8c7af5a1a1c0904f260955e14d18f2465846', 'Not correct.'
print(genre_top_book)

### Exercise 2.6 - The most popular books by genre

The bookstore wants to display the most popular book in each genre.

Create a function that returns a dataframe with the most popular book of each genre, judging by the number of ratings each book received. The columns of the dataset should be `Genre`,`Rating-Count` for the number of ratings of the most popular book and `ISBN`.

Hint: you will need [this function](https://pandas.pydata.org/docs/reference/api/pandas.core.groupby.DataFrameGroupBy.first.html#pandas.core.groupby.DataFrameGroupBy.first). Remember that you can aggregate by several columns at once.

In [None]:
def most_popular_book_per_genre(df):
    """
    Finds the most popular book for each genre.
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        
    Returns:
        top_books_genre (pd.DataFrame): dataframe with the most popular books for each genre
                                        with three columns: Genre, ISBN, Rating_Count
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
top_books_genre = most_popular_book_per_genre(book_ratings)
assert isinstance(top_books_genre, pd.DataFrame), 'The result should be a dataframe.'
assert top_books_genre.shape == (2543,3), 'The shape of the dataframe is not correct.'
assert np.sum([i in ['Genre','ISBN','Rating-Count'] for i in top_books_genre.columns]) == 3, 'The column names are not correct.'
assert top_books_genre['Rating-Count'].sum() == 6683, 'The Rating-Count column is not correct.'
assert hashlib.sha256(json.dumps(''.join(top_books_genre['ISBN'])).encode()).hexdigest() == \
'713ee806c9ad291b6b08c8cce24a3f6defd01f2149a2206a62389b502c94bd7b', 'The ISBN column is not correct.'
top_books_genre.head()

### Exercise 2.7 - Best average rated books by genre

The bookstore also wants to display the best rated books per genre.

Create a function that returns a dataframe with the top n books with the highest average rating in each genre. If there are more than n books with the same mean rating, sort them by `ISBN` in descending order. The dataframe columns should be `Genre`, `ISBN`, and `Book-Rating`. Sort the dataframe by `Genre` and `Book-Rating`.

In [None]:
def get_topn_rated_genre(df, n):
    """
    Finds the top n books with the best average rating per genre.
    
    Parameters:
        df (pd.DataFrame): dataframe with info about book ratings and genre
        n (int): the number of books to find
              
    Returns:
        books (pd.DataFrame): dataframe with top n books with the highest average rating per genre
                              with columns Genre, ISBN, Book-Rating
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
topn_rated_genre = get_topn_rated_genre(book_ratings, 3)
assert isinstance(topn_rated_genre, pd.DataFrame), 'The result should be a dataframe.'
assert topn_rated_genre.shape == (3921,3), 'The shape of the dataframe is not correct.'
assert np.sum([i in ['Genre','ISBN','Book-Rating'] for i in topn_rated_genre.columns]) == 3, 'The column names are not correct.'
assert hashlib.sha256(json.dumps(''.join(topn_rated_genre['Genre'])).encode()).hexdigest() == \
'588fcfcf3099b9bd150024fbf74ae943f4cf7eec60ce7d42bcbbe47b65dc69ba', 'Did you sort the Genre column?'
np.testing.assert_almost_equal(topn_rated_genre['Book-Rating'].sum(), 31393.701483656077, decimal=2,
                              err_msg='The Book-Rating column is not correct.')
assert hashlib.sha256(json.dumps(''.join(topn_rated_genre['ISBN'])).encode()).hexdigest() == \
'a1ff93a88be527a107f1bee949045dbf7c51467bfd3aa52a33676178d18e291c', 'The ISBN column in not correct.'
topn_rated_genre[topn_rated_genre['Genre']=='Fiction']

### Exercise 2.8 - Most common groups of books

The bookstore wants to display groups of books that the users usually rate together.

Create a function that returns the N most frequent sets of M books that the users rate together for a given minimum support, ordered by support. The function should return a dataframe with two columns, `support` and `itemsets`, ordered by support in descending order. The input of the function is the ratings matrix `R` that you created in exercise 1.1.

In [None]:
def get_apriori_booksets(R, min_support=0.003, n=3, m=3):
    
    """
    Finds the N most frequent sets of M books that the users rate together for a given minimum support.
    
    Parameters:
        R (np.ndarray): ratings matrix
        min_support (float): minimum support for the itemsets
        n (int): number of top n itemsets to return
        m (int): number of items in itemsets
              
    Returns:
        booksets (pd.DataFrame) - dataframe with the top n itemsets, with columns support and itemsets,
                                  ordered by support in descending order.
    """
    
    # YOUR CODE HERE
    raise NotImplementedError()

In [None]:
get_3_booksets = get_apriori_booksets(R, min_support=0.003, n=3, m=3)
assert isinstance(get_3_booksets, pd.DataFrame), 'The result should be a dataframe.'
assert get_3_booksets.shape == (3,2), 'The shape of the dataframe is not correct.'
np.testing.assert_almost_equal(get_3_booksets.support.sum(), 0.013349727735815914, decimal=4,
                              err_msg='The support column is not correct.')
assert hashlib.sha256(json.dumps(''.join([str(int(i)) for i in sorted(get_3_booksets.itemsets.iloc[0])])).encode()).hexdigest() == \
'9f645311b81ee935d02affd458818360c52a8fad05c731e7a477f44f4b2832e0', 'The selected itemsets are not correct.'
assert hashlib.sha256(json.dumps(''.join([str(int(i)) for i in sorted(get_3_booksets.itemsets.iloc[1])])).encode()).hexdigest() == \
'e5175b9cd2c8622984d4b0cb51604e33785f24e4a54847949269e766f096d02e', 'The selected itemsets are not correct.'
assert hashlib.sha256(json.dumps(''.join([str(int(i)) for i in sorted(get_3_booksets.itemsets.iloc[2])])).encode()).hexdigest() == \
'c258eb47dcdd3134684670626e5851b38c3bd8eeab22c5b05a8915eb67837df3', 'The selected itemsets are not correct.'
get_3_booksets

Well done! You just mastered your first recommender system.