In [1]:
# Import the necessary dependencies

# Operating System
import os

# Numpy, Pandas and Scipy
import numpy as np
import pandas as pd
from scipy.sparse import csr_matrix, save_npz, load_npz

# Scikit-learn
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.model_selection import train_test_split

# Model Evaluation
from evaluation import evaluate

# BLU12 - Learning Notebook - Workflow
## Introduction
This week we are going to simulate the real-life environment of the Hackathon! We will provide you with a dataset, some tips and you are going to train, validate and predict with a Recommender System using all the knowledge acquired throughout this specialization.

## Context
You have finished the Academy and you are hired as a Data Scientist in Recommentunify *(ignore the horrible company name, it works and we have more important stuff to discuss)*.

First week on the job and the CEO of the next-unicorn-startup that hired you pitches your first task for the company. NASA has discovered that "Despacito 2" is on the making and you need to **urgently avoid this scenario.**

<img src="./media/despacito_nasa.jpg" alt="Miss me yet?" width="500"/>

He doesn't care how you do it or how it works, just that you use fancy trendy eye-catching mind-blowing AI *stuff* to promote Recommentunify's name in the Data Science industry.

## Technical Task
After the pitch from the CEO, you talk with your Lead Data Scientist and you get the tasks needed to fulfill the job. The company collected some data on the users, you will need to create the Recommender System that is effective enough so that your users will not need a 2nd version of a Despacito.

Your Lead DS keeps some of the data as a test set (in which you will estimate production performance) and gives you the remaining data for yourself.

## Step -1: Understanding the data
You have available under the `./data/` folder some files that you can use for building the Recommender System. As a good scientist, the first step is to validate that the data you have will be good enough to proceed.

The files we have available for the **training** stage are:
* `train_play_counts.txt` has the listening history for each user
* `song_tag.csv`: has the relationship between tags and songs
* `songs.txt`: has the correspondence between the song_id (a unique id, eg: SOAAADD12AB018A9DD) and the song_index (an integer index, eg: 1)
* `tags.csv`: has the correspondence between the tag_id (a tag name, eg: classic rock) and the tag_index (an integer index, eg: 1)



For the **test** stage, we have:
* `test_users.csv`: has the user_ids for which a recommendation should be produced
* `example_output.csv`: has an example of an output (just for a couple of users)

The best approach is to look at the raw files and print out the first rows of each file just to get a overall sense of what we have.

*Note*: since the `song_tag.csv` is heavy, it was stored in a zip file which you will need to unzip first.

*Note for Windows users: remember that you need to convert the head command into the Windows equivalent. Tip: you can use what you learned in Data Wrangling Learning Unit!*

In [2]:
print("train_play_counts.txt \n")
!head -3 data/train_play_counts.txt

train_play_counts.txt 






In [3]:
#Unzip song_tag.zip first.
print("song_tag.csv\n")
!head -4 data/song_tag.csv

song_tag.csv

song_index,tag_index,val
254229,206,100.0
254229,1125,66.0
254229,582,66.0


In [4]:
print("songs.txt \n")
!head -3 data/songs.txt

songs.txt 






In [5]:
print("tags.csv \n")
!head -4 data/tags.csv

tags.csv 







In [6]:
print("test_users.csv \n")
!head -3 data/test_users.csv

test_users.csv 






In [7]:
print("example_output.csv \n")
!head -1 data/example_output.csv

example_output.csv 




## Step 0: Load the Data
After validating that the data we have is good enough, we start with building the ratings matrix. Our strategy for the first model is to get a non-personalized system as a baseline. This allows us to both predict for existing users (no cold-start problem) and for new users.

## Step 0.1: Load the Train and Test files

In [8]:
def read_users_history() -> pd.DataFrame:
    """Imports the listening history for each user.
    
    Returns:
        data (pd.DataFrame): DataFrame with the song and respective rating for each user. 
                             The rows are tuples of (user, song_id, rating).
                             
    """
    path = os.path.join('data', 'train_play_counts.txt')
    data = pd.read_csv(path, names=['user', 'song_id', 'rating'], sep='\t')
    return data

data = read_users_history()
data.head()

Unnamed: 0,user,song_id,rating
0,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOBONKR12A58A7A7E0,1
1,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOEGIYH12A6D4FC0E3,1
2,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOFLJQZ12A6D4FADA6,1
3,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SOHTKMO12AB01843B0,1
4,fd50c4007b68a3737fe052d5a4f78ce8aa117f3d,SODQZCY12A6D4F9D11,1


In [9]:
# How many ratings do we have in total?
# Tip: The ":," at the end of the f-string adds the thousand separator.
print(f"We have {len(data):,} ratings in total.")

We have 100,000 ratings in total.


In [10]:
def read_test_users() -> pd.DataFrame:
    """Imports the list of users for which we need to predict.
    
    Returns:
        users_to_pred (pd.DataFrame): DataFrame with the users for which we will recommend songs. 
    """

    path = os.path.join('data', 'test_users.csv')
    users_to_pred_ = pd.read_csv(path, names=['users to recommend songs'])
    
    return users_to_pred_


users_to_pred = read_test_users()
users_to_pred.head()

Unnamed: 0,users to recommend songs
0,56d985c92960b98ad76a48b10a062b0cd86795bf
1,0102d1549242d159df98f333aee4041f96f37e98
2,991411f0dca94f348c7bd3eae93b6e6c061605f1
3,03feb8ee0424fc5c0bafb25e4bdc6d9380a3caba
4,323fbb28144eefa3eabfa22bd310dfb0713de80d


In [11]:
# For how many users are we recommending stuff?
print(f"We have {len(users_to_pred)} users in need for better songs!")

We have 1000 users in need for better songs!


## Step 0.2: Compare the train and test files

In [12]:
# And for how many users we already know something?
# For the 1000 users in our test set, we have data for 700 in the original listening history.
# For the remaining 300 we will have to use non-personalized strategies.
users_to_pred.isin(data.user.values).sum()

users to recommend songs    700
dtype: int64

In [13]:
def get_indices_from_users_to_pred(users_to_pred: pd.DataFrame, data: pd.DataFrame):
    """Get the indices of users_to_pred for which we have data and for which we don't.
    
    Args:
        users_to_pred (pd.DataFrame): DataFrame containing the list of users we are going to recommend items.
        data (pd.DataFrame): Original of listening history for the users.
        
    Returns:
        index_users_in_data (Int64Index): Index that filters the users_to_pred to get the user_id's with training data.
        index_users_not_in_data (Int64Index): Index that filters the users_to_pred to get the user_id's without training data.
        
    """
    index_users_in_data = users_to_pred[users_to_pred.isin(data.user.values).values].index
    index_users_not_in_data = users_to_pred[~users_to_pred.isin(data.user.values).values].index
    
    return index_users_in_data, index_users_not_in_data


index_users_in_data, index_users_not_in_data = get_indices_from_users_to_pred(users_to_pred, data)

In [14]:
# For further inspection, we advise you to look at the objects themselves.
print(f"The index for users which we have training data has length of {len(index_users_in_data)}.")
print(f"The index for users which we don't have training data has length of {len(index_users_not_in_data)}.")

The index for users which we have training data has length of 700.
The index for users which we don't have training data has length of 300.


In [15]:
def get_users_to_pred_by_index(users_to_pred: pd.DataFrame, index_users_in_data: pd.Int64Index):
    """DataFrame with user_id's in test set for for which we have training data.

    Args: 
        users_to_pred (pd.DataFrame): DataFrame containing the list of users we are going to recommend items.
        index_users_in_data (Int64Index): Index that filters the users_to_pred to get the user_id's with training data.
    Returns:
        users_in_data (pd.DataFrame): Dataframe containing the list of user_id's with training data.
    
    """
    return users_to_pred.iloc[index_users_in_data].reset_index(drop=True)

# Get the test users with training data
test_users_in_data = get_users_to_pred_by_index(users_to_pred, index_users_in_data)
print(test_users_in_data.shape)
test_users_in_data.head()

(700, 1)


Unnamed: 0,users to recommend songs
0,56d985c92960b98ad76a48b10a062b0cd86795bf
1,991411f0dca94f348c7bd3eae93b6e6c061605f1
2,323fbb28144eefa3eabfa22bd310dfb0713de80d
3,55c750f0951ca1021b26c0e758660bb8a2c49d3a
4,b458e3d697276a93aa6926caf1ff08e875933940


In [16]:
# Get the test users without training data
test_users_not_in_data = get_users_to_pred_by_index(users_to_pred, index_users_not_in_data)
print(test_users_not_in_data.shape)
test_users_not_in_data.head()

(300, 1)


Unnamed: 0,users to recommend songs
0,0102d1549242d159df98f333aee4041f96f37e98
1,03feb8ee0424fc5c0bafb25e4bdc6d9380a3caba
2,02e4174f754ee4546f4c4438a4bea6ed877c6c4f
3,03070a3284db565b2e67f8ad01ce96c31c9986dd
4,01dd0b2bb39643b8925a672105819d599f827c87


## Step 0.3: Create the Ratings matrix

Here we are creating a rating matrix by instantiating the csr_matrix in a different way than before. Previously, a dense matrix (actually a `pd.DataFrame`) was created with the `.pivot()` method, the indices were ordered, the missing values were replaced with 0 and then the dataframe was used as argument of `csr_matrix`. This is an intensive task, time and memory wise. To avoid having to create a huge dataframe it is better to get the indices of users and items and the respective values (all `ndarray`s) and use the fourth instantiation method described in the [csr_matrix documentation]((https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html)).

In [17]:
def make_ratings(data: pd.DataFrame) -> csr_matrix:
    """Creates the ratings matrix of listening history with optional shape
    
    Creates the ratings matrix from the listening history imported using the read_users_history() method.
    
    Args:
        data (pd.DataFrame):  Listening history for the users.
        shape (tuple): The overall (n_users, n_items) shape desired for the matrix. 
                       If None, define the shape with the (n_users, n_items) from data argument.
        
    Returns:
        ratings (csr_matrix): Ratings matrix with shape (n_users, n_items).
    
    """
    users, user_pos = np.unique(data.iloc[:, 0].values, return_inverse=True)
    items, item_pos = np.unique(data.iloc[:, 1].values, return_inverse=True)
    values = data.iloc[:, 2].fillna(0).values
    
    #R Matrix dimensions (n_users, n_items)
    shape = (len(users), len(items))

    R_ = csr_matrix((values, (user_pos, item_pos)), shape=shape)
    return R_


R = make_ratings(data)
R

<7526x41194 sparse matrix of type '<class 'numpy.int64'>'
	with 100000 stored elements in Compressed Sparse Row format>

In [18]:
# Just for mental (in)sanity, let's match the info of the matrix to what is printed in the previous cell.
print(f"The shape is {R.shape}")
print(f"The dtypes of the elements are {R.dtype}")
print(f"The number of stored elements is {R.nnz}")
print(f"The type of the matrix is {type(R)}")

The shape is (7526, 41194)
The dtypes of the elements are int64
The number of stored elements is 100000
The type of the matrix is <class 'scipy.sparse.csr.csr_matrix'>


In [19]:
# Let's store a Series with the unique user id's that we have in the original data.
def get_unique_users(data: pd.DataFrame) -> pd.DataFrame:
    """Get unique users in training data.
    
    Args:
        data (pd.DataFrame):  listening history for the users.
        
    Returns:
        unique_users (pd.DataFrame): DataFrame of one column with unique users in training data.
    
    """
    return pd.DataFrame(np.unique(data.iloc[:, 0].values), columns=["users to recommend songs"])


unique_users_training_data = get_unique_users(data)
unique_users_training_data.head()

Unnamed: 0,users to recommend songs
0,0011d5f4fb02ff276763d385c3f2ded2b00ad94a
1,002511b392561fc1d426d875c386b356a6fc5702
2,002dfbc3c073b55a64a4abab34c0ca1f13897f1c
3,003998bc33cddeba02428a43391c6716e523c8f7
4,0042d2027dfa0340e31d2aa875c4be229730efb7


## Step 0.4: Record the indices of users
After having the users for the test set and the original ratings matrix, let's create some indices which we will use later on to filter data. Don't worry if it does not seem much intuitive in the beginning, you will further return to this and get a better grasp.

In [20]:
def get_indices_in_ratings_for_test_users_in_data(data: pd.DataFrame, test_users_in_data: pd.DataFrame) -> np.array:
    """Returns the index of the ratings matrix for the test users for which we have training data.
    
    Args:
        data (pd.DataFrame): DataFrame with the user for each user. 
                             The rows are tuples of (user, song_id, rating).
                             
        test_users_in_data (pd.DataFrame): DataFrame containing the list of test users for which we have training data.
        
    Returns:
        indices_ratings_tests_users_in_data (np.array): Indices of users in test set with training data for ratings matrix.
    
    """
    unique_users = get_unique_users(data).iloc[:, 0]
    indices_ratings_tests_users_in_data = unique_users[unique_users.isin(test_users_in_data.iloc[:, 0])].index.to_numpy()                                                                                       
    return indices_ratings_tests_users_in_data

indices_ratings_tests_users_in_data = get_indices_in_ratings_for_test_users_in_data(data, test_users_in_data)

In [21]:
len(indices_ratings_tests_users_in_data)

700

In [22]:
print(f"As expected, the length of the indices should be {len(indices_ratings_tests_users_in_data)}, matching the number of users in test set with training data.")

As expected, the length of the indices should be 700, matching the number of users in test set with training data.


## Step 1: Understand your data flow
In the end, we will be predicting over a test set of users, which may or may not have ratings available in our training data. For the ones in the test set for which we have data in the training, we may use personalized recommendation systems. Otherwise, we need to switch to non-personalized recommendations. 

Creating train and validation R matrices is not as simple as splitting the listening data into two dataframes and apply `make_ratings()` to each. This would create R matrices with **different sizes** and thus defined in **different vectorial spaces**. Having different vectorial spaces means that what a specific row represents in a matrix is not the same in the other matrix.

If the training data only has records for Ana, Beatriz and Carlos, the order of users on the training R matrix could be $U_t = \{Ana, Beatriz, Carlos\}$. If the validation data only has records for Ana, Xavier, Yolanda and Zita, the order of users on the validation R matrix could be $U_v = \{Ana, Xavier, Yolanda, Zita\}$ but **not ** $U_v = \{Ana, Beatriz, Carlos\}$. In this case $U_t$ cannot be equal to $U_v$.

The user represented by the x$^{th}$ row may not be the same on the original, training and validation data. **We are not comparing apples with apples!!!** Even if we manage to produce results, these would be **meaningless and wrong**. The same applies for items.

To solve this issue, instead of splitting the historical data, the rating values are replaced with zero on the appropriated indices before creating the R matrices. The training ratings are replaced by zero in the validation R matrix and the validations ratings are replaced by zero on the training R matrix. This makes sure that all users and items are present on all R matrices, resulting in equal vectorial spaces.

Extra Exercise: Split the original data into train and validation with `train_test_split()` and apply `make_rating()` to each. Verify that the resulting matrices have different sizes.

In [23]:
# Percentage of listening history used for validation.
test_size = 0.2

def make_train_val_split(data: pd.DataFrame, test_size : float = 0.2):
    """Split the data into train and validation and returns the ratings matrixes accordingly.
    
    Args:
        data (pd.DataFrame): Listening history for the users.
        test_size (float): Percentage of listening history used for validation.
    
    Returns:
        ratings_train (csr_matrix): Ratings matrix for train.
        ratings_val (csr_matrix): Ratings matrix for validation.
    
    """
    train_data, val_data = train_test_split(data, test_size=test_size, random_state=8)

    #Store the indexes of each observation to identify which records to replace with zero
    train_index = train_data.index
    val_index = val_data.index

    #make copies of data to replace the observations
    train_data_clean = data.copy()
    val_data_clean = data.copy()

    #Replace the validation observations on the training data
    train_data_clean.loc[val_index,["rating"]] = 0
    
    #Replace the training observations on the validation data
    val_data_clean.loc[train_index,["rating"]] = 0

    #Create the R matrices
    R_train = make_ratings(train_data_clean)
    R_val = make_ratings(val_data_clean)

    #remove the explicit zeros from the sparse matrices
    R_train.eliminate_zeros()
    R_val.eliminate_zeros()

    return R_train, R_val

ratings_train, ratings_val = make_train_val_split(data, test_size=test_size)

In [24]:
# After the train/validation split, let's compare the number of ratings available in each matrix.
print(f"After the split we have {ratings_train.nnz:,} ratings in the train set and {ratings_val.nnz:,} ratings in the validation set.")

After the split we have 80,000 ratings in the train set and 20,000 ratings in the validation set.


## Step 2: Non-Personalized
Let's build our baseline using one of the the Non-Personalized techniques we learned in BLU10 - "Popular Items". This basically fetches the items which have more ratings overall (we don't care if it is good or bad). Is this how it should be?

<img src="./media/fry_good_bad.jpg" alt="Confused Fry" width="300"/>


Probably not. But it is worth a shot!

In [25]:
def get_most_rated(ratings: csr_matrix, n: int) -> np.matrix:
    """Returns the n most rated items in a ratings matrix.
    
    Args:
        ratings (csr_matrix): A sparse ratings matrix
        n (int): The number of top-n items we should retrieve.
        
    Returns:
        most_rated (np.matrix): An array of the most rated items.
    
    """
    def is_rating(R_: csr_matrix) -> csr_matrix:
        """Returns a sparse matrix of booleans 
        
        Args:
            R_ (csr_matrix): A sparse ratings matrix
            
        Returns:
            is_rating (csr_matrix): A sparse matrix of booleans.
        """
        return R_ > 0
    
    def count_ratings(R_: csr_matrix) -> np.matrix:
        """Returns an array with the count of ratings
        
        The attribute ".A1" of a numpy matrix returns self as a flattened ndarray.
        
        Args:
            R_ (csr_matrix): A sparse matrix of booleans
        
        Returns:
            count_ratings (np.darray): Count of ratings by item
        """
        return R_.sum(axis=0).A1
    
    ratings_ = is_rating(ratings)
    ratings_ = count_ratings(ratings_)
    return np.negative(ratings_).argsort()[:n]


non_pers_most_rated = get_most_rated(ratings_train, 100)
non_pers_most_rated

array([ 9622,  1394,  2648,  1557, 30752, 23122,  7282, 18742,  5794,
       33598,  9217, 33416, 32150, 25953, 25737, 26012, 13142, 35433,
        2667, 17537,  5588,  4849, 34014, 23589, 32656, 19467, 37102,
        9121,  2616, 29486,  4153, 32168, 30860, 33143, 18055, 26211,
       35544, 35539,   221, 11151, 19105, 33009, 22051, 35892, 22119,
       35069, 32651, 25916, 31027, 22055, 25784,   930, 19464,  5555,
       24856, 22485, 20368, 19794, 35690, 22428, 14716, 32537, 37772,
       38099, 36319, 40907, 18788, 37604, 31121, 13480, 22565, 28416,
       37667, 30413, 10324, 18776, 37450, 32111, 38643,  1739,  9581,
       33098,  6983, 15124, 29939, 11471,  2062, 26777,  2707, 39865,
       36734, 26126, 17570, 20094, 40950,  5286, 11184, 29148, 23803,
        7891])

In [26]:
def convert_non_pers_recommendations_to_df(non_pers_recs: np.array, users_to_pred: pd.DataFrame) -> pd.DataFrame:
    """
    Converts the non-personalized most rated to an DataFrame with the users and the recommendations.
    We will basically repeat the non_pers_recs array for the number of users in need.
    
    Args:
        non_pers_recs (np.array): Array of indices for the best non-personalized items to recommend.
        users_to_pred (pd.DataFrame): DataFrame containing the users which need recommendations.
        
    Returns:
        non_pers_most_rated_matrix (np.array): Two dimensional array of (n_users, top_n_items)
    
    """
    non_pers_df = pd.DataFrame(np.zeros((len(users_to_pred), 1), dtype=non_pers_recs.dtype) + non_pers_recs)
    non_pers_df = pd.concat([users_to_pred, non_pers_df], axis=1)
    non_pers_df = non_pers_df.set_index("users to recommend songs")
    
    return non_pers_df


non_pers_most_rated_df = convert_non_pers_recommendations_to_df(non_pers_most_rated, unique_users_training_data)
non_pers_most_rated_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0011d5f4fb02ff276763d385c3f2ded2b00ad94a,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
002511b392561fc1d426d875c386b356a6fc5702,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
002dfbc3c073b55a64a4abab34c0ca1f13897f1c,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
003998bc33cddeba02428a43391c6716e523c8f7,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
0042d2027dfa0340e31d2aa875c4be229730efb7,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891


In [27]:
def create_dict_preds(preds_df: pd.DataFrame) -> dict:
    """Convert the predictions DataFrame (index:users -> columns: items) to a dictionary of key (user->list of items).
    
    Args: 
        preds_df (pd.DataFrame): DataFrame containing the users and the ordered predictions.
        
    Returns:
        preds_dict (dict): Dict of (user_id: list of items) used for evaluating the performance.
    
    """
    return {preds_df.index[i]: preds_df.values[i].tolist() for i in range(len(preds_df))}


non_pers_dict = create_dict_preds(non_pers_most_rated_df)
# Since dicts in python are not ordered, we need to HAMMER DOWN a way to print some values.
dict(list(non_pers_dict.items())[0:1])

{'0011d5f4fb02ff276763d385c3f2ded2b00ad94a': [9622,
  1394,
  2648,
  1557,
  30752,
  23122,
  7282,
  18742,
  5794,
  33598,
  9217,
  33416,
  32150,
  25953,
  25737,
  26012,
  13142,
  35433,
  2667,
  17537,
  5588,
  4849,
  34014,
  23589,
  32656,
  19467,
  37102,
  9121,
  2616,
  29486,
  4153,
  32168,
  30860,
  33143,
  18055,
  26211,
  35544,
  35539,
  221,
  11151,
  19105,
  33009,
  22051,
  35892,
  22119,
  35069,
  32651,
  25916,
  31027,
  22055,
  25784,
  930,
  19464,
  5555,
  24856,
  22485,
  20368,
  19794,
  35690,
  22428,
  14716,
  32537,
  37772,
  38099,
  36319,
  40907,
  18788,
  37604,
  31121,
  13480,
  22565,
  28416,
  37667,
  30413,
  10324,
  18776,
  37450,
  32111,
  38643,
  1739,
  9581,
  33098,
  6983,
  15124,
  29939,
  11471,
  2062,
  26777,
  2707,
  39865,
  36734,
  26126,
  17570,
  20094,
  40950,
  5286,
  11184,
  29148,
  23803,
  7891]}

### Step 2.1 - Creating the Ground Truth
Since we are splitting the data into train and validation, we will test the predictions we get with the training data in the validation set. To do this, we need to determine what is the ground truth (the actual outcomes) of the recommendations for the validation set. 

The main idea is to pick up a sorted array for each user on the songs with highest value in the validaton ratings matrix.

In [28]:
unique_users_training_data

Unnamed: 0,users to recommend songs
0,0011d5f4fb02ff276763d385c3f2ded2b00ad94a
1,002511b392561fc1d426d875c386b356a6fc5702
2,002dfbc3c073b55a64a4abab34c0ca1f13897f1c
3,003998bc33cddeba02428a43391c6716e523c8f7
4,0042d2027dfa0340e31d2aa875c4be229730efb7
...,...
7521,ffd41f0f4c56e011d86a5005439f3468fd29d1d9
7522,ffdfbc60afdcdcb630d3b667ca3a083b09ed6212
7523,ffdfc7f9864ee172c3488707969c90e7b1ac4dc7
7524,ffe2811be1a471ea1d30fd646d815d272aef7d4d


In [29]:
def get_y_true(R_val_: csr_matrix, users_to_pred: pd.DataFrame, n=100):
    """Get the ground truth (best recommendations) of the users in the validation set.
    
    Args:
        R_val_ (csr_matrix): Validation set ratings matrix.
        users_to_pred: 
        n (int): Number of top-n items.
        
    Returns:
        y_true_df (pd.DataFrame): DataFrame which returns the y_true items.
        
    """
    top_from_R_val = pd.DataFrame(np.negative(R_val_).toarray().argsort()[:, :n])
    y_true_df = pd.concat([users_to_pred, top_from_R_val], axis=1)
    y_true_df = y_true_df.set_index("users to recommend songs")
    return y_true_df


y_true_df = get_y_true(ratings_val, unique_users_training_data, n=100)
y_true_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0011d5f4fb02ff276763d385c3f2ded2b00ad94a,38805,15053,1447,11133,18776,27461,27462,27463,27464,27465,...,27387,27389,27390,27391,27392,27393,27394,27395,27388,27396
002511b392561fc1d426d875c386b356a6fc5702,17243,37411,27458,27459,27460,27461,27462,27463,27457,27464,...,27384,27386,27387,27388,27389,27390,27391,27392,27385,27393
002dfbc3c073b55a64a4abab34c0ca1f13897f1c,37456,1003,2208,30222,1945,3975,27461,27462,27463,27464,...,27385,27386,27379,27387,27389,27390,27391,27392,27393,27394
003998bc33cddeba02428a43391c6716e523c8f7,38260,20340,38279,0,27457,27458,27459,27460,27461,27462,...,27385,27386,27387,27388,27389,27390,27391,27392,27377,27376
0042d2027dfa0340e31d2aa875c4be229730efb7,0,27457,27458,27459,27460,27461,27462,27463,27456,27464,...,27386,27387,27388,27389,27390,27391,27392,27385,27393,27375


In [30]:
# Create the dictionary with the ground truth.
y_true_dict = create_dict_preds(y_true_df)

## Evaluate
We will use the Mean Average Precision @ K to evaluate our predictions. We will discuss this metric and more on the second learning notebook.

In [31]:
evaluate(y_true_dict, non_pers_dict)

0.0003010999155361327

## Predict 
To submit your predictions, you just need to convert your recommendations to the format we have in the `example_output.csv` file.

In [32]:
# Join both dataframes with user_id's
all_test_users = pd.concat([test_users_in_data, test_users_not_in_data]).reset_index(drop=True)

In [33]:
non_pers_test_most_rated_df = convert_non_pers_recommendations_to_df(non_pers_most_rated, all_test_users)
print(non_pers_test_most_rated_df.shape)
non_pers_test_most_rated_df.head()

(1000, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56d985c92960b98ad76a48b10a062b0cd86795bf,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
991411f0dca94f348c7bd3eae93b6e6c061605f1,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
323fbb28144eefa3eabfa22bd310dfb0713de80d,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
55c750f0951ca1021b26c0e758660bb8a2c49d3a,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
b458e3d697276a93aa6926caf1ff08e875933940,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891


In [34]:
def save_predictions(predictions: pd.DataFrame, output_path: str):
    """Save predictions to csv.
    
    Saves the predictions into a csv file with the format we need.
    We keep the index since it contains the user ids.
    
    Args:
        predictions (pd.DataFrame): DataFrame with user_id as index and ordered recommendations in the columns.
        output_path (str): Filepath for the predictions file.
    
    """
    predictions.to_csv(output_path, header=None)
    print(f"Saved to csv in '{output_path}'.")
    
    
save_predictions(non_pers_test_most_rated_df, os.path.join("data", "test_non_personalized_recommendations.csv"))

Saved to csv in 'data/test_non_personalized_recommendations.csv'.


In [35]:
# Filter the non-personalized recommendations for the users without training data and save
new_users_recommendations = non_pers_most_rated_df.iloc[index_users_not_in_data]
save_predictions(new_users_recommendations, os.path.join("data", "new_users_non_personalized.csv"))

Saved to csv in 'data/new_users_non_personalized.csv'.


## Step 3: Personalized


### Step 3.1 - Collaborative filtering


In [36]:
def make_user_similarities(R_: csr_matrix) -> csr_matrix:
    """Creates the user similarities matrix.
    
    Args:
        R_ (csr_matrix): Ratings matrix.
        
    Returns:
        user_similarities (csr_matrix): Matrix with user similarities.
    
    """
    return cosine_similarity(R_, dense_output=False)


user_similarities = make_user_similarities(ratings_train)
user_similarities

<7526x7526 sparse matrix of type '<class 'numpy.float64'>'
	with 1078541 stored elements in Compressed Sparse Row format>

In [37]:
def make_user_predictions_collab_filt(S: csr_matrix, R_: csr_matrix):
    """Predict using collaborative filtering.
    
    Args:
        S (csr_matrix): Similarities matrix (tipically using the cosine_similarity).
        R_ (csr_matrix): Ratings matrix.
        
    Returns:
        preds (csr_matrix): Predictions matrix.
    
    """
    weighted_sum = np.dot(S, R_)
    
    # We use the absolute value to support negative similarities.
    # In this particular example there are none.
    sum_of_weights = np.abs(S).sum(axis=1)
    
    preds = weighted_sum / sum_of_weights
    
    # Exclude previously rated items.
    preds[R_.nonzero()] = 0
    
    return csr_matrix(preds)
 

collab_filt_user_preds = make_user_predictions_collab_filt(user_similarities, ratings_train)
collab_filt_user_preds

<7526x41194 sparse matrix of type '<class 'numpy.float64'>'
	with 9269874 stored elements in Compressed Sparse Row format>

In [38]:
def sparsity(matrix: csr_matrix) -> float:
    """Calculates the sparsity of a matrix.
    
    Args:
        matrix (csr_matrix): Sparse matrix.
        
    Returns:
        sparsity_ (float): Sparsity percentage (between 0 and 1).
    
    """
    return 1 - matrix.nnz / (matrix.shape[0] * matrix.shape[1])


sparsity(collab_filt_user_preds)

0.9700996926567885

In [39]:
def get_most_rated_from_user_preds(user_preds_: csr_matrix, n: int) -> np.matrix:
    """Returns the n most rated items from the user predictions.
    
    Args:
        user_preds_ (csr_matrix): A sparse ratings matrix
        n (int): The number of top-n items we should retrieve.
        
    Returns:
        most_rated (np.matrix): An array of the most rated items.
    
    """
    pred_ = np.negative(user_preds_).toarray()
    return pred_.argsort()[:, :n]


collab_filt_most_rated = get_most_rated_from_user_preds(collab_filt_user_preds, 100)
print(collab_filt_most_rated.shape)
collab_filt_most_rated

(7526, 100)


array([[36432, 29939,  2648, ..., 21687,  9121,  7691],
       [31901, 25323, 14131, ..., 22554, 40790, 33750],
       [ 9966, 34014, 31046, ...,  7085,  2377,  1201],
       ...,
       [12171,  2936,  1436, ..., 27480, 27454, 27482],
       [ 6234,  1557,  1394, ...,  7598, 15159, 29817],
       [31509, 35670, 16451, ..., 14524, 36116,  3070]])

In [40]:
def convert_pers_recommendations_to_df(pers_recs: np.array, users_to_pred: pd.DataFrame) -> pd.DataFrame:
    """Converts the personalized most rated to an DataFrame with the users and the recommendations.
    
    Args:
        pers_recs (np.array): Array of indices for the best personalized items to recommend.
        users_to_pred (pd.DataFrame): DataFrame containing the users which need recommendations.
        
    Returns:
        non_pers_most_rated_matrix (np.array): Two dimensional array of (n_users, top_n_items)
    
    """
    pers_df = pd.concat([users_to_pred, pd.DataFrame(pers_recs)], axis=1)
    pers_df = pers_df.set_index("users to recommend songs")
    
    return pers_df


collab_filt_most_rated_df = convert_pers_recommendations_to_df(collab_filt_most_rated, unique_users_training_data)
collab_filt_most_rated_df.head()

Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0011d5f4fb02ff276763d385c3f2ded2b00ad94a,36432,29939,2648,20368,38277,12890,33331,22565,35406,10324,...,22571,25953,11151,31756,32530,13733,4204,21687,9121,7691
002511b392561fc1d426d875c386b356a6fc5702,31901,25323,14131,34763,16020,38354,27305,5232,37434,9191,...,30647,3367,15182,26669,17383,28997,4761,22554,40790,33750
002dfbc3c073b55a64a4abab34c0ca1f13897f1c,9966,34014,31046,9622,2208,7282,7598,23122,10269,33143,...,10324,21352,35592,26830,14134,23803,29486,7085,2377,1201
003998bc33cddeba02428a43391c6716e523c8f7,1612,1232,4276,33318,40377,25964,40950,991,26411,1658,...,38044,28931,15987,25147,37859,37842,8812,10654,5832,13737
0042d2027dfa0340e31d2aa875c4be229730efb7,4415,36171,29494,20028,31334,39976,33098,18742,35834,24048,...,2377,15114,18584,4840,12049,5188,24742,11238,30810,38725


In [41]:
collab_filt_dict = create_dict_preds(collab_filt_most_rated_df)
# Since dicts in python are not ordered, we need to HAMMER DOWN a way to print some values.
dict(list(collab_filt_dict.items())[0:1])

{'0011d5f4fb02ff276763d385c3f2ded2b00ad94a': [36432,
  29939,
  2648,
  20368,
  38277,
  12890,
  33331,
  22565,
  35406,
  10324,
  11658,
  33098,
  16428,
  9622,
  36319,
  30752,
  30933,
  5588,
  35539,
  32656,
  13363,
  13754,
  32684,
  33159,
  14705,
  33416,
  33110,
  5167,
  29015,
  40597,
  34495,
  17659,
  1394,
  11357,
  26445,
  9614,
  9675,
  33598,
  16044,
  17944,
  3592,
  13358,
  3669,
  6122,
  24229,
  34014,
  23944,
  34087,
  27882,
  14765,
  23589,
  26019,
  10352,
  29468,
  41019,
  24805,
  25916,
  39983,
  24965,
  33505,
  30860,
  18776,
  28267,
  7282,
  3288,
  15053,
  32537,
  1557,
  36116,
  34569,
  2616,
  25590,
  28926,
  25737,
  26261,
  24874,
  18506,
  29565,
  35967,
  23122,
  32168,
  38718,
  38805,
  15296,
  22051,
  28334,
  30247,
  23139,
  19105,
  14553,
  22571,
  25953,
  11151,
  31756,
  32530,
  13733,
  4204,
  21687,
  9121,
  7691]}

## Evaluate
We use again the Mean Average Precision @ K to evaluate our predictions.

In [42]:
evaluate(y_true_dict, collab_filt_dict)

0.032606521925826186

## Predict 
To submit your predictions, you just need to convert your recommendations to the format we have in the `example_output.csv` file.

In [43]:
# Filter the collaborative filtering most rated DataFrame  using the test_users_in_data mask.
collab_filt_most_rated_in_data_df = collab_filt_most_rated_df.loc[test_users_in_data.iloc[:, 0].to_list()]
print(collab_filt_most_rated_in_data_df.shape)
collab_filt_most_rated_in_data_df.head()

(700, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56d985c92960b98ad76a48b10a062b0cd86795bf,2026,25667,718,32147,21996,6368,7506,21443,36202,25025,...,15451,8633,27503,27494,27464,27495,27467,27496,27505,27497
991411f0dca94f348c7bd3eae93b6e6c061605f1,29148,39361,37581,34461,22587,25388,21873,28426,40616,33996,...,10301,5081,10940,9622,33143,2502,16643,18861,3910,23573
323fbb28144eefa3eabfa22bd310dfb0713de80d,25642,3209,3831,5546,2970,39526,25026,8076,30604,6306,...,24219,3859,23391,19149,11630,30589,40632,16127,18602,29735
55c750f0951ca1021b26c0e758660bb8a2c49d3a,16677,1394,7406,20797,5588,18863,8312,18776,15727,31040,...,2941,4845,11382,2605,29474,22037,1687,33573,13917,1201
b458e3d697276a93aa6926caf1ff08e875933940,2648,9622,7282,33143,38296,34014,2062,35539,23122,1557,...,25228,13537,31645,40923,25123,27869,36863,1119,3673,28077


In [44]:
print(new_users_recommendations.shape)
new_users_recommendations.head()

(300, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
002511b392561fc1d426d875c386b356a6fc5702,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
003998bc33cddeba02428a43391c6716e523c8f7,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
005b1fab38cdeb9d5bb97debcf73b44050994a3e,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
005cc5d858319f13f88228f62341e5a4270f8e75,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
006d3c79b9ed677280f8ddbc422d7b0fedd6d1fa,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891


In [45]:
collab_filt_most_rated_test_df = pd.concat([collab_filt_most_rated_in_data_df, new_users_recommendations])
print(collab_filt_most_rated_test_df.shape)

(1000, 100)


In [46]:
# Save the collaborative filtering recommendations 
save_predictions(collab_filt_most_rated_test_df, os.path.join("data", "collab_filt_recommendations.csv"))

Saved to csv in 'data/collab_filt_recommendations.csv'.


### Step 3.2 - Content-based recommendations

In [47]:
# If you had a FileNotFoundError, make sure you have unziped the `song_tag.zip` file.
def read_tags() -> pd.DataFrame:
    """Import the song tags file.
    
    Returns:
        tags (pd.DataFrame): DataFrame with song_index, tag_index and value of the tag in the song.
        
    """
    path = os.path.join('data', 'song_tag.csv')
    data = pd.read_csv(path)
    return data


tags = read_tags()
print(tags.shape)
tags.head()

(4909471, 3)


Unnamed: 0,song_index,tag_index,val
0,254229,206,100.0
1,254229,1125,66.0
2,254229,582,66.0
3,254229,914,33.0
4,254229,95,33.0


In [48]:
def read_songs() -> pd.DataFrame:
    """Import the songs file.
    
    Returns:
        songs (pd.DataFrame): DataFrame containing the song_id and the corresponding song_index.
    
    """
    path = os.path.join('data', 'songs.txt')
    data = pd.read_csv(path, names=['song_id', 'song_index'], sep=' ')
    return data


songs= read_songs()
print(songs.shape)
songs.head()

(386213, 2)


Unnamed: 0,song_id,song_index
0,SOAAADD12AB018A9DD,1
1,SOAAADE12A6D4F80CC,2
2,SOAAADF12A8C13DF62,3
3,SOAAADZ12A8C1334FB,4
4,SOAAAFI12A6D4F9C66,5


In [49]:
def merge_tags_songs(tags: pd.DataFrame, songs: pd.DataFrame, data: pd.DataFrame) -> pd.DataFrame:
    """Join the tags, songs and ratings data sources into a single dataframe).
    
    Args:
        tags (pd.DataFram): DataFrame with song_index, tag_index and value of the tag in the song.
        songs (pd.DataFrame): DataFrame containing the song_id and the corresponding song_index.
        data (pd.DataFrame): Listening history for the users.

    Returns:
        tags_cleaned (pd.DataFrame): Matches the song_id to the tag_index with a corresponding value.
    
    """
    tags_cleaned = (tags.merge(songs, how='left', on='song_index')
                        .merge(data, how='right', on='song_id')
                   )[['song_id', 'tag_index', 'val']]
    return tags_cleaned

tags_cleaned = merge_tags_songs(tags, songs, data)
print(tags_cleaned.shape)
tags_cleaned.head()

(4773486, 3)


Unnamed: 0,song_id,tag_index,val
0,SOBONKR12A58A7A7E0,527.0,100.0
1,SOBONKR12A58A7A7E0,384.0,28.0
2,SOBONKR12A58A7A7E0,98.0,28.0
3,SOBONKR12A58A7A7E0,4070.0,14.0
4,SOBONKR12A58A7A7E0,139705.0,14.0


In [50]:
# Get the number of unique song_id's for which we have tags. 
tags_cleaned.song_id.nunique()

41194

### One important note about the matrix dimensions. 

When creating the item profiles matrix we have to be sure that the set of all items is the same as with the rating matrix. If not, we will have a mismatch dimensions errors when calculating the dot product between these matrices.

In `make_item_profiles` below we are also instantiating `csr_matrix` using `ndarray`s instead of using `pd.DataFrame.pivot()`.

In [51]:
def make_item_profiles(data: pd.DataFrame) -> csr_matrix:
    """Creates the item profiles matrix.
    
    Args:
        data (pd.DataFrame): DataFrame containing the (rows, columns, values) for the ratings matrix.
                             In this case, we have (song_id, tag_index, value).
    
    Returns:
        item_profiles (csr_matrix): Item profiles matrix. Items as rows, tags as columns.
    
    """
    items, item_pos = np.unique(data.iloc[:, 0].values, return_inverse=True)
    tags, tag_pos = np.unique(data.iloc[:, 1].values, return_inverse=True)
    values = data.iloc[:, 2].fillna(0).values
    
    shape = (len(items), len(tags))

    item_profiles_ = csr_matrix((values, (item_pos, tag_pos)), shape=shape)
    return item_profiles_


item_profiles = make_item_profiles(tags_cleaned)
item_profiles

<41194x219644 sparse matrix of type '<class 'numpy.float64'>'
	with 1364190 stored elements in Compressed Sparse Row format>

In [52]:
def make_user_profiles(R_: csr_matrix, item_profiles_: csr_matrix) -> csr_matrix:
    """Calculate the user profiles with the items.
    
    Args:
        R_ (csr_matrix): Ratings matrix with shape (n_users, n_items).
        item_profiles_ (csr_matrix): Item profiles matrix with shape (n_items, n_tags).
        
    Returns:
        user_profiles (csr_matrix): User profiles considering the ratings and the item profiles, with shape (n_users, n_tags).
        
    """
    return np.dot(R_, item_profiles_)


user_profiles = make_user_profiles(ratings_train, item_profiles)
user_profiles

<7526x219644 sparse matrix of type '<class 'numpy.float64'>'
	with 2367971 stored elements in Compressed Sparse Row format>

In [53]:
def make_user_predictions_content(R_: csr_matrix, item_profiles_: csr_matrix, user_profiles_: csr_matrix) -> csr_matrix:
    """Produces content-based predictions.
    
    Args:
        R_ (csr_matrix): Ratings matrix with shape (n_users, n_items).
        item_profiles_ (csr_matrix): Item profiles matrix with shape (n_items, n_tags).
        user_profiles (csr_matrix): User profiles considering the ratings and the item profiles, with shape (n_users, n_tags).
       
    Returns:
        preds (csr_matrix): Predictions of ratings using content-based recommendations.
    
    """
    
    # Since we are not looking at the values inbetween the cosine_similarity and excluding previously rated items,
    # we can define dense_output=False to speed up.
    preds = cosine_similarity(user_profiles_, item_profiles_, dense_output=False)
    
    # Exclude previously rated items.
    preds[R_.nonzero()] = 0
    
    return preds


content_user_preds = make_user_predictions_content(ratings_train, item_profiles, user_profiles)
content_user_preds

<7526x41194 sparse matrix of type '<class 'numpy.float64'>'
	with 196978327 stored elements in Compressed Sparse Row format>

In [54]:
content_most_rated = get_most_rated_from_user_preds(content_user_preds, 100)
print(content_most_rated.shape)
content_most_rated

(7526, 100)


array([[35690, 24965, 34087, ...,  3413, 14047,  2511],
       [ 8356, 15600, 39353, ..., 20593,   177,  1675],
       [ 6476, 26240, 10853, ...,  6975, 24582, 19480],
       ...,
       [18123, 21223,   927, ..., 16758,  6261, 26714],
       [36252, 28597, 26519, ..., 17901,  6420, 33103],
       [20789, 35155, 21884, ..., 28755,  2725, 21034]])

In [55]:
content_most_rated_df = convert_pers_recommendations_to_df(content_most_rated, unique_users_training_data)
print(content_most_rated_df.shape)
content_most_rated_df.head()

(7526, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0011d5f4fb02ff276763d385c3f2ded2b00ad94a,35690,24965,34087,23944,13480,38410,31367,34569,29945,36174,...,38487,33039,27148,33103,37403,21147,11256,3413,14047,2511
002511b392561fc1d426d875c386b356a6fc5702,8356,15600,39353,29727,30034,21308,30226,21438,28431,6927,...,31080,32449,18312,28623,3994,27088,14616,20593,177,1675
002dfbc3c073b55a64a4abab34c0ca1f13897f1c,6476,26240,10853,17405,7410,12433,5675,10697,38010,10736,...,32620,19071,13313,39480,31475,1728,1117,6975,24582,19480
003998bc33cddeba02428a43391c6716e523c8f7,33378,7547,23431,14071,25329,37100,12643,16221,26749,29214,...,25364,5933,27271,3862,40770,24278,601,24689,8124,7216
0042d2027dfa0340e31d2aa875c4be229730efb7,27998,14385,17913,71,23087,14150,5933,25804,1688,28418,...,39126,24336,7481,26239,11350,17257,39,4164,28301,1887


In [56]:
content_dict = create_dict_preds(content_most_rated_df)
# Since dicts in python are not ordered, we need to HAMMER DOWN a way to print some values.
dict(list(content_dict.items())[0:2])

{'0011d5f4fb02ff276763d385c3f2ded2b00ad94a': [35690,
  24965,
  34087,
  23944,
  13480,
  38410,
  31367,
  34569,
  29945,
  36174,
  21717,
  38012,
  19804,
  3986,
  2808,
  14500,
  35354,
  20386,
  32045,
  40980,
  17706,
  32334,
  35162,
  19044,
  28256,
  27361,
  16590,
  15360,
  7708,
  6638,
  33244,
  19454,
  24383,
  27938,
  20243,
  25348,
  19279,
  36696,
  15371,
  34301,
  6304,
  37704,
  20729,
  17358,
  10813,
  23155,
  14489,
  12859,
  3878,
  604,
  31271,
  16055,
  14250,
  40731,
  7113,
  29193,
  39219,
  28058,
  3417,
  24048,
  7775,
  29175,
  10498,
  32467,
  27824,
  2242,
  13924,
  29696,
  9815,
  30048,
  37433,
  35280,
  14264,
  19283,
  21859,
  26121,
  8052,
  30767,
  23932,
  28628,
  5460,
  31335,
  33689,
  19916,
  31934,
  31162,
  14816,
  37809,
  37890,
  14870,
  38487,
  33039,
  27148,
  33103,
  37403,
  21147,
  11256,
  3413,
  14047,
  2511],
 '002511b392561fc1d426d875c386b356a6fc5702': [8356,
  15600,
  39353,
  

## Evaluate
We use again the Mean Average Precision @ K to evaluate our predictions.

In [57]:
evaluate(y_true_dict, content_dict)

0.002588656051398923

## Predict 
To submit your predictions, you just need to convert your recommendations to the format we have in the `example_output.csv` file.

In [58]:
content_most_rated_in_data_df = content_most_rated_df.loc[test_users_in_data.iloc[:, 0].to_list()]
print(content_most_rated_in_data_df.shape)
content_most_rated_in_data_df.head()

(700, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
56d985c92960b98ad76a48b10a062b0cd86795bf,29095,2103,36649,32897,29991,21615,30905,10538,18935,37860,...,10966,695,9570,36945,27256,38050,4370,12116,22061,34366
991411f0dca94f348c7bd3eae93b6e6c061605f1,23104,425,1327,26441,19066,36245,12399,9636,7576,33342,...,7492,30867,35926,25985,10990,37196,29828,12384,24975,4525
323fbb28144eefa3eabfa22bd310dfb0713de80d,22933,27806,25566,10989,13390,2183,15114,7810,10044,18374,...,5894,20955,10700,35559,33440,9757,19037,9969,8836,35624
55c750f0951ca1021b26c0e758660bb8a2c49d3a,29445,31259,8570,33038,31912,21069,8796,450,34996,11354,...,11625,3922,848,34890,17065,23661,900,31243,1460,38516
b458e3d697276a93aa6926caf1ff08e875933940,10853,26240,7410,12433,6476,17405,5675,10697,28187,16691,...,32708,36047,38078,24453,26334,35613,3482,19071,38753,21211


In [59]:
print(new_users_recommendations.shape)
new_users_recommendations.head()

(300, 100)


Unnamed: 0_level_0,0,1,2,3,4,5,6,7,8,9,...,90,91,92,93,94,95,96,97,98,99
users to recommend songs,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
002511b392561fc1d426d875c386b356a6fc5702,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
003998bc33cddeba02428a43391c6716e523c8f7,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
005b1fab38cdeb9d5bb97debcf73b44050994a3e,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
005cc5d858319f13f88228f62341e5a4270f8e75,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891
006d3c79b9ed677280f8ddbc422d7b0fedd6d1fa,9622,1394,2648,1557,30752,23122,7282,18742,5794,33598,...,36734,26126,17570,20094,40950,5286,11184,29148,23803,7891


In [60]:
content_most_rated_test_df = pd.concat([content_most_rated_in_data_df, new_users_recommendations])
print(content_most_rated_test_df.shape)

(1000, 100)


In [61]:
# Save the content-based filtering recommendations 
save_predictions(content_most_rated_test_df, os.path.join("data", "content_recommendations.csv"))

Saved to csv in 'data/content_recommendations.csv'.


# Conclusions

We have used non-personalized, collaborative and content-based recommenders to provide 100 recommendations to a set of test users. We evaluated the recommendations on validation data obtained by splitting the historical data. The mAP@100 results were:

- Non-personalized recommender: 0.0003010999155361327

- Collaborative recommender: 0.032606521925826186

- Content-based recommender: 0.002588656051398923

We will discuss mAP on the second learning notebook but what matters for now is that the larger the mAP the better the recommendations.
We can see that the collaborative recommender performed best, followed by the content-based and then non-personalized. In this case it appears that similarity between between users' listening history is a better predictor of users' preferences than the genre of music that they listen. Some additional song metadata (artist, release date, language, ...) could be beneficial for the content-based recommender.