# Recommender System Final Project

**The main purpose of this notebook is to understand how incorporating time awareness and adding various audio features can improve the accuracy of the recommender system that was develop to help Deezer in the first part of the project**. On one hand, intuitively, time of the day or day of the week should affect the music selection of the user. For example, if a user works out in the morning, then a genre like EMD with a higher tempo might be the best option. However on a Sunday morning, while reading a novel, a less energetic genre might be the most suitable option. On the other hand, being able to differentiate songs by more audio features should, in principle, aid the recommender system on discriminating better the preferences for each user. **Finally, to get a closer understanding of the performance of our recommender system, we incorporated our close friend's music preferences into the data set and created recommendations for them. We then discussed how satisfied they were with the recommendations**.

**The notebook consists on the following sections** (each with the aim of answering a key question)
* **_Introduction_**: will time awareness and audio features improve our recommendations?
* **_Approach_**: how can we incorporate time awareness and audio features into our recommendations?
* **_Validation_**: how were the hyper parameters tuned? what did our friends think about our recommendations?
* **_Conclusions_**: what are the shortcomings of the approach?

_Note : the previous "project report" complements this notebook. We do not discuss the problem context or the previous models that were develop in detail anymore. Please re-read key sections for a refresher in those topics._

---

## Do Imports & Load Data

In [1]:
# Managing relative imports
import os
import sys
module_path = os.path.abspath(os.path.join('..'))
if module_path not in sys.path:
    sys.path.append(module_path)

In [2]:
import pandas as pd
import numpy as np
from scipy.stats import chi2_contingency
from utils.Pipeline import Pipeline
from utils.Vectorizer import Vectorizer
from models.cross_features.CrossFeaturesModel import CrossFeaturesModel
from models.tars.context_lf import ContextLF
from models.cross_validation import ModelTester
import utils.loss_functions as lf


In [3]:
%matplotlib inline

In [4]:
spotify_file = "SpotifyAudioFeatures_clean.csv"
data_file = "train.csv"
# df = pd.read_csv(data_file, nrows=100)  # to check headers
pipe = Pipeline(deezer_path=data_file, spotify_path=spotify_file)
df = pipe.make_selected()
chi_df = pipe.make()

>>> Shape: (1646569, 56)
>>> Unique users: 19918
>>> Unique songs: 4924
>>> Shape: (1646569, 56)
>>> Unique users: 19918
>>> Unique songs: 4924
Running time: 103 seconds


---

## Introduction: will time awareness and audio features improve our recommendations?

**We believe that both incorporating time awareness and discriminating by audio features will improve the recommendations of our model**. Beyond the _theoretical_ justifications of why this makes sense, **the data supports our intuition**. As it will be shown below, the vast majority of **the users vary the songs** that they listen to **throughout the day**. Moreover, **some audio features** like "energy", "danceability" and "tempo" **are more prevalent in certain time periods**. 

### User time behaviour analysis

**The goal of this section is to show**, in a data driven manner, **that the majority of the users vary the songs that they listen to through time**; hence, supporting the idea that incorporating time awareness in our model is important. In order to frame the discussion, we will first justify the notion of time in the analysis as well as to explain how we are tracking listening behavior for each user.  

#### Time splits rationale

**In our analysis we split time into 6 buckets.** First we separated the different hours of day in the model into morning, afternoon and night. Then we separated the different days of the week into weekdays and weekends. **The first three splits of the day are based on the following time of day frequency analysis**

<img src= "./images/hour_freq.png ">

From the above plot it can be seen that activity peaks in two times of the day (hence creating three natural splits). **Beyond the data analysis, there is also a domain knowledge justification for these time splits**. In France (as in most western countries), the work schedule separates the day into three parts. The first encompasses the time period (5AM - 9AM) where the users do their pre-work activities. The second is throughout their work time (9AM - 5PM) and the last being after work (>5PM). Based on this, it makes sense that the _type_ of music that a user listens is related to the time of the day that he is in. We are going to show the activity of user 3466 to illustrate these nuances.

In [5]:
pipe.user_song(user_id=3466, songs=[135010092, 130700214, 70079770]).head()

Unnamed: 0,user_id,media_id,moment_of_day,hour_listen,spotify_name,spotify_artist,energy,danceability,tempo,valence,times_listened_in_month
0,3466,70079770,late_night,1,All of Me,John Legend,0.264,0.422,119.93,0.331,4
1,3466,130700214,afternoon_evening,18,C'est plus l'heure,Franglish,0.648,0.754,106.965,0.772,2
2,3466,130700214,afternoon_evening,19,C'est plus l'heure,Franglish,0.648,0.754,106.965,0.772,1
3,3466,135010092,morning,9,Makila,Kalash Criminel,0.868,0.825,129.99,0.661,4
4,3466,135010092,morning,11,Makila,Kalash Criminel,0.868,0.825,129.99,0.661,1


**The table above contains a summary of top 3 songs played by this user through the month**. As it can be seen, **this user consistently played "Makila" by Kalash Criminel to start the day 9-11 AM**. **This song contains strong rap lyrics** and, as it can be seen from the audio features it contains the highest energy of all the three songs displayed. Energy is measured from a scale of 0 to 1 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. However, **on the evening around 6-7 this user played "C'est plus l'heure" which has a relaxed vibe to it**, therefore more suitable for a tranquil time of the day. Finally, **on a more sentimental note, the user played all time famous "All of Me" by John Legend only late night around 1 AM** which has a really low energy measurement which is completely opposed to the morning song.

**The open question remains if this user is an outlier in the data, or if the majority of the users behave like this**. We will argue below that this encompasses the majority of the users. However, we will also first illustrate why a weekday and weekend split of time also makes sense according to our data set. For this we will look at the user 7's behavior

In [6]:
pipe.user_day(user_id=7, song_id=18190270, days=[21, 23, 25, 26])

Unnamed: 0,user_id,media_id,moment_of_day,day_listen,hour_listen,spotify_name,spotify_artist,energy,danceability,tempo,valence
6533,7,18190270,morning,21,14,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
160056,7,18190270,morning,21,10,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
197487,7,18190270,morning,21,13,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
633209,7,18190270,afternoon_evening,21,15,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
1127125,7,18190270,morning,21,10,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
1577854,7,18190270,morning,21,6,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
575672,7,18190270,morning,23,8,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
670208,7,18190270,afternoon_evening,23,17,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
1317095,7,18190270,morning,23,13,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389
705194,7,18190270,morning,25,13,Born To Die,Lana Del Rey,0.65,0.315,172.026,0.389


**In the table above we see the daily activity for the top song of user 7**. The table encompasses the last week of the month of November 2016 starting Sunday 20th and ending Saturday the 26th. **Even though this user is stuck with "Born To Die", we see different time patterns depending on whether the day of the week is Mon-Thursday or Friday-Saturday**. During the weekdays, the user played the song two times a day: in the morning and in the evening. However, this behavior shifted in the weekend to something closer to evening and late night. Hence, to properly control for this changes in time behavior (users are not obliged to wake-up early to go to work) we are creating the weekday and weekend split. **Furthermore, the time frequency comparison below of weekday vs weekend also supports this point on the aggregate level**

<img src= "./images/weekday_weekend.png ">

#### Listening activity measurement

**The main question we address by measuring listening activity is if users change the songs that they listened to throughout the day.** This plays a key role for justifying our approach. **If the majority of the users are stuck with a certain group of songs, playing them all day long, then making the recommender system time aware would not yield any improvements.** An example of this behavior was seen above by user 7. No matter what time of the day it was, she always played Lana del Rey's "Born to Die"; moreover, all the top songs of that user showed the same behavior. Hence, the time of the day for this user only helps to understand when this user is more likely to log-in into the app, but the listening patterns remain fixed. On the opposite side of the spectrum we have cases like user 3466 which did vary the energy levels of the songs through the day. **Therefore, what profile is more common in the data?**

**The approach used to measure the listening activity is the following.** **First, we determined which are the relevant time periods of each user.** If a user never logs-in into the app on the evenings then that time period was filter out when analyzing her listening behavior. Actually, any time period that accounted for less that 5% of the user's listening activity was taken out for robustness purposes in our calculations. **Then, we looked for the songs that were played across all the time periods.** Going back to our users example, for user 7 we included Lana del Rey's song in the computation since it was played through all the relevant time periods. But for user 3466, none of the top three songs were included in the analysis since they where only played in specific time periods and not across all. We understand that this is a strict requirement but it provides us with a direct answer of whether a user is playing the same songs or not through the day. **Finally, we gave each user a score based on how consistently he/she played the same songs throughout all time periods.** No user is black or white. Even user 7 has some songs that were not played through all the time periods; same for user 3488. The score measures the % of the listening activity that was played throughout all time periods. **For the % of listening activity, we calculate the number of times that a song was played throughout the month and divide that number by the total of songs played in that time frame** (a numerical example will be provided below). 

Below we plot user / time period distributions to illustrate why it is important to only account for the "relevant" time periods for the calculation.

<img src= "./images/user_timedis.png ">

As it is seen above, user 13 never listens to music late night. Hence it makes no sense to require that a song gets also played in this time bucket to say that it is consistently played over time. However, if she plays the same song in the morning and the afternoon, then it is actually the case that she is playing the same songs consistently. On the other hand, for user 2 it also seems counter intuitive to add into the analysis the afternoon/evening period. Roughly 4% of his activity concentrates there, which almost seems like noise or non-recurrent events. On the contrary, there are users like 1 for whom all time periods are relevant and hence the measurement accounts for all of them.

Once the relevant time periods for each user are determined, **we find the songs that were played across all time periods by easily checking whether the song has a positive amount of plays in each time bucket.** Again, if a user does not play a song in a certain time period (and it is relevant) then we can argue that he is not listening to the same music throughout all the day. Especially for the relevant songs. If the _most listened song_ does not appear in a certain time bucket then it does hint that it is more suitable for certain times of the day or, at the very least, that the user alters his listening behavior depending on time. Analyzing an opposite scenario, if the song has at least one play in each time bucket then we will say that that song was played in that time frame. It might seem over optimistic to assert that a song that was played only once in a certain time bucket should be considered as being consistently listened to in that time period. However, we do not correct for this fact since, even with the over estimation, the consistency results are low at the aggregate level 

In [7]:
pipe.get_user_pivot().head()

Unnamed: 0_level_0,moment_of_day,afternoon_evening,late_night,morning
user_id,media_id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,534628,2.0,5.0,6.0
0,549957,2.0,4.0,4.0
0,549995,,1.0,2.0
0,562487,,,1.0
0,573059,,1.0,1.0


In the above example, we see that user 0 listened to the third song 3 times in total; 2 in the morning, 1 late_night but none in the afternoon so this song will not count as a song listed in all time periods. **Finally to come up with an individual user score**, we add the % of the listening activity that was consistently listened to in all time periods. For example, take user 1

In [8]:
pipe.run_user_analysis(user_id=1).head()

Unnamed: 0,user_id,media_id,spotify_name,spotify_artist,mult,song_sum,total,weights
275,1,129310248,Closer,The Chainsmokers,1,13,1429,0.009097
435,1,92734438,Uptown Funk,Mark Ronson,0,13,1429,0.009097
152,1,1067076,Play That Funky Music,Wild Cherry,0,11,1429,0.007698
513,1,110954154,Just Like A Child,James Morrison,1,11,1429,0.007698
64,1,13161798,Me and Mrs. Jones,Billy Paul,0,10,1429,0.006998


**The table is read as follows:** the "total" column represent the total number of interactions that the user had in the month (that is, the different number of songs that he listened to "times" the amount of repetitions of that song in the given time frame). The "song_sum" column represents the total number of times that the given song was listened throughout the month, in this case "Uptown Funk" was listened to 13 times during the month. Finally the "weights" represent the quotient of the previous two columns. **To come up with a user score, we only aggregate the "weights" that have "mult" equals to 1, which are the songs that were listened to during all relevant time periods.**

The final summary at the user level is

In [9]:
u_fin = pipe.run_analysis(); u_fin.head(5)

Unnamed: 0,user_id,final,song_rel_per
0,0,0.436559,0.001264
1,1,0.149755,0.001295
2,2,0.767814,0.001132
3,3,0.033708,0.000161
4,4,0.6875,0.00029


where it can be seen that some users like 2 or 4 are stuck listening to the same songs all the time whereas users like 1 or 3 vary their musical patterns. **At the aggregate level the weighted value is 34%** (as seen below). 

In [10]:
print("{:2.2f}".format(np.sum(u_fin["final"] * u_fin["song_rel_per"])))

0.34


### Audio features analysis

**The audio features incorporated into the data set were parsed from Spotify's API**: https://developer.spotify.com/web-api/object-model/#audio-features-object. We matched the song name and artist name from the Deezer Dataset, querried the Spotify_ID and retrieved additional features (described below) from the spotify API.

The main motivation for doing so is twofold. First, **we want to help the recommender system by giving it more features to discriminate**. Second, **we want to understand how _diverse_ are the recommendations that we are providing to the user**. For example, with the addtion of these features we can determine if the majority of the recommendations for a user come from a single genre or if they are all energetic songs.


The list of features added into the model is quite large. However, **below are the ones that we were most interested in**. We provide a brief definition and also a rational on why they are relevant.
* **_Energy_**: As explained in the last section, this feature measures the perceptual intensity of a song in a (0, 1) scale. This feature is relevant for the analysis since some moments of the day / activities are more suitable for more energetic music. 
* **_Danceability_**: This feature measures how suitable a track is for dancing; it combines overall regularity and rhythm stability. It is also in a (0, 1) scale. This feature is relevant again due to its connection with day of the week. More "danceable" songs are more appealing for a Saturday night than for a Monday morning.
* **_Tempo_**: This feature is simply the BPM of a song. This feature is relevant since it would help us understand how a user might like or dislike slow paced songs. Also, this featured is linked to the two previous ones.
* **_Valence_**: This feature measures the "positiveness" conveyed by the track. It is measure in a (0, 1) scale and a higher number reflects happy and cheerful songs while a lower sad or depressed notes. This feature is relevant to understand more deeply the music selection of a user. Two users might like the hip-hop genre; however one can prefer more a Kanye West style while the other something like the Wu-Tang Clan. In this scenario, valence would help us further discriminate the user on how "strong" their hip-hop music preferences are.
* **_Speechiness_**: This feature detects the presence of spoken words in the track. It is also measured in a (0, 1) scale, where values lesser than 0.33 come from songs that contain no spoken words whereas values over 0.66 of songs based on mostly spoken words (rap). This feature is relevant since it would allow us to understand how much a user is into songs that contain vocals rather than just simply music beats.

**To understand if these features would actually prove useful for our data set we ran non-parametric chi-square tests.** Below are two of the most relevant results. One links time of the day and the perceptual energy of a song. And the other time of the week to the danceability of a song.

In [63]:
# Chi-square test for time_of_day vs energy
chi = pd.crosstab(chi_df["moment_of_day"], chi_df["track_energy_bucket"])
_, p_val, _, expected_energy  = chi2_contingency(chi, correction=True, lambda_=None)
chi

track_energy_bucket,high,low,med
moment_of_day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
afternoon_evening,185151,19882,139478
late_night,157624,16547,116477
morning,551737,55647,404026


In [64]:
print("The p-value obtained for this test is {:2.4e}".format(p_val))

The p-value obtained for this test is 1.4207e-18


**As we see the p-value is really low, hence we cannot reject the idea that time of day and energy are independent.** Thus, this result hints that the majority of the users follow a behavior similar to user 3466 that we discussed in the beginning of the section.

In [65]:
# Chi-square test for time_of_day vs danceability
chi = pd.crosstab(chi_df["moment_of_day"], chi_df["track_danceability_bucket"])
_, p_val_dance, _, expected_dance   = chi2_contingency(chi, correction=True, lambda_=None)
chi

track_danceability_bucket,no-dance,yes-dance
moment_of_day,Unnamed: 1_level_1,Unnamed: 2_level_1
afternoon_evening,22884,321627
late_night,19106,271542
morning,65327,946083


In [66]:
print("The p-value obtained for this test is {:2.4e}".format(p_val_dance))

The p-value obtained for this test is 3.3377e-04


In [67]:
# Chi-square test for time_of_day (week) vs danceability
chi = pd.crosstab(chi_df["moment_of_week"], chi_df["track_danceability_bucket"])
_, p_val_dance, _, expected_dance   = chi2_contingency(chi, correction=True, lambda_=None)
chi

track_danceability_bucket,no-dance,yes-dance
moment_of_week,Unnamed: 1_level_1,Unnamed: 2_level_1
weekday_afternoon_to_evening,13330,181040
weekday_late_night,12805,183666
weekday_morning,39365,559451
weekend_afternoon_evening,9554,140587
weekend_late_night,6301,87876
weekend_morning,25962,386632


In [68]:
print("The p-value obtained for this test is {:2.4e}".format(p_val_dance))

The p-value obtained for this test is 7.8304e-17


**Again the p-value is small, so this two features are also not independent.** As we see, the p-value of the chi-square test decreases when we incorporate the weekday and weekend differentiation into the analysis. Intuitively, this is explained by the nature of the activities that are performed during the weekend. 

---

## Approach: how can we incorporate time awareness and audio features into recommendations?

**The approach we followed for incorporating time awareness and audio features into our recommendations is a two step approach**; first we run the models that were created in the first part of the project for each time split and then, on an independent exercise, we run a classifier based on our audio features. **In terms of time, we simply preprocess the data into the different time buckets before running our models**. We tried other approaches, however, this yielded the best results (or at least _a_ result that ran on our computers). **In terms of the audio features, we trained different classifiers in order to get two main insights**. First, which are the audio features that discriminated our users best and, second, what are the biases in our data. In terms of the structure of the section, as a first subsection, we show the results of performing a time preprocessing filtering before running our models and our why we followed this approach. Then, in a second subsection, we also specify the components of our features classifier and its results.

### Time Aware Model

**When dealing with time, the most _popular_ model seems to be `time SVD++`. However, this model uses time with a different purpose than us.** By making the user and item biases a function of time, `SVD++` aims to understand how these biases change throughout time, say months and years. This is not the objective that we have in mind (actually, we only have a month of data). What we intend to model is how to best recommend, within the same day, according to time context that a user is currently in.

**In order to manage this different objective, as seen in class, the literature has some approaches but they are too complex to implement.** For example, the _Neural Survival Recommender_ paper by How Jing and Alexander Smola proposes a "Just-In-Time" approach but requires training RNNs and also demands some derivation of formulas. **Hence, in order to get a first approximation of what we wanted, we focused on a preprocessing technique.**

**The preprocessing steps are the following.** First, we segmented our data set into the splits dictated by the time of day buckets. Then we run a separate model in each time of the day and, therefore, generated a recommendation for each (user, time) pairing. **We created the `ContextLF` class to implement this.** The class takes as input the data from the `Pipeline` class. Then it trains one latent factor model for each context variable and one general model that ignores the time context. The function `model.fit()` returns the latent factor vectors U (user) and V (items) for each context. 

The data fed in the model consists out of the Deezer Dataset and additional data we collected with a survey. **This survey data includes the preferences of 30 of our close friends, which we used to asses the quality of our predictions.** The `'train_survey.csv'` data is obtained by concatenating the online survey data with the Deezer Data. 

In [19]:
# Load data for the model 
pipe = Pipeline('data/train_survey.csv',
                'SpotifyAudioFeatures_raw.csv',user_thres=10,item_thres=10000)
data = pipe.make()

>>> Shape: (2480191, 56)
>>> Unique users: 18847
>>> Unique songs: 4133
Running time: 194 seconds


In [31]:
# Run IBCF model for each time of the day / moment of the week and report the results
Model = ContextLF(data)
Model.fit(converg=5000, verbose=True)

Model for context: morning
Loss on iteration 0: 182782.273567
Model for context: late_night
Loss on iteration 0: 63845.0501602
Model for context: afternoon_evening
Loss on iteration 0: 72118.1345217
Model for context: no_context
Loss on iteration 0: 272971.753562


In the next section we discuss in depth the results that we obtain from this model. **Again, to re-emphasize, we are trying to come up with (user, time) predictions** since the data showed that the time of the day dictated different listening patterns; **thus, we accept a loss in accuracy as long as the recommendation show up to be different (which they did).** We hypothesize that by making different time profiles of the same user, we are able to infer recommendations from users that, without the time split, would not have been as "close" in the latent space to our user.

**The main two reasons why we followed this approach are: (1) scalability and (2) results**; in other words it accomplished what we intended, different top recommendatios for (user, time) pairings. In terms of scalability, at first we tried running different neighborhood models with time as a variable, however, our computers were not able to go above sample sizes as the ones that we used for the first part of the project. **Hence we saw the necessity of actually splitting this problem into pieces, rather than merging it into a bigger problem.** Then we switched to matrix factorization, which eventually converged. On the second point, we did get different results that did not sacrifice MSE as much. We were afraid that the same list of top recommendation would be output for each time period, but this was not the case. **Again, we believe that this is just a mere reflection of what we saw in the analysis of user-time listening patterns in the first section.**

### Audio Features Classifier

**To understand which features were the best in predicting that a song will be listened by a user we created a class named `CrossFeaturesModel`.** This class allows us to play with two important components efficiently. It first gives us the flexibility of defining which features and cross-features the model should use and, secondly, it allows us to run different classifiers (i.e. Random Forrest, Logistic Regression, etc). In the next section we discuss how this classifier was validated and hyper-tuned. But for this section we will discuss the rational of the features and cross-features that we selected as well as the insights that it gave us. 

**Also, we add this classifier as a side component of our recommender system in order to address the cold-start problem (both for users and songs).** We say that because if, for example, a certain genre is weighted more by a certain age bucket, then, when introducing a song we would mostly focus on that subset. With users this is a little trickier, nonetheless, we saw that a strong classifier in our data was french rap music. Therefore, there is a biased of the Deezer users to like rap music (Nekfeu, Ninho, Timal, Jul,...), hence a first bet again would be to start with these recommendations.

We first load the data that we wish to run our classifier on. Due to our limiting computing resources (personal computers), we restrict to a sample of the entire data set. _For running the whole model the second data files needs to be uncommented, however it takes hours to run the model, hence we add the results in a final plot_.

In [14]:
data_file = "db_nrows.csv"
# data_file = "train_sample.csv"
sample_path = "train_sample.csv"
pipe = Pipeline(deezer_path=data_file,
                spotify_path=spotify_file,
                sample_path=sample_path,
                use_sample=False)
data = pipe.make()

>>> Shape: (529, 57)
>>> Unique users: 524
>>> Unique songs: 4924
Running time: 0 seconds


Then we select the features to include into the model and, with the use of the class `Vectorizer`, we generate the cross-features.

In [15]:
# Columns to drop in the train set
to_drop = ['ts_listen', 'context_type', 'release_date', 
           'platform_name', 'platform_family', 'media_duration',
           'listen_type', 'user_id', 'Unnamed: 0', 'acousticness', 'danceability',
           'deezer_artist', 'deezer_bpm', 'deezer_name', 'duration_ms', 'energy',
           'id', 'instrumentalness', 'key', 'liveness', 'loudness', 'mode',
           'speechiness', 'spotify_album_name', 'spotify_artist', 'spotify_name',
           'tempo', 'time_signature', 'type', 'uri', 'valence', 'converted_ts',
           'year_listen', 'month_listen', 'day_listen', 'hour_listen',
           'year_release', 'month_release', 'day_release', 'user_age', 
           'is_listened', 'Unnamed: 0_x', 'Unnamed: 0_y']

In [16]:
# Standard classification framework: X = training set, y = target classes
X = data[[col for col in data.columns if col not in to_drop]]; y = data.is_listened
X.columns

Index(['genre_id', 'media_id', 'album_id', 'user_gender', 'artist_id',
       'moment_of_week', 'moment_of_day', 'track_tempo_bucket',
       'track_age_bucket', 'track_duration_bucket', 'user_age_bucket',
       'track_energy_bucket', 'track_valence_bucket',
       'track_speechiness_bucket', 'track_danceability_bucket'],
      dtype='object')

In [17]:
# We create 9 cross features and then let the class Vectorizer modify the data
transforms = [("genre_id", "moment_of_week", "&"),
              ("genre_id", "user_gender", "&"),
              ("artist_id", "user_gender", "&"),
              ("user_age_bucket", "track_age_bucket", "&"),
              ("user_age_bucket", "track_valence_bucket", "&"),
              ("track_danceability_bucket", "moment_of_week", "&"),
              ("track_speechiness_bucket", "moment_of_week", "&"),
              ("track_energy_bucket", "moment_of_week", "&"),
              ("track_tempo_bucket", "moment_of_week", "&")
              ]
vect = Vectorizer(transforms)
X.is_copy = False  # avoid indexing mistake notification
%time vect.fit_transform(X, transforms)
X.head()

CPU times: user 12 ms, sys: 0 ns, total: 12 ms
Wall time: 10.1 ms


Unnamed: 0,genre_id,media_id,album_id,user_gender,artist_id,moment_of_week,moment_of_day,track_tempo_bucket,track_age_bucket,track_duration_bucket,...,track_danceability_bucket,genre_id&moment_of_week,genre_id&user_gender,artist_id&user_gender,user_age_bucket&track_age_bucket,user_age_bucket&track_valence_bucket,track_danceability_bucket&moment_of_week,track_speechiness_bucket&moment_of_week,track_energy_bucket&moment_of_week,track_tempo_bucket&moment_of_week
0,2744,876497,99692,1,26,weekend_late_night,late_night,moderate,80s,long_duration,...,yes-dance,2744&weekend_late_night,2744&1,26&1,[22-25]&80s,[22-25]&positive,yes-dance&weekend_late_night,low-speech&weekend_late_night,med&weekend_late_night,moderate&weekend_late_night
1,2744,876498,99692,0,26,weekend_morning,morning,very_fast,80s,medium_duration,...,yes-dance,2744&weekend_morning,2744&0,26&0,[26-30]&80s,[26-30]&positive,yes-dance&weekend_morning,low-speech&weekend_morning,high&weekend_morning,very_fast&weekend_morning
2,2744,876498,99692,1,26,weekend_morning,morning,very_fast,80s,medium_duration,...,yes-dance,2744&weekend_morning,2744&1,26&1,[26-30]&80s,[26-30]&positive,yes-dance&weekend_morning,low-speech&weekend_morning,high&weekend_morning,very_fast&weekend_morning
3,2744,876498,99692,0,26,weekend_late_night,late_night,very_fast,80s,medium_duration,...,yes-dance,2744&weekend_late_night,2744&0,26&0,[26-30]&80s,[26-30]&positive,yes-dance&weekend_late_night,low-speech&weekend_late_night,high&weekend_late_night,very_fast&weekend_late_night
4,2744,876498,99692,0,26,weekend_morning,morning,very_fast,80s,medium_duration,...,yes-dance,2744&weekend_morning,2744&0,26&0,[26-30]&80s,[26-30]&positive,yes-dance&weekend_morning,low-speech&weekend_morning,high&weekend_morning,very_fast&weekend_morning


At the end, the last column display the cross-features that were incorporated. To run the classifier we do one hot encoding per each column.

In [18]:
X_dummies = pd.get_dummies(X); X_dummies.head()

Unnamed: 0,genre_id,album_id,user_gender,artist_id,media_id_876497,media_id_876498,media_id_876500,moment_of_week_weekday_afternoon_to_evening,moment_of_week_weekday_late_night,moment_of_week_weekday_morning,...,track_tempo_bucket&moment_of_week_moderate&weekday_morning,track_tempo_bucket&moment_of_week_moderate&weekend_afternoon_evening,track_tempo_bucket&moment_of_week_moderate&weekend_late_night,track_tempo_bucket&moment_of_week_moderate&weekend_morning,track_tempo_bucket&moment_of_week_very_fast&weekday_afternoon_to_evening,track_tempo_bucket&moment_of_week_very_fast&weekday_late_night,track_tempo_bucket&moment_of_week_very_fast&weekday_morning,track_tempo_bucket&moment_of_week_very_fast&weekend_afternoon_evening,track_tempo_bucket&moment_of_week_very_fast&weekend_late_night,track_tempo_bucket&moment_of_week_very_fast&weekend_morning
0,2744,99692,1,26,1,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
1,2744,99692,0,26,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
2,2744,99692,1,26,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
3,2744,99692,0,26,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
4,2744,99692,0,26,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1


We now train the model with the classifier of our choice.

In [19]:
model = CrossFeaturesModel(data=X_dummies, target=y, estimator='logistic')
model.train()

>>> Model is training
>>> Model is trained and ready to use!
Run time: 3.9033925533294678 seconds


**We found that the most interpretable results came from using a simple logistic classifier.** We are not surprise by this since, in essence, the logistic model was develop to perform similar tasks as the one at hand. At the end, since we have implicit data, we are trying to predict a binary random variable: either a user listens to a song (success) or not. Moreover, since the insights that we want to derive are what features help us predict the best, then need to do nothing more than analyze the weights that the logistic model assigns to each of the features. **Below is a plot of the most relevant weights.**

In [20]:
# Add plot from running the model not just on a sample
# model.plot_important_features(top=10)

<img src= "./images/importance.png ">

From the main 10 features that we see on the past plot, **we derived the following insights**:
* **Eventhough strong french rap is listened frequently in the data set, it is not a good recommendation to make**. We hypothesize two reasons. One is that users that like this music do not want to have it as a recommendation, they are already into that; however, another user whom is distant to that genre might find this recommendation too aggressive. The direct implication for our recommender system is to filter out restrain the presence of this type of songs in our top-k lists.
* **Genre and gender showed a connection**. Girls appear to have a biased towards french pop artists (like Marina Kaye) while they digressed from rap artists. This goes back to the previous point, if the recommender system does not put genre into the picture, then it is not surprising that women were prone to dislike explicit rap content. The implication for our recommender system is to select top items from genres that are more suitable to the genre at hand.
* **Audio features by themselves say nothing**. This point was at first counterintuitive, but we have a belief of why this is the case. By itself, audio features are not what users choose to listen. It is not the case that a user says, "I feel like listening to energetic music", but rather that he feels like listening to certain songs. Thus, in this process of creating a list of songs to listen, the user might unintentionally add a less energetic song into the session. Even though it is the case that a Monday morning might contain more energetic songs (as we saw with the data analysis) this unintentional variation of the energy levels that the user introduces generates so much noise in the model that it is not able to discriminate by these features alone.

_Note: the previous 3 points were made with a sample of the data. By incorporating the whole data set this could unfortunately changed. However, based on our domain knowledge, we do believe the previous insights could generalize_.

---

## Validation: how were the hyperparameters tuned?

**This section mainly discusses the validation of the Cross-Features classifier since the other model was validated in the first part of the project.** For this last one the exact same approach was followed; however we do show the results of this excercise

### Time aware model validation: hyperparameters and survey user testing

For evaluating the time aware model we used two approaches: we first calculated the model's accuracy and then got a qualitative feel of the output by the recommendations that we made to our close friends.

##### Evaluate the model's accuracy

**To evaluate the model for time of the day, we first looked at overall accuracy**. For every context we calculated the Absolute Mean Error and the Mean Squared Error. We can see that the MSE for any-time of the day is lower than for every. It should be noted, that usually a phase of hyperparameter tuning should follow - however due to the large data set and the resulting runtime, we were not able to do extensive hyperparameter tuning (Number of Latent factors, Regularization parameter). However, from the last project we knew that a model with latent_factors of 5 and regularization parameter of 1 performed best on the given dataset. Thus we decided to go with these parameters, without further exploring the parameter space.


In [32]:
# Load training results (as the model took 12 hours to train,
# we saved the results for U and V and load it here)
t = 0
for i in Model.context.tolist():
    name1 = 'U_'+str(i)+'.csv'
    name2 = 'V_'+str(i)+'.csv'
    U = pd.read_csv(name1)
    V = pd.read_csv(name2)
    U.index = Model.U[t].index
    U.columns = Model.U[t].columns
    V.index = Model.V[t].index
    V.columns = Model.V[t].columns
    Model.U[t] = U
    Model.V[t] = V
    t += 1

In [11]:
# MSE and AME, per context (on the validation set)
t = 0
for i in Model.context.tolist():
    Tester = ModelTester.ModelTester()
    if i == 'no_context':
        context_data = Model.data
    else:
        context_data = Model.data[Model.data['moment_of_day'] == i]
    context_data = pd.pivot_table(context_data, columns='media_id',index='user_id',values='is_listened',aggfunc='mean')
    
    Tester.fit_transform(context_data, verbose=False)
    predictions = pd.DataFrame(np.dot(Model.U[t],Model.V[t]))
    predictions.columns = context_data.columns
    predictions.index = context_data.index
    ame = Tester.evaluate_valid(predictions,
                                     loss_func=lf.absolute_mean_error,
                                     verbose=False)
    mse = Tester.evaluate_valid(predictions,
                                     loss_func=lf.mean_squared_error,
                                     verbose=False)
    print('For context: '+str(i)+' AME: '+str(ame)+' MSE: '+str(mse))
    t += 1

For context: morning AME: 0.337878295466 MSE: 0.159564782442
For context: late_night AME: 0.379839901011 MSE: 0.196984280216
For context: afternoon_evening AME: 0.362722825614 MSE: 0.183865638052
For context: no_context AME: 0.327084946073 MSE: 0.151121176189


**It can be seen that accuracy is better for the model without a context than for each context model**. We believe this is a result of the increased sparseness in the context matrices. **However, for a user the accuracy is not the most important metric** What matters is the quality of the top-k recommended items, which is not captured by the accuracy measures above and should be tested with the users directly; hence the main motiviation for our survey analysis. Also, another approach that we would like to try in the feature is be to test the accuracy of only the top-k recommendations in certain contexts.

In [13]:
# check fill level for each context
for i in Model.context.tolist():
    Tester = ModelTester.ModelTester()
    if i == 'no_context':
        context_data = Model.data
    else:
        context_data = Model.data[Model.data['moment_of_day'] == i]
    context_data = pd.pivot_table(context_data, columns='media_id',index='user_id',values='is_listened',aggfunc='mean')
    a = context_data.shape
    b = context_data.count().sum()/float((context_data.shape[0]*context_data.shape[1]))
    print('Context: '+str(i))
    print('    Shape (user,items): '+str(a))
    print('    Fill-Level: '+str(b))
    print('==========================================')

Context: morning
    Shape (user,items): (12449, 4115)
    Fill-Level: 0.00766098220228203
Context: late_night
    Shape (user,items): (7031, 3998)
    Fill-Level: 0.004691223438486417
Context: afternoon_evening
    Shape (user,items): (7662, 4018)
    Fill-Level: 0.004518267379148309
Context: no_context
    Shape (user,items): (14266, 4122)
    Fill-Level: 0.009652891587187991


### Survey Results

**Accuracy might not be the most appropriate measure to evaluate the quality of a recommender system.** Thus we surveyed our close friends to gather their thoughts about our recommendations. The idea stems from the inherent difficulty in testing the accuracy of recommender systems. Testing against past ratings can only take one so far, we wanted to go one step ahead and see the recommendations that we were making and how people were reacting to the same. We queried the Deezer API to get the names of our songs. **We constructed the survey with 65 popular songs and asked our friends to rate the songs 1 if they liked it, 0 if they did not, and N/A if they had not heard the song.** Then, for the songs that the users had rated 1, we asked them about the time that they are most likely to listen to those songs. This was a very important step as we were making a time sensitive model and getting context information was very important.  Finally, we were successfully able to collect data for 30 users. 

**The next step in the process had a heavy data cleaning component**. Data found from surveys, in the open, is typically messy and is seldom in a form that can be directly used for the analysis. These steps included matching the songs names with the ‘unique_id’s’, formatting the time of day the song was listened to and removing redundancies. The data was finally in the format that was in sync with our original data and thus, could be appended with our dataset. We then ran our models on this new dataset and got the top-k recommendations for all users including the users in the survey. The next step was arguably the most exciting one: we took these top-k recommendations to our friends who had filled the survey and took feedback about the recommendations that we provided. 

**We feel that doing this user test gave a new dimension to our recommendation system and helped us learn more about the subtleties for how the recommendations happen.** There were interesting results where we saw that many a times the system would recommend all songs from the same genre to a user, we tried to get this redundancy out by filtering for this. 

The accuracy of the our recommendations was 50%. In other words, out of the 12 recommendations that we made to our friends (3 per time of day (3x3=9) and 3 for whatever time of day) 6 of the songs were liked and the rest disliked. 

**Below are some of the impressions that we received**:
* " What I liked most was that the system was able to capture my personality really well, it figured out my style and recommended those songs to me. I think it was really cool to do something like that! "
* " The recommendations that I'm getting are not congruent with each other. On one hand I'm getting Michael Jackson and on the other this french rap artist Jul"
* " I think the model was good on capturing the genres that I like, I'm seeing quite of that in the recommendations "

**What we liked the most about this exercise is that we were able to see the different recommendations that we were generating per time of day. However, we did notice that some genres and songs were out of place from our knowledge of our friends.** We believe that adding more of our friends songs would have got us a better result. 

Below is the code that was used. Note that to also use Audio Features in generating this list, we created a "serendipidy-filter", which aims to improve the this characteristic in our recommendations. **The "serendipity-filter" takes Audio-features of the top-k songs (mainly its genre) and removes items that have similar features, thus filtering out redundant items.** We hypothesize that this will make the recommendations more appealing, if the user can select from a more diverse recommendation - but this needs to be finally evaluated through a user test. 

In [33]:
# get top_k predictions for survey users (unfiltered)
survey_user = data[data['user_id'] > 20000]['user_id'].unique()
results = []
for u in survey_user: 
    for c in Model.context.tolist():
        top_k = Model.predict_topk(u,k=10,context=c).tolist()
        results.append([u,c]+top_k)
recommendationA = pd.DataFrame(np.array(results),columns=['user_id','context','top1', 'top2', 'top3', 'top4', 'top5', 'top6', 'top7', 'top8', 'top9', 'top10'])

In [34]:
# show the head recommendations - the user_ids and the song_ids can be matched and tested with the user
recommendationA.head()

Unnamed: 0,user_id,context,top1,top2,top3,top4,top5,top6,top7,top8,top9,top10
0,9999023,morning,3766763,14525574,136864896,134533230,1123690,114313214,916259,109178544,102333186,7704004
1,9999023,late_night,101215842,121275004,1575417,120359044,59509561,119305454,75867425,135072090,134443938,133585096
2,9999023,afternoon_evening,13789079,638665,2435238,70697363,134521528,66657814,136889408,107544238,119374310,793226
3,9999023,no_context,100651306,136590548,235469,130005752,112662270,136332796,118476608,131998704,75867417,136889426
4,9999022,morning,1123690,62376277,107980710,102333186,102333188,136332796,101215842,127376739,127247619,136864892


### Cross-features classifier validation

**The classifier was validated in a classical way.** We first perform a grid search to identify the model that has the best estimation. Below we also show the results of the accuracy achieved in each cross-validation step.

In [21]:
model.grid_search_cv()

In [22]:
model.cross_val_accuracy()

>>> Mean of the accuracy of the model over all folds:
0.608692860812


array([ 0.61016949,  0.60795455,  0.60795455])

It is worth noting that the accuracy merely fluctates in each fold. This is ideal since then the features and cross-features that we have found seem to generalize properly.

---

## Conclusions: what are the shortcomings of the approach?

**Adding time into the model proved to be more difficult than suspected.** We point two main reasons for this. **First it introduced really high sparseness in our models**. If the input matrix already had many missing entries -without splitting the data into different time buckets- now asking for (user, time) combinations only worsened the problem. Hence, estimating all (user, time) combinations jointly was prohibitive with our computer resources. Moreover, such high index of sparseness would not guarantee convergence. **Second, we had to resort to other metrics to evaluate performance; accuracy was not the best guidance in this context.** By incorporating different time profiles per user, we were at odds with increasing accuracy. However, we did accomplish what we intended; a RecSys that outputs a different top-k list for each time of the day. To better understand the quality of this list we surveyed our close friends to get a qualitative feel of our recommendations. **Eventhough the accuracy was 50%, we were able to generate a different list of recommendations per each time of the day.** Personally, I did like the recommendation of starting the day with Bon Jovi and then relaxing in the evening with Happy by Pharell.  

**The main shortcoming of our approach is that we discard information**. We do so by narrowing the data to consider only the data points of the time period for which we are developing the recommendations. It could be the case that a user in the morning is more similar to a user at night (going to GYM in the morning vs at night) and hence we are discarting the possibility of the model to use this information. **Finally, our use of the audio features seems "too manual"**. We created a classifier on top of our recommender system in order to filter out recommendations which did not had a good change of success (when considering also audio features and other user characteristics: age, gender, time). However, we are doing so after those songs have already made it into the top-k list. We would prefer a model in which those songs do not make it into that list in the first place.

---