In [1]:
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

from visualisation.ipython import *
from visualisation.graphs import *
from features.generate_features import *
from models.predict_model import *
from models.train_model import *
from IPython.display import HTML

%matplotlib inline
HTML(input_toggle())

# Spotify and Machine Learning

The music recommendation problem - why is this a challenge:
1. Scale of catalog - over 30 million songs available, how to determine which ones a user will like and listen to next?
2. Repeated consumption - unlike movies, television or books, users will listen to the same or similar tracks over and over, which can generate a lot of noise in the data
3. Music taste tends to be quite idiosyncratic and each user can have _vastly_ different preferences. As an example, some kinds of music that I like:
    * Instrumental guitar solos (e.g. Stevie Ray Vaughn)
    * Classic rock (e.g. AC/DC, Led Zeppelin)
    * 90's rap and hip hop (e.g. Tupac, Nas)
    * Contemporary electronic and EDM (e.g. deadmau5, Daft Punk)
    * Progressive rock (e.g. Porcupine Tree, Dream Theater)
    * More obscure progressive rock (e.g. Ayreon)
    * Classical piano (e.g. Chopin)
    * Jazz (e.g. Miles Davis)
    * Latin american pop rock (e.g. Mana, Juanes)
4. Implicit versus explicit rating - e.g. with Netflix, users explicitly rate movies, but feedback for music is implicit through streaming behaviour (how many times you listen to a song, do you listen to the whole song, etc.)



The diagram below illustrates the stack that Spotify uses to generate the now famous "Discover Weekly" playlist for each of its individual users:
    
![spotify-ecosystem](images/spotify-ecosystem.png)

# How do they do it?

Spotify uses three primary types of modelling techniques to create the "Discover Weekly" playlist and other music recommendation applications/algorithms:

### Collaborative filtering
1. Analysis and comparison of a user's behaviour and preferences versus other users and their preferences
2. Focus is on using user data to make predictions (e.g. number of times a particular song is played, songs saved on playlists, etc.)
3. Aggregate and remove temporal aspects of user data and then create (very large, sometimes sparse) matrices that represent user behaviour across genres, demographics, etc.
4. Theoretical size of MxN, where M is number of users (140 million users) and N is number of songs available (30 million songs)
5. Basically a lot of matrix multiplication, but optimization techniques are needed due to the large volume of data and high number of features (matrix factorization, probabilistic latent semantic analysis)
6. Spotify uses Luigi (Python) to build complex pipeline of jobs in order to enable collaborative filtering model workflow

### Natural Language Processing
1. Analysis of track metadata, music blogs, forums, social media, etc. to determine what adjectvies and descriptions are available for specific songs or artists
2. Categorise into buckets and then assign weights to each song/artist to determine probability that a particular description will be used to describe that item
3. These weighted categorical representations are then converted to a vector representation of a song/artist, which can then be used for comparison to determine similarity between songs or artists

### Raw audio analysis
1. Used primarily to include new songs that might not have enough social media/mainstream data or user playtime data
2. Analysis of raw features of the audio track (e.g. key, tempo, valence, duration, time signatures, verse/chorus frequency and breakdown, mode, etc.)
3. Spotify uses convolutional neural networks (CNNs) to analyse time-frequency representations of audio frames (basically excerpts of songs) and compute statistics of individual songs based on more musically objective factors

# Can we do better than Spotify?

The correct answer here is "probably not," but that doesn't mean we shouldn't try.

How about using classification models to determine whether a song is "good" or "bad" based on an individual's preferences? We will need a way to access Spotify data first - this can be done via the Spotify web API.

## Classification with Spotify data
Fortunately, there are a variety of libraries available across multiple programming languages that provider wrappers around this API. The library we will be using here is called _spotipy_ (click [here](https://github.com/plamere/spotipy) for more details), which will allow us to extract song data available through the Spotify web API.

For the purposes of showing how the _spotipy_ library works and how machine learning can be used to classify songs, we will be performing binomial classification on two playlists - one playlist will contain only "country" songs and the other will contain "hip hop" songs.

### Step 1 - Create Spotify client object
First, we need to create a Spotify client object in Python using app credentials created via the Spotify Developer website (you can create your own apps, which can be access via a "client ID" and "client secret").

In [2]:
filepath = os.path.join(os.getcwd(), os.pardir, '.env')  # specifies path of .env file
sp = access_spotify(filepath)  # returns a Spotify client object

### Step 2 - Get tracks and song data from playlists

Once we have the Spotify client object, we can access playlist data for user playlists created through the Spotify application using a username and set of playlist IDs. These can be obtained via the application or a web browser running the Spotify Web application. A typical playlist URL link, which includes the playlist ID, will look like this:

https://open.spotify.com/user/federicociner/playlist/7ieLWhjh16EPCRV721NyPl

The URLs for the hip-hop and country playlists we will use for this exercise are as follows:
* Hip-hop: https://open.spotify.com/user/federicociner/playlist/74RF9FPZV75ILPK3Vk7hxg
* Country: https://open.spotify.com/user/federicociner/playlist/5NQnP4gJMAAhmgnvYdRyqz

In order to get song-specific details, we need to extract the song IDs from the items in the playlist array. We will do this by iterating through the "tracks" in each playlist and extracting the track features into a separate list.

In [3]:
# set user name and playlist ID variables
username = 'federicociner'
country_playlist_id = '5NQnP4gJMAAhmgnvYdRyqz'
hiphop_playlist_id = '74RF9FPZV75ILPK3Vk7hxg'

# load playlist object
country_playlist = sp.user_playlist(username, country_playlist_id)
hiphop_playlist = sp.user_playlist(username, hiphop_playlist_id)

# get track objects from country playlist object
country_tracks = country_playlist['tracks']
country_songs = country_tracks['items']
while country_tracks['next']:
    country_tracks = sp.next(country_tracks)
    [country_songs.append(item) for item in country_tracks['items']]

# get track objects from hip-hop playlist object
hiphop_tracks = hiphop_playlist['tracks']
hiphop_songs = hiphop_tracks['items']
while hiphop_tracks['next']:
    hiphop_tracks = sp.next(hiphop_tracks)
    [hiphop_songs.append(item) for item in hiphop_tracks['items']]

# get unique identifiers for each song in both playlists
country_ids = [country_songs[i]['track']['id'] for i in range(0, len(country_songs))]
hiphop_ids = [hiphop_songs[i]['track']['id'] for i in range(0, len(hiphop_songs))]

print 'Country song IDs:\n'
print '\n'.join([ str(country_ids[i]) for i in range(0,5)])
print '\n'
print 'Hip-hop song IDs:\n'
print '\n'.join([ str(hiphop_ids[i]) for i in range(0,5)])

Country song IDs:

3Ze8nX7ZYOnAgf8A1hDQQ9
7iDa6hUg2VgEL1o1HjmfBn
2OFvjqQ5xHaWqH3EQ4K3S9
6caDnpIOivsjLRXqlJJxCD
2N2EoS0dVmkNT0tfI0OYUA


Hip-hop song IDs:

7wGoVu4Dady5GV0Sv4UIsx
5eqK0tbzUPo2SoeZsov04s
43ZyHQITOjhciSUUNPVRHc
2Xqd0wUttjueBfdcltADOv
6GNifiuBPrKFpwNBYnooFm


### Step 3 - Get song features

Using these song IDs, we can extract features for each song via the *audio_features* method in the Spotify client object. This method allows us to access the "audio features" object of a specific song, and contains all sorts of useful information:
* Technical and structural features such as tempo, key, mode, duration, time signature, etc
* Derived/descriptive features such as "instrumentalness," "speechiness," "loudness", and valence (measure of how happy or sad a song is)

Further, since we will be using this feature set for binomial classification, we will need to label each observation so that our classifiers are able to distinguish a country song from a  hip-hop song during the training phase. To keep things simple, we assign a class of 1 to all country songs and a class of 0 to hip-hop songs (the integer values here are irrelevant as long as they are not the same number).

In [17]:
features = []

# get features for country songs and assign a class of '1'
for i in range(0, len(country_ids), 100):
    audio_features = sp.audio_features(country_ids[i:i + 100])
    for track in audio_features:
        if track is not None:
            features.append(track)
            features[-1]['class'] = 1

# get features for hip-hop songs and assign a class of '0'
for i in range(0, len(hiphop_ids), 100):
    audio_features = sp.audio_features(hiphop_ids[i:i + 100])
    for track in audio_features:
        if track is not None:
            features.append(track)
            features[-1]['class'] = 0

Now that we have all the features for each song in one list, we can take a look at what the data looks like. From the example below, we can see that the feature set for any given song is stored as a dictionary data structure within the list. 

In [21]:
features[-1]

{u'acousticness': 0.177,
 u'analysis_url': u'https://api.spotify.com/v1/audio-analysis/6s4kWgzQwsueo3mfllY1s6',
 'class': 1,
 u'danceability': 0.682,
 u'duration_ms': 297427,
 u'energy': 0.92,
 u'id': u'6s4kWgzQwsueo3mfllY1s6',
 u'instrumentalness': 0.00158,
 u'key': 10,
 u'liveness': 0.0443,
 u'loudness': -1.544,
 u'mode': 1,
 u'speechiness': 0.278,
 u'tempo': 82.228,
 u'time_signature': 4,
 u'track_href': u'https://api.spotify.com/v1/tracks/6s4kWgzQwsueo3mfllY1s6',
 u'type': u'audio_features',
 u'uri': u'spotify:track:6s4kWgzQwsueo3mfllY1s6',
 u'valence': 0.835}

Unfortunately, this is not the easiest type of data to use for our purposes, and therefore we are going to convert it to tabular format using Pandas (and later on, to Numpy

# Further reading

[Spotify’s Discover Weekly: How machine learning finds your new music](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe)

[Making Your Own Discovery Weekly](https://towardsdatascience.com/making-your-own-discover-weekly-f1ac7546fedb)

[Algorithmic Music Recommendations at Spotify](https://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify/16-Section_nameMonday_January_13_1416)

[Collaborative Filtering at Spotify](https://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818/63)

[Python Spotipy API](https://spotipy.readthedocs.io/en/latest/)

[Spotify Web API](https://developer.spotify.com/web-api/)