In [1]:
import os
import sys

# add the 'src' directory as one where we can import modules
src_dir = os.path.join(os.getcwd(), os.pardir, 'src')
sys.path.append(src_dir)

from visualisation.ipython import *
from features.generate_features import *
from models.predict_model import *
from models.train_model import *
from IPython.display import HTML
import pandas as pd
import matplotlib.pyplot as plt
import seaborn

%matplotlib inline
HTML(input_toggle())

# Spotify and Machine Learning

The music recommendation problem - why is this a challenge:
1. Scale of catalog - over 30 million songs available, how to determine which ones a user will like and listen to next?
2. Repeated consumption - unlike movies, television or books, users will listen to the same or similar tracks over and over, which can generate a lot of noise in the data
3. Music taste tends to be quite idiosyncratic and each user can have _vastly_ different preferences. As an example, some kinds of music that I like:
    * Instrumental guitar solos (e.g. Stevie Ray Vaughn)
    * Classic rock (e.g. AC/DC, Led Zeppelin)
    * 90's rap and hip hop (e.g. Tupac, Nas)
    * Contemporary electronic and EDM (e.g. deadmau5, Daft Punk)
    * Progressive rock (e.g. Porcupine Tree, Dream Theater)
    * More obscure progressive rock (e.g. Ayreon)
    * Classical piano (e.g. Chopin)
    * Jazz (e.g. Miles Davis)
    * Latin american pop rock (e.g. Mana, Juanes)
4. Implicit versus explicit rating - e.g. with Netflix, users explicitly rate movies, but feedback for music is implicit through streaming behaviour (how many times you listen to a song, do you listen to the whole song, etc.)



The diagram below illustrates the stack that Spotify uses to generate the now famous "Discover Weekly" playlist for each of its individual users:
    
![spotify-ecosystem](images/spotify-ecosystem.png)

# How do they do it?

Spotify uses three primary types of modelling techniques to create the "Discover Weekly" playlist and other music recommendation applications/algorithms:

### Collaborative filtering
1. Analysis and comparison of a user's behaviour and preferences versus other users and their preferences
2. Focus is on using user data to make predictions (e.g. number of times a particular song is played, songs saved on playlists, etc.)
3. Aggregate and remove temporal aspects of user data and then create (very large, sometimes sparse) matrices that represent user behaviour across genres, demographics, etc.
4. Theoretical size of MxN, where M is number of users (140 million users) and N is number of songs available (30 million songs)
5. Basically a lot of matrix multiplication, but optimization techniques are needed due to the large volume of data and high number of features (matrix factorization, probabilistic latent semantic analysis)
6. Spotify uses Luigi (Python) to build complex pipeline of jobs in order to enable collaborative filtering model workflow

### Natural Language Processing
1. Analysis of track metadata, music blogs, forums, social media, etc. to determine what adjectvies and descriptions are available for specific songs or artists
2. Categorise into buckets and then assign weights to each song/artist to determine probability that a particular description will be used to describe that item
3. These weighted categorical representations are then converted to a vector representation of a song/artist, which can then be used for comparison to determine similarity between songs or artists

### Raw audio analysis
1. Used primarily to include new songs that might not have enough social media/mainstream data or user playtime data
2. Analysis of raw features of the audio track (e.g. key, tempo, valence, duration, time signatures, verse/chorus frequency and breakdown, mode, etc.)
3. Spotify uses convolutional neural networks (CNNs) to analyse time-frequency representations of audio frames (basically excerpts of songs) and compute statistics of individual songs based on more musically objective factors

# Can we do better than Spotify?

The correct answer here is "probably not," but that doesn't mean we shouldn't try.

How about using classification models to determine whether a song is "good" or "bad" based on an individual's preferences? We will need a way to access Spotify data first.

In [None]:
# load features CSV
filepath = os.path.join(os.getcwd(), os.pardir, 'data', 'features.csv')
df = pd.read_csv(filepath, sep=',')

In [None]:
df.head()

In [None]:
df.as_matrix()

# Further reading

[Spotify’s Discover Weekly: How machine learning finds your new music](https://hackernoon.com/spotifys-discover-weekly-how-machine-learning-finds-your-new-music-19a41ab76efe)

[Making Your Own Discovery Weekly](https://towardsdatascience.com/making-your-own-discover-weekly-f1ac7546fedb)

[Algorithmic Music Recommendations at Spotify](https://www.slideshare.net/MrChrisJohnson/algorithmic-music-recommendations-at-spotify/16-Section_nameMonday_January_13_1416)

[Collaborative Filtering at Spotify](https://www.slideshare.net/erikbern/collaborative-filtering-at-spotify-16182818/63)