# Welcome to this months Central London Data Science Meetup! 

If you've ever read a data science related blog before, you'll probably have read either:
* 'AI is the new electricity' (I'm looking at you Andrew Ng)
* '90% of a data scientist time is spent sourcing and then cleaning the data'

In tonights notebook we will be delving into predicting the genre of songs using spotify data. From this you'll see that a data scientist really does spend a lot of time collecting and wrangling data.



So without further ado lets get started!

In [None]:
import os
import re
import json
import numpy as np 
import pandas as pd 
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import f1_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

# Sourcing data online

Sourcing data for a project can be tricky, one option is to scower the internet looking for datasets to download. Another option is to use an Application Programming Interface (API).


### APIs
An API allows different applications to share data between each another. By calling an API we gain access to data held on a server.

### So why are these good for a data scientist?

Say you wanted to keep up-to-date the National Leage promotion race (abosulte nail biter). You could search for the hashtag #LeytonOrient and copy and paste each of the tweets into a document. This will take forever.

The next best thing would be to email twitter and ask for a dataset of the twitter stream, but again this would waste even more peoples peoples time. 

Instead, companies create these access points known as APIs that let you query available data. This reduces up the time needed to collect the data you need for your project. Great!

### What to expect from an API
Information returned from an API can come in a couple of formats, but the one that we'll be using today is JavaScript Object Notation- JSON. 

JSON is ubiquitous throughout the web. It's human readable, lightweight and can be interpreted by a tonne of languages, including python. 


# Sourcing our data
So now that we know what an API is and what to expect from it, the next hurdle is understanding how to get it and what it'll contain. 

With this in mind let's jump over to the spotify API documentation- https://developer.spotify.com/documentation/web-api/reference/

In order to source and naviagate the data needed for a project you will need to understand what each key and value represent, so get used to jumping between your code and API documentation.

Have a look for yourself across all the documentation- (It might give you a bit of inspiration for your own project).

For our project we will be living on these two pages: 

* https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/

* https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/


These pages are incredibly useful- we get a description of what the data represent, but a map of how to access it.

In the interest of time, (and also ensuring we don't bring CodeNodes internet crashing down). All the data for tonights excercise is stored locally in the environment. 

However, at the end of the exercise we'll have a mini tutorial on how to call an API by yourself.

Let's jump into a single songs audio analysis.


In [None]:
single_song_path = '../input/singlesong/single_song.json'
with open(single_song_path, 'rb') as f:
    song_json = json.load(f)
# print(song_json)

Kinda gross, right? Let's make this a bit easier on the eyes.

Try using the `json.dumps()` function.

In [None]:
# print(json.dumps(song_json, indent=4, sort_keys=True))

Still a bit intimadating though...

When we load in a json object- python interprets it as a dictionary. So lets use some python 101 to navigate the struture.

A dictionary consists of a collection of key-value pairs. Each key-value pair maps the key to its associated value.

You can build a dictionary by wrapping a `key` joined by a colon `:` to an asscoicated `value` using curly braces `{}`.

Let's build a simple dictionary

In [None]:
dictionary = {'dog': 'woof',
              'cat': 'meow',
              'lazer': 'zapppp',
              'list_of_things': ['a', 3, dict()],
              'numbers': 10012}
dictionary

Here we have a dictionary built of key-value pairs. You can access each value stored in the dictionary using the keys.

In [None]:
dictionary.keys()

Here's how we would access the value associated with the key `dog`

In [None]:
dictionary['dog']

Now the super cool thing about dictionary object is that you can store a tonne of inforation in a variety of formats. Lets check out `list_of_things`.

In [None]:
dictionary['list_of_things']

It returns a `list`, we can either associate this value with a variable and work with it elsewhere like so....

In [None]:
list_of_things = dictionary['list_of_things']
list_of_things[0]

Or access the list object directly...

In [None]:
dictionary['list_of_things'][0]

Cool! So now we have the basics of how to naviagate our way across a dictionary, let's put it to use.

Here are the keys to our audio analysis

In [None]:
song_json.keys()

Lets look at the meta data values...

In [None]:
song_json['meta']

OK, this stuff is a bit boring, let's get into the data that we'll be using for our model.
Head on over to the [doumentation](https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/) and check out what `segments` represents.

**"Audio segments attempts to subdivide a song into many segments, with each segment containing a roughly consitent sound throughout its duration."**

In [None]:
song_json['segments']

Segments returns a list of dictionaries, each of these dictionaries contains string data and lists.

Great, our dictionary contains a list of other dictionaries, that isn't confusing at all...

What we want is the timbre values.

**Timbre is the quality of a musical note or sound that distinguishes different types of musical instruments, or voices. Timbre vectors are best used in comparison with each other.**

In [None]:
# Access the first element (dictionary) in the list
first_segment = song_json['segments'][0]
first_segment

In [None]:
# Then access the timbre values
first_segment['timbre']

We can combine the two lines and pass it through a for loop to get all the timbre values across each segment for this song...


In [None]:
song_timbres = []
for segment in song_json['segments']:
    song_timbres.append(segment['timbre'])
song_timbres

Neat, we've just navigated and stored the relevant data for our model.

However, this is only one song, we'll need to get a few more samples to play with.

We've collected about 50 songs from some well known genres:
* jazz
* blues
* funk,
* metal
* classical
* hiphop
* pop
* electronic

And streamlined the data to include the information we'll need for the rest of the exercise.

Let's use our knowledge of dictionary naviagation to wrangle the data into a Dataframe so we can model it.

Within the `musicdata` folder we have a few JSON files, each one contains songs and their segment data from a single genre.

In [None]:
training_data_path = '../input/musicdata/'
os.listdir(training_data_path)

We want to combine all these data sources into one big dataframe.

Each one has the same structure, let's look at an example fo how to get the data.

In [None]:
with open(os.path.join(training_data_path,'hiphop.json'), 'rb') as f:
    hiphop = json.load(f)

In [None]:
# Each key is a unique identifer for a song known as a URI
hiphop.keys()

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib'].keys()

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib']['artist']

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib']['song']

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib']['meta'].keys()

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib']['meta']['segments'][0]

In [None]:
hiphop['spotify:track:3MnwLa9KRUiv2gNFtWPvib']['meta']['segments'][0]

To give you an idea on the end goal of this wrangling- Check out how the dataframe will look like later...

![](https://github.com/Blair-Young/PredictingMusicGenresFromSpotifyData/blob/master/images/Screen%20Shot%202019-04-21%20at%2015.43.03.png?raw=true)

Lets create a couple of functions to extract this data.

In [None]:
def get_song_name(json_data, song_uri):
    '''Returns song name from song URI key
     Args:
     * json_data- (JSON) 
     * song_uri- (str) URI
     
     Return
     * (str)- Song name
     '''
    return json_data[song_uri]['song']

def get_artist_name(json_data, song_uri):
    '''Returns Artist name from song URI key
     Args:
     * json_data- (JSON) 
     * song_uri- (str) URI
     
     Return:
     * (str)- Artist name
     '''

    return json_data[song_uri]['artist']

def get_timbre_values(json_data, song_uri):
    '''Returns timbre values from a song
    Args:
    * json_data- (JSON) 
    * song_uri- (str) URI 
    
    Return:
    * (list) Each element is a list of timbre values
    '''
    timbre_data = []
    for segment in json_data[song_uri]['meta']['segments']:
        timbre_data.append(segment['timbre'])
    return timbre_data

def get_segment_start_time(json_data, song_uri):
    '''Returns start times of segments from a song
    Args:
    * json_data- (JSON) 
    * song_uri- (str) URI 
    
    Return:
    * (list) Each element is float representing time in milliseconds
    '''
    start_times = []
    for segment in json_data[song_uri]['meta']['segments']:
        start_times.append(segment['start'])
    return start_times

def get_segment_duration(json_data, song_uri):
    '''Returns duration of segments from a song
    Args:
    * json_data- (JSON) 
    * song_uri- (str) URI 
    
    Return:
    * (list) Each element is float representing the duration of a segment
    '''
    durations = [] 
    for segment in json_data[song_uri]['meta']['segments']:
        durations.append(segment['duration'])
    return durations

We have our extraction functions, lets get extracting!

Everything that we return from these functions should go straight into a pandas DataFrame.

For anyone that hasn't used a pandas DataFrame, it's basically a table similiar to that of an Excel/Sheets spreadsheet.
If you want more information about them, check out https://pandas.pydata.org/

In [None]:
def get_genre_data(genre_data, genre_type):
    '''
    Processes a JSON object of a single genre
    Args:
    * genre data (JSON)
    * single genre (str) Name of genre
    
    Returns:
    * pandas DataFrame containing training data and label for ML
    '''
    genre_dataframes = []
    for song_uri in genre_data.keys():
        # Extract the relevant data
        timbres = get_timbre_values(genre_data, song_uri)
        start_times = get_segment_start_time(genre_data, song_uri)
        durations = get_segment_duration(genre_data, song_uri)
        artist_name = get_artist_name(genre_data, song_uri)
        song_name = get_song_name(genre_data, song_uri)
        # Create a dataframe per song
        # We'll build the timbre parts first then add columns
        song_df = pd.DataFrame(timbres)
        song_df['start'] = start_times
        song_df['durations'] = durations
        song_df['song_name'] = song_name
        song_df['artist'] = artist_name
        # Remember to add the genre so we can use it for supervised learning later!
        song_df['genre'] = genre_type
        # Now we need to store/append all the songs in a genre dataframe
        genre_dataframes.append(song_df)
    # Now concatenate the song dataframes into a single genre specific dataframe
    genre_df = pd.concat(genre_dataframes)
    return genre_df

In [None]:
get_genre_data(hiphop, 'hiphop').head()

Let's scale it up so we can use it across all genres.

In [None]:
genre_data_path = '../input/musicdata/'
genre_list = os.listdir(genre_data_path)

In [None]:
all_genre_list = []
for genre in genre_list:
    # Get rid of the pesky .DS_Store files with this clause
    if not genre.endswith('.DS_Store'):
        path = os.path.join(genre_data_path, genre)
        with open(path, 'rb') as f:
            genre_json = json.load(f)
        # Extract the genre from the file name    
        genre_label = genre.replace('.json', '')   
        # Apply our function
        genre_data = get_genre_data(genre_json, genre_label)
        all_genre_list.append(genre_data)

df = pd.concat(all_genre_list)
df.head()

Let's tidy the name columns up.

Check out this page on what the timbre values correspond to

https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-analysis/#timbre

From this we know what some of the values correspond to, but not all of them...

In [None]:
df.rename(columns={i: 'timbre_value_'+str(i) for i in range(0,12)}, inplace=True)

Rename the first three timbre values according to the documentation

In [None]:
df.rename(columns={'timbre_value_0':'loudness',
                   'timbre_value_1': 'brightness',
                   'timbre_value_2': 'flatness'}, inplace=True)

Now we'll wrap all this in a function. 

In [None]:
def get_genre_df(genre_data_path):
    genre_list = os.listdir(genre_data_path)
    all_genre_list = []
    for genre in genre_list:
    # Get rid of the pesky .DS_Store files with this clause
        if not genre.endswith('.DS_Store'):
            path = os.path.join(genre_data_path, genre)
            with open(path, 'rb') as f:
                genre_json = json.load(f)
            # Extract the genre from the file name    
            genre_label = genre.replace('.json', '')   
            # Apply our function
            genre_data = get_genre_data(genre_json, genre_label)
            all_genre_list.append(genre_data)

    df = pd.concat(all_genre_list)
    df.rename(columns={i: 'timbre_value_'+str(i) for i in range(0,12)}, inplace=True)
    df.rename(columns={'timbre_value_0':'loudness',
                   'timbre_value_1': 'brightness',
                   'timbre_value_2': 'flatness'}, inplace=True)
    return df

    

In [None]:
df = get_genre_df(genre_data_path)
df.head()

**Check out the last line of the documentation**

* `Timbre vectors are best used in comparison with each other.`

Looks like they've already been normalised for us, thanks Spotify.

This means we can get straight into the machine learning (about time!)

One more thing (I promise). We'll separate the pop music from the rest of the data, we'll try and use our model to break them down.

In [None]:
df_pop = df[df['genre']=='pop']
df = df[df['genre']!='pop']

In [None]:
training_colummns = ['loudness', 'brightness', 'flatness', 'timbre_value_3',
                     'timbre_value_4', 'timbre_value_5', 'timbre_value_6', 'timbre_value_7',
                     'timbre_value_8', 'timbre_value_9', 'timbre_value_10',
                     'timbre_value_11']
X = df[training_colummns]
y = df['genre']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

# Train a logistic regression model

In [None]:
clf = LogisticRegression(random_state=0, solver='lbfgs',
                         multi_class='multinomial')
clf.fit(X_train, y_train)
y_pred_log_reg = clf.predict(X_test)

In [None]:
print('f1 score {}'.format(f1_score(y_test, y_pred_log_reg, average='weighted')))
print('recall score {}'.format(recall_score(y_test, y_pred_log_reg, average='weighted')))
print('precision score {}'.format(precision_score(y_test, y_pred_log_reg, average='weighted')))

* Not amazing performance here, let's see which genre's the model is having problems with...

In [None]:
{key:value for key, value in zip(sorted(df['genre'].unique()), f1_score(y_test, y_pred_log_reg, average=None))}


In [None]:
log_reg_results = pd.DataFrame({'y_Actual':y_test,
                        'y_Predicted':y_pred_log_reg})
confusion_matrix_log_reg = pd.crosstab(log_reg_results['y_Actual'], log_reg_results['y_Predicted'], rownames=['Actual'], colnames=['Predicted'], margins = True)
confusion_matrix_log_reg

# Not great :S
Let's bring in the cavalry...

In [None]:
rf = RandomForestClassifier()
rf.fit(X_train, y_train)

In [None]:
y_pred_rf = rf.predict(X_test)

In [None]:
print('f1 score {}'.format(f1_score(y_test, y_pred_rf, average='weighted')))
print('recall score {}'.format(recall_score(y_test, y_pred_rf, average='weighted')))
print('precision score {}'.format(precision_score(y_test, y_pred_rf, average='weighted')))

In [None]:
{key:value for key, value in zip(sorted(df['genre'].unique()), f1_score(y_test, y_pred_log_reg, average=None))}


In [None]:
y_pred_rf = rf.predict(X_test)
results_rf = pd.DataFrame({'y_Actual':y_test,
                           'y_Predicted':y_pred_rf})


In [None]:
confusion_matrix_rf = pd.crosstab(results_rf['y_Actual'], results_rf['y_Predicted'], rownames=['Actual'], colnames=['Predicted'], margins = True)
confusion_matrix_rf

Looks like the funk class is really letting itself down.

If we look closer we can see our model is having trouble distinguishing between funk and hiphop.

These genres are quite similar in terms of the timbre. A lot of samples will be used in hiphop that have definitely been derived from funk, so we'll let it slide for now. 

Maybe we can introduce nes features/more data later to boost our performance.

But right now, we'll continue with the project...

# Now we have the model trained, we can now test it on pop songs

Create a predicted genre column for our pop dataframe

In [None]:
pop_timbre = df_pop[training_colummns]
df_pop['predicted_genre'] = rf.predict(pop_timbre)

In [None]:
df_pop['song_name'].unique()

We'll pick a song from the list and see the breakdown of the composition

In [None]:
pop_song = df_pop[df_pop['song_name']=='CHopstix (with Travis Scott)']

In [None]:
pop_song['predicted_genre'].value_counts().plot(kind='bar')

plt.title('Genre Composition for CHopstix by ScHoolboy Q with Travis Scott')

In [None]:
plt.rcParams["figure.figsize"] = (10,10)

colors = {'hiphop':'m',
           'funk': 'g',
           'metal': 'k',
           'jazz':'y',
           'blues':'b',
           'classical':'r',
           'electronic': 'C1'}

for segment in range(len(pop_song)):
    prediction = pop_song.iloc[segment]['predicted_genre']
    start = pop_song.iloc[segment]['start']
    duration = pop_song.iloc[segment]['durations']
    plt.hlines(xmin=start, xmax= start+duration, y=1,
               colors=colors[prediction], linewidth= 200)
plt.yticks([])
plt.xlabel='Seconds'
patches = [mpatches.Patch(color=color, label=genre) for genre, color in colors.items()]
plt.legend(handles=patches, bbox_to_anchor=(0.5, -0.05),
           fancybox=True, shadow=True, ncol=7,
           loc='upper center')
plt.title('{} by {}'.format(pop_song.iloc[segment]['song_name'],
                            pop_song.iloc[segment]['artist']))

plt.show()

This is neat, we can see the majority of the song is made up of hiphop, electronic and a bit of funk.
Check the song out for yourself [here](https://www.youtube.com/watch?v=5xvxgUE_pTA)

Chopstiiiiicks, chopsticks, chopsticks....

Lyrical genius.[](http://)


Pop music isn't necessarily a genre onto itself, it's whats popular right now. So it's kinda cool that we can see whats currently influencing current music.

We'll now test out our model with separate data from our original genres and see how it copes.

We have another dataset waiting in the wing- `testsongdata`...

Let's use our original `get_genre_df` function to save some time wrangling.

In [None]:
test_songs = get_genre_df('../input/testsongdata/')

In [None]:
test_songs.head()

In [None]:
# Predict the genre
test_songs_timbre = test_songs[training_colummns]
test_songs['predicted_genre'] = rf.predict(test_songs_timbre)

We'll create a function that either picks a random or chosen song  to process from our test_songs dataframe.

In [None]:
def get_song_data(song_dataframe, genre=None, song_name=None):
    if song_name:
        song_df = song_dataframe[song_dataframe['song_name']==song_name]
        return song_df
    else:
        genre_df = song_dataframe[song_dataframe['genre']==genre]
        random_song = np.random.choice(genre_df['song_name'].unique())
        random_song_data = genre_df[genre_df['song_name']==random_song]
        return random_song_data
        

In [None]:
def get_song_composition_bar(song_data):
    song_data['predicted_genre'].value_counts().plot(kind='bar')
    plt.title('{} by {}'.format(song_data.iloc[0]['song_name'],
                                song_data.iloc[0]['artist']))
    plt.show()

In [None]:
def get_song_composition_timeline(song_data):
    plt.rcParams["figure.figsize"] = (10,10)
    colors = {'hiphop':'m',
               'funk': 'g',
               'metal': 'k',
               'jazz':'y',
               'blues':'b',
               'classical':'r',
               'electronic': 'C1'}

    for segment in range(len(song_data)):
        prediction = song_data.iloc[segment]['predicted_genre']
        start = song_data.iloc[segment]['start']
        duration = song_data.iloc[segment]['durations']
        plt.hlines(xmin=start, xmax= start+duration, y=1,
                   colors=colors[prediction], linewidth= 200)
    plt.yticks([])
    plt.xlabel='Seconds'
    patches = [mpatches.Patch(color=color, label=genre) for genre, color in colors.items()]
    plt.legend(handles=patches, bbox_to_anchor=(0.5, -0.05),
               fancybox=True, shadow=True, ncol=7,
               loc='upper center')
    plt.title('{} by {}'.format(song_data.iloc[segment]['song_name'],
                                song_data.iloc[segment]['artist']))

    plt.show()

In [None]:
def song_composition(song_dataframe, genre=None, song_name=None):
    song_data = get_song_data(song_dataframe, genre, song_name=song_name)
    get_song_composition_bar(song_data)
    get_song_composition_timeline(song_data)
    

In [None]:
song_composition(test_songs, 'classical')

Try and explore and pick a song from the test dataframe

In [None]:
test_songs[test_songs['genre']=='metal']['song_name'].unique()

In [None]:
song_composition(test_songs, genre=None, song_name='My Own Summer (Shove It)')

We'll go through each genre and pick a random song...

In [None]:
genre_types = ['metal','hiphop', 'funk', 'jazz', 'blues', 'classical', 'electronic']
for g in genre_types:
    print(g)
    song_composition(test_songs, g)

Phew...

OK that's the main part of the exercise complete, congratz! 🎉

These visualisation look pretty cool, you can even see the main parts like- intro, verses, chorus and even bridges!

If you have time to spare, or want to dive a bit deeper be our guest. 🍵🕯

Also, we mentioned about the `'90% of a data scientist time is spent sourcing and then cleaning the data'`

In this notebook we wrote `30` lines of code dedicated to ML, the rest (`284` lines) was getting and displaying the data.

That's approx `89.5%`, pretty close to the `90%` quote.

Here's a few suggestions:

* Try and cluster the data 
* Bring in the start time as a feature used to predict 
* Play around with the original data and see if you can extract other meaningful information for your model
* What could we use to distinguish hiphop from funk better? Maybe time signatures, or tempo, if we had a bigger dataset we could try and bring in artist name?
* Predict the verse, chorus, bidge, outros of a song or genre
* Try other predictive models
* Tune the hyperparams to optimise the current random forrest
> * Explore other APIs


# Just as promised here's a quick tutorial on calling an API
A lot of companies open up their APIs to the public so it's always worth checking their dev pages.

A great example of this is [TFL](https://api.tfl.gov.uk/)- The level of detail in this API is outrageous.

If your just looking to flex your API muscles then have a look at this website [apilist](https://apilist.fun/)

There's an API that gives us images of dogs, so of course we're going to use that...

https://dog.ceo/dog-api/

The dataset is based on the Stanford Dogs Dataset so hopefully we get some dog images back.

**NOTE** Kaggle notebooks do not support calling an API directly- So if you want to try this code out, just copy it into a local jupyter notebook/script.



```
# use requests to interact with an API
import requests

# The API path should be stated in the documentation 
api_path = 'https://dog.ceo/api/breeds/image/random'

# use the get to collect data from the access point.
r = requests.get(api_path).json()
# Copy the url from the message and enjoy the view
print(r['message'])

```


See you all next time! 👋👋👋👋