# Spotify Track Audio Analysis by Popular Genre: Data Collection and Cleaning
By Jason Ku

# Introduction:
The [Spotify Web API](https://developer.spotify.com/documentation/web-api/reference-beta/#endpoint-get-several-tracks) is a pretty nifty API that lets developers access various information on users, albums, playlists, artists, tracks and more. One particularly interesting endpoint is [Get Audio Features for a Track](https://developer.spotify.com/console/get-audio-features-track/?id=06AKEBrKUckW0KREUWRnvT), which returns a numerical analysis of a song's various features. 

The data is pretty strange, but if the endpoint exists, it must be able to be used for something, right? Could we, for example, use Spotify's audio analyses to predict the genre of a song?

# Data Collection
The first step is to register an app and get an API key to get an access token. The access token will allow us to make requests to the Spotify Web API.

In [0]:
import requests
import base64

# Create base64 encoding of client id and secret for POST request header
client_id = "c60d8ca0501a415d97a33cbd2af92725"
client_secret = "a45f6fa2b35b4e3fbdf87fefe6b9df7c"
b64_auth_str = base64.b64encode((client_id + ":" + client_secret).encode()).decode()

# Make the actual request and get an access token using client credentials
response = requests.post('https://accounts.spotify.com/api/token',
                         headers={'Authorization': 'Basic ' + b64_auth_str},
                         data={'grant_type': 'client_credentials'})

# Store the access token for future use
access_token = response.json()['access_token']

Next, we need to get some sample songs from a couple different popular genres. To do this, we use the [Get Recommendations Based on Seeds](https://developer.spotify.com/console/get-recommendations/?seed_artists=4NHQUGzhtTLFvgF5SZesLK&seed_tracks=0c6xIDDpzE81m2q797ordA&min_energy=0.4&min_popularity=50&market=US) endpoint to get 100 songs that Spotify thinks fall under each of our genres (based on what artist, album, and what other users play).

In [0]:
# Set up our selection of genres to look at, the genre_recs dictionary, and
# a string format for the endpoint to hit
genres = ['classical', 'country', 'electronic', 'hip-hop', 'jazz', 'pop', 'rock']
genre_recs = {}
genre_recs_endpoint = "https://api.spotify.com/v1/recommendations?market=US&seed_genres=%s&limit=100"

# For each genre, we get 100 of their recommended songs and append all of the
# track ids to the genre_recs dictionary
for genre in genres:
  response = requests.get(genre_recs_endpoint % genre,
                          headers={'Authorization': 'Bearer ' + access_token})
  response_json = response.json()

  genre_recs[genre] = []
  for track in response_json['tracks']:
    genre_recs[genre].append(track['id'])

Now that we have a dictionary with 100 of the recommended tracks for each genre, we can use the [Get Audio Features for Several Tracks](https://developer.spotify.com/console/get-audio-features-several-tracks/?ids=4JpKVNYnVcJ8tuMKjAj50A,2NRANZE9UCmPAS5XVbXL40,24JygzOLM0EmRQeGtFcIcG) endpoint to get Spotify's audio analysis for each track. Thankfully, the endpoint allows us to get the analyses for up to 100 ids at a time, so we only have to make one API call per genre!

In [0]:
# Set up dictionary to store the audio features for each track in each genre
# and a string format for the endpoint to hit
genre_audio_features = {}
audio_features_endpoint = 'https://api.spotify.com/v1/audio-features?ids=%s'

# For each genre, get the audio features for every track id in the genre_r
for genre in genres:
  genre_audio_features[genre] = []

  # Get the audio features and append the list of analyses to the dictionary
  response = requests.get(audio_features_endpoint % ','.join(genre_recs[genre]),
                          headers={'Authorization': 'Bearer ' + access_token})
  genre_audio_features[genre].append(response.json())

# Data Cleaning
At this point, we are done collecting data. Now, we only need to clean it up! Lets convert the `genre_audio_features` dictionary into a dataframe of audio features.

In [0]:
import pandas as pd
from pandas.io.json import json_normalize

df_audio = pd.DataFrame()

# For each genre, concatenate a dataframe of the audio features to df_audio
for genre in genres:
  df = json_normalize(genre_audio_features[genre][0]['audio_features'])
  df['genre'] = genre
  df_audio = pd.concat([df, df_audio])

Taking a look at df_audio, we see that we have an audio analysis of each track, which we can use to analyze and train a model on later.

In [5]:
df_audio.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,genre
0,0.579,0.946,9,-2.732,0,0.054,0.000128,2.4e-05,0.335,0.4,99.937,audio_features,6Rt9GlwZEDU0V3vhXrUNqJ,spotify:track:6Rt9GlwZEDU0V3vhXrUNqJ,https://api.spotify.com/v1/tracks/6Rt9GlwZEDU0...,https://api.spotify.com/v1/audio-analysis/6Rt9...,228053,4,rock
1,0.346,0.897,4,-5.044,1,0.0678,0.122,8.2e-05,0.358,0.661,101.744,audio_features,4YyOPaXcxCmpv3c7SQUo5e,spotify:track:4YyOPaXcxCmpv3c7SQUo5e,https://api.spotify.com/v1/tracks/4YyOPaXcxCmp...,https://api.spotify.com/v1/audio-analysis/4YyO...,183253,4,rock
2,0.64,0.864,7,-6.576,1,0.0315,0.00832,0.0,0.123,0.7,102.026,audio_features,42et6fnHCw1HIPSrdPprMl,spotify:track:42et6fnHCw1HIPSrdPprMl,https://api.spotify.com/v1/tracks/42et6fnHCw1H...,https://api.spotify.com/v1/audio-analysis/42et...,268360,4,rock
3,0.361,0.97,7,-4.817,0,0.284,0.00169,0.00163,0.357,0.254,179.017,audio_features,7GonnnalI2s19OCQO1J7Tf,spotify:track:7GonnnalI2s19OCQO1J7Tf,https://api.spotify.com/v1/tracks/7GonnnalI2s1...,https://api.spotify.com/v1/audio-analysis/7Gon...,282920,4,rock
4,0.666,0.936,7,-9.919,1,0.0476,0.00244,0.086,0.153,0.776,91.577,audio_features,0uppYCG86ajpV2hSR3dJJ0,spotify:track:0uppYCG86ajpV2hSR3dJJ0,https://api.spotify.com/v1/tracks/0uppYCG86ajp...,https://api.spotify.com/v1/audio-analysis/0upp...,282907,4,rock


Lets now save this DataFrame to a csv file in the gdrive so we can analyze and train a model on it.

In [6]:
from google.colab import drive

# Mount the drive
drive.mount('/content/gdrive')

# Save the file to the drive
df_audio.to_csv('gdrive/My Drive/audio.csv')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive
