# Collecting Data from the Spotify Web API using Spotipy

## About the Spotipy Library:

From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


## About using the Spotify Web API:

Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to access the Spotify data. In this notebook, I used the following:

- [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) to get the track IDs 
- [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) to get the corresponding audio features.

The data was collected on several days during the months of April, May and August 2018.


## Goal of this notebook:

The goal is to show how to collect audio features data for tracks from the [official Spotify Web API](https://beta.developer.spotify.com/documentation/web-api/) in order to use it for further analysis/ machine learning which will be part of another notebook.

# 1. Setting Up

The below code is sufficient to set up Spotipy for querying the API endpoint. A more detailed explanation of the whole procedure is available in the [official docs](https://spotipy.readthedocs.io/en/latest/#installation).

In [0]:
!pip install spotipy

Collecting spotipy
  Downloading https://files.pythonhosted.org/packages/d5/da/f6f71a33c99af2a22b3f885d290116d0e963afa095bf77aba4226f88a876/spotipy-2.9.0-py3-none-any.whl
Installing collected packages: spotipy
Successfully installed spotipy-2.9.0


In [0]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

cid ="XXX" # Client ID 
secret = "XXX" # Client Secret

client_credentials_manager = SpotifyClientCredentials(client_id=cid, client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

# 2. Get the Track ID Data

The data collection is divided into 2 parts: the track IDs and the audio features. In this step, I'm going to collect 10.000 track IDs from the Spotify API.

The [search endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/search/search/) used in this step had a few limitations:

- limit: a maximum of 50 results can be returned per query
- offset: this is the index of the first result to return, so if you want to get the results with the index 50-100 you will need to set the offset to 50 etc.

Spotify cut down the maximum offset to 10.000 (as of May 2018?), I was lucky enough to do my first collection attempt while it was still 100.000

My solution: using a nested for loop, I increased the offset by 50 in the outer loop until the maxium limit/ offset was reached. The inner for loop did the actual querying and appending the returned results to appropriate lists which I used afterwards to create my dataframe.

In [0]:
# timeit library to measure the time needed to run this code
import timeit
start = timeit.default_timer()

# create empty lists where the results are going to be stored
artist_name = []
track_name = []
popularity = []
track_id = []
genre_name = []

genres = ["acoustic", "afrobeat", "alt-rock", "alternative", "ambient", "anime", "black-metal", "bluegrass", "blues", "bossanova"]

# List of available genres (126) from https://developer.spotify.com/console/get-available-genre-seeds/
# {"genres": 
# ["acoustic", "afrobeat", "alt-rock", "alternative", "ambient", "anime", "black-metal", "bluegrass", "blues", "bossanova", 
# "brazil", "breakbeat", "british", "cantopop", "chicago-house", "children", "chill", "classical", "club", "comedy", 
# "country", "dance", "dancehall", "death-metal", "deep-house", "detroit-techno", "disco", "disney", "drum-and-bass", "dub", 
# "dubstep", "edm", "electro", "electronic", "emo", "folk", "forro", "french", "funk", "garage", 
# "german", "gospel", "goth", "grindcore", "groove", "grunge", "guitar", "happy", "hard-rock", "hardcore", 
# "hardstyle", "heavy-metal", "hip-hop", "holidays", "honky-tonk", "house", "idm", "indian", "indie", "indie-pop", 
# "industrial", "iranian", "j-dance", "j-idol", "j-pop", "j-rock", "jazz", "k-pop", "kids", "latin", 
# "latino", "malay", "mandopop", "metal", "metal-misc", "metalcore", "minimal-techno", "movies", "mpb", "new-age", 
# "new-release", "opera", "pagode", "party", "philippines-opm", "piano", "pop", "pop-film", "post-dubstep", "power-pop", 
# "progressive-house", "psych-rock", "punk", "punk-rock", "r-n-b", "rainy-day", "reggae", "reggaeton", "road-trip", "rock", 
# "rock-n-roll", "rockabilly", "romance", "sad", "salsa", "samba", "sertanejo", "show-tunes", "singer-songwriter", "ska", 
# "sleep", "songwriter", "soul", "soundtracks", "spanish", "study", "summer", "swedish", "synth-pop", "tango", 
# "techno", "trance", "trip-hop", "turkish", "work-out", "world-music"]
# }

for g in genres:
  for i in range(0,2000,50):
      track_results = sp.search(q=f"genre:{g} year:2018", type='track', limit=50,offset=i)
      for i, t in enumerate(track_results['tracks']['items']):
          artist_name.append(t['artists'][0]['name'])
          track_name.append(t['name'])
          track_id.append(t['id'])
          popularity.append(t['popularity'])
          genre_name.append(g)
      

stop = timeit.default_timer()
print ('Time to run this code (in seconds):', stop - start)

retrying ...1secs
Time to run this code (in seconds): 30.62036742600003
number of elements in the track_id list: 9192


# 3. EDA + Data Preparation

In the next few cells, I'm going to do some exploratory data analysis as well as data preparation of the newly gained data.

A quick check for the track_id list:

In [0]:
print('number of elements in the track_id list:', len(track_id))

number of elements in the track_id list: 9192


Looks good. Now loading the lists in a dataframe.

In [0]:
import pandas as pd

df_tracks = pd.DataFrame({'artist_name':artist_name,'track_name':track_name,'track_id':track_id,'popularity':popularity,'genre':genre_name})
print(df_tracks.shape)
df_tracks.tail()

(9192, 5)


Unnamed: 0,artist_name,track_name,track_id,popularity,genre
9187,Steve Hauschildt,Horizon of Appearances,09sBwNmHJRLTjLUkbmpLVT,0,world-music
9188,Oneohtrix Point Never,The Station,22yjezjXBvnHFCFOGDq8BZ,5,world-music
9189,Tim Hecker,Brownwedding,1aUKbiR6wyu0U6e8JgKIxl,1,world-music
9190,Tim Hecker,Keyed out,6fAo62Wd5dlKHZKoIN2Tkh,4,world-music
9191,Tim Hecker,Chimeras,21mw7zZLl70dOTLknT36vX,5,world-music


In [0]:
df_tracks.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 9192 entries, 0 to 9191
Data columns (total 5 columns):
artist_name    9192 non-null object
track_name     9192 non-null object
track_id       9192 non-null object
popularity     9192 non-null int64
genre          9192 non-null object
dtypes: int64(1), object(4)
memory usage: 359.2+ KB


Sometimes, the same track is returned under different track IDs (single, as part of an album etc.).

This needs to be checked for and corrected if needed.

In [0]:
# group the entries by artist_name and track_name and check for duplicates

grouped = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped[grouped > 1].count()

908

There are 259 duplicate entries which will be dropped in the next cell:

In [0]:
df_tracks.drop_duplicates(subset=['artist_name','track_name'], inplace=True)

In [0]:
# doing the same grouping as before to verify the solution
grouped_after_dropping = df_tracks.groupby(['artist_name','track_name'], as_index=True).size()
grouped_after_dropping[grouped_after_dropping > 1].count()

0

This time the results are empty. Another way of checking this:

In [0]:
df_tracks[df_tracks.duplicated(subset=['artist_name','track_name'],keep=False)].count()

artist_name    0
track_name     0
track_id       0
popularity     0
genre          0
dtype: int64

Checking how many tracks are left now:

In [0]:
df_tracks.shape

(7938, 5)

# 4: Get the Audio Features Data

With the [audio features endpoint](https://beta.developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/) I will now get the audio features data for my 9460 track IDs.

The limitation for this endpoint is that a maximum of 100 track IDs can be submitted per query.

Again, I used a nested for loop. This time the outer loop was pulling track IDs in batches of size 100 and the inner for loop was doing the query and appending the results to the rows list.

Additionaly, I had to implement a check when a track ID didn't return any audio features (i.e. None was returned) as this was causing issues.

In [0]:
# again measuring the time
start = timeit.default_timer()

# empty list, batchsize and the counter for None results
rows = []
batchsize = 100
None_counter = 0

for i in range(0,len(df_tracks['track_id']),batchsize):
    batch = df_tracks['track_id'][i:i+batchsize]
    feature_results = sp.audio_features(batch)
    for i, t in enumerate(feature_results):
        if t == None:
            None_counter = None_counter + 1
        else:
            rows.append(t)
            
print('Number of tracks where no audio features were available:',None_counter)

stop = timeit.default_timer()
print ('Time to run this code (in seconds):',stop - start)

Number of tracks where no audio features were available: 0
Time to run this code (in seconds): 9.546686364000152


# 5. EDA + Data Preparation

Same as with the first dataset, checking how the rows list looks like:

In [0]:
print('number of elements in the track_id list:', len(rows))

number of elements in the track_id list: 7938


Finally, I will load the audio features in a dataframe.

In [0]:
df_audio_features = pd.DataFrame.from_dict(rows,orient='columns')
print("Shape of the dataset:", df_audio_features.shape)
df_audio_features.head(15)

Shape of the dataset: (7938, 18)


Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.753,0.657,7,-3.061,1,0.0449,0.171,0.0,0.112,0.437,107.01,audio_features,09IStsImFySgyp0pIQdqAc,spotify:track:09IStsImFySgyp0pIQdqAc,https://api.spotify.com/v1/tracks/09IStsImFySg...,https://api.spotify.com/v1/audio-analysis/09IS...,184732,4
1,0.741,0.814,2,-4.393,1,0.0646,0.0281,0.0,0.064,0.476,118.048,audio_features,05bUAkDRK3xzvVkZlGU6ee,spotify:track:05bUAkDRK3xzvVkZlGU6ee,https://api.spotify.com/v1/tracks/05bUAkDRK3xz...,https://api.spotify.com/v1/audio-analysis/05bU...,190730,4
2,0.712,0.601,5,-8.968,1,0.062,1.1e-05,0.802,0.0982,0.513,124.912,audio_features,5YzBL3vkQnp3JbeDRRSbSQ,spotify:track:5YzBL3vkQnp3JbeDRRSbSQ,https://api.spotify.com/v1/tracks/5YzBL3vkQnp3...,https://api.spotify.com/v1/audio-analysis/5YzB...,398151,4
3,0.66,0.857,11,-7.946,0,0.0565,0.00262,0.165,0.215,0.111,126.032,audio_features,2zDCZ8jY4kjuUZbVROHaZj,spotify:track:2zDCZ8jY4kjuUZbVROHaZj,https://api.spotify.com/v1/tracks/2zDCZ8jY4kju...,https://api.spotify.com/v1/audio-analysis/2zDC...,285857,4
4,0.554,0.848,5,-4.075,0,0.0745,0.00237,0.0,0.185,0.286,144.996,audio_features,6Jgg7hrMuBzoJ0TG8tD28G,spotify:track:6Jgg7hrMuBzoJ0TG8tD28G,https://api.spotify.com/v1/tracks/6Jgg7hrMuBzo...,https://api.spotify.com/v1/audio-analysis/6Jgg...,216441,4
5,0.805,0.764,4,-10.983,0,0.0796,0.459,0.937,0.102,0.707,122.008,audio_features,7KtbZxJU9ZdIyJJ4QMzx66,spotify:track:7KtbZxJU9ZdIyJJ4QMzx66,https://api.spotify.com/v1/tracks/7KtbZxJU9ZdI...,https://api.spotify.com/v1/audio-analysis/7Ktb...,598071,4
6,0.484,0.92,8,-3.676,1,0.0333,0.0328,0.717,0.196,0.519,127.991,audio_features,2JjzEnHml6T2UjF8Evud85,spotify:track:2JjzEnHml6T2UjF8Evud85,https://api.spotify.com/v1/tracks/2JjzEnHml6T2...,https://api.spotify.com/v1/audio-analysis/2Jjz...,178123,4
7,0.672,0.519,4,-13.699,0,0.058,0.021,0.868,0.333,0.258,116.892,audio_features,2ywFTaCXKedBFlYA0XcHJM,spotify:track:2ywFTaCXKedBFlYA0XcHJM,https://api.spotify.com/v1/tracks/2ywFTaCXKedB...,https://api.spotify.com/v1/audio-analysis/2ywF...,249794,4
8,0.751,0.721,11,-8.093,0,0.0655,0.000252,0.574,0.119,0.0388,126.02,audio_features,1WsHKAuN9vDthcmimdqqaY,spotify:track:1WsHKAuN9vDthcmimdqqaY,https://api.spotify.com/v1/tracks/1WsHKAuN9vDt...,https://api.spotify.com/v1/audio-analysis/1WsH...,502969,4
9,0.748,0.79,5,-10.116,1,0.0782,0.00238,0.874,0.0798,0.23,110.003,audio_features,26wBcR6Damyd7l4xGI6DNg,spotify:track:26wBcR6Damyd7l4xGI6DNg,https://api.spotify.com/v1/tracks/26wBcR6Damyd...,https://api.spotify.com/v1/audio-analysis/26wB...,416923,4


In [0]:
df_audio_features.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7938 entries, 0 to 7937
Data columns (total 18 columns):
danceability        7938 non-null float64
energy              7938 non-null float64
key                 7938 non-null int64
loudness            7938 non-null float64
mode                7938 non-null int64
speechiness         7938 non-null float64
acousticness        7938 non-null float64
instrumentalness    7938 non-null float64
liveness            7938 non-null float64
valence             7938 non-null float64
tempo               7938 non-null float64
type                7938 non-null object
id                  7938 non-null object
uri                 7938 non-null object
track_href          7938 non-null object
analysis_url        7938 non-null object
duration_ms         7938 non-null int64
time_signature      7938 non-null int64
dtypes: float64(9), int64(4), object(5)
memory usage: 1.1+ MB


Some columns are not needed for the analysis so I will drop them.

Also the ID column will be renamed to track_id so that it matches the column name from the first dataframe.

In [0]:
# columns_to_drop = ['analysis_url','track_href','type','uri']
# df_audio_features.drop(columns_to_drop, axis=1,inplace=True)

df_audio_features.rename(columns={'id': 'track_id'}, inplace=True)

df_audio_features.shape

(7938, 18)

In [0]:
# merge both dataframes
# the 'inner' method will make sure that we only keep track IDs present in both datasets
df = pd.merge(df_tracks,df_audio_features,on='track_id',how='inner')
print("Shape of the dataset:", df_audio_features.shape)
df.tail(15)

Shape of the dataset: (7938, 18)


Unnamed: 0,artist_name,track_name,track_id,popularity,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
7923,Jan Jelinek,"Marcel Duchamp, Would You Like Or Expect Peopl...",07Gr875W8EsS67Db61HB8e,5,world-music,0.451,0.29,1,-27.177,1,0.0843,0.73,0.476,0.347,0.0395,108.511,audio_features,spotify:track:07Gr875W8EsS67Db61HB8e,https://api.spotify.com/v1/tracks/07Gr875W8EsS...,https://api.spotify.com/v1/audio-analysis/07Gr...,140711,4
7924,Jan Jelinek,Tendency,3Gp6ijuSujVNE8EFmOe544,2,world-music,0.751,0.468,6,-14.386,1,0.0481,0.593,0.91,0.111,0.469,123.014,audio_features,spotify:track:3Gp6ijuSujVNE8EFmOe544,https://api.spotify.com/v1/tracks/3Gp6ijuSujVN...,https://api.spotify.com/v1/audio-analysis/3Gp6...,441227,4
7925,Paul Baloche,Your Mercy,4QDSPR5PR90W4X9J32AwyS,4,world-music,0.341,0.251,4,-11.596,1,0.0314,0.532,8.9e-05,0.104,0.0975,143.297,audio_features,spotify:track:4QDSPR5PR90W4X9J32AwyS,https://api.spotify.com/v1/tracks/4QDSPR5PR90W...,https://api.spotify.com/v1/audio-analysis/4QDS...,321627,4
7926,Paul Baloche,Glorious,4YYkjZvjj2kGq7oSgYD5e0,6,world-music,0.463,0.698,0,-6.343,1,0.0298,0.0286,0.0,0.405,0.137,102.045,audio_features,spotify:track:4YYkjZvjj2kGq7oSgYD5e0,https://api.spotify.com/v1/tracks/4YYkjZvjj2kG...,https://api.spotify.com/v1/audio-analysis/4YYk...,302920,4
7927,Susumu Yokota,Wave Drops - D.K. Remix,73vlQ0JIEPWwZq7HLye2VB,4,world-music,0.198,0.262,5,-18.186,1,0.0412,0.869,0.867,0.084,0.154,74.055,audio_features,spotify:track:73vlQ0JIEPWwZq7HLye2VB,https://api.spotify.com/v1/tracks/73vlQ0JIEPWw...,https://api.spotify.com/v1/audio-analysis/73vl...,587966,1
7928,Paul Baloche,Above All,50JS0GLnXO5IMujL7IsiSM,7,world-music,0.373,0.169,9,-12.645,1,0.0355,0.84,0.0,0.105,0.147,123.139,audio_features,spotify:track:50JS0GLnXO5IMujL7IsiSM,https://api.spotify.com/v1/tracks/50JS0GLnXO5I...,https://api.spotify.com/v1/audio-analysis/50JS...,321782,4
7929,Paul Baloche,What Can I Do,1MOPrqA3QDwvUBmg5Tz3QE,8,world-music,0.431,0.522,8,-8.97,1,0.0296,0.339,1e-06,0.104,0.273,143.897,audio_features,spotify:track:1MOPrqA3QDwvUBmg5Tz3QE,https://api.spotify.com/v1/tracks/1MOPrqA3QDwv...,https://api.spotify.com/v1/audio-analysis/1MOP...,316658,4
7930,Paul Baloche,My Hope - Live,0nji3iQpbSwLXyirfMgYfd,4,world-music,0.344,0.68,10,-6.271,1,0.0316,0.0469,1e-06,0.115,0.303,167.819,audio_features,spotify:track:0nji3iQpbSwLXyirfMgYfd,https://api.spotify.com/v1/tracks/0nji3iQpbSwL...,https://api.spotify.com/v1/audio-analysis/0nji...,309080,4
7931,Paul Baloche,Found In You,4Bv2NHFBPJ8iFqUPVG1CmH,4,world-music,0.577,0.787,0,-8.261,1,0.0304,0.00395,0.00391,0.27,0.397,123.001,audio_features,spotify:track:4Bv2NHFBPJ8iFqUPVG1CmH,https://api.spotify.com/v1/tracks/4Bv2NHFBPJ8i...,https://api.spotify.com/v1/audio-analysis/4Bv2...,266840,4
7932,Paul Baloche,The Same Love,5rmILytQP7KHx0DouS2eQl,5,world-music,0.237,0.659,11,-6.449,1,0.0357,0.233,1e-06,0.0879,0.229,204.081,audio_features,spotify:track:5rmILytQP7KHx0DouS2eQl,https://api.spotify.com/v1/tracks/5rmILytQP7KH...,https://api.spotify.com/v1/audio-analysis/5rmI...,267640,4


In [0]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7938 entries, 0 to 7937
Data columns (total 22 columns):
artist_name         7938 non-null object
track_name          7938 non-null object
track_id            7938 non-null object
popularity          7938 non-null int64
genre               7938 non-null object
danceability        7938 non-null float64
energy              7938 non-null float64
key                 7938 non-null int64
loudness            7938 non-null float64
mode                7938 non-null int64
speechiness         7938 non-null float64
acousticness        7938 non-null float64
instrumentalness    7938 non-null float64
liveness            7938 non-null float64
valence             7938 non-null float64
tempo               7938 non-null float64
type                7938 non-null object
uri                 7938 non-null object
track_href          7938 non-null object
analysis_url        7938 non-null object
duration_ms         7938 non-null int64
time_signature      7938 no

Just in case, checking for any duplicate tracks:

In [0]:
df[df.duplicated(subset=['artist_name','track_name'],keep=False)]

Unnamed: 0,artist_name,track_name,track_id,popularity,genre,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature


# Save file to .csv

Everything seems to be fine so I will save the dataframe as a .csv file.

In [0]:
df.to_csv('SpotifyAudioFeatures20200301_genre_001_010.csv')