# Data Collection via Spotify API
*   Student ID : 23207226


In this part of the assignment, our goal is to extract music
data via the Spotify API. More specifically, we will focus on the collection of music metrics stored by the API for each music, in order to further study the App's consumers' music trends and tastes.
In Spotify, musics are classified within labeled categories. However, we can't collect musics directly from those categories as they are stored in playlists (during the conduct of this data collection process, it has been noticed that 1 playlist is associated to each category).
After collecting these playlists for 20 categories (we wanted to limit the data volume for sharper and concise analysis), we have extracted 50 music tracks from each playlist, with their corresponding IDs and associated music metrics, and then stored the whole in a JSON file in order to take account of the nested nature of the metrics data within the playlist within a category. This is the reason using a JSON file is seen as more approriate and a more practical choice for that matter.

## 1.1 - Setup

In [1]:
# we make the required imports
import requests
import json
import pandas as pd
from urllib.parse import urlencode
import base64
import time

In [2]:
# To each Spotify API account is associated an API key to make requests to the API
# and a client ID
API_KEY = "db927bad68784b60bac1747c884754e9"
CLIENT_ID = "25bab89762e942738934d6e271e9f02b"

The function 'get_token' below has been implemented in order to authenticate with the Spotify API and retrieve an access token using the client credentials, in a way that is suitable for a server-to-server authentification, where only the application itself is acting on behalf of the user, as it is the case with the Spotify API.

You will notice that the credentials have been encoded using the base64 encoding process after being combined into a string and separated by a colon. This encoding process is imposed by Spotify because the API requires to be provided in the 'Authorization' header in base-64 form when making a request for an access token.



In [3]:
def get_token(API_KEY, CLIENT_ID):
    # indicating the endpoint for requesting the token
    auth_url = "https://accounts.spotify.com/api/token"

    # base-64 encoding process of the client credentials
    client_creds = f"{CLIENT_ID}:{API_KEY}"
    client_creds_b64 = base64.urlsafe_b64encode(client_creds.encode()).decode()

    # headers initialized as a dictionnary containing the
    #Authorization header (to authenticate the request to the API) and
    # content type indicating the format of the data in the request body
    headers = {
        "Authorization": f"Basic {client_creds_b64}",
        "Content-Type": "application/x-www-form-urlencoded"
    }
    # this dictionnary contains a single key-value pair where we specify the
    #type of grant we are requesting, and set it to our own credentials,
    # as the applications is requesting access on its own behalf, not on ours.
    data = {
        "grant_type": "client_credentials"
    }

    # we make a post request to the API to attempt to obtain an access token
    response = requests.post(auth_url, headers=headers, data=data)

    # check if request works (it does when the status code is btw 200 and 299)
    if response.status_code in range(200, 299):
        # token extraction procss
        token_response_data = response.json()
        access_token = token_response_data.get("access_token")
        return access_token
    else: #if request does not work
        return None


## 1.2 - Implementation of functions to fetch music categories and their associated tracks, and data collection loop

In this block, we are coding the logic of the functions to fetch 20categories, the corresponding playlist of the category and 50 tracks from each playlist of a category.

Then we code a data collection loop for that matter.
To avoid going beyond the authorized API rate, we put two 30 sec-waiting times at the beginning and the end of the collection loop.

We will see however that the author has still succeeded in going beyond the API rate and getting an error after multiple attempts. Fortunately, sufficient data has been collected to conduct the second part of the analysis.

In [4]:
def fetch_categories(access_token, limit=20):
   # set up the URL link to fetch categories. A limit parameter is set to extract the desired amount of categories
    categories_url = "https://api.spotify.com/v1/browse/categories?limit=" + str(limit)
    headers = {"Authorization": f"Bearer {access_token}"} # authorization header with our access token
    response = requests.get(categories_url, headers=headers) # sending a get request to the API to get the categories
    response.raise_for_status()  # checking step to see if the request has failed or not (success --> 200 to 299)
    categories_data = response.json() # parsing the respnse data to json
    return categories_data['categories']['items'] # we return the list of category items

def fetch_playlists_for_category(access_token, category_id):
    # similarly, we sett up the url to fetch a playlist within a category given its ID, and put an optional limit as it is already 1
    playlists_url = f"https://api.spotify.com/v1/browse/categories/{category_id}/playlists?limit=1"
    headers = {"Authorization": f"Bearer {access_token}"} # authorization head and access token
    response = requests.get(playlists_url, headers=headers) # get request to fetch the playlist
    response.raise_for_status() # check for success of the request
    playlists_data = response.json() # parsing data to json again
    return playlists_data['playlists']['items'] # returning the list of playlist item

def fetch_tracks_from_playlist(access_token, playlist_id, limit=50):
  # again, set up of the url to fetch the music tracks within a playlist given its ID, and setting up a limit param
    tracks_url = f"https://api.spotify.com/v1/playlists/{playlist_id}/tracks?limit={limit}"
    headers = {"Authorization": f"Bearer {access_token}"}
    response = requests.get(tracks_url, headers=headers) # sending get request to the API
    response.raise_for_status() # check if request is ok
    tracks_data = response.json() # parsing the data to json
    tracks = [] # setting an empty list to hold the track data
    for item in tracks_data['items']: # going through all the track items
        if item['track']:  # ensuring that there is a track object
            track = item['track'] # selecting the items of the detected track object
            # appending the track with its id and name
            tracks.append({
                'id': track['id'],
                'name': track['name']
            })
    return tracks

def fetch_audio_features(access_token, track_ids):
    # we setup again the url for audio features (music metrics) extraction
    features_url = f"https://api.spotify.com/v1/audio-features?ids={','.join(track_ids)}"
    headers = {"Authorization": f"Bearer {access_token}"}
    response = requests.get(features_url, headers=headers) # sending get request for fetching tracks
    response.raise_for_status() # check if request is ok
    features_data = response.json() # parsing the data to json
    return features_data['audio_features'] # returning the list of audio feature items

# data collection function
def collect_data(access_token):
    time.sleep(30) # to avoid going beyond the API rate, we wait a bit (Spotify API refreshes every 30 s)

    data = {}  # initializing a dict to store the categories

    # category fetching by calling the function, the limit is already set to 20 so it won't collect more
    categories = fetch_categories(access_token)
    # going through the collected categories
    for category in categories:
        category_id = category['id'] # collecting category's id and name
        category_name = category['name']
        data[category_name] = []  # initializing a list for the playlist of the category (I didn't know there was only one when I coded this, but it still works so did not change the logic)

        # fetching the playlist for the category
        playlists = fetch_playlists_for_category(access_token, category_id)

        # going thorugh the list and collecting the playlist ID
        for playlist in playlists:
            playlist_id = playlist['id']
            playlist_data = {'id': playlist_id, 'tracks': []} # setting a dict with the playlist ID and a list of tracks

            # now we fetch tracks for the given playlist
            tracks = fetch_tracks_from_playlist(access_token, playlist_id)

            # going through tracks
            for track in tracks:
                # fetching the tracks' id and name in a dict
                track_data = {'id': track['id'], 'name': track['name']}
                # fetching and appending the corresponding audio features for each track given its ID
                audio_features = fetch_audio_features(access_token, track['id'])
                if audio_features: # if there is any audio feature
                    track_data['audio_features'] = audio_features # we select it
                playlist_data['tracks'].append(track_data) # and append it to the track data dict
            if playlist_data['tracks']:  # to ensure that there are tracks before adding the playlist data
                data[category_name].append(playlist_data)
    time.sleep(30) # another precaution step to avoid reaching the authorized api rate

    return data





In [5]:
access_token = get_token(API_KEY, CLIENT_ID) # calling the access token
collected_data = collect_data(access_token) # initializing collection loop with the provided token

# data is dumped in a json file and saved for further usage
with open('spotify_data.json', 'w') as f:
    json.dump(collected_data, f, indent=4)

# as you can see below, a 429 client Error is displayed, signaling that the API rate has been
#reached. The API does not allow me any access anymore for further data collection

HTTPError: 429 Client Error: Too Many Requests for url: https://api.spotify.com/v1/audio-features?ids=0,N,j,0,v,1,z,G,v,D,L,g,v,h,j,L,L,z,M,z,o,m