# Track data writer

In this file, we will use the ID list generated to get all the data we need for the model training. The process is quite lenghty because of how many API calls we need to make, so we really want to run this file once and then move onwards. The API is also prone to running into errors when some data is not available for a song, so we need to add a workaround for that.

The same as before, we will use spotipy to get the data. However, as we now have the data stored in Azure, we can use it (or the local data if it is already downloaded) to get the ID's. This will not work for other people than the author, but the code is still there for conviniency. All the necessary data can be found in the git repo and maybe in Kaggle later on.

In [None]:
## Import all required libraries
from requests import HTTPError
import spotipy as spy
import pandas as pd
import csv

In [None]:
## Import the required Azure modules
from azureml.core import Workspace, Dataset

In [3]:
## Get the secret and initialize necessary variables for the API
with open('secret.txt') as s:
    client_secret = next(s)
s.close()
client_id = '66b50b394b214bd399d67a70be01a80e'
user = 'Willis Orr'
playlist_uri = 'spotify:playlist:5S8SJdl1BDc0ugpkEvFsIL'

In [3]:
## Set up the credentials manager and API
credentials_manager = spy.oauth2.SpotifyClientCredentials(client_id, client_secret)
sp = spy.Spotify(client_credentials_manager=credentials_manager)

In [4]:
# azureml-core of version 1.0.72 or higher is required
## Get the files from Azure datasets
subscription_id = '86ce59b5-49c9-46fa-bf68-5345d087b896'
resource_group = 'JodaHt'
workspace_name = 'DatatieteidenHarjoitustyo'

workspace = Workspace(subscription_id, resource_group, workspace_name)

dataset = Dataset.get_by_name(workspace, name="Song ID's")
dataset.download(target_path='Datasets', overwrite=True)

['/mnt/batch/tasks/shared/LS_root/mounts/clusters/yleiskone/code/Users/aleksius.kurkela/JohdatusDatatieteisiinHarjoitustyo/Datasets/ids.csv']

In [4]:
## Read the ID dataset
with open("Datasets/ids.csv", "r") as x:
    read = csv.reader(x)
    ids = list(read)
del ids[0]
x.close()

[['1DrAlSTzo3iwiEZjkZtOip'], ['6maXBs9zUY48m1UNHgTeRC'], ['4sjLcE0GQ6urc4iUXsUPe9'], ['58E1XVmZTODC67YNjneuXM'], ['6W761RLieramPCDV9hijAF'], ['2H4zwjbv0D0ggDhf0E8j8j'], ['27QvYgBk0CHOVHthWnkuWt'], ['07NxDD1iKCHbAldceD7QLP'], ['3iKa5oAYercQOTrOBux76O'], ['6qnM0XXPZOINWA778uNqQ9'], ['3kM1Rc8bQDNpAtXCiAhGV5'], ['1oWSkX6hHx8ea2Wr0SE39g'], ['3WSbTOsuBSKUeMa3Pddu7o'], ['2wVDWtLKXunswWecARNILj'], ['5tMdNsKOlNr0iUFtSE1BwG'], ['18GiV1BaXzPVYpp9rmOg0E'], ['0kdOSSWcYye98wID8RIhdl'], ['09YA1Ma4NhmKCu9VHdQMcU'], ['41xY0XOmX8DF8GKoPi4oF3'], ['77qLIZDZx0DAjrbolGgTwk']]


In [None]:
## And finally, start making GET requests to the API
## One poor part of Spotipy is that it requires each request to be made separately, so there is no way to just get all the data at once for the ID's provided.
## This means that we need to make almost 10,000 GET-requests, which with an error rate of 1% would mean we lose 100 songs. The error rate is provided in the end of the output.
## We also print the variable i each 100 ids, so that even in the midst of all the error messages you can see roughly how many rows you have already added.

## Generally, if you run this, it will take about 20 to 40 minutes to complete, depending ofcourse on the hardware.
base = len(ids)
f = open('trackdata.csv', 'w')
writer = csv.writer(f)
writer.writerow(["name", "length", "popularity",'acousticness','danceability','energy','instrumentalness','liveness','loudness', 'speechiness','tempo','time_signature'])
errors = 0
for i in range(len(ids)):
    id = ids[i][0]
    if ((i % 100 == 0) and (i != 0)):
        print(i)
    track = []
    try:
        features = sp.audio_features(id)
        data = sp.track(id)

        name = data["name"]
        len_ = data["duration_ms"]
        pop = data["popularity"]
        
        ac = features[0]['acousticness']
        dc = features[0]['danceability']
        en = features[0]['energy']
        ins = features[0]['instrumentalness']
        liv = features[0]['liveness']
        lou = features[0]['loudness']
        spe = features[0]['speechiness']
        tmp = features[0]['tempo']
        track = [name, len_, pop, ac, dc, en, ins, liv, lou, spe, tmp]
        writer.writerow(track)
    except HTTPError:
        print(HTTPError)
        errors = errors + 1
    except spy.SpotifyException:
        print(spy.SpotifyException)
        errors = errors + 1
    except TypeError:
        print(TypeError)
        errors = errors + 1
f.close()
print("Error rate: {}".format(errors / base))