># 2. Data Acquisition: Audio Features Data

Our group leveraged the Spotify API to gather information on music tracks. Data Acquisition was performed in two separate notebooks given API rate limits. This notebook is meant to be run second, to gather all audio features of the tracks that we have chosen to include in our analysis. As described previously, the first step of leveraging the Spotify API is the client credentials flow performed again below and leveraging the same client keys.

PLEASE NOTE: If re-running this file, data collection via the API can take some amount of time. If pausing and re-running the same blocks of code multiple times, care must be taken to allow for a brief pause to avoid exceeding rate limits. Additionally, please note that the command to create a CSV file from the API calls below has been commented out to prevent writing over the files created by our group. Given that API output can differ by day, we wanted to ensure that our analysis produced consistent results during the marking phase of this project.

In [1]:
# library load & key load
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import json
import pandas as pd
import time
import timeit

with open('keys_personal.json') as f:
    keys = json.load(f)
cid = keys['spotify']['client_id']
secret = keys['spotify']['client_secret']

In [2]:
# client credentials flow
spotify = spotipy.Spotify(client_credentials_manager = 
                          SpotifyClientCredentials(client_id=cid, client_secret=secret))

In [3]:
spotify

<spotipy.client.Spotify at 0x7f7a60395b20>

________________

### Gathering features for all gathered tracks:

Note, [this](https://towardsdatascience.com/spotify-data-project-part-1-from-data-retrieval-to-first-insights-f5f819f8e1c3) article was used as a reference point for the batching process leveraged below though code and processes were altered for our use-case. 

In [23]:
df_tracks = pd.read_csv('spotify_tracks.csv')
df_tracks.head()

Unnamed: 0,year,market,artist_name,artist_id,track_name,track_id,popularity,artist_genre
0,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"['pop', 'r&b']"
1,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Creepin' (with The Weeknd & 21 Savage),2dHHgzDwk4BJdRwy9uXhTO,94,['rap']
2,2022,US,Drake,3TVXtAsR1Inumwj472S9r4,Rich Flex,1bDbXMyjaUIooNwFE9wn0N,91,"['canadian hip hop', 'canadian pop', 'hip hop'..."
3,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Superhero (Heroes & Villains) [with Future & C...,0vjeOZ3Ft5jvAi9SBFJm1j,88,['rap']
4,2022,US,Lil Uzi Vert,4O15NlyKLIASxsJ0PrXPfz,Just Wanna Rock,4FyesJzVpA39hbYvcseO2d,88,"['melodic rap', 'philly rap', 'rap', 'trap']"


In [24]:
df_tracks.shape

(9394, 8)

In [25]:
features = []
size = 100
no_feats = 0

start = timeit.default_timer()

for step in range(0,len(df_tracks['track_id']),size):
    
    batch = df_tracks['track_id'][step:step+size]
    feat_results = spotify.audio_features(batch)
    
    for index, data in enumerate(feat_results):
        if data == None:
            no_feats += 1
        else:
            features.append(data)

stop = timeit.default_timer()
print ('Time to run (in seconds): ', stop - start)

Time to run (in seconds):  9.090163542000028


In [26]:
print('Number of tracks where no features were available: ',no_feats)
print('Number of elements in the feature list: ', len(features))

Number of tracks where no features were available:  14
Number of elements in the feature list:  9380


In [27]:
# turning into dataframe
df_features = pd.DataFrame.from_dict(features,orient='columns')
df_features.shape

(9380, 18)

In [30]:
df_features.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.644,0.728,8,-5.75,1,0.0351,0.0543,0.169,0.161,0.43,88.993,audio_features,3OHfY25tqY28d16oZczHc8,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
1,0.715,0.62,1,-6.005,0,0.0484,0.417,0.0,0.0822,0.172,97.95,audio_features,2dHHgzDwk4BJdRwy9uXhTO,spotify:track:2dHHgzDwk4BJdRwy9uXhTO,https://api.spotify.com/v1/tracks/2dHHgzDwk4BJ...,https://api.spotify.com/v1/audio-analysis/2dHH...,221520,4
2,0.561,0.52,11,-9.342,0,0.244,0.0503,2e-06,0.355,0.424,153.15,audio_features,1bDbXMyjaUIooNwFE9wn0N,spotify:track:1bDbXMyjaUIooNwFE9wn0N,https://api.spotify.com/v1/tracks/1bDbXMyjaUIo...,https://api.spotify.com/v1/audio-analysis/1bDb...,239360,3
3,0.526,0.606,5,-5.3,0,0.259,0.152,2e-06,0.194,0.492,116.622,audio_features,0vjeOZ3Ft5jvAi9SBFJm1j,spotify:track:0vjeOZ3Ft5jvAi9SBFJm1j,https://api.spotify.com/v1/tracks/0vjeOZ3Ft5jv...,https://api.spotify.com/v1/audio-analysis/0vje...,182667,4
4,0.486,0.545,11,-7.924,1,0.0336,0.0652,0.00474,0.0642,0.0385,150.187,audio_features,4FyesJzVpA39hbYvcseO2d,spotify:track:4FyesJzVpA39hbYvcseO2d,https://api.spotify.com/v1/tracks/4FyesJzVpA39...,https://api.spotify.com/v1/audio-analysis/4Fye...,123891,4


In [31]:
# renaming ID column to track ID
df_features.rename(columns={'id': 'track_id'}, inplace=True)

In [33]:
df_features.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,track_id,uri,track_href,analysis_url,duration_ms,time_signature
0,0.644,0.728,8,-5.75,1,0.0351,0.0543,0.169,0.161,0.43,88.993,audio_features,3OHfY25tqY28d16oZczHc8,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
1,0.715,0.62,1,-6.005,0,0.0484,0.417,0.0,0.0822,0.172,97.95,audio_features,2dHHgzDwk4BJdRwy9uXhTO,spotify:track:2dHHgzDwk4BJdRwy9uXhTO,https://api.spotify.com/v1/tracks/2dHHgzDwk4BJ...,https://api.spotify.com/v1/audio-analysis/2dHH...,221520,4
2,0.561,0.52,11,-9.342,0,0.244,0.0503,2e-06,0.355,0.424,153.15,audio_features,1bDbXMyjaUIooNwFE9wn0N,spotify:track:1bDbXMyjaUIooNwFE9wn0N,https://api.spotify.com/v1/tracks/1bDbXMyjaUIo...,https://api.spotify.com/v1/audio-analysis/1bDb...,239360,3
3,0.526,0.606,5,-5.3,0,0.259,0.152,2e-06,0.194,0.492,116.622,audio_features,0vjeOZ3Ft5jvAi9SBFJm1j,spotify:track:0vjeOZ3Ft5jvAi9SBFJm1j,https://api.spotify.com/v1/tracks/0vjeOZ3Ft5jv...,https://api.spotify.com/v1/audio-analysis/0vje...,182667,4
4,0.486,0.545,11,-7.924,1,0.0336,0.0652,0.00474,0.0642,0.0385,150.187,audio_features,4FyesJzVpA39hbYvcseO2d,spotify:track:4FyesJzVpA39hbYvcseO2d,https://api.spotify.com/v1/tracks/4FyesJzVpA39...,https://api.spotify.com/v1/audio-analysis/4Fye...,123891,4


In [34]:
# checking for duplicates
grouped = df_features.groupby(['track_id'], as_index=True).size()
grouped[grouped > 1]

track_id
00Blm7zeNqgYLPtW6zg8cj    4
00NAQYOP4AmWR549nnYJZu    2
017PF4Q3l4DBUiWoXk4OWT    2
01JPQ87UHeGysPVwTqMJHK    2
01K4zKU104LyJ8gMb7227B    2
                         ..
7ytR5pFWmSjzHJIeQkgog4    3
7zFXmv6vqI4qOt4yGf3jYZ    3
7zLYKWcXnYeHHWidalz7rj    3
7zwn1eykZtZ5LODrf7c0tS    3
7zxRMhXxJMQCeDDg0rKAVo    2
Length: 1669, dtype: int64

In [35]:
# example of duplicates - seems like all feature values are the same
df_features[df_features['track_id'] == '00Blm7zeNqgYLPtW6zg8cj']

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,track_id,uri,track_href,analysis_url,duration_ms,time_signature
1243,0.687,0.781,1,-4.806,1,0.053,0.0361,0.0,0.0755,0.688,97.014,audio_features,00Blm7zeNqgYLPtW6zg8cj,spotify:track:00Blm7zeNqgYLPtW6zg8cj,https://api.spotify.com/v1/tracks/00Blm7zeNqgY...,https://api.spotify.com/v1/audio-analysis/00Bl...,193507,4
1583,0.687,0.781,1,-4.806,1,0.053,0.0361,0.0,0.0755,0.688,97.014,audio_features,00Blm7zeNqgYLPtW6zg8cj,spotify:track:00Blm7zeNqgYLPtW6zg8cj,https://api.spotify.com/v1/tracks/00Blm7zeNqgY...,https://api.spotify.com/v1/audio-analysis/00Bl...,193507,4
1970,0.687,0.781,1,-4.806,1,0.053,0.0361,0.0,0.0755,0.688,97.014,audio_features,00Blm7zeNqgYLPtW6zg8cj,spotify:track:00Blm7zeNqgYLPtW6zg8cj,https://api.spotify.com/v1/tracks/00Blm7zeNqgY...,https://api.spotify.com/v1/audio-analysis/00Bl...,193507,4
2264,0.687,0.781,1,-4.806,1,0.053,0.0361,0.0,0.0755,0.688,97.014,audio_features,00Blm7zeNqgYLPtW6zg8cj,spotify:track:00Blm7zeNqgYLPtW6zg8cj,https://api.spotify.com/v1/tracks/00Blm7zeNqgY...,https://api.spotify.com/v1/audio-analysis/00Bl...,193507,4


In [36]:
# dropping duplicates to avoid merge complications and since values are all same
df_features.drop_duplicates(subset=['track_id'], inplace=True)

In [37]:
df_features.shape

(6563, 18)

In [38]:
df = df_tracks.merge(df_features, on = 'track_id', how = 'inner')

In [39]:
df.shape

(9380, 25)

In [40]:
df.head()

Unnamed: 0,year,market,artist_name,artist_id,track_name,track_id,popularity,artist_genre,danceability,energy,...,instrumentalness,liveness,valence,tempo,type,uri,track_href,analysis_url,duration_ms,time_signature
0,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"['pop', 'r&b']",0.644,0.728,...,0.169,0.161,0.43,88.993,audio_features,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
1,2022,IN,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"['pop', 'r&b']",0.644,0.728,...,0.169,0.161,0.43,88.993,audio_features,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
2,2022,NL,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"['pop', 'r&b']",0.644,0.728,...,0.169,0.161,0.43,88.993,audio_features,spotify:track:3OHfY25tqY28d16oZczHc8,https://api.spotify.com/v1/tracks/3OHfY25tqY28...,https://api.spotify.com/v1/audio-analysis/3OHf...,153947,4
3,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Creepin' (with The Weeknd & 21 Savage),2dHHgzDwk4BJdRwy9uXhTO,94,['rap'],0.715,0.62,...,0.0,0.0822,0.172,97.95,audio_features,spotify:track:2dHHgzDwk4BJdRwy9uXhTO,https://api.spotify.com/v1/tracks/2dHHgzDwk4BJ...,https://api.spotify.com/v1/audio-analysis/2dHH...,221520,4
4,2022,GB,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Creepin' (with The Weeknd & 21 Savage),2dHHgzDwk4BJdRwy9uXhTO,94,['rap'],0.715,0.62,...,0.0,0.0822,0.172,97.95,audio_features,spotify:track:2dHHgzDwk4BJdRwy9uXhTO,https://api.spotify.com/v1/tracks/2dHHgzDwk4BJ...,https://api.spotify.com/v1/audio-analysis/2dHH...,221520,4


PLEASE NOTE: If re-running this file, do not re-run the line below to avoid overwriting the files created by our group which could impact the dataset used in the report.

In [41]:
#df.to_csv('spotify_tracks_feats.csv', index = False)  