># 1. Data Acquisition: Track Data

Our group leveraged the Spotify API to gather music tracks across different years and countries. Data Acquisition was performed in two separate notebooks given API rate limits. This notebook is meant to be run first, to gather all tracks that will be assessed in our analysis. The first step of leveraging the Spotify API is the client credentials flow, which requires a Spotify account (to log into the Spotify Developer site) and requires the user to create an application which leads to the provision of both types of keys. Data Collection is described in detail in the report file (2.B. Data Collection).

PLEASE NOTE: If re-running this file, data collection via the API can take a significant amount of time. If pausing and re-running the same blocks of code multiple times, care must be taken to allow for a brief pause to avoid exceeding rate limits. Additionally, please note that the command to create a CSV file from the API calls below has been commented out to prevent writing over the files created by our group. Given that API output can differ by day, we wanted to ensure that our analysis produced consistent results during the marking phase of this project. 

In [3]:
# library load & key load
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import json
import pandas as pd
import time
import timeit

with open('keys_personal.json') as f:
    keys = json.load(f)
cid = keys['spotify']['client_id']
secret = keys['spotify']['client_secret']

In [4]:
# client credentials flow
spotify = spotipy.Spotify(client_credentials_manager = 
                          SpotifyClientCredentials(client_id=cid, client_secret=secret))

In [5]:
spotify

<spotipy.client.Spotify at 0x7fb5621752e0>

____________________________

### Gathering tracks across years and markets:

Note, [this](https://towardsdatascience.com/spotify-data-project-part-1-from-data-retrieval-to-first-insights-f5f819f8e1c3) article was used as a reference point for our data collection process shown below though code and processes were altered significantly for our use-case. 

In [6]:
years = ['2022', '2021', '2020', '2019', '2018', '2017', '2016', '2015']
markets = ['US', 'GB', 'IN', 'NL']

In [7]:
year_list = []
market_list = []
artist_name = []
track_name = []
popularity = []
track_id = []
artist_id = []

In [8]:
start = timeit.default_timer()

for year in years: 
    for market in markets:
        
        for i in range(0,300,50):
            track_results = spotify.search(q=f'year:{year}', type='track', market = market, limit=50,offset=i)
            print('Checking')
            for i, t in enumerate(track_results['tracks']['items']):
                year_list.append(year)
                market_list.append(market)
                artist_name.append(t['artists'][0]['name'])
                artist_id.append(t['artists'][0]['id'])
                track_name.append(t['name'])
                track_id.append(t['id'])
                popularity.append(t['popularity'])
        
        print('Ran: ', year + ' ' + market)
    time.sleep(60)
        
stop = timeit.default_timer()
print ('Time to run (in seconds): ', stop - start)

Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2022 US
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2022 GB
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2022 IN
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2022 NL
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2021 US
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2021 GB
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2021 IN
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2021 NL
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2020 US
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2020 GB
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2020 IN
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2020 NL
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2019 US
Checking
Checking
Checking
Checking
Checking
Checking
Ran:  2019 GB
Checking
Checking
Checking
Checking
Checking
Che

In [9]:
df_tracks = pd.DataFrame({'year': year_list, 'market': market_list,'artist_name':artist_name, 'artist_id': artist_id, 'track_name':track_name,'track_id':track_id,'popularity':popularity})
df_tracks.shape

(9600, 7)

In [10]:
df_tracks.head()

Unnamed: 0,year,market,artist_name,artist_id,track_name,track_id,popularity
0,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93
1,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Creepin' (with The Weeknd & 21 Savage),2dHHgzDwk4BJdRwy9uXhTO,94
2,2022,US,Drake,3TVXtAsR1Inumwj472S9r4,Rich Flex,1bDbXMyjaUIooNwFE9wn0N,91
3,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92
4,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Superhero (Heroes & Villains) [with Future & C...,0vjeOZ3Ft5jvAi9SBFJm1j,88


In [11]:
# number of unique artists
df_tracks['artist_id'].nunique()

2400

In [12]:
dedup_artist_id = list(set(artist_id))
len(dedup_artist_id)

2400

In [13]:
# sourcing the genre per artist
artist_genre = []
count = 0

start = timeit.default_timer()

for i in dedup_artist_id:
    
    t = spotify.artist(str(i))['genres']
    artist_genre.append(t)
    count += 1
    
    if count%150 == 0:
        print('Checked: ' + str(count))
        time.sleep(60)
        
stop = timeit.default_timer()
print ('Time to run (in seconds): ', stop - start)

Checked: 150
Checked: 300
Checked: 450
Checked: 600
Checked: 750
Checked: 900
Checked: 1050
Checked: 1200
Checked: 1350
Checked: 1500
Checked: 1650
Checked: 1800
Checked: 1950
Checked: 2100
Checked: 2250
Checked: 2400
Time to run (in seconds):  1125.8443239580001


In [14]:
len(artist_genre)

2400

In [15]:
df_genre = pd.DataFrame({'dedup_artist_id': dedup_artist_id, 'artist_genre': artist_genre})
df_genre.head()

Unnamed: 0,dedup_artist_id,artist_genre
0,53XhwfbYqKCa1cC15pYq2q,"[modern rock, rock]"
1,3upXQVqKZlTKRES3LQ9VjO,"[carnaval, levenslied]"
2,1SyKki7JI1AZNKwgNMkn25,"[desi pop, modern bollywood]"
3,3lC07j0oFDXJflpeAjo9eZ,[dutch pop]
4,0L5fC7Ogm2YwgqVCRcF1bT,"[alternative metal, danish metal, danish rock]"


In [16]:
# renaming the artist id column and merging into df
df_genre.rename(columns={'dedup_artist_id': 'artist_id'}, inplace=True)

In [17]:
df_tracks_genre = df_tracks.merge(df_genre, on = 'artist_id', how = 'left')
df_tracks_genre.head()

Unnamed: 0,year,market,artist_name,artist_id,track_name,track_id,popularity,artist_genre
0,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"[pop, r&b]"
1,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Creepin' (with The Weeknd & 21 Savage),2dHHgzDwk4BJdRwy9uXhTO,94,[rap]
2,2022,US,Drake,3TVXtAsR1Inumwj472S9r4,Rich Flex,1bDbXMyjaUIooNwFE9wn0N,91,"[canadian hip hop, canadian pop, hip hop, rap,..."
3,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92,"[pop, r&b]"
4,2022,US,Metro Boomin,0iEtIxbK0KxaSlF7G42ZOp,Superhero (Heroes & Villains) [with Future & C...,0vjeOZ3Ft5jvAi9SBFJm1j,88,[rap]


In [18]:
#checking for duplicates - since same track from single and part of an album can be returned under different track id's
grouped = df_tracks_genre.groupby(['artist_name','track_name','year','market'], as_index=True).size()
grouped[grouped > 1].count()

202

In [19]:
df_tracks_genre[df_tracks_genre['track_name'] == "Kill Bill"]

Unnamed: 0,year,market,artist_name,artist_id,track_name,track_id,popularity,artist_genre
0,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"[pop, r&b]"
3,2022,US,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92,"[pop, r&b]"
308,2022,GB,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92,"[pop, r&b]"
311,2022,GB,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"[pop, r&b]"
768,2022,IN,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"[pop, r&b]"
795,2022,IN,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92,"[pop, r&b]"
941,2022,NL,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,3OHfY25tqY28d16oZczHc8,93,"[pop, r&b]"
949,2022,NL,SZA,7tYKF4w9nC0nq9CsPZTHyP,Kill Bill,1Qrg8KqiBpW07V7PNxwwwL,92,"[pop, r&b]"


In [20]:
df_tracks_genre.drop_duplicates(subset=['artist_name','track_name','year','market'], inplace=True)

# doing the same grouping as before to verify the solution
grouped = df_tracks_genre.groupby(['artist_name','track_name','year','market'], as_index=True).size()
grouped[grouped> 1].count()

0

In [21]:
df_tracks_genre.shape

(9394, 8)

PLEASE NOTE: If re-running this file, do not re-run the line below to avoid overwriting the files created by our group which could impact the dataset used in the report.

In [22]:
#df_tracks_genre.to_csv('spotify_tracks.csv', index = False)  