# Intro

In this file, we will be reading from a pickle file, which will contain a list of songs through the years. Each element of this list should be a dict containing week, name, and artist. This list should contain the Billboard 100 songs for every week for the last 10 years, and week should be the number of weeks after January 1st, 2009 that the song is for.

Given this list, we will query the spotify API to generate a dataframe that contains the same information, augmented with song ID, song features, and artist genre. This dataframe will be saved as a pickle file at the end.

In [180]:
import spotipy
import spotipy.util as util

scope = ''
username = 'testname777123'
# Note: This should be filled in with an authentication token. Make sure to delete it before committing.
token = 'BQAfEUTMzSwkgcMwyiEtBT0rX7hVvoGV2ZN5RJCnrt4wSie7NOAoQEegjINYuOGtl6CZekP-txvcm0XxpMMNH2wJNqZmJOhLagoCxHCUtDzAT0gPqolgVsDZPfwxn7jl78mwpMD48490Y7hv2mwgRfzS'

sp = spotipy.Spotify(auth=token)

Now let's devise a way to get a song id, given a song and an artist:

In [181]:
def contains_artist(goal, artists):
    for artist in artists:
        if artist['name'].lower() == goal.lower():
            return True
    
    return False

def sort_fn(elem):
    return elem['popularity']

# If the song is not in the top 20 results, ignore it.
# TOOD: how frequently does this happen? If frequently, maybe follow the pages?
def get_song(track, artist):
    query_result = sp.search(q=track, type='track', limit=20)
    query_result = query_result['tracks']['items']
    
    # Clean the query results
    result = []
    for elem in query_result:
        if not contains_artist(artist, elem['artists']):
            continue

        new_song = dict()
        new_song['name'] = elem['name'].lower()
        new_song['id'] = elem['id']
        #new_song['artists'] = elem['artists']
        new_song['artist'] = artist.lower()
        new_song['popularity'] = elem['popularity']
        
        result += [new_song]
    
    # Return the most popular song
    result.sort(key=sort_fn)
    
    if len(result) == 0:
        return None
    
    return result[-1]

Create a wrapper to do it given a list of songs:

In [189]:
import pandas as pd

#TODO: should integrate 'retry_after' when response 429, to deal with rate limiting
# Expects song_list of type [dict{week, name, artist}....]
def convert_all_songs(song_list):
    result = []
    for song in song_list:
        temp = get_song(song['name'], song['artist'])
        if temp is None:
            print('Could not find \'%s\' by \'%s\'.' % (song['name'], song['artist']))
            continue
        temp['week'] = song['week']
        
        # Add the audio features, remove unneeded one
          """
          Here are the features:
          dict_keys(['danceability', 'energy', 'key', 'loudness', 'mode',
                     'speechiness', 'acousticness', 'instrumentalness', 'liveness',
                     'valence', 'tempo', 'type', 'id', 'uri', 'track_href',
                     'analysis_url', 'duration_ms', 'time_signature'])"""
        features = sp.audio_features(temp['id'])[0]
        del features['type']
        del features['id']
        del features['uri']
        del features['track_href']
        del features['analysis_url']
        #TODO: do we want time signature? Do we want to remove duration?
        del features['time_signature']
        temp.update(features)
        
        # Note: spotipy takes care of the 429 'retry-after' response from spotify
        result += [temp]
    
    return pd.DataFrame(result).set_index('week')

Here is an example of the functionality:

In [190]:
songs = [
    {'week':0, 'name':'gravy train', 'artist':'yung gravy'},
    {'week':1, 'name':'shape of you', 'artist':'ed sheeran'},
    {'week':0, 'name':'shouldn\'t exist', 'artist':'not a real artist pls 42'},
    {'week':1, 'name':'another day of sun', 'artist':'la la land cast'}
]
x = convert_all_songs(songs)
print(x)

Could not find 'shouldn't exist' by 'not a real artist pls 42'.
                    name                      id           artist  popularity  \
week                                                                            
0            gravy train  3qOkkP9JLltoUvaLjEXTmW       yung gravy          68   
1           shape of you  7qiZfU4dY1lWllzX7mPBI3       ed sheeran          87   
1     another day of sun  5kRBzRZmZTXVg8okC7SJFZ  la la land cast          68   

      danceability  energy  key  loudness  mode  speechiness  acousticness  \
week                                                                         
0            0.733   0.622    6    -7.689     0       0.1400        0.3240   
1            0.825   0.652    1    -3.183     0       0.0802        0.5810   
1            0.588   0.742    8    -6.757     1       0.0528        0.0162   

      instrumentalness  liveness  valence    tempo  duration_ms  
week                                                             
0      

Cool! Now that we have this functionality, lets run it on our data of top songs to create the end dataframe.

In [None]:
import pickle
top_songs = pickle.load('top_songs.pickle')
print(top_songs[:10])