# Intro

In this notebook, we will be reading from a pickle file, which will contain a list of songs through the years. Each element of this list will be a year of top songs, which is a list of dicts containing date, name, and artist. This list should contain the Billboard 100 songs for every week for the last 20 years, or as many years are available for that genre.

Given this list, we will query the spotify API to generate a dataframe that contains the same information, augmented with song ID, song features, and artist genre. This dataframe will be saved as a pickle file at the end.

In [122]:
import spotipy
import spotipy.util as util

scope = ''
username = 'testname777123'
# Note: This should be filled in with an authentication token. Make sure to delete it before committing.
token = 'BQBoowAucF4t3k1mBY9AJ1n8ykkuBvo5UfM8JL0sEOXKXLFn0ywZ6zCYyOg9uEFxhUyBvaO0QBb8Gv_AnwMM_ke9CAz8TvauJeBGR4ZL3O9AEwWgwJ-u7mupyggHi00rN81IJXlJzWTO5K3FRI8N5Vp-'

sp = spotipy.Spotify(auth=token)

Now let's devise a way to get a song id, given a song and an artist:

In [131]:
from unidecode import unidecode
from fuzzywuzzy import fuzz

TITLE_DIFF = 90
ARTIST_DIFF = 70

def contains_artist(goal, artists):
    # Use unidecode to turn special characters to regular, i.e. ñ --> n
    goal = unidecode(goal).lower()
    
    for artist in artists:
        art = unidecode(artist['name']).lower()
        if fuzz.partial_ratio(art, goal) >= ARTIST_DIFF:
            return True

    return False

def sort_fn(elem):
    return elem['popularity']

# If the song is not in the top 20 results, ignore it.
# TOOD: how frequently does this happen? If frequently, maybe follow the pages?
def get_song(track, artist):
    query_result = sp.search(q=track, type='track', limit=20)
    query_result = query_result['tracks']['items']
    
    # Clean the query results
    result = []
    for elem in query_result:
        if fuzz.partial_ratio(track, elem['name']) < TITLE_DIFF:
            continue
        if not contains_artist(artist, elem['artists']):
            continue

        new_song = dict()
        new_song['name'] = unidecode(elem['name']).lower()
        new_song['id'] = elem['id']
        #new_song['artists'] = elem['artists']
        new_song['artist'] = unidecode(artist).lower()
        new_song['popularity'] = elem['popularity']
        
        result += [new_song]
    
    # Return the most popular song
    result.sort(key=sort_fn)
    
    if len(result) == 0:
        print("Could not find '%s' by '%s' in top 20 results" % (track, artist))
        return None
    
    return result[-1]

Let's create a cache of song mappings so as to limit our API requests to spotify.

In [132]:
# Format is dict{artist : {song : data, song : data ...} ...}
memo = dict()

Create a wrapper to do it given a list of songs:

In [133]:
import pandas as pd
import copy

#TODO: should integrate 'retry_after' when response 429, to deal with rate limiting
# Expects song_list of type [dict{week, name, artist}....]
def convert_all_songs(song_list):
    result = []
    for song in song_list:
        artist = song['artist']
        title = song['name']
        if artist not in memo:
            memo[artist] = dict()
        
        # Check if we've already hit spotify for this
        if title in memo[artist]:
            temp = copy.copy(memo[artist][title])
            if temp is not None:
                temp['date'] = song['date']
                result += [temp]
            continue

        # Haven't hit spotify yet, so query and add to memo
        temp = get_song(title, artist)
        if temp is None:
            memo[artist][title] = None
            continue

        temp['date'] = song['date']
        
        # Add the audio features, remove unneeded one
        """
        Here are the features:
        dict_keys(['danceability', 'energy', 'key', 'loudness', 'mode',
                   'speechiness', 'acousticness', 'instrumentalness', 'liveness',
                   'valence', 'tempo', 'type', 'id', 'uri', 'track_href',
                   'analysis_url', 'duration_ms', 'time_signature'])"""
        features = sp.audio_features(temp['id'])[0]
        del features['type']
        del features['id']
        del features['uri']
        del features['track_href']
        del features['analysis_url']
        #TODO: do we want time signature? Do we want to remove duration?
        del features['time_signature']
        temp.update(features)
        
        # Note: spotipy takes care of the 429 'retry-after' response from spotify
        result += [temp]
        memo[artist][title] = temp
    
    return pd.DataFrame(result).set_index('date')

Here is an example of the functionality:

In [138]:
songs = [
    {'name': 'Case Of The Ex (Whatcha Gonna Do)', 'artist': 'Mya', 'date': '2000-12-23'},
    {'name': "It Wasn't Me", 'artist': 'Shaggy Featuring Ricardo "RikRok" Ducent', 'date': '2000-12-30'},
    {'name': "It Wasn't Me", 'artist': 'Shaggy Featuring Ricardo "RikRok" Ducent', 'date': '2000-12-23'},
    {'name': 'Case Of The Ex (Whatcha Gonna Do)', 'artist': 'Mya', 'date': '2000-12-23'}
]
"""
    {'name': 'Independent Women Part I', 'artist': "Destiny's Child", 'date': '2000-12-30'},
    {'name': 'Independent Women Part I', 'artist': "Destiny's Child", 'date': '2000-12-23'},
    {'name': 'He Loves U Not', 'artist': 'Dream', 'date': '2000-12-30'},
"""
x = convert_all_songs(songs)
print(x)

                                         name                      id  \
date                                                                    
2000-12-23  case of the ex (whatcha gonna do)  1ak0S3NhwWrUgNlQhJ1412   
2000-12-30                       it wasn't me  3WkibOpDF7cQ5xntM1epyf   
2000-12-23                       it wasn't me  3WkibOpDF7cQ5xntM1epyf   
2000-12-23  case of the ex (whatcha gonna do)  1ak0S3NhwWrUgNlQhJ1412   

                                              artist  popularity  \
date                                                               
2000-12-23                                       mya          53   
2000-12-30  shaggy featuring ricardo "rikrok" ducent          77   
2000-12-23  shaggy featuring ricardo "rikrok" ducent          77   
2000-12-23                                       mya          53   

            danceability  energy  key  loudness  mode  speechiness  \
date                                                                 
2000-12-23  

Cool! Now that we have this functionality, lets run it on our data of top songs to create the end dataframe.

In [137]:
import pickle

with open('hot100Results', 'rb') as f:
    top_songs = pickle.load(f)

top_songs = [song for year in top_songs for song in year] 

print(len(top_songs))

104200
