# Intro

In this notebook, we will be reading from a pickle file, which will contain a list of songs through the years. Each element of this list will be a year of top songs, which is a list of dicts containing date, name, and artist. This list should contain the Billboard 100 songs for every week for the last 20 years, or as many years are available for that genre.

Given this list, we will query the spotify API to generate a dataframe that contains the same information, augmented with song ID, song features, and artist genre. This dataframe will be saved as a pickle file at the end.

In [56]:
import spotipy
import spotipy.util as util

scope = ''
username = 'testname777123'
# Note: This should be filled in with an authentication token. Make sure to delete it before committing.
token = ''

sp = spotipy.Spotify(auth=token)

Let's create a cache of song mappings so as to limit our API requests to spotify.

In [27]:
# Format is dict{artist : {song : data, song : data ...} ...}
memo = dict()
unique_unfound = 0
total_unfound = 0

Now let's devise a way to get a song id, given a song and an artist:

In [3]:
from unidecode import unidecode
from fuzzywuzzy import fuzz

TITLE_DIFF = 70
ARTIST_DIFF = 70

def contains_artist(goal, artists):
    # Use unidecode to turn special characters to regular, i.e. ñ --> n
    goal = unidecode(goal).lower()
    
    for artist in artists:
        art = unidecode(artist['name']).lower()
        if fuzz.partial_ratio(art, goal) >= ARTIST_DIFF:
            return True

    return False

# If the song is not in the top 20 results, ignore it.
# TOOD: how frequently does this happen? If frequently, maybe follow the pages?
def get_song(track, artist):
    global unique_unfound
    track = unidecode(track).lower()
    artist = unidecode(artist).lower()
    
    extra_search_words = {'is', 'a', 'i', 'the'}
    
    split1 = track.split()
    first = split1[0]
    if split1[0] in extra_search_words and len(split1) > 1:
        first += ' ' + split1[1]
    
    split2 = artist.split()
    second = split2[0]
    if split2[0] in extra_search_words and len(split2) > 1:
        second += ' ' + split2[1]
        
    to_search = first + ' ' + second
    query_result = sp.search(q=to_search, type='track', limit=40)
    query_result = query_result['tracks']['items']
    
    # Clean the query results
    result = []
    for elem in query_result:
        cur_name = unidecode(elem['name']).lower()
        """
        arts = []
        for a in elem['artists']:
            arts += [a['name']]
        print(elem['name'], arts, fuzz.partial_ratio(track, cur_name), contains_artist(artist, elem['artists']))
        """
        if fuzz.partial_ratio(track, cur_name) < TITLE_DIFF:
            continue
        if not contains_artist(artist, elem['artists']):
            continue

        new_song = dict()
        new_song['name'] = cur_name
        new_song['id'] = elem['id']
        new_song['artist'] = artist
        new_song['popularity'] = elem['popularity']
        
        result += [new_song]
    
    # Return the most popular song
    result.sort(key=lambda d: d['popularity'])
    
    if len(result) == 0:
        print("Could not find '%s' by '%s' in top 20 results" % (track, artist))
        unique_unfound += 1
        return None
    
    return result[-1]

Create a wrapper to do it given a list of songs:

In [51]:
import pandas as pd
import copy

CHUNK_SIZE = 50

def divide_chunks(l, n):
    result = []
    for i in range(0, len(l), n):  
        if i >= len(l):
            result += [l[i:]]
        else:
            result += [l[i:i + n]]
    return result

def del_unneeded(d):
    del d['type']
    del d['id']
    del d['uri']
    del d['track_href']
    del d['analysis_url']
    #TODO: do we want time signature? Do we want to remove duration?
    del d['time_signature']
    
    return d

def combine_fn(a, b):
    a.update(b)
    return a

#TODO: should integrate 'retry_after' when response 429, to deal with rate limiting
# Expects song_list of type [dict{week, name, artist}....]
def convert_all_songs(song_list):
    global unique_unfound
    global total_unfound
    
    unique_unfound = 0
    total_unfound = 0
    
    result_data = []
    position = 0
    for song in song_list:
        artist = unidecode(song['artist']).lower()
        title = unidecode(song['name']).lower()
        if artist not in memo:
            memo[artist] = dict()
        
        # Check if we've already hit the API for this song
        if title in memo[artist]:
            temp = copy.copy(memo[artist][title])
        else:
            # Note: spotipy takes care of the 429 'retry-after' response from spotify
            temp = get_song(title, artist)
            
        if temp is not None:
            temp['date'] = song['date']
            temp['position'] = song['position']
            result_data += [temp]
        else:
            total_unfound += 1
        
        memo[artist][title] = temp

    # Batch the audio features into groups of 50, which is the max number you can query spotify at once
    song_list = divide_chunks(result_data, CHUNK_SIZE)
    result_features = []
    for sublist in song_list:
        sublist = list(map(lambda d: d['id'], sublist))
        # Add the audio features, remove unneeded one
        """
        Here are the features:
        dict_keys(['danceability', 'energy', 'key', 'loudness', 'mode',
                   'speechiness', 'acousticness', 'instrumentalness', 'liveness',
                   'valence', 'tempo', 'type', 'id', 'uri', 'track_href',
                   'analysis_url', 'duration_ms', 'time_signature'])"""
        features = sp.audio_features(sublist)
        features = list(map(del_unneeded, features))
        result_features += features
    
    # merge the features into the other song data
    result = list(zip(result_data, result_features))
    result = list(map(lambda a: combine_fn(a[0], a[1]), result))
    
    return pd.DataFrame(result).reset_index()

Here is an example of the functionality:

In [52]:
songs = [
    {'name': 'the business of emotion', 'artist': "big data featuring white sea", 'date': '2000-12-23'}
]
"""
    {'name': 'Independent Women Part I', 'artist': "Destiny's Child", 'date': '2000-12-23'}, not in
    {'name': 'He Loves U Not', 'artist': 'Dream', 'date': '2000-12-30'}, not in
    {'name': 'She Misses Him', 'artist': 'Tim Rushlow', 'date': '2000-12-30'}, not in
    {'name': 'Get Crunked Up', 'artist': 'Iconz Featuring Tony Manshino', 'date': '2000-12-30'} 69
    
    {'name': 'NAStradamus', 'artist': 'Nas', 'date': '2000-12-30'}
    {'name': 'Independent Women Part I', 'artist': "Destiny's Child", 'date': '2000-12-30'},
    {'name': 'Case Of The Ex (Whatcha Gonna Do)', 'artist': 'Mya', 'date': '2000-12-23'},
    {'name': "It Wasn't Me", 'artist': 'Shaggy Featuring Ricardo "RikRok" Ducent', 'date': '2000-12-30'},
    {'name': "It Wasn't Me", 'artist': 'Shaggy Featuring Ricardo "RikRok" Ducent', 'date': '2000-12-23'},
    {'name': 'Case Of The Ex (Whatcha Gonna Do)', 'artist': 'Mya', 'date': '2000-12-23'},
    {'name': 'I Wish', 'artist': 'R. Kelly', 'date': '??daf'}
"""
#Could not find 'fiesta' by 'r. kelly featuring jay-z' in top 20 results

#x = convert_all_songs(songs)
#x = get_song('NAStradamus', 'Nas')
x = get_song(songs[0]['name'], songs[0]['artist'])
print(x)

{'name': 'the business of emotion (feat. white sea)', 'id': '2TKso10B9HcIWy7HR1oP2g', 'artist': 'big data featuring white sea', 'popularity': 39}


Cool! Now that we have this functionality, lets run it on our data of top songs to create the end dataframe.

In [53]:
import pickle

with open('data/results/hot100Results', 'rb') as f:
    top_songs = pickle.load(f)

for elem in top_songs:
    for i in range(len(elem)):
        elem[i]['position'] = i + 1
top_songs = [song for year in top_songs for song in year] 
print(top_songs[99])

print(len(top_songs))

{'name': 'If I Am', 'artist': 'Nine Days', 'date': '2000-12-30', 'position': 100}
260685


In [54]:
top_songs[1600]

{'name': "Doesn't Really Matter",
 'artist': 'Janet',
 'date': '2000-09-09',
 'position': 1}

In [60]:
import time
start = time.time()
result = convert_all_songs(top_songs)
print(time.time() - start)
print("Unique unfound: %s, total unfound: %s" % (unique_unfound, total_unfound))

Could not find 'love sets you free' by 'kelly price & friends' in top 20 results
Could not find 'this gift' by '98 degrees' in top 20 results
Could not find 'roll out (my business)' by 'ludacris' in top 20 results
Could not find 'take away' by 'missy "misdemeanor" elliott featuring ginuwine & tweet' in top 20 results
Could not find 'ooohhhwee' by 'master p featuring weebie' in top 20 results
Could not find 'from her mama (mama got a**)' by 'juvenile' in top 20 results
Could not find 'never too far/hero medley' by 'mariah carey' in top 20 results
Could not find 'am to pm' by 'christina milian' in top 20 results
Could not find 'do u wanna roll (dolittle theme)' by 'r.l., snoop dogg & lil' kim' in top 20 results
Could not find 'god bless america' by 'daniel rodriguez' in top 20 results
Could not find 'so complicated' by 'carolyn dawn johnson' in top 20 results
Could not find 'what's going on' by 'all star tribute' in top 20 results
Could not find 'one minute man' by 'missy "misdemeanor" e

Could not find 'ain't no sunshine' by 'kris allen' in top 20 results
Could not find 'a change is gonna come' by 'adam lambert' in top 20 results
Could not find 'apologize' by 'kris allen' in top 20 results
Could not find '3am' by 'eminem' in top 20 results
Could not find 'f*ck you' by 'lily allen' in top 20 results
Could not find 'top of the world' by 'the pussycat dolls' in top 20 results
Could not find 'j**z in my pants' by 'the lonely island' in top 20 results
Could not find 'boots' by 'the killers' in top 20 results
Could not find 'one love (people get ready)' by 'glee cast' in top 20 results
Could not find 'the big bang' by 'rockmafia' in top 20 results
Could not find 'gonna get this' by 'hannah montana featuring iyaz' in top 20 results
Could not find 'little white church' by 'little big town' in top 20 results
Could not find 'la la la' by 'auburn featuring iyaz' in top 20 results
Could not find 'eenie meenie' by 'sean kingston & justin bieber' in top 20 results
Could not find 'to

In [61]:
with open('hot100.df', 'wb') as f:
    pickle.dump(result, f)