# Lab | API wrappers - Create your collection of songs & audio features

**Instructions**
To move forward with the project, you need to create a collection of songs with their audio features - as large as possible!

These are the songs that we will cluster. And, later, when the user inputs a song, we will find the cluster to which the song belongs and recommend a song from the same cluster. The more songs you have, the more accurate and diverse recommendations you'll be able to give. Although... you might want to make sure the collected songs are "curated" in a certain way. Try to find playlists of songs that are diverse, but also that meet certain standards.

The process of sending hundreds or thousands of requests can take some time - it's normal if you have to wait a few minutes (or, if you're ambitious, even hours) to get all the data you need.

An idea for collecting as many songs as possible is to start with all the songs of a big, diverse playlist and then go to every artist present in the playlist and grab every song of every album of that artist. The amount of songs you'll be collecting per playlist will grow exponentially!

## Import Libraries

In [59]:
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
from tqdm import tqdm_notebook
import spotipy

## Initialize SpotiPy with user credentials

In [60]:
client_id = '8d4cb3eae1394f10b8b5f97826cd53ba'
client_secret = '8b4dd0a7b41c474ba0a986c104160f06'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Collect Data 1 Playlist

In [61]:
greatest_hits_ever = sp.user_playlist_tracks("spotify", "7EAqBCOVkDZcbccjxZmgjp")
greatest_hits_ever

{'href': 'https://api.spotify.com/v1/playlists/7EAqBCOVkDZcbccjxZmgjp/tracks?offset=0&limit=100&additional_types=track',
 'items': [{'added_at': '2017-10-23T10:26:23Z',
   'added_by': {'external_urls': {'spotify': 'https://open.spotify.com/user/117578005'},
    'href': 'https://api.spotify.com/v1/users/117578005',
    'id': '117578005',
    'type': 'user',
    'uri': 'spotify:user:117578005'},
   'is_local': False,
   'primary_color': None,
   'track': {'album': {'album_type': 'album',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/74ASZWbe4lXaubB36ztrGX'},
       'href': 'https://api.spotify.com/v1/artists/74ASZWbe4lXaubB36ztrGX',
       'id': '74ASZWbe4lXaubB36ztrGX',
       'name': 'Bob Dylan',
       'type': 'artist',
       'uri': 'spotify:artist:74ASZWbe4lXaubB36ztrGX'}],
     'available_markets': ['AD',
      'AE',
      'AG',
      'AL',
      'AM',
      'AO',
      'AR',
      'AT',
      'AU',
      'AZ',
      'BA',
      'BB',
      'BD',
 

In [62]:
greatest_hits_ever["total"], len(greatest_hits_ever["items"]) 

(502, 100)

In [63]:
results = sp.user_playlist_tracks("spotify", "7EAqBCOVkDZcbccjxZmgjp")
tracks = results['items']

while results['next']:
    results = sp.next(results)
    tracks.extend(results['items'])

In [64]:
results = sp.user_playlist_tracks("spotify", "7EAqBCOVkDZcbccjxZmgjp")
tracks = results['items']

for oset in range(100,results['total'],100):
    results = sp.user_playlist_tracks("spotify", "7EAqBCOVkDZcbccjxZmgjp", offset=oset)
    tracks += results['items']
len(tracks)

502

In [65]:
list(range(100,results['total'],100))

[100, 200, 300, 400, 500]

In [66]:
len(tracks)

502

In [67]:
# It is limited to 100 tracks, we will have to fix it:

def get_playlist_tracks(username, playlist_id):
    
    results = sp.user_playlist_tracks(username, playlist_id)
    tracks = results['items']
    
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    
    return tracks

tracks = get_playlist_tracks("spotify", "7EAqBCOVkDZcbccjxZmgjp")

In [68]:
# Getting all the artists IDs

def get_artists_ids_from_playlist(playlist_id):
    
    tracks_from_playlist = get_playlist_tracks("spotify", playlist_id)
    
    artists_ids = []
    
    for track in tracks_from_playlist:
        artists_info = track['track']['artists']
        
        for artist_info in artists_info:
            artists_ids.append(artist_info['id'])
            
    return list(set(artists_ids))

In [69]:
artists_ids = get_artists_ids_from_playlist("7EAqBCOVkDZcbccjxZmgjp")
artists_ids

['1zuJe6b1roixEKMOtyrEak',
 '3rJ3m1tM6vUgiWLjfV8sRf',
 '0JDkhL4rjiPNEp92jAgJnS',
 '2vDV0T8sxx2ENnKXds75e5',
 '6uSKeCyQEhvPC2NODgiqFE',
 '2ye2Wgw4gimLv2eAKyk1NB',
 '3WrFJ7ztbogyGnTHbHJFl2',
 '08GQAI4eElDnROBrJRGE0X',
 '4J8cVSLFJ4T4ReYLtehLj0',
 '1CYsQCypByMVgnv17qsSbQ',
 '6lOk7hCr8x3O9vHwylXyHR',
 '1WvziZcLLYLoMMdmQx7qcN',
 '293zczrfYafIItmnmM3coR',
 '2loYllWFfoWpoxC5YrJKc4',
 '7nwUJBm0HE4ZxD3f5cy5ok',
 '1eYhYunlNJlDoQhtYBvPsi',
 '1SQRv42e4PjEYfPhS0Tk9E',
 '776Uo845nYHJpNaStv1Ds4',
 '6TKOZZDd5uV5KnyC5G4MUt',
 '2bmixwMZXlkl2sbIbOfviq',
 '4MVyzYMgTwdP7Z49wAZHx0',
 '4xls23Ye9WR9yy3yYMpAMm',
 '51Blml2LZPmy7TTiAg47vQ',
 '7lKaTIgVek1R2lqpCulQmq',
 '33EUXrFKGjpUSGacqEHhU4',
 '0XNa1vTidXlvJ2gHSsRi4A',
 '1eEfMU2AhEo7XnKgL7c304',
 '2OpqcUtj10HHvGG6h9VYC5',
 '1FClsNYBUoNFtGgzeG74dW',
 '7oPftvlwr6VrsViSDV7fJY',
 '4BFMTELQyWJU1SwqcXMBm3',
 '2y8Jo9CKhJvtfeKOsYzRdT',
 '3vbKDsSS70ZX9D2OcvbZmS',
 '4S76LQXJD6N2uPcLhKejG8',
 '0vn7UBvSQECKJm2817Yf1P',
 '4y6J8jwRAwO4dssiSmN91R',
 '4nts0oxMT67lVUoi5Kjxrb',
 

In [70]:
from collections import OrderedDict
list(OrderedDict.fromkeys(artists_ids))
len(artists_ids)

276

In [71]:
test_artists_ids = artists_ids[0:1]
test_artists_ids

['1zuJe6b1roixEKMOtyrEak']

In [72]:
def get_artists_from_playlist(playlist_id):
    
    tracks_from_playlist = get_playlist_tracks("spotify", playlist_id)
    
    artists = []
    
    for track in tracks_from_playlist:
        artists_info = track['track']['artists']
        
        for artist_info in artists_info:
            artists.append(artist_info['name'])
    
    return list(set(artists))

In [79]:
top_artists = get_artists_from_playlist("7EAqBCOVkDZcbccjxZmgjp")
top_artists

['Sex Pistols',
 'The Mamas & The Papas',
 'Tina Turner',
 'Sonny & Cher',
 'Billy Joel',
 'The Jam',
 'Andy Macpherson',
 'Black Sabbath',
 'James Brown',
 'Missy Elliott',
 'Fats Domino',
 'Al Green',
 'Gladys Knight & The Pips',
 'George Jones',
 'Ray Charles',
 'Martha Reeves & The Vandellas',
 'Alice Cooper',
 'The Police',
 'Pixies',
 'Them',
 'Earth, Wind & Fire',
 'Queen',
 'Paul McCartney',
 'John Mellencamp',
 'James Taylor',
 "Booker T. & the M.G.'s",
 'Simon & Garfunkel',
 'Iggy Pop',
 'Big Star',
 'Janis Joplin',
 'Fleetwood Mac',
 'Don Henley',
 'Metallica',
 'Jefferson Airplane',
 'The Box Tops',
 'Bill Haley',
 'R. Kelly',
 'Eric B. & Rakim',
 'Dionne Warwick',
 'Drifting Cowboys',
 'Elvis Presley',
 'Carl Perkins',
 'Parliament',
 'C. Hardin',
 'Elvis Costello',
 'Coldplay',
 'Nirvana',
 'Bob Dylan',
 'The Chantels',
 'The Isley Brothers',
 'The Velvet Underground',
 'Blondie',
 'The Troggs',
 'The Notorious B.I.G.',
 'U2',
 'New Order',
 'Percy Sledge',
 'Aretha Frank

In [80]:
from collections import OrderedDict
list(OrderedDict.fromkeys(top_artists))
len(top_artists)

276

## Use 502 best artists to grow the dataset

https://medium.com/@samlupton/spotipy-get-features-from-your-favourite-songs-in-python-6d71f0172df0

I found this solution online and tried to make it work but I guess my data was too big? Could not really find the issue. :(

In [84]:
def artist_tracks(artists):
    
    '''
    Takes a list of artist names, iterates through their Spotify albums, checks for 
    duplicate albums, then appends all the tracks in those albums to a list of lists
    '''
    
    # Each list in this list will be a track and its features
    tracks = []
    
    for artist in tqdm_notebook(artists):
        
        # Get the artist URI (a unique ID)
        artist_uri = sp.search(artist)['tracks']['items'][0]['artists'][0]['uri']

        # Spotify has a lot of duplicate albums, but we'll cross-reference them with this list to avoid extra loops
        album_checker = []
        
        # The starting point of our loop of albums for those artists with more than 50
        n = 0
        
        # Note the album_type = 'album'. This discounts singles, compilations and collaborations
        while len(sp.artist_albums(artist_uri, album_type = 'album', limit=50, offset = n)['items']) > 0:
            
            # Avoid overloading Spotify with requests by assigning the list of album dictionaries to a variable
            dict_list = sp.artist_albums(artist_uri, album_type = 'album', limit=50, offset = n)['items']
            
            for i, album in tqdm_notebook(enumerate(dict_list)):

                # Add the featured artists for the album in question to the checklist
                check_this_album = [j['name'] for j in dict_list[i]['artists']]
                # And the album name
                check_this_album.append(dict_list[i]['name'])
                # And its date
                check_this_album.append(dict_list[i]['release_date'])

                # Only continue looping if that album isn't in the checklist
                if check_this_album not in album_checker:
                    
                    # Add this album to the checker
                    album_checker.append(check_this_album)
                    # For every song on the album, get its descriptors and features in a list and add to the tracklist
                    tracks.extend([[artist, album['name'], album['uri'], song['name'],

                      album['release_date']] + list(sp.audio_features(song['uri'])[0].values()) 
                                   for song in sp.album_tracks(album['uri'])['items']])
            
            # Go through the next 50 albums (otherwise we'll get an infinite while loop)
            n += 50

    return tracks

In [88]:
artist_df = artist_tracks(top_artists[0:20])

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for artist in tqdm_notebook(artists):


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=20.0), HTML(value='')))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i, album in tqdm_notebook(enumerate(dict_list)):


HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…





In [91]:
def df_tracks(tracklist):
    
    '''
    Takes the output of artist_tracks (i.e. a list of lists),
    puts it in a dataframe and formats it.
    '''

    df = pd.DataFrame(tracklist, columns=['artist',
     'album_name',
     'album_uri',
     'track',
     'release_date'] + list(sp.audio_features('7tr2za8SQg2CI8EDgrdtNl')[0].keys()))

    df.rename(columns={'uri':'song_uri'}, inplace=True)

    df.drop_duplicates(subset=['artist', 'track', 'release_date'], inplace=True)

    # Reorder the cols to have identifiers first, auditory features last
    cols = ['artist', 'album_name', 'album_uri', 'track', 'release_date', 'id', 'song_uri', 'track_href',
     'analysis_url', 'type', 'danceability', 'energy', 'key',  'loudness', 'mode', 'speechiness',
     'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

    df = df[cols]
    
    return df

In [None]:
spotify_tracks = df_tracks(artist_df)

In [57]:
spotify_tracks.sample(10)

Unnamed: 0,artist,album_name,album_uri,track,release_date,id,song_uri,track_href,analysis_url,type,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
2883,Black Sabbath,Vol. 4,spotify:album:61j7phQkxuKzcoFsi0XtkQ,Wheels of Confusion (2009 - Remaster),1972-09-25,61zDwmHFXYYNjKTTpT5GOR,spotify:track:61zDwmHFXYYNjKTTpT5GOR,https://api.spotify.com/v1/tracks/61zDwmHFXYYN...,https://api.spotify.com/v1/audio-analysis/61zD...,audio_features,...,-10.909,1,0.0464,0.0441,0.159,0.0868,0.559,120.186,479640,4
1264,Andy Macpherson,WHO (Deluxe & Live At Kingston),spotify:album:1xSNEy7p90VpUfuT60f8qV,Rockin' In Rage,2020-10-30,1oNBwGteHZ2PZ3qeeVRzd3,spotify:track:1oNBwGteHZ2PZ3qeeVRzd3,https://api.spotify.com/v1/tracks/1oNBwGteHZ2P...,https://api.spotify.com/v1/audio-analysis/1oNB...,audio_features,...,-4.781,1,0.0325,0.0714,4e-06,0.0931,0.432,117.134,244613,4
2762,Black Sabbath,Technical Ecstasy (2009 Remastered Version),spotify:album:6KSQLHjRof63DtaUu55SMm,All Moving Parts (Stand Still),1976-09-25,4Xw2xDDq1mDTpXzsEQooVe,spotify:track:4Xw2xDDq1mDTpXzsEQooVe,https://api.spotify.com/v1/tracks/4Xw2xDDq1mDT...,https://api.spotify.com/v1/audio-analysis/4Xw2...,audio_features,...,-9.255,0,0.0501,0.162,0.0594,0.0551,0.743,165.483,298493,4
2599,Black Sabbath,Live Evil (2008 Remaster),spotify:album:6AOClmLV3vaZ83kjqXtwrq,Voodoo - Live; 2000 Remaster,1982-12-01,5QRTewUwStrwQp3qRozYfC,spotify:track:5QRTewUwStrwQp3qRozYfC,https://api.spotify.com/v1/tracks/5QRTewUwStrw...,https://api.spotify.com/v1/audio-analysis/5QRT...,audio_features,...,-10.429,1,0.0927,0.00444,0.569,0.587,0.346,117.623,367933,4
56,Sex Pistols,More Product,spotify:album:3zpM4FwpPQjFxS13uDVs0M,The Very Name 'Sex Pistols' - Remastered 1993,2017-08-25,5GdGkiMpf5a2ctflGxKFzS,spotify:track:5GdGkiMpf5a2ctflGxKFzS,https://api.spotify.com/v1/tracks/5GdGkiMpf5a2...,https://api.spotify.com/v1/audio-analysis/5GdG...,audio_features,...,-14.443,1,0.696,0.946,0.0,0.593,0.696,131.957,327813,3
914,Billy Joel,Glass Houses,spotify:album:5sztejERqpktXEdemlUvU5,All for Leyna,1980-03-12,57hJxdJGm8kZMU0xPGNBAA,spotify:track:57hJxdJGm8kZMU0xPGNBAA,https://api.spotify.com/v1/tracks/57hJxdJGm8kZ...,https://api.spotify.com/v1/audio-analysis/57hJ...,audio_features,...,-5.989,0,0.0279,0.284,6e-06,0.191,0.899,142.814,253600,4
1837,Andy Macpherson,Quadrophenia,spotify:album:3JV6BIIXo3mj6GLIGH9p8a,Doctor Jimmy,1973-10-19,3oW69AyIAznSmb7D2Y21cx,spotify:track:3oW69AyIAznSmb7D2Y21cx,https://api.spotify.com/v1/tracks/3oW69AyIAznS...,https://api.spotify.com/v1/audio-analysis/3oW6...,audio_features,...,-10.064,1,0.0437,0.215,0.279,0.121,0.289,146.763,515960,4
427,Tina Turner,Twenty Four Seven (Expanded Version),spotify:album:0WwNekBN3mKT5gFN6oyAK8,Don't Leave Me This Way - Recorded Live in Lon...,1999-11-01,0zneZP2jK1iNzHUKGOQObI,spotify:track:0zneZP2jK1iNzHUKGOQObI,https://api.spotify.com/v1/tracks/0zneZP2jK1iN...,https://api.spotify.com/v1/audio-analysis/0zne...,audio_features,...,-6.377,1,0.0329,0.0214,2e-06,0.916,0.191,80.827,263773,4
1431,Andy Macpherson,Quadrophenia - Live In London,spotify:album:7LxuW6EXCxjVcGe2BxzfcZ,Won't Get Fooled Again - Live In London / 2013,2014-01-01,2Tj0Ilyeg4GefiUJAUuD6L,spotify:track:2Tj0Ilyeg4GefiUJAUuD6L,https://api.spotify.com/v1/tracks/2Tj0Ilyeg4Ge...,https://api.spotify.com/v1/audio-analysis/2Tj0...,audio_features,...,-5.5,1,0.0558,0.00908,0.0456,0.99,0.151,134.576,546253,4
2009,Andy Macpherson,Tommy (Deluxe Edition),spotify:album:2srjzxgFaYLNh8UlJPAJ8b,Cousin Kevin,1969-05-23,5BikBqwnM8AZ1SG45BtcGK,spotify:track:5BikBqwnM8AZ1SG45BtcGK,https://api.spotify.com/v1/tracks/5BikBqwnM8AZ...,https://api.spotify.com/v1/audio-analysis/5Bik...,audio_features,...,-10.103,1,0.0606,0.208,0.0,0.178,0.214,112.934,246608,4


In [52]:
spotify_tracks.to_csv('spotify_tracks.csv')