# Lab | Unsupervised learning intro

**Instructions** 

It's the moment to perform clustering on the songs you collected. Remember that the ultimate goal of this little project is to improve the recommendations of artists. Clustering the songs will allow the recommendation system to limit the scope of the recommendations to only songs that belong to the same cluster - songs with similar audio features.

The experiments you did with the Spotify API and the Billboard web scraping will allow you to create a pipeline such that when the user enters a song, you:

- Check whether or not the song is in the Billboard Hot 200.
- Collect the audio features from the Spotify API.

After that, you want to send the Spotify audio features of the submitted song to the clustering model, which should return a cluster number.

We want to have as many songs as possible to create the clustering model, so we will add the songs you collected to a bigger dataset available on Kaggle containing 160 thousand songs.

## Import Libraries

In [249]:
import pandas as pd
import numpy as np
from sklearn import datasets

from bs4 import BeautifulSoup
import requests

from spotipy.oauth2 import SpotifyClientCredentials
from tqdm import tqdm_notebook
import spotipy

## Import Data

In [250]:
data = pd.read_csv(r"C:\\Users\\nicol\\lab-api-wrappers\spotify_tracks.csv")
data.sample(10)

Unnamed: 0.1,Unnamed: 0,artist,album_name,album_uri,track,release_date,id,song_uri,track_href,analysis_url,...,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
6302,6624,Pixies,"Live from MacEwan Hall, University of Calgary,...",spotify:album:3aHWofJ19fdj8C43ZV7NQQ,"Isla De Encanta - Live from MacEwan Hall, Univ...",2021-04-19,5wtiSXLdMJZwkeiVUScQcv,spotify:track:5wtiSXLdMJZwkeiVUScQcv,https://api.spotify.com/v1/tracks/5wtiSXLdMJZw...,https://api.spotify.com/v1/audio-analysis/5wti...,...,-9.75,1,0.134,0.00317,0.00117,0.84,0.268,129.249,113067,4
3458,3706,Fats Domino,My Blue Heaven,spotify:album:2P6OFyXNxpIxMZzNlCbrbV,I'm Ready,1990-01-01,7gQzF4ZKsRo7C2DY5jMf8r,spotify:track:7gQzF4ZKsRo7C2DY5jMf8r,https://api.spotify.com/v1/tracks/7gQzF4ZKsRo7...,https://api.spotify.com/v1/audio-analysis/7gQz...,...,-10.007,1,0.0516,0.682,0.000426,0.114,0.968,172.285,125040,4
1527,1613,Andy Macpherson,Live At The Isle Of Wight Festival 1970,spotify:album:1FUZJRZ4eOHT4MJuRh5fh8,Christmas - Live,1996-10-29,3ISDIWbHRvUutTBc27YmXS,spotify:track:3ISDIWbHRvUutTBc27YmXS,https://api.spotify.com/v1/tracks/3ISDIWbHRvUu...,https://api.spotify.com/v1/audio-analysis/3ISD...,...,-5.822,1,0.202,0.277,0.0,0.623,0.312,119.098,207627,4
2958,3206,Missy Elliott,The Evolution,spotify:album:3GHCyn6Wwo3CzUDv5MzC1k,Promise,2006,0BC9Uo7Ddsmn3A0IyHidgx,spotify:track:0BC9Uo7Ddsmn3A0IyHidgx,https://api.spotify.com/v1/tracks/0BC9Uo7Ddsmn...,https://api.spotify.com/v1/audio-analysis/0BC9...,...,-5.757,1,0.0425,0.409,4e-06,0.0819,0.439,123.279,267413,4
4903,5181,Martha Reeves & The Vandellas,50th Anniversary | The Singles Collection | 19...,spotify:album:5NSCsRBb7YxQTzKkb0UfLe,Jimmy Mack - Single Version (Mono),2013-01-01,1Naq0juYdFzmytkAuUFa2a,spotify:track:1Naq0juYdFzmytkAuUFa2a,https://api.spotify.com/v1/tracks/1Naq0juYdFzm...,https://api.spotify.com/v1/audio-analysis/1Naq...,...,-2.94,1,0.0305,0.154,0.00487,0.0631,0.9,129.343,174787,4
3802,4050,Al Green,The Lord Will Make a Way,spotify:album:4NoiA9k1PwQ9M6Om0N29u9,Highway to Heaven,1980,27TThjoJrBYrv7lUNXDKjf,spotify:track:27TThjoJrBYrv7lUNXDKjf,https://api.spotify.com/v1/tracks/27TThjoJrBYr...,https://api.spotify.com/v1/audio-analysis/27TT...,...,-12.418,1,0.0455,0.132,0.0247,0.358,0.962,117.412,178480,5
4028,4297,Al Green,Gets Next To You,spotify:album:5xvnozp5wBuAHgFhgfbmzQ,"Are You Lonely for Me, Baby",1971-08-14,6IpK9YZ5pCWqMIZVwNIqCq,spotify:track:6IpK9YZ5pCWqMIZVwNIqCq,https://api.spotify.com/v1/tracks/6IpK9YZ5pCWq...,https://api.spotify.com/v1/audio-analysis/6IpK...,...,-12.978,1,0.0401,0.35,0.000447,0.15,0.882,95.583,238587,4
329,371,Tina Turner,Children Beyond,spotify:album:3Y4y50H5GIOBJHFLkkO1sV,Om Ah Hum,2011,7KEq86lOihE0l2an8lKHS7,spotify:track:7KEq86lOihE0l2an8lKHS7,https://api.spotify.com/v1/tracks/7KEq86lOihE0...,https://api.spotify.com/v1/audio-analysis/7KEq...,...,-6.139,1,0.0288,0.543,0.905,0.0766,0.157,100.006,274640,4
3268,3516,Fats Domino,"Live From Austin, TX",spotify:album:0pDyIBwFjeTxJzttXt1lF3,I Want to Walk You Home - Live,2006-07-25,0HC5SabG5bBD52OQZFxpv4,spotify:track:0HC5SabG5bBD52OQZFxpv4,https://api.spotify.com/v1/tracks/0HC5SabG5bBD...,https://api.spotify.com/v1/audio-analysis/0HC5...,...,-9.929,1,0.0489,0.517,4.3e-05,0.365,0.917,141.802,164627,3
2169,2348,Black Sabbath,13,spotify:album:5WXL9YjbNd4GIqWc9mZOOq,Age Of Reason,2013-01-01,0BfFxZfRGcf6FOCpZ75tr0,spotify:track:0BfFxZfRGcf6FOCpZ75tr0,https://api.spotify.com/v1/tracks/0BfFxZfRGcf6...,https://api.spotify.com/v1/audio-analysis/0BfF...,...,-4.021,0,0.0441,8.9e-05,0.291,0.152,0.398,127.071,421067,4


In [251]:
data.shape

(7002, 24)

In [252]:
num_df = data.select_dtypes(include = np.number)
cat_df = data.select_dtypes(include = object)

In [253]:
num_df = num_df.drop('Unnamed: 0', axis=1)

In [254]:
num_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,0.400,0.896,7,-13.928,1,0.0407,0.2860,0.502000,0.3230,0.6330,152.520,251893,4
1,0.562,0.672,2,-13.605,1,0.0532,0.4110,0.000594,0.1220,0.3480,141.160,218200,4
2,0.383,0.829,5,-12.322,1,0.0597,0.5760,0.000005,0.1940,0.7720,166.530,172267,4
3,0.339,0.858,5,-13.166,1,0.0531,0.5680,0.000001,0.1140,0.7150,169.849,174773,4
4,0.378,0.849,5,-12.917,1,0.0658,0.5160,0.000000,0.1260,0.8160,168.026,187133,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...
6997,0.639,0.491,9,-13.906,0,0.0616,0.0975,0.902000,0.1070,0.3990,114.581,184076,4
6998,0.908,0.435,1,-8.583,1,0.0676,0.2790,0.500000,0.0889,0.9280,124.980,126903,4
6999,0.421,0.346,4,-9.700,0,0.0359,0.9510,0.911000,0.1020,0.0353,56.265,125049,5
7000,0.546,0.475,0,-9.670,1,0.0557,0.6440,0.107000,0.1310,0.2570,75.006,124479,4


## Skaling

In [255]:
# K-Means is a distance based algorithm: we need to scale / normalize:
from sklearn.preprocessing import StandardScaler
num_df_prep = StandardScaler().fit_transform(num_df)

pd.DataFrame(num_df_prep).head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12
0,-0.540619,1.068808,0.530273,-1.168264,0.576581,-0.337709,0.082021,1.900405,-0.103095,0.443852,1.058719,0.258931,0.2435
1,0.412785,0.036137,-0.875517,-1.084789,0.576581,-0.21852,0.522632,-0.409819,-0.760495,-0.730027,0.658908,-0.049034,0.2435
2,-0.640667,0.759928,-0.032043,-0.753213,0.576581,-0.156541,1.104237,-0.412535,-0.525008,1.016376,1.551795,-0.468877,0.2435
3,-0.899617,0.893622,-0.032043,-0.971335,0.576581,-0.219473,1.076038,-0.41255,-0.78666,0.7816,1.668606,-0.445971,0.2435
4,-0.670093,0.852131,-0.032043,-0.906984,0.576581,-0.098377,0.892744,-0.412556,-0.747413,1.197607,1.604446,-0.332997,0.2435


## Clustering

In [256]:
from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=3, random_state=42)
kmeans.fit(num_df_prep)

KMeans(n_clusters=3, random_state=42)

In [257]:
# Predicting / assigning the clusters:
clusters = kmeans.predict(num_df_prep)
clusters

array([0, 1, 1, ..., 2, 2, 1])

In [258]:
# Check the size of the clusters
pd.Series(clusters).value_counts().sort_index()

0    2787
1    2918
2    1297
dtype: int64

In [260]:
# Explore the cluster assignment in the original dataset
clustered_df = pd.DataFrame(num_df)
clustered_df["cluster"] = clusters
clustered_df.sample(10)

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,cluster
2805,0.639,0.755,7,-6.928,0,0.0837,0.0178,0.0,0.243,0.36,147.868,224800,4,1
6416,0.487,0.575,6,-12.845,0,0.0421,0.329,0.0105,0.962,0.469,121.568,170521,4,0
2722,0.332,0.769,9,-11.503,1,0.0765,0.00367,0.00413,0.0372,0.525,157.378,354774,4,0
1221,0.219,0.888,0,-5.28,1,0.0553,1e-05,1.1e-05,0.689,0.245,126.82,528280,4,0
1393,0.367,0.64,4,-10.043,1,0.0441,0.21,0.354,0.114,0.287,146.699,515960,4,0
5626,0.18,0.944,0,-13.214,1,0.0872,0.00195,0.928,0.935,0.124,177.348,201400,4,0
5054,0.514,0.624,0,-9.287,0,0.0842,0.513,0.00517,0.307,0.755,166.726,163053,4,1
6311,0.407,0.86,11,-11.152,0,0.0331,0.0547,0.222,0.945,0.871,144.001,245814,4,0
6861,0.413,0.813,11,-12.196,1,0.052,5.4e-05,0.00558,0.0919,0.729,124.176,101267,3,1
1713,0.618,0.908,7,-5.484,1,0.032,0.491,0.000232,0.0903,0.705,133.299,163133,4,1


In [261]:
# "performance metric"
kmeans.inertia_

72394.89422540777

In [262]:
from sklearn.metrics import pairwise_distances_argmin_min

new_df = clustered_df.drop('cluster', axis=1).loc[[0]]
closest, _ = pairwise_distances_argmin_min(kmeans.cluster_centers_, new_df)
closest

array([0, 0, 0], dtype=int64)

In [264]:
new_df

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,0.4,0.896,7,-13.928,1,0.0407,0.286,0.502,0.323,0.633,152.52,251893,4


## Check whether or not the song is in the Billboard Hot 200.

### Create Soup

In [124]:
url = input("What URL do you want to use? ")
response = requests.get(url)

if response.status_code == 200:
    soup = BeautifulSoup(response.content, "html.parser")
    print('soup created!')
    
else:
    print('ERROR: ',response.status_code)

What URL do you want to use? https://www.billboard.com/charts/billboard-200
soup created!


### Pick "Ingredients" from Soup :)

In [126]:
songs = soup.find_all("span", class_="chart-element__information__song text--truncate color--primary")
artists = soup.find_all("span", class_="chart-element__information__artist text--truncate color--secondary")
last_week_ranks = soup.find_all("span", class_= "chart-element__meta text--center color--secondary text--last")
peak_ranks = soup.find_all("span", class_= "chart-element__meta text--center color--secondary text--peak")
weeks_on_chart = soup.find_all("span", class_= "chart-element__meta text--center color--secondary text--week") 

In [127]:
# 6. Get the text
for i in [songs, artists, last_week_ranks, peak_ranks, weeks_on_chart]:
    for j in range(len(i)):
        i[j] = i[j].getText()

In [130]:
# 7. Build a dataframe

billboard = pd.DataFrame(
    {"song": songs,
     "artist": artists,
     "last_week_rank": last_week_ranks,
     "peak_rank": peak_ranks,
     "weeks_on_chart": weeks_on_chart}
)

billboard.sample(10)

Unnamed: 0,song,artist,last_week_rank,peak_rank,weeks_on_chart
52,Traveller,Chris Stapleton,54,1,317
1,Certified Lover Boy,Drake,2,1,5
161,BLAME IT ON BABY,DaBaby,145,1,77
57,YHLQMDLG,Bad Bunny,53,2,84
12,Happier Than Ever,Billie Eilish,9,1,10
128,Moana,Soundtrack,132,2,252
139,Dont Smile At Me,Billie Eilish,136,14,198
189,IGOR,"Tyler, The Creator",-,1,104
22,Legends Never Die,Juice WRLD,24,1,65
179,The Anxiety,THE ANXIETY: WILLOW & Tyler Cole,153,141,4


### Check for song in Billboard

In [199]:
user_artist = input("Is there any artist you like? ")
user_song = input("What song do you like? ")

pos = -1
for index, song in enumerate(billboard['song'].tolist()):
    if song == user_song:
        pos = index

if pos >=0:
    print('Here is: ',billboard['song'].iloc[pos], 'from', user_artist)
else:
    print(user_song,' is not a song in the Billboard Hot 200!')

Is there any artist you like? Chris Stapleton
What song do you like? Traveller
Here is:  Traveller from Chris Stapleton


## Collect the audio features from the Spotify API.

### Initialize SpotiPy with user credentials

In [200]:
client_id = '8d4cb3eae1394f10b8b5f97826cd53ba'
client_secret = '8b4dd0a7b41c474ba0a986c104160f06'
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

In [201]:
searchResults = sp.search(q="artist:" + user_artist + " track:" + user_song, type="track")

In [202]:
searchResults

{'tracks': {'href': 'https://api.spotify.com/v1/search?query=artist%3AChris+Stapleton+track%3ATraveller&type=track&offset=0&limit=10',
  'items': [{'album': {'album_type': 'album',
     'artists': [{'external_urls': {'spotify': 'https://open.spotify.com/artist/4YLtscXsxbVgi031ovDDdh'},
       'href': 'https://api.spotify.com/v1/artists/4YLtscXsxbVgi031ovDDdh',
       'id': '4YLtscXsxbVgi031ovDDdh',
       'name': 'Chris Stapleton',
       'type': 'artist',
       'uri': 'spotify:artist:4YLtscXsxbVgi031ovDDdh'}],
     'available_markets': ['AD',
      'AE',
      'AG',
      'AL',
      'AM',
      'AO',
      'AR',
      'AT',
      'AU',
      'AZ',
      'BA',
      'BB',
      'BD',
      'BE',
      'BF',
      'BG',
      'BH',
      'BI',
      'BJ',
      'BN',
      'BO',
      'BR',
      'BS',
      'BT',
      'BW',
      'BY',
      'BZ',
      'CA',
      'CH',
      'CI',
      'CL',
      'CM',
      'CO',
      'CR',
      'CV',
      'CW',
      'CY',
      'CZ',
     

In [None]:
def artist_tracks(artists):
    
    '''
    Takes a list of artist names, iterates through their Spotify albums, checks for 
    duplicate albums, then appends all the tracks in those albums to a list of lists
    '''
    
    # Each list in this list will be a track and its features
    tracks = []
    
    for artist in tqdm_notebook(artists):
        
        # Get the artist URI (a unique ID)
        artist_uri = sp.search(artist)['tracks']['items'][0]['artists'][0]['uri']

        # Spotify has a lot of duplicate albums, but we'll cross-reference them with this list to avoid extra loops
        album_checker = []
        
        # The starting point of our loop of albums for those artists with more than 50
        n = 0
        
        # Note the album_type = 'album'. This discounts singles, compilations and collaborations
        while len(sp.artist_albums(artist_uri, album_type = 'album', limit=50, offset = n)['items']) > 0:
            
            # Avoid overloading Spotify with requests by assigning the list of album dictionaries to a variable
            dict_list = sp.artist_albums(artist_uri, album_type = 'album', limit=50, offset = n)['items']
            
            for i, album in tqdm_notebook(enumerate(dict_list)):

                # Add the featured artists for the album in question to the checklist
                check_this_album = [j['name'] for j in dict_list[i]['artists']]
                # And the album name
                check_this_album.append(dict_list[i]['name'])
                # And its date
                check_this_album.append(dict_list[i]['release_date'])

                # Only continue looping if that album isn't in the checklist
                if check_this_album not in album_checker:
                    
                    # Add this album to the checker
                    album_checker.append(check_this_album)
                    # For every song on the album, get its descriptors and features in a list and add to the tracklist
                    tracks.extend([[artist, album['name'], album['uri'], song['name'],

                      album['release_date']] + list(sp.audio_features(song['uri'])[0].values()) 
                                   for song in sp.album_tracks(album['uri'])['items']])
            
            # Go through the next 50 albums (otherwise we'll get an infinite while loop)
            n += 50

    return tracks

In [224]:
artist_df = artist_tracks(user_artist)

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for artist in tqdm_notebook(artists):


HBox(children=(HTML(value=''), FloatProgress(value=0.0, max=15.0), HTML(value='')))

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  for i, album in tqdm_notebook(enumerate(dict_list)):


HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…




HBox(children=(HTML(value=''), FloatProgress(value=1.0, bar_style='info', layout=Layout(width='20px'), max=1.0…

HTTP Error for GET to https://api.spotify.com/v1/search with Params: {'q': ' ', 'limit': 10, 'offset': 0, 'type': 'track', 'market': None} returned 404 due to Not found.






SpotifyException: http status: 404, code:-1 - https://api.spotify.com/v1/search?q=+&limit=10&offset=0&type=track:
 Not found., reason: None

In [225]:
def df_tracks(tracklist):
    
    '''
    Takes the output of artist_tracks (i.e. a list of lists),
    puts it in a dataframe and formats it.
    '''

    df = pd.DataFrame(tracklist, columns=['artist',
     'album_name',
     'album_uri',
     'track',
     'release_date'] + list(sp.audio_features('7EAqBCOVkDZcbccjxZmgjp')[0].keys()))

    df.rename(columns={'uri':'song_uri'}, inplace=True)

    df.drop_duplicates(subset=['artist', 'track', 'release_date'], inplace=True)

    # Reorder the cols to have identifiers first, auditory features last
    cols = ['artist', 'album_name', 'album_uri', 'track', 'release_date', 'id', 'song_uri', 'track_href',
     'analysis_url', 'type', 'danceability', 'energy', 'key',  'loudness', 'mode', 'speechiness',
     'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']

    df = df[cols]
    
    return df

In [226]:
searchresult_df = df_tracks(artist_df)

AttributeError: 'NoneType' object has no attribute 'keys'

In [227]:
searchresult_df

NameError: name 'searchresult_df' is not defined

## send the Spotify audio features of the submitted song to the clustering model