# Some Background 

When I'm listening to music on Spotify, I generally like songs but don't really bother placing the songs into playlists. Everytime I press shuffle, it can go from Bad Bunny to Little Richard and then to Romeo Santos. Can you imagine being at the club listening to hip hop and then the next song is a country song? And the following song is a Corrido? Yep... these radical changes in genre can really throw you off. 

That's basically every single car ride for me and my dance group as we head to practice every Sunday. At this point, we just try and guess what genre is comming up next. The person who gets the most right, is usually DJ on the way back.

I'm going to use Spotify's API wrapper (Spotipy) to sort my music into playlists that have a similar genre. I will focus primarily on Hispanic genres so I will disregard music in English since most of my music is primarily Hispanic music. 


Shoutout to those new Bad Bunny albums though!! haha


# Imports

In [1]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
import time 
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import DBSCAN, KMeans
from sklearn.metrics import silhouette_score
from sklearn.decomposition import PCA

# Credentials

My credentials are stored in another file and I'm going to be importing them.

In [2]:
creds= pd.read_csv('../credentials')
SPOTIPY_CLIENT_ID = creds['Client ID'][0]
SPOTIPY_CLIENT_SECRET = creds['Client Secret'][0]
#Shameless redirect to my dance group if you're signing in for the first time
SPOTIPY_REDIRECT_URI= 'https://www.ballethermosoamanecer.com/'
username = ""

# Data Collection 

## Testing 20 Songs in Liked 

In [3]:
client_credentials_manager = SpotifyClientCredentials(client_id= SPOTIPY_CLIENT_ID, 
                                                      client_secret=SPOTIPY_CLIENT_SECRET)

sp = spotipy.Spotify(client_credentials_manager= client_credentials_manager)
scope = 'user-library-read playlist-read-private playlist-modify-public playlist-modify-private user-read-recently-played user-top-read'
token = util.prompt_for_user_token(username, scope, 
                                   client_id= SPOTIPY_CLIENT_ID, 
                                   client_secret= SPOTIPY_CLIENT_SECRET, 
                                   redirect_uri= SPOTIPY_REDIRECT_URI)
if token:
    sp = spotipy.Spotify(auth=token)
    results = sp.current_user_saved_tracks()
    count = 0
    total_liked_songs = results['total']
    for item in results['items']:
        track = item['track']
        print(track['name'] + ' - ' + track['artists'][0]['name'])
        count += 1
    print(f"\nShowing {count} out of {total_liked_songs} songs")
else:
    print("Can't get token for", username)

Corazón Solitario - Alberto Pedraza
Temblor - Remix - Causa
Viento - Caifanes
Toda la Noche - El Haragán y Compañía
Maniqui - Chimbala
Qué Maldición - Banda MS de Sergio Lizárraga
Con Tus Besos - Eslabon Armado
Rosita de Olivo - Los Tigres Del Norte
Intentalo Tu - Joe Veras
La Cumbia Buenota - Sonideros de MEX USA
Cumbia Yemaya - Sonideros de MEX USA
Háblame De Ti - Banda MS de Sergio Lizárraga
Borracho de Amor - Edwin Luna y La Trakalosa de Monterrey
New Light - John Mayer
El Amor de Mi Vida - La Adictiva Banda San José de Mesillas
Aparentemente - Yaga & Mackie
La Rabia - Chuy Lizarraga y Su Banda Tierra Sinaloense
Mentirosa (Eres Mentirosa) - Chicos de Barrio
Safaera - Bad Bunny
La Santa - Bad Bunny

Showing 20 out of 1221 songs


## Collecting Audio Features for Liked Songs

In [4]:
song_name = []
song_uri = []
song_popularity = []
artist_name = []
artist_uri = []
genres_found = []
loops= 0 
track_audio_features = []

while results:
    for item in results['items']:
        track = item['track']
        song_name.append(track['name'])
        song_uri.append(track['uri'])
        song_popularity.append(track['popularity'])
    
        #Some songs have multuple artists so I'll make a list of artists as well as a list of uris for each song
        artist_name.append([artist['name'] for artist in track['artists']])
        artist_uri.append([artist['uri'] for artist in track['artists']])
    
        #Spotify doesn't provide the genre of the song so I have to look at the genre of the artist(s) of the song
        temp = sp.artists([artist['uri'] for artist in track['artists']])['artists']
        temp2 = []
        for temp_artists in temp:
            for genre in temp_artists['genres']:
                temp2.append(genre)
        #We only want to keep unique genres and avoid duplicate genres
        genres_found.append(list(set(temp2)))
    
    #Collecting audio features available for each song 
    #As shown from above, 20 
    track_audio_features.extend(sp.audio_features(song_uri[20*loops :20*(loops+1)]))
    print(f'Audio Features were gathered for {len(track_audio_features)} songs')

    loops += 1 
    results = sp.next(results)
    if (loops % 5) == 0 :
        print(f'{loops} cycles have been completed\n')
    #Sleep in order to not bombard with too many requests
    time.sleep(1)

#Bringing it all together 
df_audio_feat = pd.DataFrame(track_audio_features) 
df_tracks = pd.DataFrame({'artist_uri': artist_uri,
                          'song_uri': song_uri,
                          'song_name':song_name,
                          'artists': artist_name,
                          'genres': genres_found,
                          'popularity': song_popularity})

print("\nCreating Dataframe . . .")
collected_data = pd.merge(df_tracks, df_audio_feat, left_on= 'song_uri', right_on= 'uri')
print("Complete!")

Audio Features were gathered for 20 songs
Audio Features were gathered for 40 songs
Audio Features were gathered for 60 songs
Audio Features were gathered for 80 songs
Audio Features were gathered for 100 songs
5 cycles have been completed

Audio Features were gathered for 120 songs
Audio Features were gathered for 140 songs
Audio Features were gathered for 160 songs
Audio Features were gathered for 180 songs
Audio Features were gathered for 200 songs
10 cycles have been completed

Audio Features were gathered for 220 songs
Audio Features were gathered for 240 songs
Audio Features were gathered for 260 songs
Audio Features were gathered for 280 songs
Audio Features were gathered for 300 songs
15 cycles have been completed

Audio Features were gathered for 320 songs
Audio Features were gathered for 340 songs
Audio Features were gathered for 360 songs
Audio Features were gathered for 380 songs
Audio Features were gathered for 400 songs
20 cycles have been completed

Audio Features were g

In [5]:
collected_data.head()

Unnamed: 0,artist_uri,song_uri,song_name,artists,genres,popularity,danceability,energy,key,loudness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,[spotify:artist:3TQh6LXI9ADgyZJTT19TeR],spotify:track:7pxs0seevgCsp3h23lUGBw,Corazón Solitario,[Alberto Pedraza],"[cumbia, cumbia sonidera]",56,0.506,0.589,0,-7.364,...,0.139,0.884,199.849,audio_features,7pxs0seevgCsp3h23lUGBw,spotify:track:7pxs0seevgCsp3h23lUGBw,https://api.spotify.com/v1/tracks/7pxs0seevgCs...,https://api.spotify.com/v1/audio-analysis/7pxs...,243458,4
1,"[spotify:artist:067L5aKjYxwwYPdgIfNnW1, spotif...",spotify:track:40gfrW5BX2QE5SrxzRq5TM,Temblor - Remix,"[Causa, Farruko, El Alfa]","[rap dominicano, reggaeton, trap latino, latin...",58,0.64,0.8,6,-3.768,...,0.0848,0.444,122.988,audio_features,40gfrW5BX2QE5SrxzRq5TM,spotify:track:40gfrW5BX2QE5SrxzRq5TM,https://api.spotify.com/v1/tracks/40gfrW5BX2QE...,https://api.spotify.com/v1/audio-analysis/40gf...,212683,4
2,[spotify:artist:1GImnM7WYVp95431ypofy9],spotify:track:6QJCZyJv1fhkCyZA3lRoAD,Viento,[Caifanes],"[latin rock, mexican rock, rock urbano mexican...",67,0.586,0.855,11,-5.947,...,0.129,0.448,124.917,audio_features,6QJCZyJv1fhkCyZA3lRoAD,spotify:track:6QJCZyJv1fhkCyZA3lRoAD,https://api.spotify.com/v1/tracks/6QJCZyJv1fhk...,https://api.spotify.com/v1/audio-analysis/6QJC...,236333,4
3,[spotify:artist:1kydV1RJaCH3wePowuxDhB],spotify:track:5uWjWM3kJEjVyyaEuoVkMi,Toda la Noche,[El Haragán y Compañía],[rock en espanol],46,0.535,0.736,6,-3.812,...,0.0779,0.462,141.619,audio_features,5uWjWM3kJEjVyyaEuoVkMi,spotify:track:5uWjWM3kJEjVyyaEuoVkMi,https://api.spotify.com/v1/tracks/5uWjWM3kJEjV...,https://api.spotify.com/v1/audio-analysis/5uWj...,283147,4
4,[spotify:artist:4VVEpEhC8NcR7AqNEds42U],spotify:track:5Vk7ve73fLQbuN9t9hnzpN,Maniqui,[Chimbala],"[rap dominicano, dominican pop, dembow]",52,0.885,0.805,5,-3.726,...,0.0578,0.536,122.967,audio_features,5Vk7ve73fLQbuN9t9hnzpN,spotify:track:5Vk7ve73fLQbuN9t9hnzpN,https://api.spotify.com/v1/tracks/5Vk7ve73fLQb...,https://api.spotify.com/v1/audio-analysis/5Vk7...,177633,4


Ideally, I want to create playlists for each genre and place the correct songs into each playlist. We're using the artist's genre as a proxy for the song's genre since Spotify doesn't provide us with specific song genres.

Given that some songs can have multiple artists, some songs might be classified under multiple genres if artists from different genres are collaborating on a song. In addition, genres provided are waaaaay too specific for my needs. To start, I will simplify my problem and classify my music into larger, umbrella genres. This will definitely capture some noise as songs can overlap generes but that's okay. I'll need to create my own training data from the data that I have collected.

After I cleaned up my data, I will explore a couple of different clustering techniques in order to see how many realistic clusters can be formed. I will use PCA in order to reduce dimensionality and see how that changes the effectiveness of my models. I will be using K-Means and DBSCAN but first, I have some data cleaning to do.  


# Data Cleaning

## Outliers and  `null` values

Let's check for `null` values and remove columns that are repetitive and/or provide no additional information. Domain knowledge usually helps identify any outliers or data points that might be off due to human error but for this case, I will assume all values provided by Spotify have no human error. 

In [6]:
collected_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1221 entries, 0 to 1220
Data columns (total 24 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   artist_uri        1221 non-null   object 
 1   song_uri          1221 non-null   object 
 2   song_name         1221 non-null   object 
 3   artists           1221 non-null   object 
 4   genres            1221 non-null   object 
 5   popularity        1221 non-null   int64  
 6   danceability      1221 non-null   float64
 7   energy            1221 non-null   float64
 8   key               1221 non-null   int64  
 9   loudness          1221 non-null   float64
 10  mode              1221 non-null   int64  
 11  speechiness       1221 non-null   float64
 12  acousticness      1221 non-null   float64
 13  instrumentalness  1221 non-null   float64
 14  liveness          1221 non-null   float64
 15  valence           1221 non-null   float64
 16  tempo             1221 non-null   float64


In [7]:
collected_data.isnull().sum()

artist_uri          0
song_uri            0
song_name           0
artists             0
genres              0
popularity          0
danceability        0
energy              0
key                 0
loudness            0
mode                0
speechiness         0
acousticness        0
instrumentalness    0
liveness            0
valence             0
tempo               0
type                0
id                  0
uri                 0
track_href          0
analysis_url        0
duration_ms         0
time_signature      0
dtype: int64

## Checking for Duplicates 

In [10]:
collected_data[collected_data.duplicated(subset=['song_uri'])]

Unnamed: 0,artist_uri,song_uri,song_name,artists,genres,popularity,danceability,energy,key,loudness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature


No duplicates found

## Dropping Unecessary Columns

In [12]:
collected_data

Unnamed: 0,artist_uri,song_uri,song_name,artists,genres,popularity,danceability,energy,key,loudness,...,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature
0,[spotify:artist:3TQh6LXI9ADgyZJTT19TeR],spotify:track:7pxs0seevgCsp3h23lUGBw,Corazón Solitario,[Alberto Pedraza],"[cumbia, cumbia sonidera]",56,0.506,0.589,0,-7.364,...,0.1390,0.884,199.849,audio_features,7pxs0seevgCsp3h23lUGBw,spotify:track:7pxs0seevgCsp3h23lUGBw,https://api.spotify.com/v1/tracks/7pxs0seevgCs...,https://api.spotify.com/v1/audio-analysis/7pxs...,243458,4
1,"[spotify:artist:067L5aKjYxwwYPdgIfNnW1, spotif...",spotify:track:40gfrW5BX2QE5SrxzRq5TM,Temblor - Remix,"[Causa, Farruko, El Alfa]","[rap dominicano, reggaeton, trap latino, latin...",58,0.640,0.800,6,-3.768,...,0.0848,0.444,122.988,audio_features,40gfrW5BX2QE5SrxzRq5TM,spotify:track:40gfrW5BX2QE5SrxzRq5TM,https://api.spotify.com/v1/tracks/40gfrW5BX2QE...,https://api.spotify.com/v1/audio-analysis/40gf...,212683,4
2,[spotify:artist:1GImnM7WYVp95431ypofy9],spotify:track:6QJCZyJv1fhkCyZA3lRoAD,Viento,[Caifanes],"[latin rock, mexican rock, rock urbano mexican...",67,0.586,0.855,11,-5.947,...,0.1290,0.448,124.917,audio_features,6QJCZyJv1fhkCyZA3lRoAD,spotify:track:6QJCZyJv1fhkCyZA3lRoAD,https://api.spotify.com/v1/tracks/6QJCZyJv1fhk...,https://api.spotify.com/v1/audio-analysis/6QJC...,236333,4
3,[spotify:artist:1kydV1RJaCH3wePowuxDhB],spotify:track:5uWjWM3kJEjVyyaEuoVkMi,Toda la Noche,[El Haragán y Compañía],[rock en espanol],46,0.535,0.736,6,-3.812,...,0.0779,0.462,141.619,audio_features,5uWjWM3kJEjVyyaEuoVkMi,spotify:track:5uWjWM3kJEjVyyaEuoVkMi,https://api.spotify.com/v1/tracks/5uWjWM3kJEjV...,https://api.spotify.com/v1/audio-analysis/5uWj...,283147,4
4,[spotify:artist:4VVEpEhC8NcR7AqNEds42U],spotify:track:5Vk7ve73fLQbuN9t9hnzpN,Maniqui,[Chimbala],"[rap dominicano, dominican pop, dembow]",52,0.885,0.805,5,-3.726,...,0.0578,0.536,122.967,audio_features,5Vk7ve73fLQbuN9t9hnzpN,spotify:track:5Vk7ve73fLQbuN9t9hnzpN,https://api.spotify.com/v1/tracks/5Vk7ve73fLQb...,https://api.spotify.com/v1/audio-analysis/5Vk7...,177633,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1216,[spotify:artist:1qto4hHid1P71emI6Fd8xi],spotify:track:0HDHY6RSHHG2ZTE0cMT4GJ,Los Infieles,[Aventura],"[bachata, latin pop, latin, tropical, latin hi...",70,0.745,0.716,9,-8.221,...,0.0590,0.817,132.932,audio_features,0HDHY6RSHHG2ZTE0cMT4GJ,spotify:track:0HDHY6RSHHG2ZTE0cMT4GJ,https://api.spotify.com/v1/tracks/0HDHY6RSHHG2...,https://api.spotify.com/v1/audio-analysis/0HDH...,257187,4
1217,[spotify:artist:47bVt95bvBMpmJFWoyhH0C],spotify:track:2XYkvc4UWMO9U2iQcIjJe7,Te Extraño - Bachata Version,[Xtreme],"[latin, tropical, bachata]",0,0.793,0.592,6,-4.749,...,0.0876,0.892,129.960,audio_features,2XYkvc4UWMO9U2iQcIjJe7,spotify:track:2XYkvc4UWMO9U2iQcIjJe7,https://api.spotify.com/v1/tracks/2XYkvc4UWMO9...,https://api.spotify.com/v1/audio-analysis/2XYk...,213973,4
1218,"[spotify:artist:3rs3EOlJ8jyPpdGiQ9Mhub, spotif...",spotify:track:0wDEs6WvqDHq4XJZC0dHhO,Hoja En Blanco,"[Monchy & Alexandra, Alexandra]","[latin, tropical, bachata]",63,0.932,0.772,0,-3.599,...,0.1760,0.863,135.063,audio_features,0wDEs6WvqDHq4XJZC0dHhO,spotify:track:0wDEs6WvqDHq4XJZC0dHhO,https://api.spotify.com/v1/tracks/0wDEs6WvqDHq...,https://api.spotify.com/v1/audio-analysis/0wDE...,307200,4
1219,[spotify:artist:5lwmRuXgjX8xIwlnauTZIP],spotify:track:6I86RF3odBlcuZA9Vfjzeq,Eres Mía,[Romeo Santos],"[latin, tropical, bachata]",73,0.843,0.729,6,-3.634,...,0.2060,0.903,123.046,audio_features,6I86RF3odBlcuZA9Vfjzeq,spotify:track:6I86RF3odBlcuZA9Vfjzeq,https://api.spotify.com/v1/tracks/6I86RF3odBlc...,https://api.spotify.com/v1/audio-analysis/6I86...,250640,4


Some of the columns above wonn't be needed to create my training dataset so I drop the following columns. I'm renaming this new dataframe as `data` in order to differentiate the two datasets. I could keep the old name but for demonstration purposes, I'll rename it to `data`

In [13]:
data= collected_data.drop(columns= ['artist_uri', 'type','id', 'uri', 'track_href', 'analysis_url'])
data

Unnamed: 0,song_uri,song_name,artists,genres,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
0,spotify:track:7pxs0seevgCsp3h23lUGBw,Corazón Solitario,[Alberto Pedraza],"[cumbia, cumbia sonidera]",56,0.506,0.589,0,-7.364,0,0.0379,0.1190,0.004720,0.1390,0.884,199.849,243458,4
1,spotify:track:40gfrW5BX2QE5SrxzRq5TM,Temblor - Remix,"[Causa, Farruko, El Alfa]","[rap dominicano, reggaeton, trap latino, latin...",58,0.640,0.800,6,-3.768,0,0.1510,0.0278,0.003940,0.0848,0.444,122.988,212683,4
2,spotify:track:6QJCZyJv1fhkCyZA3lRoAD,Viento,[Caifanes],"[latin rock, mexican rock, rock urbano mexican...",67,0.586,0.855,11,-5.947,0,0.0327,0.3040,0.000000,0.1290,0.448,124.917,236333,4
3,spotify:track:5uWjWM3kJEjVyyaEuoVkMi,Toda la Noche,[El Haragán y Compañía],[rock en espanol],46,0.535,0.736,6,-3.812,1,0.0312,0.0878,0.000000,0.0779,0.462,141.619,283147,4
4,spotify:track:5Vk7ve73fLQbuN9t9hnzpN,Maniqui,[Chimbala],"[rap dominicano, dominican pop, dembow]",52,0.885,0.805,5,-3.726,1,0.0541,0.3930,0.000089,0.0578,0.536,122.967,177633,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1216,spotify:track:0HDHY6RSHHG2ZTE0cMT4GJ,Los Infieles,[Aventura],"[bachata, latin pop, latin, tropical, latin hi...",70,0.745,0.716,9,-8.221,0,0.0411,0.2020,0.000000,0.0590,0.817,132.932,257187,4
1217,spotify:track:2XYkvc4UWMO9U2iQcIjJe7,Te Extraño - Bachata Version,[Xtreme],"[latin, tropical, bachata]",0,0.793,0.592,6,-4.749,0,0.0344,0.7600,0.000000,0.0876,0.892,129.960,213973,4
1218,spotify:track:0wDEs6WvqDHq4XJZC0dHhO,Hoja En Blanco,"[Monchy & Alexandra, Alexandra]","[latin, tropical, bachata]",63,0.932,0.772,0,-3.599,1,0.0369,0.2250,0.000003,0.1760,0.863,135.063,307200,4
1219,spotify:track:6I86RF3odBlcuZA9Vfjzeq,Eres Mía,[Romeo Santos],"[latin, tropical, bachata]",73,0.843,0.729,6,-3.634,0,0.0374,0.4010,0.000001,0.2060,0.903,123.046,250640,4


In [15]:
data.describe()

Unnamed: 0,popularity,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature
count,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0,1221.0
mean,48.298935,0.685651,0.700354,5.285831,-5.868667,0.60688,0.093263,0.247459,0.025448,0.166513,0.693604,124.345588,224405.185094,3.927109
std,26.171657,0.147135,0.163208,3.664992,2.506716,0.488643,0.088126,0.227436,0.117048,0.127466,0.21857,32.701815,54981.247864,0.334508
min,0.0,0.167,0.178,0.0,-17.369,0.0,0.0232,2.5e-05,0.0,0.017,0.056,48.718,62563.0,1.0
25%,33.0,0.602,0.593,2.0,-7.159,0.0,0.0391,0.0569,0.0,0.0862,0.554,96.006,190435.0,4.0
50%,54.0,0.714,0.717,5.0,-5.497,1.0,0.0567,0.178,3e-06,0.118,0.735,119.079,216520.0,4.0
75%,69.0,0.793,0.823,9.0,-4.175,1.0,0.107,0.385,0.000245,0.215,0.88,147.883,248952.0,4.0
max,99.0,0.967,0.996,11.0,0.683,1.0,0.919,0.984,0.965,0.986,0.988,214.017,597720.0,5.0


## Simplifying Generes (Manually) 

Lets see what genres we have:

In [None]:
genre_dummied = data['genres'].str.join(sep='*').str.get_dummies(sep='*')
genre_dummied

In theory, I would have 371 different playlists if I made a playlist for each genre. That's excessive and unproductive since some songs pertain to multiple genres and some genres only have one song associated with it. Lets add up our genres and see what's popular in my liked songs 

In [None]:
genre_dummied.sum().sort_values(ascending= False)

Given that `latin` is the most common genere, I will go back to `data` and check if any `latin` is in the list of generes. I will replace the list with `latin` in order to simplfy my process. There are multiple subgeneres like, cumbia, bachata, merenguem etc. but for the initial phase, I am focusing on latin or not latin.

--------
After initial EDA, I have to modify my problem since my songs are overwhelmingly Hispanic. I made subdivisions of the different subgenres within Hispanic music. Not as accurate but this is how I distinguish my own music tastes and this is how I want my playlists to be organized by (for the most part). I notcied that some artists do not have a genre associated with them so lets figure out why and what to do with these values 

In [None]:
bachata_salsa_merengue = {'porro','bachata', 'merengue','salsa', 'salsa peruana', 'tropical', 'timba', 'cuban rumba','dominican pop'}
artists_bsm = []

cumbia_and_mexican = {'trival','cumbia','cumbia salvadorena','gruperas inmortales' ,'cumbia villera', 
                      'cumbia sonidera', 'tejano', 'guaracha', 'cumbia paraguaya','nu-cumbia','deep cumbia sonidera', 
                      'grupera','tamborazo','banda', 'ranchera', 'mariachi', 'duranguense', 
                      'cancion melodica', 'norteno-sax','regional mexican','regional mexican pop'}
artists_cam = []

latin_reggaeton = {'rap dominicano','perreo','venezuelan hip hop','chilean hardcore','reggaeton', 
                   'dembow', 'trap latino', 'latin hip hop','panamanian pop','venezuelan indie',
                   'latin pop', 'mexican edm', 'pop romantico', 'reggaeton flow', 'colombian pop','electro latino',
                   'puerto rican pop'}
artists_lat_reg = []

not_hispanic = {'escape room','christian music','shimmer pop','bass trap', 'indie folk', 'stomp and holler',
                'emo rap','indie pop','indie soul', 'edm', 'house', 
                'british soul', 'disco', 'funk','la pop','uk hip hop',
                'social media pop','afropop','reggae fusion','ghanaian pop','soca',
                'hip hop','pop rap', 'pop', 'hip pop', 'rap', 'chicago rap','chicago pop', 
                'meme rap', 'dancehall', 'outlaw country', 'contemporary country','indietronica', 
                'indie poptimism', 'vapor twitch','electropop', 'vapor soul','tropical house'}
artists_not_his = []

rock = {'mexican rock', 'rock en espanol', 'latin rock', 'mexican rock-and-roll',
        'metalcore','screamo','rock', 'rock-and-roll', 'punk', 'alternative metal', 
        'soft rock', 'glam rock', 'piano rock'}
artists_rock = []

empty_index = []


In [None]:
simplified_genres = []
for i in range(len(genres_found)):
    counter_a = len(not_hispanic & set(genres_found[i]))
    counter_b = len(rock & set(genres_found[i]))
    counter_c = len(cumbia_and_mexican & set(genres_found[i]))
    counter_d = len(bachata_salsa_merengue  & set(genres_found[i]))
    counter_e = len(latin_reggaeton & set(genres_found[i]))

    count_max = max([counter_a, counter_b, counter_c, counter_d, counter_e])
    
    if count_max == 0:
        simplified_genres.append('Empty')
        empty_index.append(i)
        
    elif count_max == counter_a:    
        simplified_genres.append('Non Hispanic')
        artists_not_his.extend(artist_name[i])
        
    elif count_max == counter_b:
        simplified_genres.append('Rock')
        artists_rock.extend(artist_name[i])

        
    elif count_max == counter_c:
        simplified_genres.append('Cumbia y Musica Mexicana')
        artists_cam.extend(artist_name[i])
       
    elif count_max == counter_d:
        simplified_genres.append('Bachata Salsa y Merengue')
        artists_bsm.extend(artist_name[i])

    elif count_max == counter_e:
        simplified_genres.append('Reggaeton and Latin Pop')
        artists_lat_reg.extend(artist_name[i])



data['simplified_genre'] = simplified_genres

In [None]:
genre_dummied = data['simplified_genre'].str.get_dummies()
genre_dummied

In [None]:
genre_dummied.sum().sort_values(ascending= False) 

Looks like there are some artists that do not have genre data. I'm not sure why that is but since we used a list, it would not have appeared as a null value. Lets check out what `Empty` genre looks like

In [None]:
data[data['simplified_genre'] == 'Empty'][15:20]

Oh man, it looks like I will have to manually assign them to categories. Let the listening begin. I'll reassign the simplfied genres to make sure we have no more empty. I will also add the artists to their respective artists list in case I need to use it in the future. 


After this, I'll make playlists based on some machine learning models. The bright side is that since I'm only concerned with Hispanic genres, anything with an english title will be placed into non-hispanic. 

In [None]:
artists_bsm.extend(['Kalimete','Merenglass', 'Merenglass Grupo', 'Célia', 'Jeyro', 'Judy Santos', 
                    'Orquesta los Adolecente', 'Orquesta Noche Sabrosa'])
artists_cam.extend(['Grupo Dinastia Mendoza', 'Nuevo Nivel Norteño', 'Grupo Firme', 
                    'Banda Coloso','La Atrevida Banda Sierra Blanca','Grupo Novedoso',
                    'Super Máquina Musical de Guerrero', 'Sonora Dinamica De Colombia',
                    'La Conga', 'A.B. Quintanilla III Y Los Kumbia Kings', 'Ricardo Muñoz',
                    'Sonador', 'Control', 'El Amigable De Tijuana', 'Tornado','La Hija Del Mariachi',
                    'Ángela Aguilar', 'Los Kiero', 'Banda El RetoÑo', 'Grupo Kual? Dinastia Pedraza',
                    'A Mover La Colita Cumbias', "Los Karkik's" ])
artists_lat_reg.extend(['Jenn', 'Tatiana Hazel', 'Danny Daniel','La Montra' ,'Dj Worldwide', 
                        'John Jairo & Dj Ewduarmix'])
artists_not_his.extend(['Hvrbie', 'Franklyn Watts', 'Rhiannon Roze', 'Homestead', 
                        'Chris Jobe', 'Tim Gent', 'Finatticz', 'YUNGHELLBOY', "Jo'el Monroe", 
                        'La Doña','TITUS', 'Jay Pharoah', 'Myles William', 'RØYLS', 'Ashley Clark',
                        'D Nilsz', 'K.P. & Envyi', ])
# artists_rock.extend([]) No rock music found in empty genre 

In [None]:
for index in empty_index:
    counter_a = len(set(data.loc[index, 'artists']) & set(artists_bsm))
    counter_b = len(set(data.loc[index, 'artists']) & set(artists_cam))
    counter_c = len(set(data.loc[index, 'artists']) & set(artists_lat_reg))
    counter_d = len(set(data.loc[index, 'artists']) & set(artists_not_his))
    counter_e = len(set(data.loc[index, 'artists']) & set(artists_rock))
    count_max = max([counter_a, counter_b, counter_c, counter_d, counter_e])
    
    if count_max == 0:
        data.loc[index, 'simplified_genre'] = 'Still Empty'
        
    elif count_max == counter_a:  
        data.loc[index, 'simplified_genre'] = 'Bachata Salsa y Merengue'
        
    elif count_max == counter_b:
        data.loc[index, 'simplified_genre'] = 'Cumbia y Musica Mexicana'
        simplified_genres.append('Rock')
        artists_rock.extend(artist_name[i])

        
    elif count_max == counter_c:
        data.loc[index, 'simplified_genre'] = 'Reggaeton and Latin Pop'

       
    elif count_max == counter_d:
        data.loc[index, 'simplified_genre'] = 'Non Hispanic'

    elif count_max == counter_e:
        data.loc[index, 'simplified_genre'] = 'Rock'
        
    

This is the following compostion of my liked music. I will make playlists based on these manually selected Genres

In [None]:
data['simplified_genre'].value_counts(normalize = True)

In [None]:
data['simplified_genre'].value_counts()

# Playlist Creations

## Playlists (Manual Simplification) 

In [None]:
# # HOW TO REMOVE A PLAYLIST 
# playlist_genres = set(data['simplified_genre'])
# for l in sp.current_user_playlists()['items']: 
#     if l['name'] in playlist_genres: 
#         sp.user_playlist_unfollow(user= sp.me()['id'], playlist_id= l['id'])

In [None]:
playlist_genres = set(data['simplified_genre'])
myself = sp.me()['id']
current_playlist_names = []
for l in sp.current_user_playlists()['items']: 
    current_playlist_names.append(l['name'])


In [None]:
for i in playlist_genres: 
    if i not in current_playlist_names:
        playlist_name = i
        description = f'Using Python to Manually Sort Liked Songs pertaining to "{i}" genres'
        ls_songs_uri = list(data[data['simplified_genre'] == i]['song_uri'])
        playlist = sp.user_playlist_create(user= myself, name = playlist_name, description= description)
        current_playlist_names.append(playlist['name'])
        print(f'{playlist} has been created')
        play_id = playlist['id']
        loops = (len(ls_songs_uri) // 50) + 1
        for k in range(loops):
            print('attempting')
            print(f'adding songs from {50*k} to {50*k+1}')
            sp.user_playlist_add_tracks(user = myself, playlist_id= play_id, tracks= ls_songs_uri[(50*k) : (50 *(k+1))])
            print('done')



## Playlists (Using KMeans) 

In [None]:
#only want to keep numeric values
X = data.drop(columns= ['song_uri', 'song_name','artists','genres', 'simplified_genre'])
#scaling everything since KMeans is sensitive to distance
sc = StandardScaler()
X_sc = sc.fit_transform(X)

In [None]:
km = KMeans(n_clusters=5, random_state=42)
km.fit(X_sc)

In [None]:
km.labels_

In [None]:
data['Kmeans_cluster'] = km.labels_
data

In [None]:
scores = []
for k in range(2, 51):
    cl = KMeans(n_clusters=k)
    cl.fit(X_sc)
    inertia = cl.inertia_
    sil = silhouette_score(X_sc, cl.labels_)
    scores.append([k, inertia, sil])
    
score_df = pd.DataFrame(scores)
score_df.columns = ['k', 'inertia', 'silhouette']

In [None]:
score_df.head()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
axes[0].plot(score_df.k, score_df.inertia)
axes[0].set_title('Inertia over k')
axes[1].plot(score_df.k, score_df.silhouette);
axes[1].set_title('Silhouette Score over k')

Based on these graphs, I can conclude that KMeans is not suitable because you would be able to see a sharp decrease and then a flattening out. There is no "elbow" in the inertia plot. The sharp decrease is seen in the silhouette score but the silhouette score is so low, KMeans is not useful for identifying clusters in this dataset. We will have to explore other options. 

## Playlists (Using KMeans) - Including PCA

In [None]:
pca = PCA(n_components= 3)
pca.fit(X_sc)

In [None]:
new_X = pca.transform(X_sc)

In [None]:
pca_km = KMeans(n_clusters=5, random_state=42)
pca_km.fit(new_X)

In [None]:
data['Kmeans_cluster_pca'] = pca_km.labels_
data

In [None]:
scores = []
for k in range(2, 51):
    cl = KMeans(n_clusters=k)
    cl.fit(new_X)
    inertia = cl.inertia_
    sil = silhouette_score(new_X, cl.labels_)
    scores.append([k, inertia, sil])
    
score_df = pd.DataFrame(scores)
score_df.columns = ['k', 'inertia', 'silhouette']

In [None]:
score_df.head()

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(14, 7))
axes[0].plot(score_df.k, score_df.inertia)
axes[0].set_title('Inertia over k')
axes[1].plot(score_df.k, score_df.silhouette);
axes[1].set_title('Silhouette Score over k')

In [None]:
# Pull the explained variance attribute.
var_exp = pca.explained_variance_ratio_
print(f'Explained variance (first 3 components): {np.round(var_exp[:3],3)}')

print('')

# Generate the cumulative explained variance.
cum_var_exp = np.cumsum(var_exp)
print(f'Cumulative explained variance (first 3 components): {np.round(cum_var_exp[:3],3)}')

In [None]:
# Plot the variance explained (and cumulative variance explained).

# Set figure size.
plt.figure(figsize=(12,8))

# Plot the explained variance.
plt.plot(range(len(var_exp)), var_exp, lw=3, label = 'Variance Explained')

# Plot the cumulative explained variance.
plt.plot(range(len(var_exp)), cum_var_exp, lw=3, color = 'orange', label = 'Cumulative Variance Explained')

# Add horizontal lines at y=0 and y=1.
plt.axhline(y=0, linewidth=1, color='grey', ls='dashed')
plt.axhline(y=1, linewidth=1, color='grey', ls='dashed')

# Set the limits of the axes.
plt.xlim([-1,15])
plt.ylim([-0.01,1.01])

# Label the axes.
plt.ylabel('Variance Explained', fontsize=20)
plt.xlabel('Principal Component', fontsize=20)

# Make the tick labels bigger
plt.xticks(range(0, 15, 5), range(1, 15, 5), fontsize=12)
plt.yticks(fontsize=12)
    
# Add title and legend.
plt.title('Component vs. Variance Explained', fontsize=24)
plt.legend(fontsize=11);

PCA isn't really useful since we need more than three components to explain most of the variance. Although the silhouette score improved, only 36.8% of the variance is explained by our model. This confirmed that we definitely need to use another type of clustering algorithm 

## Playlists (Using DBSCAN) 

In [None]:
#scaling everything since DBSCAN is sensitive to distance
sc = StandardScaler()
X_sc = sc.fit_transform(X)

In [None]:
dbscan = DBSCAN(eps= 6, min_samples= 1)
dbscan.fit(X_sc);
set(dbscan.labels_)

In [None]:
silhouette_score(X_sc, dbscan.labels_)

In [None]:
dbscan = DBSCAN(eps= 6.7, min_samples= 1)
dbscan.fit(X_sc);
set(dbscan.labels_)

In [None]:
silhouette_score(X_sc, dbscan.labels_)

In [None]:
dbscan = DBSCAN(eps= 3, min_samples= 2)
dbscan.fit(X_sc);
set(dbscan.labels_)

In [None]:
silhouette_score(X_sc, dbscan.labels_)

In [None]:
dbscan = DBSCAN(eps= 2, min_samples= 5)
dbscan.fit(X_sc);
set(dbscan.labels_)

In [None]:
silhouette_score(X_sc, dbscan.labels_)

I need to find a systemic way to test various combinations and find the optimal hyperparameters 

# Actionable Next Steps 

It seems like I was able to do this pretty efficiently for myself. There is a large portion of Hispanic users who might want to sort their music by Hispanic subgenres (Bachata, Merengue, Cumbia, etc) but haven't found an easy way to sort their music. A couple of older family members are new to technology and if I can scale this up, this could potentially be used to sort music.