# Project | Machine Learning with Spotify API Data

Creation of a Spotify song recommender.

## Scenario
You have been hired as a Data Analyst for "Gnod".
"Gnod" is a site that provides recommendations for music, art, literature and products based on collaborative filtering algorithms. Their flagship product is the music recommender, which you can try at www.gnoosic.com. The site asks users to input 3 bands they like, and computes similarity scores with the rest of the users. Then, they recommend to the user bands that users with similar tastes have picked.
"Gnod" is a small company, and its only revenue stream so far are Ads in the site. In the future, they would like to explore partnership options with music apps (such as Deezer, Soundcloud or even Apple Music and Spotify). However, for that to be possible, they need to expand and improve their recommendations.
That's precisely where you come. They have hired you as a Data Analyst, and they expect you to bring a mix of technical expertise and business mindset to the table.
Jane, CTO of Gnod, has sent you an email assigning you with your first task.

## Task(s)

This is an e-mail Jane - CTO of Gnod - sent over your inbox in the first weeks working there.
Dear xxxxxxxx, We are thrilled to welcome you as a Data Analyst for Gnoosic!
As you know, we are trying to come up with ways to enhance our music recommendations. One of the new features we'd like to research is to recommend songs (not only bands). We're also aware of the limitations of our collaborative filtering algorithms, and would like to give users new possibilities when searching for recommendations:
Songs that are actually similar to the ones they picked from an acoustic point of view.
Coming up with the perfect song recommender will take us months - no need to stress out too much. In this first week, we want you to explore new data sources for songs. The Internet is full of information and our first step is to acquire it do an initial exploration. Feel free to use APIs or directly scrape the web to collect as much information as possible from popular songs. Eventually, we'll need to collect data from millions of songs, but we can start with a few hundreds or thousands from each source and see if the collected features are useful.
Once the data is collected, we want you to create clusters of songs that are similar to each other. The idea is that if a user inputs a song from one group, we'll prioritize giving them recommendations of songs from that same group.
On Friday, you will present your work to me and Marek, the CEO and founder.Full disclosure: I need you to be very convincing about this whole song-recommender, as this has been my personal push and the main reason we hired you for!
Be open minded about this process: we are agile, and that means that we define our products and features on-the-go, while exploring the tools and the data that's available to us. We'd love you to provide your own vision of the product and the next steps to be taken.
Lots of luck and strength for this first week with us!
-Jane

___________________________________________________________________________________________________

# Data Gathering
____________________________________________________________________________________________________

In [1]:
import config
import spotipy
import json
import pandas as pd
import os
from time import sleep 
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import numpy as np
CLIENT_ID = os.getenv("spotify_client_id")
CLIENT_SECRET = os.getenv("spotify_client_secret_id")

In [2]:
from spotipy.oauth2 import SpotifyClientCredentials

#Initialize SpotiPy with user credentials
sp = spotipy.Spotify(auth_manager=SpotifyClientCredentials(client_id=CLIENT_ID,
                                                           client_secret=CLIENT_SECRET))


## Importing from existing song list

In [3]:
tracks_30K = pd.read_csv("tracks_and_features.csv")

In [4]:
track_ids_30K= tracks_30K["track_id"]
track_ids_30K_list = track_ids_30K.tolist()

## Adding more songs via categories & playlists from Spotipy API

In [85]:
# Pulling list of categories

categories = sp.categories(limit = 50)
categories['categories']['items'][0]
len(categories['categories']['items'])

50

In [86]:
# Extracting Category IDs

category_ids = [category['id'] for category in categories['categories']['items']]
category_names = [category['name'] for category in categories['categories']['items']]

In [87]:

for n, i in zip(category_names, category_ids):
    if n[:1].lower() == 'f':
        print(n,i)

French Variety 0JQ5DAqbMKFBweu2euf3Sj
Folk & Acoustic 0JQ5DAqbMKFy78wprEpAjl
Focus 0JQ5DAqbMKFCbimwdOYlsl
Funk & Disco 0JQ5DAqbMKFFsW9N8maB6z


In [88]:
category_playlists_responses = []
category_playlists_response_errors = []
for cat_name, cat_id in zip(category_names, category_ids):
    print(f"{cat_name}", end=" ")
    try:
        category_playlists_responses.append(sp.category_playlists(cat_id))
    except:
        category_playlists_response_errors.append((cat_name, cat_id))
    sleep(2)

Made For You New Releases Hip-Hop Pop French Variety Latin Charts In the car Rock Mood Dance/Electronic Discover Indie Workout R&B Afro K-pop Chill Sleep Party Dancehall/Zouk Reggae At Home Decades Alternative Love Metal Jazz Trending EQUAL RADAR Classical Folk & Acoustic Focus Soul Netflix Kids & Family Student Gaming Wellness Punk Ambient Country Blues Arab Travel Caribbean Songwriters Nature & Noise Funk & Disco 

In [89]:
for c in category_playlists_responses:
    if c['playlists']['items'] is None:
        print(c)

In [90]:
category_playlists_responses[20]['playlists']['items']

[{'collaborative': False,
  'description': 'Les hits zouk du moment. Photo :  Mik L x Nesly',
  'external_urls': {'spotify': 'https://open.spotify.com/playlist/37i9dQZF1DWZjVBCh3y4ph'},
  'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DWZjVBCh3y4ph',
  'id': '37i9dQZF1DWZjVBCh3y4ph',
  'images': [{'height': None,
    'url': 'https://i.scdn.co/image/ab67706f000000021828260835d9ea74b13fb789',
    'width': None}],
  'name': 'Zouk Station',
  'owner': {'display_name': 'Spotify',
   'external_urls': {'spotify': 'https://open.spotify.com/user/spotify'},
   'href': 'https://api.spotify.com/v1/users/spotify',
   'id': 'spotify',
   'type': 'user',
   'uri': 'spotify:user:spotify'},
  'primary_color': '#ffffff',
  'public': True,
  'snapshot_id': 'Zg8i4AAAAACaEW5BTl7kDJpoChZSBris',
  'tracks': {'href': 'https://api.spotify.com/v1/playlists/37i9dQZF1DWZjVBCh3y4ph/tracks',
   'total': 40},
  'type': 'playlist',
  'uri': 'spotify:playlist:37i9dQZF1DWZjVBCh3y4ph'},
 {'collaborative': False,

In [91]:
many_playlists = []
for category_playlists_response in category_playlists_responses:
    #print(f"x{len(category_playlists_response['playlists']['items'])}", end=" ")
    #[print(playlist.get('id')) for playlist in category_playlists_response['playlists']['items'] if playlist]
    many_playlists.append([playlist['id'] for playlist in category_playlists_response['playlists']['items'] if playlist])
len(many_playlists)

50

In [92]:
playlist_tracks_responses = []
i=0
for playlists in many_playlists:
    for playlist in playlists:
        playlist_tracks_responses.append(sp.playlist_tracks(playlist)) # we could replace with function to collect all songs from each playlist
        i+=1
        if i % 25 == 0:
            print(f"{i} playlists captured")
        sleep(4)

25 playlists captured
50 playlists captured
75 playlists captured
100 playlists captured
125 playlists captured
150 playlists captured
175 playlists captured
200 playlists captured
225 playlists captured
250 playlists captured
275 playlists captured
300 playlists captured
325 playlists captured
350 playlists captured
375 playlists captured
400 playlists captured
425 playlists captured
450 playlists captured
475 playlists captured
500 playlists captured
525 playlists captured
550 playlists captured
575 playlists captured
600 playlists captured
625 playlists captured
650 playlists captured
675 playlists captured
700 playlists captured
725 playlists captured
750 playlists captured
775 playlists captured
800 playlists captured
825 playlists captured
850 playlists captured
875 playlists captured


In [None]:
len(playlist_tracks_responses)

In [None]:
playlist_tracks_responses[822]['items'][0]['track']

In [None]:
artist_names = []
artist_ids = []
track_names = []
track_ids = []
album_release_dates = []
album_release_date_precisions = []
is_explicits = []
durations_ms = []
popularity_scores = []

for playlist_tracks_response in playlist_tracks_responses:
    for t in playlist_tracks_response['items']:
        if t['track']:
            artist_ids.append(t['track']['artists'][0]['id'])
            artist_names.append(t['track']['artists'][0]['name'])
            track_ids.append(t['track']['id'])
            track_names.append(t['track']['name'])
            album_release_dates.append(t['track']['album']['release_date'])
            album_release_date_precisions.append(t['track']['album']['release_date_precision'])
            is_explicits.append(t['track']['explicit'])
            durations_ms.append(t['track']['duration_ms'])
            popularity_scores.append(t['track']['popularity'])

tracks_df = pd.DataFrame(
{'artist_id':artist_ids
,'artist_name':artist_names
, 'track_id':track_ids
, 'track_name':track_names
, 'album_release_date':album_release_dates
, 'album_release_date_precision':album_release_date_precisions
, 'is_explicit':is_explicits
, 'duration_ms':durations_ms
, 'popularity':popularity_scores
})

print("Tracks captured: ", tracks_df.shape[0])
tracks_df = tracks_df.drop_duplicates()
print("Tracks remaining after handling dupes: ", tracks_df.shape[0])

tracks_df.to_csv('tracks.csv',index=False)

## Adding more songs via playlists from Spotipy API

In [5]:
#Gathering song information from plylists using playlist ID

playlist_ids = ["5XoWyyakdCj0eMXdNq87Xl","2beGxzTgxl11D95OTXTJjO","37i9dQZF1EQnqst5TRi17F", "1tKBndVtn5OCpfy4rcmurG", "37i9dQZF1E4qLyivW6wXLU", "4GNvB0NkFezFss5l4T0O78", "37i9dQZF1DXbXD9pMSZomS", "6RbTK3cIUwwcTbzOPSdJTC", "37i9dQZF1DX16FYsDCnfL9", "1h1wPrq6HqEgb6C88NdY1F"]

all_playlist_items = []

for playlist_id in playlist_ids:
    playlist_items = sp.playlist_items(playlist_id)
    all_playlist_items.append(playlist_items)

print(all_playlist_items)



In [6]:
#Extracting Song IDs from all_playlist_items

# Convert the list to a JSON string
all_playlist_items_json = json.dumps(all_playlist_items)

# Parse the JSON data
parsed_data = json.loads(all_playlist_items_json)

# Extract track IDs
track_ids = [item['track']['id'] for item in parsed_data[0]['items']]

print(track_ids)

['0tIkLggEJ9QriD7KFz1GGA', '1WP7rr9V3s3Ees7Yumvcsc', '1saXdvEAafdRzUphXBzSHg', '7dTyZl4cBRKnPBeq1fdrHY', '45pxaXkI8eSyvSmG5sHemY', '00my773cwSEc9tdnQBaLGI', '1WeemV2TqouSYkf8A8PQB7', '0aSQ98crQLAOa6PYUJW9Rg', '0Kc8Wv9c0p9cpyKT0BtVao', '70MdH2wpvvIfPqR6xCVDEZ', '7xrOKGH1oKSS6VbGIhiydE', '7nd51y4iaZaNLZcbAyQZWK', '5pkO89lmMmDKhZItC6iKuP', '22TrxuTE2bllUWnZVJ3Qcp', '01AKBy0h9oJamIXgMkDXu0', '3gmfFneY6uzrz6MON0A9eI', '51bI5ELOFmt0V4Z5TB0ulN', '2GZkcMGoupMWnzrDeTYmxG', '4RjkqfDxCxBVWQ6yclgLVc', '7b0NyzvbB64gsRpafUFUNP', '0Ljig8Rgb59cmfBdufLTZp', '0AhXAec5sq48khkI5xQytU', '2UQ9GUmmHHIymRNNkQDW4a', '3hmLdTnX4KQT5kpj8bMiN8', '4N8nlsCe28faBiXN6xsuKg', '7Mu5hWaPbT7Ps8aQ1eaYwq', '71PBeQP7j0plu7tAp94f6S', '1zCY8dAJg4poghTa99hpnV', '10qpwA4FS2qxVFGRYFAVfr', '1bqvrbR2AW7PB9dMgV4hTn', '7BAHKwMMX4W1K1MdfniBuV', '4eGixKa6tOmNnBVmahHqhI', '0Li1z9niLLkA4IioEzKwnr', '6Sr7kyYlnpFgiLfXyKje1g', '4vN31hpnqIwYk25lnuPMP4', '7xjmqf3xGRg3ZLjms8vPXb', '40MNKAPyVpXHHi9Fx16jWt', '2WqWrI1wDfW9WS2Tlq3uRu', '2DpvWnJreK

In [7]:
master_track_id = track_ids_30K_list + track_ids
print(master_track_id)

['6YLlHjowYdcrULkRk4r4Ga', '2F2p7b5Xq20mRyEeWYaeUF', '17B9XlyyYDKkTmPay4z02V', '1cYyZ1N98GyRS3cMcccFHd', '2a1o6ZejUi8U3wzzOtCOYw', '01aTsQoKoeXofSTvKuunzv', '1zLBsSVxETQOfINOLYvT5m', '5iwz1NiezX7WWjnCgY5TH4', '6OufwUcCqo81guU2jAlDVP', '5YaskwnGDZFDRipaqzbwQx', '2xsRIYRAC0mXttKR4sZscl', '0vR2rIVORmgeKiGIgNT0fV', '3BKD1PwArikchz2Zrlp1qi', '5uEYRdEIh9Bo4fpjDd4Na9', '6BePGk3eCan4FqaW2X8Qy3', '28LzmRnsiidhiBBUvdi6Aj', '5fwSHlTEWpluwOM0Sxnh5k', '2SdMEke7JCBEo8t60iyw2r', '6BhwbBtaXivvPHmHCZmxkT', '3uUuGVFu1V7jTQL60S1r8z', '4ekDggjElGJX9kNSOtnbVf', '6h5PAsRni4IRlxWr6uDPTP', '4zN21mbAuaD0WqtmaTZZeP', '3ig8Zzc0TgQbtJZbctWHZa', '4GSDkZt7GcmzBWmsMcpaVO', '20HCH8XT2EK1QYe1loAJ8E', '3f5eXxFdnV4xppMhH4IC5r', '1HFfMOxCAT4GAwaPfCdmUs', '2y2TNx72H4oE01TkztsAR6', '72GVdRHK49beKrC1NqRCqj', '4cG7HUWYHBV6R6tHn1gxrl', '4Dvkj6JhhA12EX05fT7y2e', '5N5k9nd479b1xpDZ4usjrg', '3M89zvwwb1jtDUDKYNsdiA', '5dSWNkk66Gez4AM1kpZPiu', '1tpl2fUDbrPAcbPufthvQe', '6PUIzlqotEmPuBfjbwYWOB', '525GY0mXYnxajolwgS8Eaz', '79DFRI318i

In [8]:
len(master_track_id)

36265

In [9]:
# if we pull 100 audio features with each request, add a delay of 3 seconds between requests
# how long will it take approx?
print("Number of requests needed: ",len(master_track_id)/100)
print("Estimate time in minutes: ", len(master_track_id)/100 * 3 / 60)
print("Number of requests per minute: ", 60/3)

Number of requests needed:  362.65
Estimate time in minutes:  18.132499999999997
Number of requests per minute:  20.0


In [10]:
list_of_audio_features = []
for x in range(0, len(master_track_id), 100):
    if x + 100 > len(master_track_id):
        y = len(master_track_id)
    else:
        y = x + 100
    list_of_audio_features.extend(sp.audio_features(master_track_id[x:y]))
    sleep(3)

In [11]:
# Checking how many songs - should match the length of master_track_id generated above
len(list_of_audio_features)

36265

In [12]:
# Putting audio features into separate lists

danceability = []
energy = []
key = []
loudness = []
mode = []
speechiness = []
acousticness = []
instrumentalness = []
liveness = []
valence = []
tempo = []
duration_ms = []
time_signature = []
track_href = []

for af in list_of_audio_features:
    if af:
        danceability.append(af['danceability'])
        energy.append(af['energy'])
        key.append(af.get('key', -1))
        loudness.append(af['loudness'])
        mode.append(af['mode'])
        speechiness.append(af['speechiness'])
        acousticness.append(af['acousticness'])
        instrumentalness.append(af['instrumentalness'])
        liveness.append(af['liveness'])
        valence.append(af['valence'])
        tempo.append(af['tempo'])
        duration_ms.append(af['duration_ms'])
        time_signature.append(af['time_signature'])
        track_href.append(af['track_href'])
    else:
        danceability.append(-99)
        energy.append(-99)
        key.append(-99)
        loudness.append(-99)
        mode.append(-99)
        speechiness.append(-99)
        acousticness.append(-99)
        instrumentalness.append(-99)
        liveness.append(-99)
        valence.append(-99)
        tempo.append(-99)
        duration_ms.append(-99)
        time_signature.append(-99)
        track_href.append(-99)        


In [13]:
# Checking length is good by checking one of the audio features
len(danceability)

36265

In [14]:
audio_features_dict = {
    'danceability':danceability,
    'energy':energy,
    'key':key,
    'loudness':loudness,
    'mode':mode,
    'speechiness':speechiness,
    'acousticness':acousticness,
    'instrumentalness':instrumentalness,
    'liveness':liveness,
    'valence':valence,
    'tempo':tempo,
    'duration_ms':duration_ms,
    'time_signature':time_signature,
    'track_href':track_href
    }

audio_features_df = pd.DataFrame(audio_features_dict)
audio_features_df.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href
0,0.691,0.519,0,-7.77,1,0.0353,0.686,0.0767,0.269,0.117,125.008,197045,4,https://api.spotify.com/v1/tracks/6YLlHjowYdcr...
1,0.69,0.89,0,-4.866,0,0.0507,0.00903,1.6e-05,0.188,0.713,124.018,181478,4,https://api.spotify.com/v1/tracks/2F2p7b5Xq20m...
2,0.587,0.722,2,-6.983,0,0.0709,0.000754,0.668,0.543,0.466,121.964,180843,4,https://api.spotify.com/v1/tracks/17B9XlyyYDKk...
3,0.734,0.855,6,-4.578,0,0.135,0.0821,5e-06,0.0901,0.54,121.955,139563,4,https://api.spotify.com/v1/tracks/1cYyZ1N98GyR...
4,0.4,0.915,0,-4.89,1,0.28,0.0846,0.0,0.0503,0.878,88.917,198773,3,https://api.spotify.com/v1/tracks/2a1o6ZejUi8U...


In [15]:
# Combining master_track_id list and audio_features_df

#converting master_track_id to a pandas series
master_track_id_series = pd.Series(master_track_id, name='track_id')

tracks_and_features_df = pd.concat([master_track_id_series, audio_features_df], axis=1)


In [16]:
# remove items we have missing audio features for
tracks_and_features_df = tracks_and_features_df[tracks_and_features_df['key']>=0]

In [17]:
tracks_and_features_df.columns

Index(['track_id', 'danceability', 'energy', 'key', 'loudness', 'mode',
       'speechiness', 'acousticness', 'instrumentalness', 'liveness',
       'valence', 'tempo', 'duration_ms', 'time_signature', 'track_href'],
      dtype='object')

In [18]:
audio_features_model_on = ['duration_ms','danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']
features_df = tracks_and_features_df[audio_features_model_on]

In [19]:
tracks_and_features_df.to_csv('tracks_and_features.csv', index=False)
features_df.to_csv('features.csv', index=False)

# Training the Model - 1st Try
_________________________________________________________________________________

In [20]:
audio_features_model_on = ['danceability', 'energy', 'key',
       'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness',
       'liveness', 'valence', 'tempo', 'time_signature']
features_df = features_df[audio_features_model_on]

In [21]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=100, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df)
kmeans100.fit(scaled_features)


  super()._check_params_vs_input(X, default_n_init=10)


In [22]:
kmeans100.labels_

array([ 3, 77, 31, ..., 21, 44, 21])

In [23]:
tracks_clustered_df = tracks_and_features_df.copy()
tracks_clustered_df['cluster_km100'] = kmeans100.labels_
tracks_clustered_df[tracks_clustered_df['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
21337,4o800k0L50O5mIUB69SPyV,0.426,0.0766,7,-21.274,1,0.0361,0.98,0.947,0.107,0.213,114.239,149251,4,https://api.spotify.com/v1/tracks/4o800k0L50O5...,4


In [24]:
display(tracks_clustered_df)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
0,6YLlHjowYdcrULkRk4r4Ga,0.691,0.519,0,-7.770,1,0.0353,0.686000,0.076700,0.2690,0.11700,125.008,197045,4,https://api.spotify.com/v1/tracks/6YLlHjowYdcr...,3
1,2F2p7b5Xq20mRyEeWYaeUF,0.690,0.890,0,-4.866,0,0.0507,0.009030,0.000016,0.1880,0.71300,124.018,181478,4,https://api.spotify.com/v1/tracks/2F2p7b5Xq20m...,77
2,17B9XlyyYDKkTmPay4z02V,0.587,0.722,2,-6.983,0,0.0709,0.000754,0.668000,0.5430,0.46600,121.964,180843,4,https://api.spotify.com/v1/tracks/17B9XlyyYDKk...,31
3,1cYyZ1N98GyRS3cMcccFHd,0.734,0.855,6,-4.578,0,0.1350,0.082100,0.000005,0.0901,0.54000,121.955,139563,4,https://api.spotify.com/v1/tracks/1cYyZ1N98GyR...,94
4,2a1o6ZejUi8U3wzzOtCOYw,0.400,0.915,0,-4.890,1,0.2800,0.084600,0.000000,0.0503,0.87800,88.917,198773,3,https://api.spotify.com/v1/tracks/2a1o6ZejUi8U...,25
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
36260,77wRBCrqhWvvQjEI44uHEi,0.182,0.942,10,-22.021,0,0.0621,0.000016,0.965000,0.3730,0.02790,80.479,113961,3,https://api.spotify.com/v1/tracks/77wRBCrqhWvv...,44
36261,3imggdpnHlcy8I25zT93KW,0.103,0.374,1,-22.173,1,0.0459,0.000000,0.993000,0.4650,0.07070,91.115,156511,3,https://api.spotify.com/v1/tracks/3imggdpnHlcy...,56
36262,7kMw28F523wOc3OLveJSF2,0.172,1.000,9,-23.813,0,0.0653,0.695000,0.556000,0.8340,0.00001,117.909,125631,1,https://api.spotify.com/v1/tracks/7kMw28F523wO...,21
36263,3agFKTnUIFMbLCtBrnxlqA,0.112,0.994,5,-20.663,0,0.0618,0.134000,0.990000,0.4460,0.01410,47.173,73561,3,https://api.spotify.com/v1/tracks/3agFKTnUIFMb...,44


In [25]:
import pickle

tracks_clustered_df.to_csv('tracks_clustered_df.csv', index=False)

with open('model_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler.pickle', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Training the Model - 2nd Try
_________________________________________________________________________________

In [26]:
audio_features_model_on_2 = ['key','mode', 'tempo']
features_df_2 = tracks_and_features_df[audio_features_model_on_2]

In [27]:
tracks_and_features_df.to_csv('tracks_and_features_2.csv', index=False)
features_df_2.to_csv('features.csv', index=False)

In [28]:
audio_features_model_on_2 = ['key','mode', 'tempo']
features_df_2 = features_df_2[audio_features_model_on_2]

In [29]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=100, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df_2)
kmeans100.fit(scaled_features)

  super()._check_params_vs_input(X, default_n_init=10)


In [31]:
tracks_clustered_df_2 = tracks_and_features_df.copy()
tracks_clustered_df_2['cluster_km100'] = kmeans100.labels_
tracks_clustered_df_2[tracks_clustered_df_2['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
21809,2K85RbA7wlm8OV707sUWrV,0.692,0.0482,7,-28.882,1,0.109,0.996,0.969,0.123,0.846,68.359,74250,3,https://api.spotify.com/v1/tracks/2K85RbA7wlm8...,4


In [32]:
import pickle

tracks_clustered_df_2.to_csv('tracks_clustered_df_2.csv', index=False)

with open('model2_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler2.pickl', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Training the Model - 3rd Try
_________________________________________________________________________________

In [60]:
audio_features_model_on_3 = ['energy','tempo','valence']
features_df_3 = tracks_and_features_df[audio_features_model_on_3]

In [61]:
tracks_and_features_df.to_csv('tracks_and_features_3.csv', index=False)
features_df_3.to_csv('features.csv', index=False)

In [62]:
audio_features_model_on_3 = ['energy','tempo','valence']
features_df_3 = features_df_3[audio_features_model_on_3]

In [63]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=100, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df_3)
kmeans100.fit(scaled_features)

  super()._check_params_vs_input(X, default_n_init=10)


In [64]:
tracks_clustered_df_3 = tracks_and_features_df.copy()
tracks_clustered_df_3['cluster_km100'] = kmeans100.labels_
tracks_clustered_df_3[tracks_clustered_df_3['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
2253,1NMYbsFZyxwHRNaLGzFGWh,0.531,0.53,5,-5.267,0,0.0405,0.133,0.0,0.0927,0.329,108.864,193776,5,https://api.spotify.com/v1/tracks/1NMYbsFZyxwH...,4


In [65]:
import pickle

tracks_clustered_df_3.to_csv('tracks_clustered_df_3.csv', index=False)

with open('model3_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler3.pickl', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Training the Model - 4th Try
_________________________________________________________________________________

In [66]:
audio_features_model_on_4 = ['energy','loudness','valence']
features_df_4 = tracks_and_features_df[audio_features_model_on_4]

In [67]:
tracks_and_features_df.to_csv('tracks_and_features_4.csv', index=False)
features_df_4.to_csv('features.csv', index=False)

In [68]:
audio_features_model_on_4 = ['energy','loudness','valence']
features_df_4 = features_df_4[audio_features_model_on_4]

In [69]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=100, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df_4)
kmeans100.fit(scaled_features)

  super()._check_params_vs_input(X, default_n_init=10)


In [70]:
tracks_clustered_df_4 = tracks_and_features_df.copy()
tracks_clustered_df_4['cluster_km100'] = kmeans100.labels_
tracks_clustered_df_4[tracks_clustered_df_4['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
34733,399Hm1oqfzzQaZRu3CtVMb,0.492,0.743,4,-5.807,1,0.135,0.461,0.0,0.0797,0.774,84.929,155667,4,https://api.spotify.com/v1/tracks/399Hm1oqfzzQ...,4


In [71]:
import pickle

tracks_clustered_df_4.to_csv('tracks_clustered_df_4.csv', index=False)

with open('model4_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler4.pickle', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Training the Model - 5th Try
_________________________________________________________________________________

In [72]:
audio_features_model_on_5 = ['danceability','instrumentalness','valence']
features_df_5 = tracks_and_features_df[audio_features_model_on_5]

In [73]:
tracks_and_features_df.to_csv('tracks_and_features_5.csv', index=False)
features_df_5.to_csv('features.csv', index=False)

In [74]:
audio_features_model_on_5 = ['danceability','instrumentalness','valence']
features_df_5 = features_df_5[audio_features_model_on_5]

In [75]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=100, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df_5)
kmeans100.fit(scaled_features)

  super()._check_params_vs_input(X, default_n_init=10)


In [76]:
tracks_clustered_df_5 = tracks_and_features_df.copy()
tracks_clustered_df_5['cluster_km100'] = kmeans100.labels_
tracks_clustered_df_5[tracks_clustered_df_5['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
22207,4c66tmKsdmvNENPo54oDny,0.44,0.986,7,-3.117,0,0.176,0.000175,0.0455,0.101,0.0979,89.992,222741,4,https://api.spotify.com/v1/tracks/4c66tmKsdmvN...,4


In [77]:
import pickle

tracks_clustered_df_5.to_csv('tracks_clustered_df_5.csv', index=False)

with open('model5_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler5.pickle', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

# Training the Model - 6th Try
_________________________________________________________________________________

In [93]:
audio_features_model_on_6 = ['danceability','speechiness','acousticness']
features_df_6 = tracks_and_features_df[audio_features_model_on_6]

In [94]:
tracks_and_features_df.to_csv('tracks_and_features_6.csv', index=False)
features_df_6.to_csv('features.csv', index=False)

In [95]:
audio_features_model_on_6 = ['danceability','speechiness','acousticness']
features_df_6 = features_df_6[audio_features_model_on_6]

In [99]:
from sklearn.preprocessing import MinMaxScaler

kmeans100 = KMeans(n_clusters=600, random_state=42)

scaler = MinMaxScaler()
scaled_features = scaler.fit_transform(features_df_6) ####back to data frame - add weights
kmeans100.fit(scaled_features)

  super()._check_params_vs_input(X, default_n_init=10)


In [100]:
tracks_clustered_df_6 = tracks_and_features_df.copy()
tracks_clustered_df_6['cluster_km100'] = kmeans100.labels_
tracks_clustered_df_6[tracks_clustered_df_6['cluster_km100']==4].sample(1)

Unnamed: 0,track_id,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,duration_ms,time_signature,track_href,cluster_km100
32324,1MBgaJoraHR7RHgAV58QU4,0.674,0.658,5,-9.647,0,0.104,0.404,1e-06,0.0981,0.677,143.292,329387,4,https://api.spotify.com/v1/tracks/1MBgaJoraHR7...,4


In [101]:
import pickle

tracks_clustered_df_6.to_csv('tracks_clustered_df_6.csv', index=False)

with open('model6_km100.pickle', 'wb') as handle:
    pickle.dump(kmeans100, handle, protocol=pickle.HIGHEST_PROTOCOL)

with open('scaler6.pickle', 'wb') as handle:
    pickle.dump(scaler, handle, protocol=pickle.HIGHEST_PROTOCOL)

______________________________________________________________________

# Program Flow

- Program takes and input from the user: they are prompted to give a song name.
- Program takes the user input, passes it to spotify, and spotify will return a list of songs back asking the customer to confirm which is the correct song.
- Then, using the track id from that song, the program requests the audio features of the song, selects for the audio features the program is working with, scales them to determine which of the kmeans clusters the song belongs to, then randomly reccommends another song from that cluster.

# Presentation
- How many extra songs did you add and how
- How many clusters?
- Which features did you use to train the model?
- Demo
- Streamlit (everything becomes .py file)

# Discarded  Code

In [None]:
#Gathering audio features for all songs in track_ids

def gather_audio_features(master_track_id):
    audio_features = []
    for track_id in master_track_id:
        features = sp.audio_features(master_track_id)
        audio_features.extend(features)
    return audio_features

# Gather audio features for all songs in track_ids
audio_features = gather_audio_features(master_track_id)

# Create a DataFrame from audio features
audio_features_df = pd.DataFrame(audio_features)

# Select relevant columns and set 'id' as index
audio_features_df = audio_features_df[['id', 'danceability', 'energy', 'key', 'loudness', 'mode', 'speechiness', 'acousticness', 'instrumentalness', 'liveness', 'valence', 'tempo', 'duration_ms', 'time_signature']]
audio_features_df.set_index('id', inplace=True)

# Optionally rename index to 'song_id'
# df.index.name = 'song_id'

display(audio_features_df)