# Collecting the Data: Museum Analytics Playlist - Thesis Diana Marisol Rivera

 - About the data:

   Playlist with 5,987 songs collected from 100 playlists that spotify's users curated to listen to in art museums.

   Playlist link: https://open.spotify.com/playlist/5y8FYZfbyVhfEi8k9bi0ka

 - Method: Spotipy Library

   From the [official Spotipy docs](https://spotipy.readthedocs.io/en/latest/): 
>"Spotipy is a lightweight Python library for the Spotify Web API. With Spotipy you get full access to all of the music data provided by the Spotify platform."


# 1. Setting Up

- Importing the necessary librarys

In [1]:
import spotipy
import spotipy.util as util
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import matplotlib.pyplot as plt
import time

- Defining the Client Credentials I got from creating an app in Spotify for Developers Platform. In this document that is shared publicly, I changed the last two digits of my secret ID  for security. 

In [2]:
client_id='540cea50990e42e18631c5c20049b94b'
client_secret='d00753237465402997a44e0e320a51XX'
username = "marisolriveraroa"
client_credentials_manager = SpotifyClientCredentials(client_id=client_id, client_secret=client_secret) 
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

- Stablishing the scope (more info: [Authorization Scopes](https://developer.spotify.com/documentation/general/guides/authorization/scopes/)) and getting the token by using the spotipy function prompt_for_user_token

In [3]:
scope = 'playlist-read-private'
# getting the token
util.prompt_for_user_token(username,
                           scope,
                           client_id='540cea50990e42e18631c5c20049b94b',
                           client_secret='d00753237465402997a44e0e320a51XX',
                           redirect_uri='http://localhost/')

'BQDFGLZjQsdr63dDYKqcQdysIwVyFQINFcLXl9BgVWLIJxKmMu0ZwOG7q5zqNw_bNX4JtGlbDm_nHq8vjQ1zHyze2Jab6kIGNiP1A0v7f4H06-67FMDcNhBvt_EuckjtQt86LzwXyufz1Uj61cmsmpEfX5IJVkx0jzSOGP7gbvZ1Jg77f3HizJRvyVCcoZ2Tgqk'

# 2. Getting the playlist metadata

- Endpoints: Spotify offers a number of [API endpoints](https://beta.developer.spotify.com/documentation/web-api/reference/) to       request especific data. In this step, I used the following:

  To get the playlists tracks: [user_playlist_tracks](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-playlists-tracks)

  To get the corresponding audio features: [audio_features](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-several-audio-features)

The Spotify Api endpoint [user_playlist_tracks](https://developer.spotify.com/documentation/web-api/reference/#/operations/get-playlists-tracks) has a limit of 100 results per query. To get the 5,987 songs metadata, I used the function writen by the stackoverflow member: [sfnxboy](https://stackoverflow.com/users/14335908/sfnxboy).

From [sfnxboy post](https://stackoverflow.com/questions/39086287/spotipy-how-to-read-more-than-100-tracks-from-a-playlist?newreg=56ce12fb262e4988971133371bc661d4)

> "I wrote a function that can output Panda's DataFrame where it pulls all the metadata (not all of it because I didn't want to, but you can make some space for that) for playlists over 100 songs. I do it by iterating over every song, finding the metadata for each, saving the metadata to a dictionary, and then concatenating the dictionary to the DataFrame. It takes your username and the Playlist ID as input". 

On this matter, I did do some modifications to get the specific data I needed to this project. The metadata selection if explain in the thesis main document. 

In [4]:
# Defining the playlists id and the username id 
username = "marisolriveraroa"
playlist_id = "5y8FYZfbyVhfEi8k9bi0ka"

In [5]:
# Function to extract MetaData from a playlist with more than 100 songs. Written by: sfnxboy Edited: Diana Rivera

def get_playlist_tracks_more_than_100_songs(username, playlist_id):
    results = sp.user_playlist_tracks(username,playlist_id)
    tracks = results['items']
    while results['next']:
        results = sp.next(results)
        tracks.extend(results['items'])
    results = tracks    

    playlist_tracks_id = []
    playlist_tracks_titles = []
    playist_tracks_uri = []
    playlist_tracks_artists_uri = []
    playlist_tracks_first_artists = []
    playlist_tracks_first_release_date = []
    playlist_tracks_popularity = []

    for i in range(len(results)):
        if i == 0:
            playlist_tracks_id = results[i]['track']['id']
            playlist_tracks_titles = results[i]['track']['name']
            playist_tracks_uri = results[i]['track']['uri']
            playlist_tracks_first_release_date = results[i]['track']['album']['release_date']
            playlist_tracks_popularity = results[i]['track']['popularity']        
            playlist_tracks_artists_uri= results[i]["track"]["artists"][0]["uri"]
            
            #Main Artist
            artist_info = sp.artist(playlist_tracks_artists_uri)
            artist_df = pd.DataFrame(columns=artist_info)
            #Track Features
            features = sp.audio_features(playlist_tracks_id)
            features_df = pd.DataFrame(data=features, columns=features[0].keys())
            features_df['title'] = playlist_tracks_titles
            features_df['uri'] = playist_tracks_uri
            features_df['artists_uri'] = playlist_tracks_artists_uri 
            features_df['popularity'] = playlist_tracks_popularity
            features_df['release_date'] = playlist_tracks_first_release_date
            features_df = features_df[['id', 'title', 'uri', 'artists_uri','popularity', 'release_date',
                                       'danceability', 'energy', 'key', 'loudness',
                                       'mode', 'acousticness', 'instrumentalness',
                                       'liveness', 'valence','speechiness','tempo',
                                       'duration_ms', 'time_signature']]
            continue
        else:
            try:
                playlist_tracks_id = results[i]['track']['id']
                playlist_tracks_titles = results[i]['track']['name']
                playist_tracks_uri = results[i]['track']['uri']
                playlist_tracks_first_release_date = results[i]['track']['album']['release_date']
                playlist_tracks_popularity = results[i]['track']['popularity']
                playlist_tracks_artists_uri= results[i]["track"]["artists"][0]["uri"]
                artist_info = sp.artist(playlist_tracks_artists_uri)
                playlist_tracks_artists = results[i]["track"]["artists"][0]["name"]
                playlist_tracks_artists_genres = artist_info["genres"]
                features = sp.audio_features(playlist_tracks_id)
                new_row = {'id':[playlist_tracks_id],
               'title':[playlist_tracks_titles],
               'uri': [playist_tracks_uri], 
               'artist': results[i]["track"]["artists"][0]["name"],
               'artists_uri': [playlist_tracks_artists_uri],           
               'artists_genres': artist_info["genres"],
               'popularity':[playlist_tracks_popularity],
               'release_date':[playlist_tracks_first_release_date],    
               'danceability':[features[0]['danceability']],
               'energy':[features[0]['energy']],
               'key':[features[0]['key']],
               'loudness':[features[0]['loudness']],
               'mode':[features[0]['mode']],
               'acousticness':[features[0]['acousticness']],
               'instrumentalness':[features[0]['instrumentalness']],
               'liveness':[features[0]['liveness']],
               'valence':[features[0]['valence']],
               'speechiness':[features [0] ['speechiness']],
               'tempo':[features[0]['tempo']],
               'duration_ms':[features[0]['duration_ms']],
               'time_signature':[features[0]['time_signature']]
               }

                dfs = [features_df, pd.DataFrame(new_row)]
                features_df = pd.concat(dfs, ignore_index = True)
            except:
                continue
                
    return features_df
    

In [8]:
#running the function

playlist_df = get_playlist_tracks_more_than_100_songs (username, playlist_id)



In [10]:
# Checking the dataset by their first rows.

playlist_df.head()

Unnamed: 0,id,title,uri,artists_uri,popularity,release_date,danceability,energy,key,loudness,...,acousticness,instrumentalness,liveness,valence,speechiness,tempo,duration_ms,time_signature,artist,artists_genres
0,0fBSs3fRoh1yJcne77fdu9,Video Games,spotify:track:0fBSs3fRoh1yJcne77fdu9,spotify:artist:00FQb4jTyendYWaN8pK0wa,74,2012-01-01,0.236,0.249,6,-9.595,...,0.811,1e-06,0.087,0.181,0.0348,72.847,281960,5,,
1,1cyZIM22N8kmBqdATPBmI7,Nocturnal Waltz,spotify:track:1cyZIM22N8kmBqdATPBmI7,spotify:artist:1yLIaxyVkZnLMXhfRSYEjV,51,2016-09-02,0.185,0.112,1,-18.943,...,0.99,0.915,0.107,0.0697,0.0366,166.449,128767,3,Johannes Bornlöf,focus
2,5Zf25eS8E1znm9mez4cGsm,Reflections,spotify:track:5Zf25eS8E1znm9mez4cGsm,spotify:artist:08tfDO4dSrwxax35a3HIMC,67,1986-11-28,0.224,0.146,9,-16.331,...,0.961,0.712,0.113,0.142,0.0338,94.255,130693,3,Toshifumi Hinata,japanese soundtrack
3,0lx2cLdOt3piJbcaXIV74f,willow,spotify:track:0lx2cLdOt3piJbcaXIV74f,spotify:artist:06HL4z0CvFAxyc27GXpf02,78,2020-12-11,0.392,0.574,7,-9.195,...,0.833,0.00179,0.145,0.529,0.17,81.112,214707,4,Taylor Swift,pop
4,6VzcQuzTNTMFnJ6rBSaLH9,Fine Line,spotify:track:6VzcQuzTNTMFnJ6rBSaLH9,spotify:artist:6KImCVD70vtIoJWnq6nGn3,79,2019-12-13,0.306,0.347,2,-8.5,...,0.172,0.00013,0.0485,0.0511,0.0334,120.996,377960,4,Harry Styles,pop


In [None]:
# exporting the data in a csv file

playlist_df.to_csv("Dataset.csv", index = False)