# Spotify Song Data Scraping

### References:
Based on the following tutorials: <br />
Max Hilsdorf, "How to Create Large Music Datasets Using Spotipy", <i>Towards Data Science</i>, 25 April 2020: <br />
https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6 <br />
Max Tingle, "Getting Started with Spotify’s API & Spotipy", <i>Towards Data Science</i>, 3 Oct 2019: <br />
https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b <br />
Sandra Radgowska, "How to use Spotify API and what data science opportunities can it open up?", <i>My Journey As A Data Scientist</i>, 18 August 2021:<br />
https://datascientistdiary.com/index.php/2021/03/04/how-to-use-spotify-api-and-what-data-science-opportunities-can-it-open-up/

## Setup

### Import packages

In [9]:
import json
import time
import pandas as pd
import creds
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

### Spotify Credentials

#### Load credentials

Loads the creds.py file, containing the following two lines for variables client_id and secret, which is gitignored for sharing. 

client_id = 'Your Client ID Here'<br />
secret = 'Your secret here'

In [10]:
%run -i 'creds.py'

#### Set credentials

In [11]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

## Functions for data extraction

### Get track data (uri, name, album, artist name, release date, explicit T/F, duration in mins)

#### Create track container dictionaries

* Note that since tracks variable is no created in cell with function call, subsequent calls will be appended to the same dictionary

In [12]:
tracks = []

#### Function to extract all the track ids from your playlist:

In [13]:
def get_track_ids(playlist_id):
    music_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
        music_id_list.append(music_track['id'])
    return music_id_list 

#### Function to extract all the details of each track by passing its ID:

In [14]:
def get_track_data(track_id):
    meta = sp.track(track_id)
    track_details = {'uri': meta['uri'],
                    'name': meta['name'],
                    'album': meta['album']['name'],
                    'artist': meta['album']['artists'][0]['name'],
                    'release_date': meta['album']['release_date'],
                    'explicit': meta['explicit'],
                    'duration_in_mins': round((meta['duration_ms'] * 0.001) / 60.0, 2)}
    return track_details

####  Extract track data

Extract info of each track

For testing:  playlist_id = '0qfagBJB5ou0r1kwQDZ8Op'

In [15]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
track_ids = get_track_ids(playlist_id)
print(len(track_ids))
print(track_ids)

#  Loop over track ids and get their data points
for i in range(len(track_ids)):
    time.sleep(.5)
    track = get_track_data(track_ids[i])
    tracks.append(track)

Enter the playlist id0qfagBJB5ou0r1kwQDZ8Op
21
['0bXpmJyHHYPk6QBFj25bYF', '0Xy9xPPs2zRRFqljGqKXel', '0RqRN88qFUEeOQZ7VklA14', '1QZJiYulh7ak7GpZ8OAdwI', '0wrBkSiN1Y1StFR1Q3ZC28', '4Q2UM2QSR7Gye03jvl4Rdw', '5urkJ8dcmxtvsnruNfx6ZS', '6ujlgkgbsrPskgWslEcZmR', '3pyTksNccLM1jRvzQ4zTke', '7AiIlhSSVKAyMJTygWuut2', '7pKfPomDEeI4TPT6EOYjn9', '0WWz2AaqxLoO0fa9ou6Fqc', '5qJbIYo8AitajCbnOGPIvI', '7IbMonU3CRITpQ0cRVThsV', '5yEPxDjbbzUzyauGtnmVEC', '2gANywSFYF58YFMPdDSAjC', '29zkoUsOE50f0I3n44LjjU', '6d3geXDfoj6hz882o9Ip9S', '2UKYMN7VnsQo40n0qCt6Sa', '6Prs4p7iVZxODcO62NIiA6', '7kCrYUDtWsPldohOKPTKPL']


#### Create dataframe

In [22]:
df = pd.DataFrame(tracks)
df.head()

Unnamed: 0,uri,name,album,artist,release_date,explicit,duration_in_mins
0,spotify:track:0bXpmJyHHYPk6QBFj25bYF,Intro,xx,The xx,2009-08-16,False,2.13
1,spotify:track:0Xy9xPPs2zRRFqljGqKXel,Pyro,Come Around Sundown,Kings of Leon,2010-10-19,False,4.18
2,spotify:track:0RqRN88qFUEeOQZ7VklA14,Restless,Music Complete,New Order,2015-09-25,False,5.47
3,spotify:track:1QZJiYulh7ak7GpZ8OAdwI,Control,After the Disco,Broken Bells,2014-01-13,False,3.69
4,spotify:track:0wrBkSiN1Y1StFR1Q3ZC28,All I'm Saying,La Petite Mort,James,2014-06-02,False,4.97


### Get artist data (id, artist name, genre, popularity, followers)

#### Create track container dictionaries

* Note that since tracks variable is no created in cell with function call, subsequent calls will be appended to the same dictionary

In [26]:
artists = []

#### Function to extract all of the tracks' artist ids from your playlist:

In [59]:
def get_artist_ids(playlist_id):
    artist_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
        artist_id_list.append(music_track['artists'][0]['id'])
    return artist_id_list 

#### Function to extract all the details of each artist by passing their ID:

In [61]:
def get_artist_data(artist_id):
    meta = sp.artist(artist_id)
    artist_details = {'artist id': meta['id'],
                    'artist name': meta['name'],
                    'genres': meta['genres'],
                    'popularity': meta['popularity'],
                    'followers': meta['followers']['total']
                    }
    return artist_details

####  Extract artist data

Extract artist data of each track

For testing:  playlist_id = '0qfagBJB5ou0r1kwQDZ8Op'

In [64]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
artist_ids = get_artist_ids(playlist_id)
print(len(artist_ids))
print(artist_ids)

#  Loop over track ids and get their data points
for i in range(len(artist_ids)):
    time.sleep(.5)
    artist = get_artist_data(artist_ids[i])
    artists.append(artist)

Enter the playlist id0qfagBJB5ou0r1kwQDZ8Op
21
['3iOvXCl6edW5Um0fXEBRXy', '2qk9voo8llSGYcZ6xrBzKx', '0yNLKJebCb8Aueb54LYya3', '6dgwEwnK0YtDfS9XhRwBTG', '0qLNsNKm8bQcMoRFkR8Hmh', '2DaxqgrOhkeH0fpeiQq2f4', '3Bf4u6r96pGx1eIbaGqfvf', '7sjttK1WcZeyLPn3IsQ62L', '1eClJfHLoDI4rZe5HxzBFv', '4STHEaNw4mPZ2tzheohgXB', '4x1nvY2FN8jxqAFA0DA02H', '4gzpq5DPGxSnKTe4SA8HAU', '0qLNsNKm8bQcMoRFkR8Hmh', '7sjttK1WcZeyLPn3IsQ62L', '2cGwlqi3k18jFpUyTrsR84', '2DaxqgrOhkeH0fpeiQq2f4', '0k17h0D3J5VfsdmQ1iZtE9', '4W48hZAnAHVOC2c8WH8pcq', '3OsRAKCvk37zwYcnzRf5XF', '63MQldklfxkjYDoUE4Tppz', '51Blml2LZPmy7TTiAg47vQ']


#### Create dataframe

In [67]:
artist_df = pd.DataFrame(artists)
artist_df.head()

Unnamed: 0,artist id,artist name,genres,popularity,followers
0,3iOvXCl6edW5Um0fXEBRXy,The xx,"[downtempo, dream pop, indietronica]",70,3788369
1,2qk9voo8llSGYcZ6xrBzKx,Kings of Leon,"[modern rock, rock]",78,4895360
2,0yNLKJebCb8Aueb54LYya3,New Order,"[art rock, dance rock, madchester, new romanti...",69,1617097
3,6dgwEwnK0YtDfS9XhRwBTG,Broken Bells,"[alternative dance, alternative rock, indie po...",61,514191
4,0qLNsNKm8bQcMoRFkR8Hmh,James,"[britpop, madchester, new wave, new wave pop, ...",64,429477


### Get track's audio features directly from playlist (for concept only, still a WIP)

#### Function to extract each track's audio features from a playlist directly

In [16]:
def get_playlist_tracks(playlist_id):
    track_attributes = sp.playlist_tracks(playlist_id)
    return track_attributes

In [17]:
playlist_tracks_data = []
playlist_ids = ['0qfagBJB5ou0r1kwQDZ8Op']

#  Loop over playlist ids and get their data points
for i in range(len(playlist_ids)):
    time.sleep(.5)
    playlist_track = get_playlist_tracks(playlist_ids[i])
    playlist_tracks_data.append(playlist_track)

In [18]:
playlist_df = pd.DataFrame(playlist_tracks_data)
playlist_df

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/0qfagBJB5...,"[{'added_at': '2015-12-04T17:25:30Z', 'added_b...",100,,0,,21
