# Spotify Song Data Scraping

### References:
Based on the following tutorials: <br />
Max Hilsdorf, "How to Create Large Music Datasets Using Spotipy", <i>Towards Data Science</i>, 25 April 2020: <br />
https://towardsdatascience.com/how-to-create-large-music-datasets-using-spotipy-40e7242cc6a6 <br />
Max Tingle, "Getting Started with Spotify’s API & Spotipy", <i>Towards Data Science</i>, 3 Oct 2019: <br />
https://medium.com/@maxtingle/getting-started-with-spotifys-api-spotipy-197c3dc6353b <br />
Sandra Radgowska, "How to use Spotify API and what data science opportunities can it open up?", <i>My Journey As A Data Scientist</i>, 18 August 2021:<br />
https://datascientistdiary.com/index.php/2021/03/04/how-to-use-spotify-api-and-what-data-science-opportunities-can-it-open-up/

## Setup

### Import packages

In [9]:
import json
import time
import pandas as pd
import creds
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

### Spotify Credentials

#### Load credentials

Loads the creds.py file, containing the following two lines for variables client_id and secret, which is gitignored for sharing. 

client_id = 'Your Client ID Here'<br />
secret = 'Your secret here'

In [10]:
%run -i 'creds.py'

#### Set credentials

In [11]:
client_credentials_manager = SpotifyClientCredentials(client_id=client_id,client_secret=secret)
sp = spotipy.Spotify(client_credentials_manager=client_credentials_manager)

### Functions for data extraction

#### Create track container dictionaries

* Note that since tracks variable is no created in cell with function call, subsequent calls will be appended to the same dictionary

In [12]:
tracks = []

Function to extract all the track ids from your playlist:

In [13]:
def get_track_ids(playlist_id):
    music_id_list = []
    playlist = sp.playlist(playlist_id)
    for item in playlist['tracks']['items']:
        music_track = item['track']
        music_id_list.append(music_track['id'])
    return music_id_list 

Function to extract all the details of each track by passing its ID:

In [14]:
def get_track_data(track_id):
    meta = sp.track(track_id)
    track_details = {'uri': meta['uri'],
                    'name': meta['name'],
                    'album': meta['album']['name'],
                    'artist': meta['album']['artists'][0]['name'],
                    'release_date': meta['album']['release_date'],
                    'explicit': meta['explicit'],
                    'duration_in_mins': round((meta['duration_ms'] * 0.001) / 60.0, 2)}
    return track_details

### Extract track data

In [15]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
track_ids = get_track_ids(playlist_id)
print(len(track_ids))
print(track_ids)

#  Loop over track ids and get their data points
for i in range(len(track_ids)):
    time.sleep(.5)
    track = get_track_data(track_ids[i])
    tracks.append(track)

Enter the playlist id0qfagBJB5ou0r1kwQDZ8Op
21
['0bXpmJyHHYPk6QBFj25bYF', '0Xy9xPPs2zRRFqljGqKXel', '0RqRN88qFUEeOQZ7VklA14', '1QZJiYulh7ak7GpZ8OAdwI', '0wrBkSiN1Y1StFR1Q3ZC28', '4Q2UM2QSR7Gye03jvl4Rdw', '5urkJ8dcmxtvsnruNfx6ZS', '6ujlgkgbsrPskgWslEcZmR', '3pyTksNccLM1jRvzQ4zTke', '7AiIlhSSVKAyMJTygWuut2', '7pKfPomDEeI4TPT6EOYjn9', '0WWz2AaqxLoO0fa9ou6Fqc', '5qJbIYo8AitajCbnOGPIvI', '7IbMonU3CRITpQ0cRVThsV', '5yEPxDjbbzUzyauGtnmVEC', '2gANywSFYF58YFMPdDSAjC', '29zkoUsOE50f0I3n44LjjU', '6d3geXDfoj6hz882o9Ip9S', '2UKYMN7VnsQo40n0qCt6Sa', '6Prs4p7iVZxODcO62NIiA6', '7kCrYUDtWsPldohOKPTKPL']


#### Create dataframe

In [None]:
df = pd.DataFrame(tracks)
df

Function to extract each track's audio features

In [16]:
def get_playlist_tracks(playlist_id):
    track_attributes = sp.playlist_tracks(playlist_id)
    return track_attributes

In [17]:
playlist_tracks_data = []
playlist_ids = ['0qfagBJB5ou0r1kwQDZ8Op']

#  Loop over playlist ids and get their data points
for i in range(len(playlist_ids)):
    time.sleep(.5)
    playlist_track = get_playlist_tracks(playlist_ids[i])
    playlist_tracks_data.append(playlist_track)

In [18]:
playlist_df = pd.DataFrame(playlist_tracks_data)
playlist_df

Unnamed: 0,href,items,limit,next,offset,previous,total
0,https://api.spotify.com/v1/playlists/0qfagBJB5...,"[{'added_at': '2015-12-04T17:25:30Z', 'added_b...",100,,0,,21


#### Extract playlist data

In [None]:
# Get the ids for all the songs in your playlist
playlist_id = input('Enter the playlist id')
track_ids = get_track_ids(playlist_id)
print(len(track_ids))
print(track_ids)

#  Loop over track ids and get their data points
for i in range(len(track_ids)):
    time.sleep(.5)
    track = get_track_data(track_ids[i])
    tracks.append(track)