**Warner Music Group: Data Scientist, International Insights - Programming Exercise
Author: Jack Munday**

*Task: We would like you to connect programmatically to the public API of Spotify, get some interesting data and produce a little POC, a predictive analytics report or anything that you think worthwhile learning about a topic of music & audience of your choice. Feel free to use other data sources and any tools that you like.*

I have used collected my data using Spotify's public API access through the Python SpotiPy library, performed a series of exploratory analyses of the data and then built a series of models to predict a song's popularity. My analysis is structured as follows:

1.   Data Collections
2.   Exploratory Analysis
3.   Logistic Regression
4.   Random Forest Classifier

The full set of source codes for this exercise can be found on my GitHub [here](https://github.com/1602077/experiments_in_spotipy). 

Some of my other music-based projects can be found there too, which include:

* a [Selenium-based web-scraper](https://github.com/1602077/vinyl_pricechecker) to automate tracking the historical prices of modern records in my wishlist; and
* a reconstruction of my [Apple Music Replay statistics](https://github.com/1602077/apple_music_replay)  in BigQuery.


# Data Collection

I have collected my data using the aforementioned SpotiPy Python library by building the `get_artist_data(artist_name, api_credentials)` function. For a given set of credentials to a Spotify developer application, this will search for the specified artist name and if a match is found download all songs of that artist's album. The full docstring for this is included below in the function definition. I have then iterated through a list of artist names obtained from my personal Apple Music library to download a sufficiently large body of data to draw some insightful conclusions.

In [None]:
%pip install spotipy --quiet
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)
%cd /content/gdrive/MyDrive/spotify/scripts
import numpy as np
import pandas as pd
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy
# Custom library containing spotify credentials for authentication
import spotify_credentials as cred
import os
import glob
import time
import random

Mounted at /content/gdrive
/content/gdrive/MyDrive/spotify/scripts


In [None]:
def get_artist_data(artist_name, api_credentials):
    """
    Function that calls the Spotify API using Pythons SpotiPy library to search 
    for a specified artist_name within Spotify's dataset. If a match is found 
    that artist / bands albums will be appended into a nested dictionary along 
    with each album's subsequent tracks and audio features as classified by 
    Spotify. More information on the meaning of each feature can be found at 
    https://developer.spotify.com/documentation/web-api/reference/
    #category-tracks.
    
    inputs:
    -------------------------------------------------------------------------- 
    artist_name:        Artist name whose catelog is to be downloaded
    api_crediantials:   Credentials required to call Spotify API. i.e. output of
                        calling SpotifyClientCredentials().

    returns:
    --------------------------------------------------------------------------
    Unique dataframe for each artist_name which contains their whole Spotify 
    catelog, with a series of categorisation features as described in the 
    category-tracks url above.
    """
    
    sp = spotipy.Spotify(client_credentials_manager=api_credentials, retries=15)
    # Search for artist name, find all their uris (unique reference ids) and
    # the corresponding album names storing: storing both in seperate lists.
    search_result = sp.search(artist_name)
    artist_uri = search_result['tracks']['items'][0]['artists'][0]['uri']
    # Top artist name search results
    artist_name_search_result = search_result['tracks']['items'][0]['artists'][0]['name']

    # If the search result doesn't match the input artist name,
    # then look through the top 10 results, if still no match skip this artist.
    if artist_name != artist_name_search_result:
        try:
            top_10_results = [search_result['tracks']['items'][i]['artists'][0]['name'] 
                               for i in range(10)]
            # Get index position of matched artist name in list to use as an 
            # index-match in artist_name_search_result.
            index = top_10_results.index(artist_name)
            artist_uri = search_result['tracks']['items'][index]['artists'][0]['uri']
            artist_name_search_result = search_result['tracks']['items'][index]['artists'][0]['name']
        except:
            print(f"!! {artist_name} not found in Spotify dataset.")
            return 0

    sp_albums = sp.artist_albums(artist_uri, album_type='album')

    album_names = [sp_albums['items'][i]['name']
                   for i in range(len(sp_albums['items']))]
    album_uris = [sp_albums['items'][i]['uri']
                  for i in range(len(sp_albums['items']))]

    print(f">> Currently downloading {artist_name_search_result}'s data.")

    ############################################################################
    # GET TRACK NAMES, SEQUENCING & IDS FOR EACH ARTIST ALBUM
    ############################################################################
    spotify_albums = {}
    album_counter = 0
    track_keys = ['artist_name', 'album', 'track_number', 'id', 'name', 'uri']

    for album in album_uris:
        # Assign an empty list to each key value inside a nested dictionary.
        spotify_albums[album] = {key: [] for key in track_keys} 

        # Pull track data for each album track and append its info to nest dict.
        tracks = sp.album_tracks(album)

        for n in range(len(tracks['items'])):
            spotify_albums[album]['artist_name'].append(artist_name_search_result)
            spotify_albums[album]['album'].append(album_names[album_counter])
            spotify_albums[album]['track_number'].append(tracks['items'][n]['track_number'])
            spotify_albums[album]['id'].append(tracks['items'][n]['id'])
            spotify_albums[album]['name'].append(tracks['items'][n]['name'])
            spotify_albums[album]['uri'].append(tracks['items'][n]['uri'])

        album_counter += 1
    
    ############################################################################
    # GET AUDIO FEATURES FOR EACH ALBUM TRACK
    ############################################################################
    audio_feature_keys = ['acousticness', 'danceability', 'energy', 
                          'instrumentalness', 'liveness', 'loudness', 
                          'speechiness', 'tempo', 'valence', 'duration_ms', 
                          'release_date', 'popularity']
    for album in spotify_albums:
        # Assign audio feature keys empty list values in nested dictionary.
        for key in audio_feature_keys:
            spotify_albums[album][key] = []
        
        for track in spotify_albums[album]['uri']:
            # Get all audio features for the current track and append values
            # into appropriate key in dictionary.
            features = sp.audio_features(track)

            # Append data for all keys expect duration, release date and 
            # popularity (final three elements in audio_feature_keys) which 
            # will need to be obtained using sp.track().
            for key in audio_feature_keys[:-3]:
                spotify_albums[album][key].append(features[0][key])

            track_info = sp.track(track)

            spotify_albums[album]['duration_ms'].append(track_info['duration_ms'])
            spotify_albums[album]['release_date'].append(track_info['album']['release_date'])
            spotify_albums[album]['popularity'].append(track_info['popularity'])

    ############################################################################
    # REORGANISE DATA INTO AN UNNESTED DICTIONARY TO ALLOW FOR DF CONVERSION
    ############################################################################
    all_albums_data_keys = track_keys + audio_feature_keys
    all_albums_data = {key: [] for key in all_albums_data_keys}

    for album in spotify_albums:
        for feature in spotify_albums[album]:
            all_albums_data[feature].extend(spotify_albums[album][feature])

    df = pd.DataFrame.from_dict(all_albums_data)
    return df

## Data Download Pipeline

The `main` function below runs a data processing pipeline that calls `get_artists_data()` to collect a specified artist name's data using Spotify's API via the SpotiPy python library.

### **Importing credentials**
To authenticate this process, I have written a basic library (`spotify_credentials`), which contains the credentials to verify access to my Spotify Developer account. Separating crediants from the main script allows the user to keep their credentials private when uploading to Git, while also avoiding the need to continually export their client id and secret as environment variables every time the script is run. 

To replicate save your credentials under the following `spotify_credentials.py` in `scripts/` as follows:
```
client_id = "xxx"
client_secret = "xxx"
redirect_url = "http://localhost:8888
```
A client id can then be accessed by calling `spotify_crediantial.client_id`.

### **Generating a list of artists name's to collect data**
As I am not a user of Spotify, I have generated a list of artists from my Apple Music library, from a prior data & privacy request submitted to Apple. While this was not necessary for the analysis - I could have easily generated the list from another source -  I already had the data I thought it would be a nice touch to have a dataset that is personal to my tastes. This has given me a list of 1,084 unique artists for which I have downloaded each of their whole music catelogues - resulting in an output dataset of around 70k songs. I consider myself to have a broad taste in music, but this dataset will naturally contain a skew, if this proves an issue I will combine data from additional sources to balance out my dataset.

### **Parallelising Data Download**
Since this is a one-time request for data, there would be little time-cost benefit to efficiently parallelising my code. Although I have randomly shuffled the artist name input list on each run of the python script, which has allowed me to run multiple threads of my script at the same time, without each script iterating over the same part of the input list. The speed-up gained by this is significantly outweighed by the cost of checking if an artist name has already been processed on each iteration. Saving each artist's data as a seperating file and the merging on completion also allows me to checkpoint my code, in the sense that if the https times out or Spotify forcibly disconnects me I can easily pick up where I left off.


In [None]:
def main():
    
    credentials = SpotifyClientCredentials(client_id=cred.client_id,
                                           client_secret=cred.client_secret)
    ############################################################################
    # GET ARISTS IN MY APPLE MUSIC LIBRARY
    ############################################################################
    # If artist list has not already been generated, read in apple music library
    # data and drop all cols expect album artist.
    if not os.path.isfile("../data/artist_list.csv"):
        in_dir = "~/Documents/Computing/SQL/apple_music_replay/input_data/MusicLib.csv"
        artists_df = pd.read_csv(in_dir, usecols=["Album Artist"])
        artists_df.drop_duplicates(inplace=True)
        artists_df.sort_values(by=['Album Artist'], inplace=True, ascending=True)
        artists_df.to_csv("../data/artist_list.csv", index=False)
    else:
        artists_df = pd.read_csv("../data/artist_list.csv")
    artists_list = artists_df.values.tolist()

    # Randomly shuffle artist_list to allow for multiple processors to be run
    # the script simultaneously. This parallelisation more than accounts for the
    # slow down in having to check whether a dataframe for the arists has
    # already been download on each iteration, without having to deploy any 
    # libraries to parallelise my code.
    random.shuffle(artists_list)

    ############################################################################
    # CALL get_artist_data() FOR EACH ARTIST IN artists_list
    ############################################################################
    request_counter = 0
    sleep_min, sleep_max = 4, 6
    for artist in artists_list:
        # If artist dataframe doesn't exist then call get_artist_data() to 
        # download artist's data.
        if not os.path.isfile('../data/artists/' + str(*artist) + '.csv'):
            df = get_artist_data(*artist, credentials)
            request_counter += 1
            # Add random delay to avoid being forcibly disconnected.
            if request_counter % 5 == 0:
                time.sleep(np.random.uniform(sleep_min, sleep_max))
            # Get artist data returns 0 on not an unmatched artist, else a
            # pandas dataframe for matched artists. Check that we do not have 
            # our error code (0) before trying to write df to disk.
            if type(df) != int:
                df.to_csv('../data/artists/' + str(*artist) + '.csv',
                          index=False)

    ############################################################################
    # APPEND ARTISTS DATAFRAMES INTO MASTER DATAFRAME
    ############################################################################
    # Create an empty master dataframe to append each artists catelog to.
    master_df = pd.DataFrame()

    artist_csvs = glob.glob(os.path.join("../data/artists/","*.csv"))
    for f in artist_csvs:
        df = pd.read_csv(f)
        master_df = master_df.append(df, ignore_index=True)

    master_df.to_csv('../data/master_data.csv', index=False)
    return 1

In [None]:
if __name__ == "__main__":
  main()

I ran this script locally  - not on Google Colab - as it allowed me to simultaneously run many threads to speed up the time required to download the dataset. Consequently, all cells in this notebook have no output, I have presented this portion of my analysis in Google Colab for continuity with my other notebooks, where using a Juypter Notebook works better for displaying graphs, data-frames etc than a terminal console.