# Extracting FOB Data From Spotify and Genius APIs

Based on Vastava's notebook: <br>
https://github.com/vastava/data-science-projects/blob/master/spotify-genre-classifier/get-spotify-data.ipynb

## Setup and Library Imports

In [1]:
# Install spotipy and genius libraries
!pip install spotipy -q
!pip install lyricsgenius -q

In [2]:
import re
import pandas as pd
import numpy as np
import matplotlib
from matplotlib import pyplot as plt
%matplotlib inline
import seaborn as sns
from collections import Counter
from statsmodels.stats.weightstats import ztest as ztest

In [3]:
# Display entire dataframe
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

In [4]:
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials
import spotipy.util as util
import lyricsgenius as lg

## FOB DataFrame– Spotify Features and Genius Lyrics

In [6]:
# Spotipy-specific variables
cid = 'Your Client ID here'
secret = 'Your Client secret here'
user = 'Your Client user name here'
scope = 'user-library-read playlist-modify-public playlist-read-private'
redirect_uri = 'http://google.com'

client_credentials_manager = SpotifyClientCredentials(client_id = cid, client_secret = secret) 
sp = spotipy.Spotify(client_credentials_manager = client_credentials_manager)

In [7]:
# Genius-specific variables
genius_access_token = 'Your Genius access token here'
genius_obj = lg.Genius(genius_access_token)

### Playlist Info
Link: https://open.spotify.com/playlist/0ubKnC9ctDVxlGbF5jICvq?si=6ee6b776aaf943de <br>

Currently including:
- All FOB Songs on Spotify
- Bonus tracks
- Covers
- Both studio versions of "Calm Before the Storm"
- FOB Features, such as "I've Been Waiting" and "One and Only"<br>

Currently not including:
- remixes, including AB/AP remix album
- live versions of songs whose studio versions are already included
- acoustic versions of songs whose main studio versions are already included
- Solo work after the hiatus
- FOB songs which aren't on Spotify ("Pavlove")

Classification of songs:
- Class 0: pre-hiatus and Class 1: post-hiatus
- Songs are classified by release date, not writing date
    - Although "Light Em Up" and "Lake Effect Kid" were written pre-hiatus, they will be classified as post-hiatus

In [8]:
def get_features(username, playlist_id):
    results = sp.user_playlist_tracks(username, playlist_id)
    songs = results['items']
    
    while results['next']:
        results = sp.next(results)
        songs.extend(results['items'])    
    
    ids = []
    names = []
    dates = []
    
    for i in range(len(songs)):
        if songs[i]['track'] and songs[i]['track']['id'] != None:
            names.append(songs[i]['track']['name'])
            ids.append(songs[i]['track']['id'])
            dates.append(songs[i]['track']['album']['release_date'])
    ids = list(filter(None, ids))
    
    print(f'Found {len(ids)} songs.')
    
    features = [] 
    for id_idx, id_val in enumerate(ids):
        print(f'  Processing song #{id_idx+1}: {names[id_idx]}                                                ', end = '\r')
        audio_features = sp.audio_features(id_val)
        
        for index, feature in enumerate(audio_features):
            features.append(feature)
            
    features = list(filter(None, features))
    print(f'\r\nFinished processing {len(ids)} songs.')
    df = pd.DataFrame(features)
    df['title'] = names
    df['release_date'] = dates
    return df

In [9]:
fob_songs = get_features('shiffytali', '0ubKnC9ctDVxlGbF5jICvq')

Found 141 songs.
  Processing song #141: Stayin out All Night - Boys of Zummer Remix                                                (feat. Missy Elliott)                                                
Finished processing 141 songs.


### RegEx Setup
- Goal: eliminate useless bracketed information from lyrics
- Use a lot of | (or) operators because brackets can occur once in beginning, once at end, or twice!
    - Last two pipes account for lyrics that have a bracket in the middle

In [10]:
def get_lyrics(song_list, artist):
    total_words = []
    unique_words = []
    pattern = r'(\[?([\w\-]+)\])|(\[([\w\-]+)\]?)|(\[([\w\-]+)\])|.*\[.*]?|.*\[?.*]'
    
    for idx, song in enumerate(song_list):
        lyrics = genius_obj.search_song(title = song, artist = artist).lyrics
        # Substitute any useless information with an empty string
        lyrics = re.sub(pattern, '', lyrics)
        count = Counter(lyrics.split())
        
        total = sum(count.values())
        unique = len(count)
        total_words.append(total)
        unique_words.append(unique)
    
    print('Processed all songs')    
    return total_words, unique_words

In [11]:
# Note titles that confuse the get_lyrics function
manually = {78: 'Beat It', 115: 'From Now On We Are Enemies',
            116: 'Yule Shoot Your Eye Out', 132: "I Wan'na Be Like You (The Monkey Song)",
            139: 'Ghostbusters (I’m Not Afraid)'}

# Replace with titles that will work with the get_lyrics function
for idx, title in manually.items():
    fob_songs.at[idx, 'title'] = title

In [12]:
fob_songs['total_words'], fob_songs['unique_words'] = get_lyrics(fob_songs['title'], 'Fall Out Boy')

Searching for "Honorable Mention" by Fall Out Boy...
Done.
Searching for "Calm Before the Storm" by Fall Out Boy...
Done.
Searching for "Switchblades and Infidelity" by Fall Out Boy...
Done.
Searching for "Pretty In Punk" by Fall Out Boy...
Done.
Searching for "Growing Up" by Fall Out Boy...
Done.
Searching for "The World’s Not Waiting (For Five Tired Boys In A Broken Down Van)" by Fall Out Boy...
Done.
Searching for "Short, Fast, And Loud" by Fall Out Boy...
Done.
Searching for "Moving Pictures" by Fall Out Boy...
Done.
Searching for "Parker Lewis Can't Lose (But I'm Gonna Give It My Best Shot)" by Fall Out Boy...
Done.
Searching for "Tell That Mick He Just Made My List of Things to Do Today" by Fall Out Boy...
Done.
Searching for "Dead on Arrival" by Fall Out Boy...
Done.
Searching for "Grand Theft Autumn / Where Is Your Boy" by Fall Out Boy...
Done.
Searching for "Saturday" by Fall Out Boy...
Done.
Searching for "Homesick at Space Camp" by Fall Out Boy...
Done.
Searching for "Sendin

In [13]:
fob_songs.head()

Unnamed: 0,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,type,id,uri,track_href,analysis_url,duration_ms,time_signature,title,release_date,total_words,unique_words
0,0.384,0.969,5,-3.903,1,0.0985,0.00236,0.0,0.277,0.374,103.858,audio_features,5fWwfsMUyLeApOAdIieGiX,spotify:track:5fWwfsMUyLeApOAdIieGiX,https://api.spotify.com/v1/tracks/5fWwfsMUyLeA...,https://api.spotify.com/v1/audio-analysis/5fWw...,205867,4,Honorable Mention,2003-03-25,201,74
1,0.338,0.919,9,-5.538,1,0.316,0.00391,0.000251,0.307,0.464,173.643,audio_features,4kTbEgPxqeBz1wVeXcyKJy,spotify:track:4kTbEgPxqeBz1wVeXcyKJy,https://api.spotify.com/v1/tracks/4kTbEgPxqeBz...,https://api.spotify.com/v1/audio-analysis/4kTb...,283573,4,Calm Before the Storm,2003-03-25,289,141
2,0.374,0.99,11,-4.092,1,0.38,0.00673,3e-06,0.25,0.336,135.163,audio_features,1gVJSRoWcnlDFQ472PSUaS,spotify:track:1gVJSRoWcnlDFQ472PSUaS,https://api.spotify.com/v1/tracks/1gVJSRoWcnlD...,https://api.spotify.com/v1/audio-analysis/1gVJ...,194573,4,Switchblades and Infidelity,2003-03-25,165,45
3,0.43,0.963,9,-4.803,1,0.161,0.00347,0.0,0.298,0.268,125.773,audio_features,1ZcR4oKoZn7HHzpNsT40Nc,spotify:track:1ZcR4oKoZn7HHzpNsT40Nc,https://api.spotify.com/v1/tracks/1ZcR4oKoZn7H...,https://api.spotify.com/v1/audio-analysis/1ZcR...,217253,4,Pretty In Punk,2003-03-25,249,85
4,0.291,0.971,0,-4.066,1,0.301,0.00926,0.0,0.287,0.408,149.928,audio_features,1SWB4y4XIjmesBR5Nn2qBJ,spotify:track:1SWB4y4XIjmesBR5Nn2qBJ,https://api.spotify.com/v1/tracks/1SWB4y4XIjme...,https://api.spotify.com/v1/audio-analysis/1SWB...,168973,3,Growing Up,2003-03-25,174,94


In [14]:
fob_songs.to_csv('data/raw/FOB_songs_raw.csv')

## Data Cleaning and Validation

In [15]:
# Remove irrelevant columns
fob_songs = fob_songs.drop(labels = ['type','id', 'uri', 'track_href', 'analysis_url'], axis = 'columns')

In [16]:
# Move title column to first index
first_column = fob_songs.pop('title')
fob_songs.insert(0, 'title', first_column)

In [17]:
# Replace duration in ms with duration in minutes
fob_songs['duration_min'] = round((fob_songs['duration_ms'] / 60000), 2)
fob_songs = fob_songs.drop(['duration_ms'], axis = 1)

### Clarification on Duration Column:
The duration_min column does not show minutes.seconds. <br>
For example, the first song in this list is Honorable Mention, which is 3 minutes and 25 seconds. My dataframe value for Honorable Mention in the duration_min column is 3.43 because 43/60 = ~25 <br>
This value is displayed as a <i>fraction of a minute</i>, not as a count of seconds

In [18]:
fob_songs.to_csv('data/interim/FOB_songs_interim.csv')

## Target Variable

In [19]:
fob_songs['class'] = 'pre-hiatus' # set default value as pre-hiatus

# Designate post-hiatus songs
for idx, date in enumerate(fob_songs['release_date']):
    year = int(date[:4])
    if (year == 1973) or (year > 2009):
        # Accounts for Elton John cover labeled as 1973 but recorded in post-hiatus era
        fob_songs.at[idx, 'class'] = 'post-hiatus'

In [20]:
fob_songs.head()

Unnamed: 0,title,danceability,energy,key,loudness,mode,speechiness,acousticness,instrumentalness,liveness,valence,tempo,time_signature,release_date,total_words,unique_words,duration_min,class
0,Honorable Mention,0.384,0.969,5,-3.903,1,0.0985,0.00236,0.0,0.277,0.374,103.858,4,2003-03-25,201,74,3.43,pre-hiatus
1,Calm Before the Storm,0.338,0.919,9,-5.538,1,0.316,0.00391,0.000251,0.307,0.464,173.643,4,2003-03-25,289,141,4.73,pre-hiatus
2,Switchblades and Infidelity,0.374,0.99,11,-4.092,1,0.38,0.00673,3e-06,0.25,0.336,135.163,4,2003-03-25,165,45,3.24,pre-hiatus
3,Pretty In Punk,0.43,0.963,9,-4.803,1,0.161,0.00347,0.0,0.298,0.268,125.773,4,2003-03-25,249,85,3.62,pre-hiatus
4,Growing Up,0.291,0.971,0,-4.066,1,0.301,0.00926,0.0,0.287,0.408,149.928,3,2003-03-25,174,94,2.82,pre-hiatus


### Convert processed song data to csv

In [21]:
fob_songs.to_csv('data/processed/FOB_songs_processed.csv')