# Preparing Lyrics - Sentiment dataset

In this notebook I am using [Spotify API](https://developer.spotify.com/documentation/web-api/) and an existing [Song/Band/Lyrics Kaggle Dataset](https://www.kaggle.com/detkov/lyrics-dataset) to get a sentiment analysis dataset, where I use valence to measure positiveness. The valence ranges from 0 to 1, where higher valence corresponds to [a happier sentiment](https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/). 

Results in a total of ~156,000 non-null rows.


## Taken Steps to Prepare Dataset:

1. Get songs lyrics database (with columns: Band, Lyrics, Song).
2. Query Spotify for a song ID and songs valence (i.e hapiness), done in chunks of 99 songs per query.
3. Integrate to one dataframe, where each song has a corresponding valence value, or np.nan if song not found.


In [0]:
%%bash
# Clone Github repository 
rm -rf lyrics-sentiment
git clone https://github.com/EdenBD/lyrics-sentiment.git
# To get lyrics from genius.com
pip install lyricsgenius

Collecting lyricsgenius
  Downloading https://files.pythonhosted.org/packages/45/22/a1bb37594dff57b87134071f9881919110cd26fcb4f37b682c7db544d53b/lyricsgenius-1.8.4-py3-none-any.whl
Collecting beautifulsoup4==4.6.0
  Downloading https://files.pythonhosted.org/packages/9e/d4/10f46e5cfac773e22707237bfcd51bbffeaf0a576b0a847ec7ab15bd7ace/beautifulsoup4-4.6.0-py3-none-any.whl (86kB)
Installing collected packages: beautifulsoup4, lyricsgenius
  Found existing installation: beautifulsoup4 4.6.3
    Uninstalling beautifulsoup4-4.6.3:
      Successfully uninstalled beautifulsoup4-4.6.3
Successfully installed beautifulsoup4-4.6.0 lyricsgenius-1.8.4


Cloning into 'lyrics-sentiment'...


In [0]:
import pandas as pd
import numpy as np
import os 

# Mount google Drive
from google.colab import drive
drive.mount('/content/gdrive/')

# For spotify helper function
import sys
sys.path.append('/content/lyrics-sentiment')
from helpers import get_spotify_valence, get_lyrics_from_genius, plot_histogram, plot_training

# Get lyrics
import lyricsgenius 

# Data analysis
import matplotlib.pyplot as plt


Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/gdrive/


In [0]:

# Download Lyrics file from Kaggle place it in your drive.
# Change MY_FOLDER to your folder in the derive.
ROOT_DIRECTORY = "/content/gdrive/My Drive"
MY_FOLDER = "6864"
FILE_NAME = "Lyrics1.csv"

GIT_DB_PATH = '/content/lyrics-sentiment/diversified.xlsx'
# Path to save new df to.
OUTPUT_FILENAME = "labeled_lyrics.csv"
SAVE_PATH = os.path.join(ROOT_DIRECTORY, MY_FOLDER, OUTPUT_FILENAME)

path_to_data = os.path.join(ROOT_DIRECTORY, MY_FOLDER, FILE_NAME)
print("Verify your path: ",path_to_data,end='\n\n')
df = pd.read_csv(path_to_data, error_bad_lines=False)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]

# Report the number of sentences.
print('Number of training sentences: {:,}\n'.format(df.shape[0]))

# Display random rows from the data.
df.sample(5)


Verify your path:  /content/gdrive/My Drive/6864/Lyrics1.csv

Number of training sentences: 250,000



Unnamed: 0,Band,Lyrics,Song
17858,Sugarland,Why you walking around with your heart so heav...,Find the Beat Again
13474,Brian McKnight,"Mmm, yeah\nAlright, yeah\nListen\n\nBaby, can ...",Shoulda Woulda Coulda
249679,Miranda,Esta vez seguro que hago mía tu respuesta\r\nF...,Horoscopo
42987,Johnny Cash,I was born down in the bottoms of the flat bla...,Mississippi Sand
245537,Annett Louisan,Er bringt sie hoch zu ihrem Bett und jeden Abe...,Besonders


In [0]:
# Reomove non-english songs.
for column in df.columns:
  mask_nonAscii = df[column].str.len().ne(df[column].str.encode('ascii',errors = 'ignore').str.len())
  df = df[~mask_nonAscii & df[column].notnull()]

# Rename.
df.rename({"Lyrics": "seq", "Song": 'song', "Band":'artist'}, axis=1, inplace=True)

# Create nan column for label.
df['label'] = np.nan

# Drop duplicates
df.drop_duplicates(['song','seq','artist'], keep='first', inplace=True)

# Reset index after deleting Nan and duplicate rows
df = df.reset_index(drop= True)

# Display df information after changes.
df.info()

### Accessing Spotify API

In [0]:
import json
import urllib.request
from urllib.request import Request
from pandas.io.json import json_normalize
from urllib.parse import quote
import time
import sys

# need to keep quering "Get Token" button here: https://developer.spotify.com/console/get-audio-features-track/?id=06AKEBrKUckW0KREUWRnvT
current_token = 'PLACE_YOUR_TOKEN_HERE'

artists = df.artist.values
songs = df.song.values

print(len(songs))
print(len(artists))
print(df.shape)


Thank you [Madeline Zhang](https://github.com/madelinez820) for the detailed [Spotify access example code](https://github.com/EdenBD/How-To-Win-Eurovision/blob/master/data-wrangling-scripts/spotify_script.ipynb).

In [0]:
songURIS = ""
max_tracks_count = 99
successful_i = []
nulls_count = 0

# If range bigger than 35,000 need to do in chunks.
for i in range(0,len(songs)):
    # formatting spaces
    song = quote(songs[i])
    artist = quote(artists[i]) 

    # going from artist / song name to song URIs (https://developer.spotify.com/documentation/web-api/reference/search/search/)
    # Can make more efficient by increasing limit to 50.
    request = Request('https://api.spotify.com/v1/search?q=track:' + song + '%20artist:' + artist + '&type=track&limit=1')
    request.add_header('Accept', 'application/json')
    request.add_header('Content-Type', 'application/json')
    request.add_header('Authorization', 'Bearer ' + current_token)
    try: 
      res = urllib.request.urlopen(request)
      resObject = json.load(res)

      if (len(resObject["tracks"]["items"]) == 0):
          nulls_count += 1
      else:
          songURI = resObject["tracks"]["items"][0]["id"]

          if len(successful_i)<max_tracks_count:
            songURIS+=songURI + ','
            successful_i.append(i)
          else:
            songURIS+=songURI
            successful_i.append(i)
            print("Got {} Successful songs".format(len(successful_i)))
            songURIS = quote(songURIS)
            # Getting 99 songs URI -> audio features (https://developer.spotify.com/documentation/web-api/reference/tracks/get-several-audio-features/)
            audioRequest = Request('https://api.spotify.com/v1/audio-features?ids=' + songURIS)
            audioRequest.add_header('Accept', 'application/json')
            audioRequest.add_header('Content-Type', 'application/json')
            audioRequest.add_header('Authorization', 'Bearer ' + current_token)
            audioRes = urllib.request.urlopen(audioRequest)
            jsonObject = json.load(audioRes)
            tracks_objects = jsonObject["audio_features"]
            for idx,trackObject in zip(successful_i, tracks_objects):
              # Set value at specified row/column pair.
              if trackObject:
                df['label'].iat[idx] = trackObject["valence"]
            # Reset. 
            successful_i = []
            songURIS = ""
        

    except urllib.error.HTTPError as e:
      if int(e.code) == 429: # Maxed requests, need to wait 
        wait_time = float(e.info()['Retry-After'])
      else:
        wait_time = 3

      if int(e.code) == 400:
        print("Invalid request at song: ",song)
      if int(e.code) == 401:
        print("Need to refresh Token from i: ",i)
        break
        
      print("For {} Sleeping {} seconds at {}".format(e.code,wait_time,i))
      time.sleep(wait_time)

print(df.sample(5))

In [0]:
# Remove nulls for songs not found in Spotify.
print("df Length before removing Spotify unfound songs: ", len(df))
df.dropna(inplace=True)
print("df Length after: ", len(df))

Only needed if dataset is large (above 35K) and we are building dataset in chunks. Stack current and previously computed dfs. 

In [0]:
# Size of the chunk that was computed now. 
done_chunk = 35000

# Read the previously calculated df file.
prev = pd.read_csv(SAVE_PATH, error_bad_lines=False)

prev = prev.loc[:done_chunk, ~prev.columns.str.contains('^Unnamed')]
current = df.loc[done_chunk:, ~df.columns.str.contains('^Unnamed')]
df = pd.concat([prev, current], axis=0)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]


Save current df to file.

In [0]:
df.to_csv(SAVE_PATH)

Get Sentiment of one song

In [0]:
artist_name ="ariana grande"
song_title = "Thank you, next" 
spotify_token = "BQAf1m2U3MrXAXnDvJfmep-PJU4bKIuxACzd2OdKaJuNgCQrtuDK2fhTscrBKUP1L10H2IKGYon0dtxmeJ5dFyUhsnZCaRIbAiXOeS2JttAMFSI5eOrEPbBCe04u1aSW8RtQWvflpsFthdwWrk0ToQt7otOmE_4Em4tPRpC3t4ctMhsLGyfQi51K5gGFBhv6TA"
spotify_label = get_spotify_valence(song_title, artist_name , spotify_token)


Found valence: 0.41 of the song: Thank you, next - ariana grande
