Lyric saver notebook

The below code can be used to generate a labeled dataset of song lyrics, with the samples being saved as individual .txt files, which will automatically save within the folder with the corresponding label the labels being the genre of the song.

THIS CODE CANNOT BE RUN IN CLOUD-HOSTED NOTEBOOKS E.G. GOOGLE COLAB, AS THE GENIUS API WILL BLOCK ACCESS

IT MUST BE RUN IN A LOCAL ENVIRONMENT

GENIUS API KEY CAN BE OBTAINED VIA: https://docs.genius.com/

LASTFM KEY CAN BE OBTAINED VIA : https://www.last.fm/api/authentication


In [1]:
import os
import json
import re

import pandas as pd
import requests
import lyricsgenius
from tqdm import tqdm
from requests.exceptions import Timeout

GENIUS_ACCESS_TOKEN = '[GeniusAPIKey]'
LASTFM_API_KEY = '[LastFMAPIKey]'
genius = lyricsgenius.Genius(GENIUS_ACCESS_TOKEN)

# Define the name for the new dataset and the file path if not to be saved to home directory
DatasetName = ' Lyrics 5'
DatasetFilePath = './'

In [2]:
# Method to get the top genres from Last.fm API
def get_top_genres():
    base_url = "http://ws.audioscrobbler.com/2.0/"

    # Make a request to the Last.fm API to get the top artists for the given genre
    params = {"method": "tag.getTopTags", "api_key": LASTFM_API_KEY, "format": "json"}
    response = requests.get(base_url, params=params)

    # Parse the response data
    tags = response.json()["toptags"]["tag"]
    dataframe = pd.DataFrame(tags)

    return dataframe



In [3]:
# Method to get the top tracks by genre from Last.fm API
def get_top_tracks_by_genre(genre, limit):
    base_url = "http://ws.audioscrobbler.com/2.0/"

    # Make a request to the Last.fm API to get the top artists for the given genre
    params = {
        "method": "tag.getTopTracks",
        "tag": genre,
        "limit": limit,
        "api_key": LASTFM_API_KEY,
        "format": "json",
    }
    response = requests.get(base_url, params=params)

    # Parse the response data
    data = json.loads(response.text)

    tracks_list = [
        {"name": track["name"], "artist": track["artist"]["name"]}
        for track in data["tracks"]["track"]
    ]

    dataframe = tracks_list

    return dataframe



In [4]:
#Get a list of the top genres from Last FM
topgen = get_top_genres()
print(topgen)

                 name    count   reach
0                rock  4037633  400572
1          electronic  2458086  259637
2           seen live  2170788   82345
3         alternative  2119472  265657
4               indie  2050751  258479
5                 pop  2044610  231537
6    female vocalists  1624785  168853
7               metal  1287240  157837
8    alternative rock  1209625  169020
9                jazz  1179547  148954
10       classic rock  1144817  136995
11            ambient  1103572  148219
12       experimental  1092186  143416
13               folk   943362  150166
14         indie rock   910744  136082
15               punk   907680  144344
16            Hip-Hop   901909  130158
17          hard rock   898427  114887
18        black metal   886320   63671
19       instrumental   872460  125258
20  singer-songwriter   847548  109452
21              dance   813073  133698
22                80s   790744  100449
23        death metal   746150   72279
24   Progressive rock   7

In [5]:
# Define a list of genres for the dataset and obtain a list of the top songs for each
#
# NOTE: songs can and will appear in multiple genres, which will result in the same sample appearing in multiple classes in the dataset

# Set the number of tracks to be collected for each genre. Consider the following when choosing this:
# Use an integer that's not a round 50 or 100 for this value to avoid an issue where LASTFM returns only 50
# Each track requires its own api call, taking anywhere between 2 and 12 seconds to complete,
# Additionally, LASTFM does not disclose what their api call limit is - 6000 total has been tested to work
trackcount = 901

# Define an empty dictionary called genre_data
genre_data = {}

# Define a list of music genres
genres = [
    'metal',
    'jazz',
    'Hip-Hop',
    'country',
    'punk',
]

# Iterate over each genre in the list
for genre in genres:
    # Call the function "get_top_tracks_by_genre" and store the returned tracks in the "tracks" variable
    tracks = get_top_tracks_by_genre(genre, trackcount)

    # Add a new key-value pair to the genre_data dictionary for the current genre, with the key "top_tracks" and the value being the tracks variable
    genre_data[genre] = {"top_tracks": tracks}

In [6]:
#print the list of the collected songs
print(genre_data)



In [7]:
#create a new directory to store the dataset
path = DatasetFilePath + DatasetName
#check to see if folder already exists, if not, create folder
if not os.path.exists(path):
    os.makedirs(path)


#loop through each genre
for genre in tqdm(genres):

    #Create subfolder for lyrics files to be saved to
    pathtemp = path + '/' + genre
    #check to see if folder already exists, if not, create folder
    if not os.path.exists(pathtemp):
        os.makedirs(pathtemp)

    # Loop through the top tracks of the current genre
    for track in tqdm(genre_data[genre]['top_tracks']):
        # Retry up to 3 times in case of a Timeout or AttributeError
        retries = 0
        while retries < 3:
            try:
                # Search for the lyrics of the current track using the Genius API
                track['lyrics'] = genius.search_song(
                    track['name'], track['artist']
                ).lyrics
            except Timeout as e:
                retries += 1
                track['lyrics'] = ''
                continue
            except AttributeError as e:
                retries += 1
                track['lyrics'] = ''
                continue
            break
        
        # Cleaning of track name and lyrics text to make them suitable to be saved
        track['lyrics'] = track['lyrics'].replace("\n", " ")
        track['lyrics'] = re.sub(r'\[.*?\]', '', track['lyrics'])
        track['lyrics'] = track['lyrics'].replace("\'", "'")
        word_to_keep = "Lyrics"
        parts = track['lyrics'].split(word_to_keep)
        track['lyrics'] = word_to_keep + parts[-1]
        track['lyrics'] = re.sub(r"Translations\S*", " ", track['lyrics'])
        track['lyrics'] = re.sub(r"\[.*?\<]", " ", track['lyrics'])
        track['lyrics'] = re.sub(r"\d+Embed", " ", track['lyrics'])
        track['lyrics'] = track['lyrics'].replace('\n', ' ')
        track['lyrics'] = track['lyrics'].lower()
        track['lyrics'] = re.sub(
            r"see(?!.*see.*you might also like).*you might also like", "", track['lyrics']
        )
        track['name'] = re.sub(r"\[.*?\]", " ", track['name'])
        track['name'] = track['name'].replace('"', '').replace("?",' ').replace("/",' ').replace("*",' ').replace("|",' ').replace(">",' ').replace("<",' ')

        #check to ensure lyrics text is present, prevents the saving of blank samples
        if len(track['lyrics']) >= 8:
            with open(pathtemp + '/' + track['name'] + ".txt", 'w', encoding="utf-8") as file:
                file.write(track['lyrics'])

  0%|                                                                                            | 0/5 [00:00<?, ?it/s]
  0%|                                                                                          | 0/901 [00:00<?, ?it/s][A

Searching for "Chop Suey!" by System of a Down...



  0%|                                                                                  | 1/901 [00:01<17:48,  1.19s/it][A

Done.
Searching for "In the End" by Linkin Park...



  0%|▏                                                                                 | 2/901 [00:02<14:39,  1.02it/s][A

Done.
Searching for "Numb" by Linkin Park...
Searching for "Numb" by Linkin Park...



  0%|▎                                                                                 | 3/901 [00:08<51:02,  3.41s/it][A

Done.
Searching for "Bring Me to Life" by Evanescence...



  0%|▎                                                                                 | 4/901 [00:09<35:43,  2.39s/it][A

Done.
Searching for "Killing in the Name" by Rage Against the Machine...
Searching for "Killing in the Name" by Rage Against the Machine...
Searching for "Killing in the Name" by Rage Against the Machine...



  1%|▍                                                                               | 5/901 [00:21<1:27:55,  5.89s/it][A

Done.
Searching for "Toxicity" by System of a Down...



  1%|▌                                                                               | 6/901 [00:22<1:02:08,  4.17s/it][A

Done.
Searching for "Welcome to the Jungle" by Guns N' Roses...



  1%|▌                                                                               | 7/901 [00:26<1:03:02,  4.23s/it][A

Done.
Searching for "Back in Black" by AC/DC...



  1%|▋                                                                                 | 8/901 [00:27<47:57,  3.22s/it][A

Done.
Searching for "Paranoid" by Black Sabbath...


  1%|▋                                                                               | 8/901 [00:32<1:01:10,  4.11s/it]
  0%|                                                                                            | 0/5 [00:32<?, ?it/s]


KeyboardInterrupt: 