# Fetching Genius Lyrics

This notebook describes how I used the Genius API to scrape lyrics for use in my lyric generation model. You can run the code after obtaining an API key to fetch your own data as well. Be sure to obtain your key from the [Genius website](https://genius.com/api-clients).

Keep in mind that fetching data takes a while. It took me on average 2 minutes to fetch 50 songs. I reccommend fetching songs one artist at a time. I suggest having a list of artists you want to fetch songs from before running the code.

Credits to: 
- [Hakan Sarıtaş](https://hakansaritas.medium.com/80s-hard-rock-song-scraping-and-letters-analysis-via-pandas-72a43b1267f) for his well detailed guide on getting lyrics from the Genius API.
- [Rolling Stone](https://genius.com/Rolling-stone-100-greatest-artists-annotated) for their list on 100 top artists I used as inspiration to create my artist list
- [Spinditty](https://spinditty.com/artists-bands/100-Best-Rock-Bands-of-the-2000s) for their list on the 100 top rock artists of the 2000s (I needed to get something that played during my childhood)

## Preliminaries

The two main functions in this notebook are:
- Fetching lyrics
- Formatting lyrics

I begin by defining both of those functions. Note that the way I wrote the code makes it so I first fetch the lyrics and perform the lyric formatting after all the data has been retrieved from Genius. If you want to perform the lyrics clean up at the beginning, just uncomment the _#lyrics_list.append(lyrics_cleanup(lyrics_raw))_ line.

First import the libraries and get the API key into a variable:

In [1]:
import pandas as pd
from lyricsgenius import Genius
import re
import time
from time import strftime, gmtime

#Replace the string below with your actual API key
api_key = ""

Then define the dataframe fetching function:

In [2]:
'''
The function below receives:
- t0: a timestamp to track the time it takes to retrieve the lyrics
- list_of_bands: a list of strings with the name of the bands whose lyrics are going to be retrieved
- num_of_song: the number of songs per band to retrieve. These are taken in descending order from the band's top songs in Genius

The output of the function is a dataframe with three columns: artist, song name and lyrics for the song
'''
def dataFrameFetcher(t0,list_of_bands=["The Who"], num_of_song=25):
    # Change remove_section_headers to true if you do not care about having the separations of [verse], [chorus] and so on
    genius = Genius(api_key, timeout=120, remove_section_headers=False, verbose=False)
    
    #Store relevant information in lists, then create a dataframe from those lists
    #Initialize empty lists
    lyrics_list, song_name_list, artist_name_list = [], [], []
    for band in list_of_bands:
        #You can remove these prints if you want. I kept them to monitor progress
        print(f'Fetching songs for artist: {band}')
        t1 = time.time()
        #Get seconds
        secs = t1-t0
        print(f'Time elapsed so far: {strftime("%H:%M:%S", gmtime(secs))}')
        #Access the API
        artist_all_lyrics = genius.search_artist(band, max_songs=num_of_song)
        for i in range(num_of_song):
            # getting songs,artists names and IDs                 
            try:
            #Fetch the relevant information and add to the dataset
                lyrics_raw = artist_all_lyrics.songs[i].lyrics
                #Process the lyrics and add the rest
                #lyrics_list.append(lyrics_cleanup(lyrics_raw))
                lyrics_list.append(lyrics_raw)
                song_name_list.append(artist_all_lyrics.songs[i].title)
                artist_name_list.append(artist_all_lyrics.songs[i].artist)
            except (AttributeError, IndexError):
                continue
    #Having iterated through everything, store in dataframe
    df = pd.DataFrame({"Artist":artist_name_list,"Song Name":song_name_list,"Lyrics":lyrics_list})
    return df

Define the function for lyric clean-up (formatting). The process the function follows is:
- Remove the _Embed_ keyword along with how many embeds occur at the end of the song
- Remove all instances of _You might also like_ which appears at the beginning and sometimes in the middle of a song. This is a keyword which I presume Genius uses to suggest other songs when you look at the lyrics in their website. To avoid removing that string from a song that has it as part of the actual lyrics, I use regular expressions to check where they occur.
- Transform all text into lower case
- If you are interested in obtaining the song parts (verse, chorus, etc.), these are tagged by Genius in brackets ([ ]). Genius numbers each verse depending on where they occur on the song (verse 1, verse 2,...) for simplicity, I treat all of them as a verse

In [3]:
'''
This function receives as input:
- lyrics: a string with the lyrics of the songs
- remove_break: a boolean that indicates if you want to remove the line breaks (\n) at the end of each line in the song

The function outputs a string with the formatted lyrics
'''
def lyrics_cleanup_parts(lyrics,remove_break=True):
    #Every lyric has at the beginning the number of contributors and at the end an embed
    #Remove them from mental sanity
    #Start with the embed, just read backwards until there is no more number
    #I know that if I index the lyric at -5 I finish the embed word
    #Create a candidate and iterate until the lyric stops being a number
    #Processing the after part
    cand = -5
    while lyrics[cand].isdigit():
        cand -= 1
    #Then re-write the lyrics
    lyrics = lyrics[:cand-2]
    #Sometimes the songs have a "you might also like" right at the end
    #Also get rid of that 
    if lyrics[-20:] == '\nYou might also like':
        lyrics = lyrics[:-20]
    #Processing the before part
    #There is a number, a space and then contributors. the space and contributors substring contributes 13 characters
    #So I count through the other characters until the number is gone
    cand = 0
    while lyrics[cand].isdigit():
        cand += 1
    #Re-write the lyrics
    lyrics = lyrics[cand+13:]
    #The Genius formatting has the song title and then lyrics afterwards. I can identify where those lie easily by finding
    #the first \n
    cand = lyrics.find('[')
    lyrics = lyrics[cand:]
    #Sometimes the lyrics have another \n at the beginning, if that is the case I also get rid of it
    if lyrics[0] == '\n':
        lyrics = lyrics[1:]
    #There might be other you might also like that are intrusive, so I remove them as well
    lyrics = lyrics.replace('You might also like[', '[')
    #Remove the numbers from the tags
    pattern = re.compile(r'\[([^0-9]+)\s*\d+\]')
    # Use sub() to replace the matched pattern with '[' and ']'
    lyrics = pattern.sub(r'[\1]', lyrics)
    #Replace the verse whitespace with just verse
    lyrics = lyrics.replace(' ]',']')
    #Remove all potential extra words inside brackets
    pattern = r'\[([^\]]+)\]'
    lyrics = re.sub(pattern, lambda match: '[' + match.group(1).split()[0] + ']', lyrics)
    #Remove potential double dots
    lyrics = lyrics.replace(':]',']')
    #After that is done, remove all newlines
    if remove_break:
        lyrics = ' '.join(lyrics.splitlines())
    #Make everything lowercase
    lyrics = lyrics.lower()
    #Remove buying ticket info
    return lyrics

Also, some artists have an annoying message intertwined with the lyrics where they advertise live tickets. Because that message changes depending on the artist name, I create a function that receives the artist name and the lyrics and gets rid of it.

In [4]:
def remove_ticket_info(artist_name, lyrics):
    #Transform artist_name to lower
    artist_name = artist_name.lower()
    # Construct the regex pattern using re.escape() for safe inclusion of artist_name
    #pattern = re.compile(re.escape(artist_name) + r'\s*liveget\s+tickets\s+as\s+low\s+as\s+\$\d+\s*', re.IGNORECASE)
    pattern = re.compile(r'see\s+' + re.escape(artist_name) + r'\s*liveget\s+tickets\s+as\s+low\s+as\s+\$\d+\s*', re.IGNORECASE)
    # Use re.sub() to remove the matched expression from the lyrics
    result = re.sub(pattern, '', lyrics)
    return result.strip()

## Data Fetching

Having defined the functions, it is time to run the code! The code first retrieves the data for one artist, and then iterates through the artist_list, concatenating the data to the original dataframe.

Please note that the way it is written, the csv file with the lyrics will be overwritten everytime you fetch data for a new artist. This is great if you have an intermittent connection and are unsure if something will happen before you finish fetching. You can move it to act after the for loop if you would like.

In [5]:
artist_list = ["Put your artist list here, you can fetch it from a txt or csv file if you would like"]
artist_list = ["The Who"]

In [6]:
#Store data periodically
#Get first time
t0 = time.time()
df = dataFrameFetcher(t0,list_of_bands=artist_list[:1], num_of_song=20)
for artist in artist_list[1:]:
    #Fetch new data
    dfres = dataFrameFetcher(t0,list_of_bands=[artist], num_of_song=20)
    df = pd.concat([df, dfres])
    #overwrite
    df.to_csv('Lyrics50WithParts.csv',index=False)

Fetching songs for artist: The Who
Time elapsed so far: 00:00:00


## Lyrics Processing

Having obtained all the lyrics, it is time to perform the data formatting.

I have set up the code so it stores a new column in the original dataframe with the processed lyrics. This is to have better control over the formatting and to be able to go back to the original lyrics and format them differently if needed.

In [7]:
#Apply the transformation
lyrics_cleaned = [lyrics_cleanup_parts(lyrics,remove_break=True) for lyrics in df['Lyrics']]
#Append to database
df['Lyrics_Cleaned'] = lyrics_cleaned
#Remove the selling ticket message
df['Lyrics_Cleaned'] = df.apply(lambda row: remove_ticket_info(row['Artist'], row['Lyrics_Cleaned']), axis=1)

In [8]:
#Some lyrics may be a float (empty lyrics), so I delete them
for lyric in df['Lyrics_Cleaned']:
    if type(lyric) != str:
        print(lyric)

In [9]:
df.to_csv('SongsDFCleanedLyricsParts.csv',index=False)

## Done

After running the code in this notebook, you will have a dataframe with formatted lyrics, ready to be used! Please refer to the Notebook with the lyric generation model for details on that process.