# SC290: Supplementary Notebook
## Downloading Lyrics

Lyrics can be a fun type of text data to experiment with different analysis techniques. There are also opportunities to examine the lyrical content of popular music to find patterns and trends that could be of value to social science, with some creativity.

- What are the key topics during the x period of music in the y genre?
- Do musicians get more mellow over time?
- How is class expressed through song lyrics in x period vs y period?
- What is the sentiment of the most popular music over time?

## Genius.com / Lyricsgenius

- [Genius](http://genius.com) is a website that provides song lyrics and critically, also provide a free API to their database of songs. 
- `lyricsgenius` is a Python library designed to use that information to then scrape lyrics data from their website.

### API KEY
- Sign up for an account at [https://genius.com/signup_or_login](https://genius.com/signup_or_login)
- To get an API key head to [https://genius.com/api-clients](https://genius.com/api-clients) and sign in.
- Click "New API Client" and fill in the form.
    - App name: Student project
    - Icon URL: leave blank
    - App website url: https://www.essex.ac.uk
    - Redirect URI: https://www.essex.ac.uk
    - Click Save
- Your API info is under 'All API Clients'
    - Click 'Generate Access Token'
    - Double click the generated token to select it all, then copy it.
    - Ideally save this to a file somewhere safe on your computer.
    - When you're ready to use it, make a new entry in the Secrets section of Google Colab.







In [None]:
# Uncomment the line below and run this cell to install lyricsgenius library
# !pip install lyricsgenius

In [None]:
# Use colab's Secrets section to store your key, 
# then use the code below to import it and assign it to GENIUS_KEY

from google.colab import userdata
GENIUS_KEY = userdata.get('Genius') # The string should match what you named the key in the secrets panel.


In [None]:
from lyricsgenius import Genius

# Set up your API connection
genius = Genius(GENIUS_KEY)
genius.remove_section_headers = True

# Set it up with terms that filter out results for you, things like Live albums etc. 
genius.excluded_terms = ["(Live), "]

In [None]:
# First we get the API's ID for the artist we want by searching them and returning one song.
artist = genius.search_artist("Lawrence", get_full_info=True, max_songs=1)

In [None]:
# The internal ID is accessed using the .id attribute
print('Artist ID is:', artist.id)

In [None]:
# Now we check for all the albums available for this artist

all_albums = []
for album in genius.artist_albums(artist.id)['albums']:
  name = album['name']
  album_id = album['id']
  album_release_date = album['release_date_for_display']
  record = (name, album_id, album_release_date)
  all_albums.append(record)

all_albums


In [None]:

from time import sleep
import pandas as pd

# You can either use the all_albums list or hand pick albums. Below I hand pick the four main ones.
chosen_album_ids = [('Family Business', 1093905, 'June 21, 2024'),
                    ('Hotel TV', 771545, 'July 23, 2021'),
                    ('Living Room', 419506, 'September 14, 2018'),
                    ('Breakfast', 288208, 'March 11, 2016'),
                    ('Homesick', 288971, 'January 5, 2013')]

# Set up a blank list
dataframes = []

# For each album we access the dictionary key (album name) and value (album id)
for name, album_id, album_release_date in chosen_album_ids:

    # We get all the tracks from the album
    album_tracks = genius.album_tracks(album_id)['tracks']

    # we iterate through each track and take the song information and put all this in a dataframe
    album_df = pd.DataFrame([x['song'] for x in album_tracks])

    # we select the most useful information to keep, critically the track ids
    album_df = album_df[['id','title','artist_names','release_date_for_display']]
    
    # We convert the string release dates for all the tracks to datetime. After we can drop the string column
    album_df['track_release_date'] = pd.to_datetime(album_df['release_date_for_display'], format='mixed')
    album_df = album_df.drop(columns=['release_date_for_display'])

    # We convert the album release date to datetime too and assign it as the value for every row of the dataframe
    album_df['album_release_date'] = pd.to_datetime(album_release_date)

    # we also assign the album name to every row under album_name
    album_df['album_name'] = name
    
    # Add the album's dataframe to the list
    dataframes.append(album_df)

    # give the API a second to rest
    sleep(1)


In [None]:
# We join the dataframes into one big dataframe and reset the index
data_df = pd.concat(dataframes).reset_index(drop=True)
# We also make a blank column filled with empty values
data_df['lyrics'] = pd.NA
data_df.info()

In [None]:
# We use the .lyrics method to finally pull the lyrics for each track

# First we get a subset of the data where lyrics are missing - at first this will be all tracks.
tracks_missing_lyrics = data_df[data_df['lyrics'].isna()]

# We iterate over each row in your dataframe where lyrics are missing
for index, row in tracks_missing_lyrics.iterrows():

  # We get the lyrics using the track id
  lyrics = genius.lyrics(row['id'])

  # We locate the correct row in the main dataframe using the index, and assign the lyrics to the lyrics column
  data_df.loc[index,'lyrics'] = lyrics

  # Report and give the scraper a 2 second rest
  print(f'Retrieved Lyrics: {index+1}/{len(data_df)} - {row["title"]}')
  sleep(2)
  # If you get an error, re-run the cell, it will pick up where it left off rather than re-run the whole job


In [None]:
# Check you have lyrics for all tracks
data_df.info()

In [None]:
# Check the head
data_df.head()

In [None]:
# Save the data
data_df.to_parquet('lyrics_data.parquet')