# Data Acquisition - Lyrics

Selecting various artists that I like to listen to, along with some that I know like to say their own name in songs, I will be using the lyricsgenius library to call the Genius API to pull songs for that artist. From this I will gather name, title, year released and lyrics of the songs from each artist.

### Imports

In [1]:
import pandas as pd
import numpy as np

import time   
import lyricsgenius # Lyricsgenious Library to use with Genius API
from requests.exceptions import Timeout # To avoid timeout errors during pull

In [2]:
# np.set_printoptions(threshold=np.inf)

### Getting the Lyrics
I have made a list of artists, some I listen to, some I know self-announce, and others for variety. There are a total of 127 artists, I will pull 100 songs (if available) from each artist. Of these artists there are ~50 that I know have self announced themselves in tracks in the past. 

<p> In total I will hopefully have 12,700 rows of songs and lyrics, which will get trimmed if there are any missing values, or duplicates. I will also need to make a target column, and depending on my balance of the classes, I may need to randomly drop some tracks to rebalance my dataset</p>

In [3]:
gather = [] # empty list to collect all song titles from each artist

In [4]:
chosen_artists = ['Jason Derulo', 'CAKE', 'System of a Down', 'Nicki Minaj',
       'Britney Spears', 'BLACKPINK', 'Young Thug', 'Lady Gaga',
       'Shakira', 'Ludacris', 'Akon', 'Sean Paul', 'Usher', 'Dua Lipa',
       'The Weeknd', 'Cardi B', 'Miley Cyrus', 'Logic', 'Post Malone',
       'Ini Kamoze', 'Rachel Platten', 'Sia', 'Muse', 'Maroon 5',
       'Christina Perri', 'Lorde', 'Lil Wayne', 'Diplo',
       'Florida Georgia Line', 'Brad Paisley', 'Thomas Rhett',
       'Spencer Crandall', 'Jennifer Lopez', 'Ozuna',
       'Natasha Bedingfield', 'Daddy Yankee', 'Demi Lovato',
       'Major Lazer', 'Imagine Dragons', 'Gotye', 'Birdy',
       'Matchbox Twenty', 'Uncle Kracker', 'John Newman', 'Bruno Mars',
       'P!nk', 'Lana Del Rey', 'Keane', 'A$AP Ferg', 'JoJo', 'Gorillaz',
       'Rage Against the Machine','The Proclaimers',
       'Mumford & Sons', 'Run The Jewels', 'Fischerspooner', 'Yasiin Bey',
       'Pharrell Williams', '“Weird Al” Yankovic', 'Tones and I',
       'Billie Eilish', 'Ariana Grande', 'Megan Thee Stallion',
       'Doja Cat', 'DaBaby', 'Halsey', 'Rick Astley', 'Shawn Mendes',
       'Justin Bieber', '24kGoldn', 'Lizzo', 'Katy Perry',
       'Iggy Azalea', 'Lil Eazzyy', 'Future', 'Moneybagg Yo', 'Saweetie',
       'AWOLNATION', 'Weezer', 'MGMT', 'Chiddy Bang', 'Nas', 'Snoop Dogg'
       'twenty one pilots', 'The Prodigy', 'Missy Elliot', 'Eve'
       'The Presidents of the United States of America', 'Meek Mill',
       'Drake', 'Lil Pump', 'Pusha T', 'Pitbull', 'Kesha', 'Macklemore',
       'T-Pain','Ellie Goulding', 'Tenacious D', 'Sublime'
       'The Notorious B.I.G.', '2Pac', 'Colbie Caillat', 'Flo Rida',
       'Gucci Mane', 'Young Money', 'Migos', 'Yo Gotti', 'G-Eazy',
       'Foo Fighters', 'Kehlani', 'Princess Nokia', 'French Montana',
       'Backstreet Boys', 'Spice Girls', 'Soulja Boy', 'ZAYN',
       'Travis Scott', 'Sam Smith', 'The Beatles', 'DJ Khaled',
       'Stone Temple Pilots', 'Big Sean', 'Ashnikko', 'Clean Bandit',
       'Kane Brown', 'Juice WRLD', 'Ava Max', 'J Balvin', 'Taylor Swift',
       '21 Savage']

In [5]:
len(chosen_artists)

127

### Get Lyrics Function
Using my Genius token, it calls the Genius API using the lyricsgenius library. For each artist in the the chosen_artists list, the function will try to pull x amount of songs for that artist, then append it to the gather list.
There is also a timeout setting and a retry counter to allow the code to try song 3 times should there be a timeout before moving on.

In [8]:
# Code revision from Allerter on Github (My original ran into timeout issues randomly)
# https://github.com/johnwmillr/LyricsGenius/issues/121#issuecomment-704448192 

def get_lyrics(): # no arguments needed
    genius = lyricsgenius.Genius(API) # token
    genius.timeout = 15  #timeout
    genius.sleep = 5
    for human in chosen_artists: # for each artist in the pre-selected list of artists
        retries = 0 # retry counter
        while retries < 3: # while retries are less than 3
            try:
                artist = genius.search_artist(human, max_songs=100) # look for the artist and pull 100 songs
                gather.append(artist.songs) #append to the list
                break # break to go to next artist
            except Timeout as e: # if timeout
                retries += 1 # add to counter and try again
                continue 

In [9]:
get_lyrics()

Searching for songs by Jason Derulo...

Song 1: "Swalla"
Song 2: "Talk Dirty"
Song 3: "Wiggle"
Song 4: "Trumpets"
Song 5: "Tip Toe"
Song 6: "Whatcha Say"
Song 7: "Want to Want Me"
Song 8: "If I’m Lucky"
Song 9: "Marry Me"
Song 10: "Colors"
Song 11: "Get Ugly"
Song 12: "Take You Dancing"
Song 13: "The Other Side"
Song 14: "It Girl"
Song 15: "Ridin’ Solo"
Song 16: "In My Head"
Song 17: "If It Ain’t Love"
Song 18: "Bubblegum"
Song 19: "Cheyenne"
Song 20: "Don’t Wanna Go Home"
Song 21: "Stupid Love"
Song 22: "Kama Sutra"
Song 23: "Vertigo"
Song 24: "Mamacita"
Song 25: "Try Me"
Song 26: "Rest of Our Life"
Song 27: "Kiss the Sky"
Song 28: "Naked"
Song 29: "Fight for You"
Song 30: "Breathing"


KeyboardInterrupt: 

Empty lists to get the lyrics, year, title, artist name and if the track has anyone featured.

In [None]:
lyrics = []
release_year = []
title = []
artist_name = []
featured = []

In [None]:
gather

For each pull in the gather list that was appended to during the lyrics function pull. We will iterate through the song and append the method to the empty lists above.
This is nice because unlike the Genius API the lyricsgenius library gives this information directly and there is no need to gather the URL to look up the song. 👀

In [None]:
for fetch in gather:
     for song in fetch:
        release_year.append(song.year)
        lyrics.append(song.lyrics)
        title.append(song.title)
        artist_name.append(song.artist)
        featured.append(song.featured_artists)

Creating a dataframe with the lists as column values, and saving the progress because the API calls took a long time. A checkpoint before moving on.

In [None]:
df = pd.DataFrame({
    'Artist' : artist_name,
    'Featured' : featured,
    'Title' : title,
    'release_year' : release_year,
    'lyrics' : lyrics
})

In [None]:
#df.to_csv('./data/gathering_lyrics.csv') # Checkpoint

In the next notebook I will do some cleaning, filling in some lyrics that didn't get pulled & dropping those that don't have lyrics.
Once all the cleaning is done I will create a target column based on if the artist or featuring artist says their own name or an alias in the song.
