# AZLyrics Scraper

With this code you could scrape individual lyrics pages or whole artist pages for some of their basic data. This information will be saved in a csv file. To retrieve the data of the artist/band you want, you need to change the following link within the quotation markts with that of the AZLyrics artist page (make sure you use the full link). 

In [1]:
index_url = 'https://www.azlyrics.com/n/nothingbutthieves.html'

## Setting up

First we start by importing various python libraries:

In [2]:
import sys
import csv
import requests
import time
import random
import pandas as pd
from bs4 import BeautifulSoup

Then we make functions to retrieve the urls from the web and look for certain elements within the webpages:

In [3]:
#makes the default request
def load_page(url):
    with requests.get(url) as f:
        page = f.text
    return page

#extracts a text element from a html tag
def get_element_text(element):
    try:
        return element.text.strip()
    except AttributeError as e:                     
        print('Element not found, error: {}'.format(e), file=sys.stderr)
        return ''

#extracts specifically the text element to extract the genre
def get_genre_element(html):
    try:
        genre_script = html.find('body').find_all('script')[2]
        genre = str(genre_script).split()[9][1:-2]
        return genre
    except AttributeError as e:                     
        print('Element not found, error: {}'.format(e), file=sys.stderr)
        return ''

## Getting all artist songs

We proceed to make a function to get some basic information from each song on an artist's page on AZLyrics. From the list of songs and albums we extract the song, the link to the specific song page and generate an id for every song. It stores the information in a dictionary nested in the list called `data`.

In [4]:
def get_songs(url):
    artist_page = BeautifulSoup(load_page(url), 'lxml')        
    items = artist_page.find(id="listAlbum")                   
    if not items:                                             
        print('Something went wrong!', file=sys.stderr)
        sys.exit()
    data = []
    for count, row in enumerate(items.find_all(class_= 'listalbum-item')):          
        song = get_element_text(row.find('a'))
        link = row.find('a').get('href')
        link = 'https://www.azlyrics.com/' + str(link)
        data.append({    
                         'id' : 's' + str(count + 1),
                         'song': song,
                         'link': link,
                        })
    return data

Run the code below to test the function:

In [5]:
#test = pd.DataFrame(get_songs(index_url))
#test

## Getting the individual lyrics

Now that we are able to get the basic information for each song we now need to define a function to look at every individual song page to extract the album, year of the album release, the song's writers, the music genre, and the lyrics. Like the previous function it will store all the information in the list variable called `data`.

In [6]:
def get_song_info(url):
    song_page = BeautifulSoup(load_page(url), 'lxml')                  
    interesting_html = song_page.find(class_='container main-page')    
    if not interesting_html:
        print('No information availible for song at {}'.format(url), file=sys.stderr)
        return {}                                                      
    album = get_element_text(interesting_html.find(class_='songinalbum_title').find('b'))[1:-1]
    if album == 'ou May Also Lik':
        album = album.replace('ou May Also Lik', 'other songs')
    album_released = get_element_text(interesting_html.find(class_='songinalbum_title'))[-5:-1]
    if 'Lik' in album_released:
        album_released = album_released.replace('Lik', '0')
    credits = get_element_text(interesting_html.find_all(class_='smt')[2])[11:]
    genre = get_genre_element(song_page)
    lyrics = get_element_text(interesting_html.find('div', {'class':None}))
    return {'album': album, 'album release': album_released,'credits': credits, 'genre': genre, 'lyrics': lyrics}                      

Run the code below to test the function:

In [7]:
#song_url = 'https://www.azlyrics.com/lyrics/bobdylan/highlands.html'  #you can replace this url with any song on AZLyrics
#song_info = get_song_info(song_url)
#for key, value in song_info.items():
#    if key == 'lyrics':      #you can change 'lyrics' with any of the keys in the dictionary
#        print(value)

## Scraping

The following code applies the previously defined function, thus scraping the relevant information from AZLyrics. First the `get_songs(url)` function is applied. Then using the hyperlinks gained from this function we can loop through those to scrape all the data from every song page using the `get_song_info(url)` function. This all gets merged into a single dictionary. This process may take a while depending on the amount of songs released by the artist.

In [8]:
song_data = get_songs(index_url)                      
for row in song_data:
    #print('Scraping info on {}.'.format(row['song'])) #can be useful for debugging
    url = row['link']
    song_info = get_song_info(url)                    
    for key, value in song_info.items():
        row[key] = value
    time.sleep(random.uniform(3,8)) #take this faster code if you have <100 songs you want to download
    #time.sleep(random.uniform(4,16)) #take this slower code if you have >100 songs you want to download
print('Finished scraping')

Finished scraping


## Writing data into CSV

In this last part we will write down the data we have just scraped in a csv file.

In [9]:
filename = index_url.rsplit('/', 1)[1][:-5]
with open(filename, 'w', encoding='utf-8') as f:       
    fieldnames=['id','song', 'album', 'album release', 'genre','credits', 'lyrics']
    writer = csv.DictWriter(f,
                            delimiter=',',                
                            quotechar='"',                
                            quoting=csv.QUOTE_NONNUMERIC, 
                            fieldnames=fieldnames
                            )
    writer.writeheader()                                  
    for row in song_data:
        writer.writerow({k:v for k,v in row.items() if k in fieldnames})
print('File created')

File created


## Optional: Creating a Dataframe from the CSV file

This allows you to see what is in the csv file from this Juypiter Notebook using pandas and do operations on the data if so desired.

In [10]:
df = pd.read_csv(filename) 
for index, value in df.dtypes.items(): 
    if value == 'object':
        df[index] = df[index].fillna('')
    else:
        df[index] = df[index].fillna(0)
df['album release'] = df['album release'].astype(int)
df['lyrics'] = df['lyrics'].astype('string')
df

Unnamed: 0,id,song,album,album release,genre,credits,lyrics
0,s1,Graveyard Whistling,Graveyard Whistling,2014,pop,"Craik Dominic Alexander Roberto, Emery Julian","All that afterlife, I don't hold with it All ..."
1,s2,Emergency,Graveyard Whistling,2014,pop,"Dominic Alexander Roberto Craik, Joseph Langri...",Let me be absolutely clear This is mine but I...
2,s3,Itch,Graveyard Whistling,2014,pop,"Craik Dominic Alexander Roberto, Hibbit Larry",There's a rumbling in my head It's getting lo...
3,s4,Last Orders,Graveyard Whistling,2014,pop,"Craik Dominic Alexander Roberto, Emery Julian",We left The Cliff Wandered down the Broadway ...
4,s5,Excuse Me,Nothing But Thieves,2015,pop,"Jim Irvin, Julian Emery, Conor Ryan Mason, Dom...","His space crowds out your space, your space Y..."
...,...,...,...,...,...,...,...
56,s57,Your Blood,Moral Panic II,2021,pop,"Julian Emery, Jim Irvin, Dominic Craik, Joseph...",You know it's your blood that I bleed Tell me...
57,s58,Crazy,other songs,0,pop,"Gian Piero Reverberi, Brian Joseph Burton, Gia...","I remember when I remember, I remember when I..."
58,s59,Holding Out For A Hero,other songs,0,pop,"Jim Steinman, Dean Pitchford",Where have all the good men gone And where ar...
59,s60,Life's Coming In Slow,other songs,0,pop,,The sand in the hourglass keeps dripping away ...
