The website https://lyrics.fandom.com consists of artist pages, that are contained in index pages that have the following URL format:
https://lyrics.fandom.com/wiki/Category:Artists_A?from=Aa <br>
Therefore all the index page URLs can be generated with the following code:

In [8]:
alphabet = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z"]

category_artist_urls = []
for letter in alphabet:
    for second_letter in alphabet:
        category_artist_urls.append('https://lyrics.fandom.com/wiki/Category:Artists_'+letter.upper()+"?from="+letter.upper()+second_letter)


In [9]:
print("There are", len(category_artist_urls), "different artist index pages.")
print("An example artist index page URL is:", category_artist_urls[234])

There are 676 different artist index pages.
An example artist index page URL is: https://lyrics.fandom.com/wiki/Category:Artists_J?from=Ja


Import BeautifulSoup and urllib for web scraping

In [1]:
try:
    import urllib.request as urllib2
except ImportError:
    import urllib2 
from bs4 import BeautifulSoup

The number of artists that are contained in certain index pages exceed the number that can be fit into a single page. In such cases, the remaining artist names are presented in other index pages represented by random URLs formed by the first artist name in those pages (e.g. https://lyrics.fandom.com/wiki/Category:Artists_A?from=Above+%26+Beyond) instead of using the standart URL format for index pages (such as https://lyrics.fandom.com/wiki/Category:Artists_J?from=Ja). <br> 
The following code tries to spot such exceptions, and retrieves the random URLs, so that **all** artist index pages are recorded into a list.

In [12]:
# first find the index pages that have the exception described above, and store them in a list called exceptions
exceptions = []
for url in category_artist_urls:
    try:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        # check the page content to see if there is an html class that contains the 'next' button, indicating that
        # this index page overflows to subsequent index pages containing other artists with the same initial letter
        content = soup.find_all('a', {"class": "category-page__pagination-next wds-button wds-is-secondary"})
        if content == []: # means that there is no 'next' button in the page, so there is no exception
            continue
        else: # add the exceptional URL to the list
            exceptions.append(url)
    except urllib2.HTTPError:
        pass
    
# then, use a handler function to go through all the index pages with exceptional URLs, 
# and collect the artist page URLs stored in the html code of those index pages

additional_category_pages = [] # a list to collect those additional pages
def exception_getter(url): # the handler function for retrieving additional random category page URLs
    try:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        
        #find the url that is attached to the 'next' button in the page
        content = soup.find_all('a', {"class": "category-page__pagination-next wds-button wds-is-secondary"})
        next_url = content[0]['href']
        
        # find the starting two-letters of the first artist of the next page
        index_start_next = next_url.find("from")
        first_two_letters_next = next_url[index_start_next+5:index_start_next+7]
        
        # also record the first two letters of the first artist in the current page
        index_start_now = url.find('from')
        first_two_letters_now = url[index_start_now+5:index_start_now+7]
            
        # apply the function recursively until we reach to a url that has a first artist starting with different two-letters
        if first_two_letters_next == first_two_letters_now:
            additional_category_pages.append(next_url)
            exception_getter(next_url)
        else:
            return
    except:
        print(url,"page not accessed")


# finally, go to each of those index pages, follow until the page doesn't contain any artist that starts with 
# the corresponding character bigram, and collect all the random URLs generated using the handler function above
for exception in exceptions:
    exception_getter(exception)

https://lyrics.fandom.com/wiki/Category:Artists_D?from=Dutch+Kills page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_E?from=Evolve page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_F?from=Fuzz+Fuzz+Machine page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_G?from=Gu%C3%A9na+LG+%26+Amir+Afargan page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_J?from=Junk+Science page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_L?from=Lusine page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_M?from=Mystik+Spiral page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_U?from=Unleashed+Power page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_W?from=Wowaka page not accessed
https://lyrics.fandom.com/wiki/Category:Artists_Y?from=Your+Shapeless+Beauty page not accessed


In [15]:
# add the additional category pages to the initial list to get the complete url set:
artist_page_urls = list(set(category_artist_urls + additional_category_pages))
print("There are", len(artist_page_urls), "artist pages to be scraped")

There are 965 artist pages to be scraped


Save the variables to a pickle file!

In [2]:
import pickle
def writePickle( Variable, fname):
    filename = fname +".pkl"
    f = open("pickle_vars/"+filename, 'wb')
    pickle.dump(Variable, f)
    f.close()
def readPickle(fname):
    filename = "pickle_vars/"+fname +".pkl"
    f = open(filename, 'rb')
    obj = pickle.load(f)
    f.close()
    return obj

In [18]:
writePickle(artist_page_urls, "Artist_Collection_Pages")

In each artist collection (index) page, there are multiple artists listed with a link to their own artist pages. We store each of these in a dictionary where the artist name maps to its link in the website.

In [3]:
artist_page_urls = readPickle("Artist_Collection_Pages")

In [11]:
artist_to_url = {}
for url in artist_page_urls:
    try:
        page = urllib2.urlopen(url)
        soup = BeautifulSoup(page, 'html.parser')
        artists = soup.find_all('a', {"class": "category-page__member-link"})
        for item in artists:
            artist_to_url[item['title']] = item['href']
    except:
        pass

In [12]:
# write the artist page urls dictionary to a json file for later use
import json
with open('Artist_Pages.txt', 'w') as outfile:
    json.dump(artist_to_url, outfile)
    
# also save it to a pickle file
writePickle(artist_to_url, "Artist_Pages")

In [13]:
print("Now there are", len(artist_to_url), "different artists with corresponding links to their artist pages are stored")


Now there are 79280 different artists with corresponding links to their artist pages are stored


Our final move before getting all the lyrics is to generate a placeholder dictionary for all the songs to be retrieved from the website. This dictionary has the following format: <br> <br>
{'Artist1 Name' : <br>
        &emsp;[Genre, URL_of_Artist, <br>
         &emsp;&emsp;{'Album1_name': <br>
           &emsp;&emsp;&emsp;[URL_of_Album, <br>
           &emsp;&emsp;&emsp;&emsp;{'song1': [lyrics, year, URL_of_Song, Song_ID],<br>
           &emsp;&emsp;&emsp;&emsp;'song2': [ ], <br> 
           &emsp;&emsp;&emsp;&emsp;...., <br> 
           &emsp;&emsp;&emsp;&emsp;'songN': [ ] }<br>
           &emsp;&emsp;&emsp;] <br>
          &emsp;&emsp;'Album2_name':....,<br>
          &emsp;&emsp;...., <br>
          &emsp;&emsp;'AlbumN_name':....<br>
         &emsp;&emsp;}<br>
        &emsp;],<br>
     'Artist2 Name': ...,<br>
     'ArtistN Name':...<br>
    }


In [None]:
# create a new dictionary for lyrics
lyrics_dict = {}

In [1]:
# there are certain genres that we don't want to cover due to several reasons
# 'classical music' for instance, tend to have no lyrics, and we don't want to spend time scraping them
# other genres are in the blacklist due to several reasons such as ambiguity, foreign language, scarcity...
# use the following list to avoid retrieving songs of artists belonging to unwanted genres
avoid_genre_list = ['Acoustic', 'Aggrotech', 'Alternative Country', 'Alternative Metal', 'Americana', 'Andean Music', 'Australian Hip Hop', 'Avant-garde', 'Ballad', 'Brazilian Rock', 'Canterbury', 'Celtic', 'Celtic Folk', 'Chanson', "Children's Music", 'Christian Hip Hop', 'Christian Metal', 'Christian Rock', 'Comedy', 'Contemporary Christian', 'Crust Punk', 'Dance', 'Dark Wave', 'Death Doom', 'Deathcore', 'Disco', 'Downtempo', 'EBM', 'Electronica', 'Electropop', 'Experimental Hip Hop', 'Experimental Pop', 'Experimental Rock', 'Extreme Metal', 'Forró', 'Freestyle', 'Goregrind', 'Gospel', 'Gothic Metal', 'Gothic Rock', 'Grindcore', 'Hardcore Punk', 'Horror Punk', 'Horrorcore', 'House', 'Industrial', 'Industrial Metal', 'J-Pop', 'J-Rock', 'Latin', 'Melodic Death Metal', 'Melodic Metalcore', 'Metalcore', 'Morna', 'New Jack Swing', 'New Wave', 'Noise', 'Pop Punk', 'Post-Hardcore', 'Power Metal', 'Progressive House', 'Progressive Metal', 'Punk Rock', 'Rockabilly', 'Roots Rock','Singer-Songwriter','Sludge Metal','Spanish Rock','Spoken Word','Stoner Rock','Synthpop', 'A Cappella',
 'Adult Contemporary','Afropop','Ambient','Austropop','Avant-garde Metal','Axé','Bachata','Baroque Pop','Black Metal',
 'Blackgaze','Blue-Eyed Soul','Bluegrass','Blues Revival','Blues Rock','Bossa Nova','British Blues','Bubblegum Pop','Canadian Folk',
 'Celtic Rock','Christian','Classical','Comedy Rock','Contemporary Folk','Country Rock','Crossover Thrash','Crunk',
 'Cumbia','Dance Punk','Dark Electro','Deathrock','Deutschrock','Doom Metal','Drone Music','Drum And Bass','Dubstep',
 'Easy Listening','Electronic Rock','Emo','Eurobeat','Eurodance','Fado','Flamenco','Folk Metal','Folk Punk','Folk Rock',
 'Freak Folk','French Hip Hop','Garage Rock','Glam Metal','Glam Rock','Gothic','Groove Metal','Happy Hardcore','Industrial Rock',
 'Italian Folk','Italo Disco','Jazz Fusion','K-Pop','Kirtan','Latin Pop','Lo-Fi','MPB','Medieval','Melodic Hardcore',
 'Neo-Psychedelia','Neofolk','Nerdcore Hip Hop','Noise Rock','Nu Metal','Nu Metalcore','Nueva Canción','Oi-Punk',
 'Pagan Metal','French Pop','Pagode','Persian','Neue Deutsche Welle','Norteño','Pachanga','Post-Punk','Progressive Metalcore',
 'Psychedelic','RAC','Raggamuffin','Rap Rock','Raï','Reggae Fusion','Reggaeton','Romanian Etno','Salsa','Sertanejo',
 'Riot Grrrl',"Worship",'Schlager','Ska','Bolero','Tango','Cabaret','Cajun','Celtic Fusion','Klezmer','Latin Rock','Soul',
 'Space Rock','Symphonic Metal','Technical Death Metal','Tejano','Trance','Turkish Folk','Turkish Rock','Vocal']

In [None]:
# go to each artist page, check whether genre constraints are met, and then fill the dictionary with necessary info
# such as album name, album year, song name, lyrics, genre, etc. 
for artist, link in artist_to_url.items():
    
    lyrics_dict[artist] = [] # create an empty list for each artist in the dictionary, to be filled in later on
    try:
        page = urllib2.urlopen('https://lyrics.fandom.com'+link)
        soup = BeautifulSoup(page, 'html.parser')
        
        '''first get the genre and add it to the dictionary with the artist name'''
        try: # see if you can get genres
            genres = soup.find_all('div', {'class' : "css-table-cell"})
            if genres == []: # if there are no genre blocks to be processed in the html, then we should proceed with the next artist
                del lyrics_dict[artist]
                continue
            for item in genres:
                try:
                    b_blocks = item.find('b')
                    for entry in b_blocks:
                        if entry == 'Genres:':
                            genre = item.find('a')['title'][15:]
                            lyrics_dict[artist] = [genre, 'https://lyrics.fandom.com'+link, dict()] # create a list for each artist, consisting of the artist link, and the genre value     
                except:
                    pass
        except:
            pass
        '''genre retrieved and saved'''
        
        
        if (genre in avoid_genre_list): # only the desired genres are accepted
            del lyrics_dict[artist]
            continue
        
        
        '''then get the content for further classification'''
        content = soup.find_all("span", { "class" : "mw-headline" })
        '''content retrieved'''
        
        '''here, to save time, we need to check whether the given artist has enough number of albums and songs 
        that are above a certain threshold. if not, we need to stop the process now, before consuming time on
        accessing pages'''
        album_count = 0
        song_count = 0
        for item in content:
            if album_count >=3 and song_count >=10:
                break
            try:
                album = item.find("a", recursive=False)["title"]
                if album[-6:] != "exist)" and album[len(artist)] == ':':
                    album_count +=1
                    album_url = 'https://lyrics.fandom.com' + item.find("a", recursive=False)["href"]
                    try:
                        page = urllib2.urlopen(album_url)
                        soup2 = BeautifulSoup(page, 'html.parser')
                        song_list = soup2.find_all('b')
                    except urllib2.HTTPError:
                        pass
                    for song in song_list:
                        try:
                            song_title = song.find('a', title = True, ref=False)['title'] # ref=False prevents listed songs without any links
                            # when the title is not in the form "Michael Jackson:...", then there might be an additional artist in the song
                            if song_title[len(artist)] == ':':
                                song_count +=1
                        except:
                            pass
            except:
                pass
        '''threshold check completed'''
        
        # we'll only continue if the artist has more than 3 albums and 10 songs with lyrics registered in the website
        if album_count < 3 or song_count < 10:
            del lyrics_dict[artist]
            continue
        else:        
            '''if the threshold requirements are satisfied, continue with retrieving the content and storing it in the dictionary'''
            album_count = 0
            song_count = 0
            for item in content:
                try:
                    album = item.find("a", recursive=False)["title"]
                    if album[-6:] == "exist)" or album[len(artist)] != ':': #if the album name ends with (page does not exist), it means the album page is not there yet
                        continue
                    else:
                        album_count += 1
                        album_url = 'https://lyrics.fandom.com' + item.find("a", recursive=False)["href"]
                        year = album[-5:-1]
                        lyrics_dict[artist][-1][album] = [album_url, dict()]
                        try:
                            page = urllib2.urlopen(album_url)
                            soup2 = BeautifulSoup(page, 'html.parser')
                            song_list = soup2.find_all('b')
                        except urllib2.HTTPError:
                            pass
                        for song in song_list:
                            try:
                                song_url = song.find('a', title = True, ref=False)['href'] # ref=False prevents listed songs without any links
                                song_title = song.find('a', title = True, ref=False)['title']
                                if song_title[len(artist)] == ':':
                                    try: # get the lyrics of each song
                                        lyricshtml = urllib2.urlopen('https://lyrics.fandom.com'+song_url)
                                        lyricssoup = BeautifulSoup(lyricshtml, 'html.parser')
                                        raw_lyrics = lyricssoup.find(attrs={'class': 'lyricbox'})
                                    except urllib2.HTTPError:
                                        pass
                                    lyrics_dict[artist][-1][album][-1][song_title] = \
                                    [raw_lyrics.get_text(separator='<>'), year, song_url, str(id_count)]
                                    song_count +=1
                                    id_count +=1
                                else:
                                    continue
                            except:
                                pass
                except:
                    pass
            # finally add the album and song count for each artist
            lyrics_dict[artist].append(album_count)
            lyrics_dict[artist].append(song_count)
        
            # here is a double check of the numbers.
            # unless sufficient numbers are satisfied, we delete the artist data from the dict and continue
            if album_count < 3 or song_count < 10:
                del lyrics_dict[artist]
                continue
    except:
        pass

In [10]:
# store everything  to a pickle file
writePickle(lyrics_dict, 'lyrics_dict')