 Project Title: Lyrics Scraper Using Web Scraping


Table of Contents

1. Project Objective

2. Setup and Imports

3. Define Utility Functions

4. Scraping Process

5. Data Saving and CSV Export

6. Results and Data Preview


Setup and Imports

In [2]:
# Import Required Libraries
import os
import time
import random
import pandas as pd
from bs4 import BeautifulSoup
from urllib.request import urlopen


Define Utility Functions

In [4]:
def save_file(path, text, replace=False):
    """Save lyrics text to a file with optional replacement."""
    file_path = path + ".txt"
    if not replace and os.path.exists(file_path):
        file_path = path + "_2.txt"
    os.makedirs(os.path.dirname(file_path), exist_ok=True)
    with open(file_path, "w", encoding="utf-8") as file:
        file.write(text)

def get_lyrics(song_url, artist_name, save=True, by_decade=False, replace=False, folder="songs"):
    """Fetch lyrics from a given song URL."""
    try:
        song = urlopen(song_url)
        soup = BeautifulSoup(song.read(), "html.parser")
        all_divs = soup.find_all("div")
        lyrics_div = next((div for div in all_divs if not div.get("class") and not div.get("id")), None)
        lyrics = lyrics_div.get_text(strip=True, separator="\n") if lyrics_div else "Lyrics Not Found"

        title = soup.find_all("b")[1].get_text().replace('"', '').replace(" ", "_")
        album = soup.find_all(class_="songinalbum_title")

        year, decade = None, "others"
        if album:
            try:
                year = int(album[0].get_text().split('(')[-1].split(')')[0])
                decade = f"{str(year)[:3]}0s"
            except ValueError:
                pass

        if save:
            save_file(f"{folder}/all/{title}", lyrics, replace)
            if by_decade:
                save_file(f"{folder}/decades/{decade}/{title}", lyrics, replace)

        return {"Artist": artist_name, "Song_Title": title, "Lyrics": lyrics, "Album_Year": year, "Decade": decade}

    except Exception as e:
        print(f"Error fetching lyrics from {song_url}: {e}")
        return None


## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: To determine whether scraping is allowed or disallowed, we must check the robots.txt file of the website. This file outlines the rules set by the website administrators for web crawlers and automated bots. If the file explicitly disallows crawling paths related to song lyrics or specific directories (e.g., /lyrics/), then scraping those pages would violate the site's policy. If no such disallow rules exist for the relevant paths, scraping is technically permitted, but ethical considerations and rate limiting should still be followed.

Scraping Process

In [5]:
def scrape_artist(az_url, artist_name, song_limit=20, sleep="random", by_decade=True, replace=False, folder="songs"):
    """Scrape songs for a specific artist up to a defined limit."""
    base_url = "https://www.azlyrics.com/"
    artist_data = []
    try:
        main_page = urlopen(az_url)
        soup = BeautifulSoup(main_page.read(), "html.parser")
        song_divs = soup.find_all('div', {"class": "listalbum-item"})
        urls = [base_url + d.a['href'].split("/", 1)[1] for d in song_divs][:song_limit]

        for idx, url in enumerate(urls, 1):
            song_data = get_lyrics(url, artist_name, save=True, by_decade=by_decade, replace=replace, folder=folder)
            if song_data:
                artist_data.append(song_data)
            rt = random.randint(5, 15) if sleep == "random" else sleep
            print(f"Downloaded: {idx}/{len(urls)} - ETA: {round(rt * (len(urls) - idx) / 60, 2)} min")
            time.sleep(rt)
    except Exception as e:
        print(f"Error scraping artist page {az_url}: {e}")

    return artist_data


Run Scraping for Selected Artists

In [6]:
artist_urls = [
    "https://www.azlyrics.com/a/adele.html",
    "https://www.azlyrics.com/e/eminem.html"
]
artist_names = ["adele", "eminem"]

base_folder = r"C:\Users\archa\Desktop\lyrics"
os.makedirs(base_folder, exist_ok=True)

all_lyrics_data = []

for name, url in zip(artist_names, artist_urls):
    print(f"\n--- Scraping {name.title()} ---")
    artist_lyrics = scrape_artist(
        az_url=url,
        artist_name=name,
        song_limit=20,
        sleep="random",
        by_decade=True,
        replace=False,
        folder=os.path.join(base_folder, name)
    )
    all_lyrics_data.extend(artist_lyrics)



--- Scraping Adele ---
Downloaded: 1/20 - ETA: 2.85 min
Downloaded: 2/20 - ETA: 4.2 min
Downloaded: 3/20 - ETA: 1.98 min
Downloaded: 4/20 - ETA: 4.0 min
Downloaded: 5/20 - ETA: 2.0 min
Downloaded: 6/20 - ETA: 1.63 min
Downloaded: 7/20 - ETA: 1.52 min
Downloaded: 8/20 - ETA: 1.4 min
Downloaded: 9/20 - ETA: 0.92 min
Downloaded: 10/20 - ETA: 0.83 min
Downloaded: 11/20 - ETA: 0.75 min
Downloaded: 12/20 - ETA: 1.07 min
Downloaded: 13/20 - ETA: 1.28 min
Downloaded: 14/20 - ETA: 0.7 min
Downloaded: 15/20 - ETA: 0.83 min
Downloaded: 16/20 - ETA: 0.8 min
Downloaded: 17/20 - ETA: 0.3 min
Downloaded: 18/20 - ETA: 0.43 min
Downloaded: 19/20 - ETA: 0.15 min
Downloaded: 20/20 - ETA: 0.0 min

--- Scraping Eminem ---
Downloaded: 1/20 - ETA: 2.22 min
Downloaded: 2/20 - ETA: 3.3 min
Downloaded: 3/20 - ETA: 3.12 min
Downloaded: 4/20 - ETA: 2.13 min
Downloaded: 5/20 - ETA: 2.5 min
Downloaded: 6/20 - ETA: 2.8 min
Downloaded: 7/20 - ETA: 1.95 min
Downloaded: 8/20 - ETA: 2.6 min
Downloaded: 9/20 - ETA: 1.83

Data Saving and CSV Export

In [7]:
if all_lyrics_data:
    df = pd.DataFrame(all_lyrics_data)
    csv_path = os.path.join(base_folder, "lyrics_dataset.csv")
    df.to_csv(csv_path, index=False, encoding='utf-8')
    print(f"\n✅ Lyrics saved to: {csv_path}")
else:
    print("\n❌ No data was scraped. Please verify artist URLs.")



✅ Lyrics saved to: C:\Users\archa\Desktop\lyrics\lyrics_dataset.csv


In [8]:
# Preview the dataset
df.head()


Unnamed: 0,Artist,Song_Title,Lyrics,Album_Year,Decade
0,adele,Daydreamer,Daydreamer\nSitting on the sea\nSoaking up the...,2008,2000s
1,adele,Best_For_Last,"Wait, do you see my heart on my sleeve?\nIt's ...",2008,2000s
2,adele,Chasing_Pavements,I've made up my mind\nDon't need to think it o...,2008,2000s
3,adele,Cold_Shoulder,You say it's all in my head\nAnd the things I ...,2008,2000s
4,adele,Crazy_For_You,Found myself today\nSinging out loud your name...,2008,2000s
