# SCRAPING SONG DATASET FROM `AZLyrics`

<img src="https://i.imgur.com/U4BrQEd.png" title="source: imgur.com" />

This notebook contains our web scraping project for group 1.

Group members:
- Gift Abah
- Ibukunoluwa Moses
- Abayomi-Perez Okekunle
- Sophia Emifoniye

----------------------------------------------------------------------------------------------------------------------------------------------------------------

> ## TASK

To create a **dataset of songs and albums** that have been created by differnt artists (e.g Sam Smith, Nicki Minaj) by scraping the `AZLyrics` site: https://www.azlyrics.com/f.html

The Function should be able to take any artist page and return the EPs and/or albums that the artist have released, the release date (year) and some songs on that album.


---



For this dataset, we decided to collect the following information:

- Name of the Artist
- EPs or Albums that they have released
- Year of release
- Tracks on the album
- Cover art (to be saved as image link) for the EP or album


---



> **OUTPUT FORMAT:** Our final export will be made into a `CSV file`



---



 ### OUR CODES START HERE

Here's an outline of the steps we intend to follow:

1. Install the necessary libraries
2. Use `requests` to download the webpage
3. Use `BeautifulSoup` to parse and read the webpage
4. Collect the necessary dataset and save it as a list of dictionaries
5. Write a simple function that does everything
6. Save final result as a CSV file



---



> Step 1: Installing the necessary libraries that are needed - `requests` and `BeautifulSoup`

In [1]:
!pip install requests --upgrade --quiet

In [2]:
import requests

In [3]:
!pip install beautifulsoup4 --upgrade --quiet

In [4]:
from bs4 import BeautifulSoup

> Getting the topic url to be scraped, `https://store.steampowered.com/genre/Free%20to%20Play/`

In [37]:
topic_url = 'https://www.azlyrics.com/s/samsmith.html'

#  Using requests.get to download the webpage

response = requests.get(topic_url)

In [38]:
# Checking to ensure that I got the response of "200 family"
response.status_code


200

In [39]:
page_content = response.text
len(page_content)

57787

In [40]:
# Checking to be sure the webpage has been downloaded and is being read correctly
page_content[:1001]

'<!DOCTYPE html>\r\n<html lang="en">\r\n<head>\r\n<meta charset="utf-8">\r\n<meta http-equiv="X-UA-Compatible" content="IE=edge">\r\n<meta name="viewport" content="width=device-width, initial-scale=1">\r\n<!-- The above 3 meta tags *must* come first in the head; any other head content must come *after* these tags -->\r\n<meta name="description" content="Sam Smith lyrics - 107 song lyrics sorted by album, including &quot;Stay With Me&quot;, &quot;I\'m Not The Only One&quot;, &quot;Unholy&quot;."> \r\n<meta name="keywords" content="Sam Smith, Sam Smith lyrics, discography, albums, songs">\r\n<meta name="robots" content="noarchive">\r\n<title>Sam Smith Lyrics</title>\r\n\r\n<link rel="canonical" href="https://www.azlyrics.com/s/samsmith.html" />\r\n<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.4/css/bootstrap.min.css">\r\n<link rel="stylesheet" href="/local/az.css">\r\n\r\n<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->\r

> Parsing your code into BeautifulSoup using parsers

In [41]:
doc = BeautifulSoup(response.text, 'html.parser')
# type(doc)

> **STEP 2:** Getting the `Name of the Artist` from the `h1` tag

In [42]:
artist_name = doc.h1.text.strip()
artist_name = artist_name.strip(' Lyrics')
artist_name

'Sam Smith'



---


> STEP 3:

> Using `.find_all` to find all `div` tags that contain the class `album`




In [43]:
list_of_albums = doc.find_all('div', class_='album')
list_of_albums

#  Trying it out to see if it works
first_item = list_of_albums[0]
first_item.text.strip()

'EP: "Nirvana" (2013)'

In [44]:
albums_and_year = []

for album in list_of_albums:
  album = album.text.strip()
  albums_and_year.append(album)


albums_and_year

['EP: "Nirvana" (2013)',
 'album: "In The Lonely Hour" (2014)',
 'compilation: "The Lost Tapes - Remixed" (2015)',
 'album: "The Thrill Of It All" (2017)',
 'album: "Love Goes" (2020)',
 'album: "Gloria" (2023)',
 'EP: "A Lonely Christmas" (2023)',
 'other songs:']

In [45]:
# @title **Step 4:** Getting the `Cover Images` of the albums

album_img = doc.find_all('img', class_='album-image')
album_img

[<img alt="Sam Smith - Nirvana EP cover" class="album-image" src="/images/albums/301/e5d75335ac015e7cdf969156da6757c1.jpg"/>,
 <img alt="Sam Smith - In The Lonely Hour album cover" class="album-image" src="/images/albums/311/b68c5abf3224e6be1646cdf3cc654317.jpg"/>,
 <img alt="Sam Smith - The Lost Tapes - Remixed compilation cover" class="album-image" src="/images/albums/389/a804e1c4301d624da01765e30a8d7275.jpg"/>,
 <img alt="Sam Smith - The Thrill Of It All album cover" class="album-image" src="/images/albums/507/20be9d855606d7afd018e34e1c1e615a.jpg"/>,
 <img alt="Sam Smith - Love Goes album cover" class="album-image" src="/images/albums/880/62f6b1bbff0c287e401b40777a009bd8.jpg"/>,
 <img alt="Sam Smith - Gloria album cover" class="album-image" src="/images/albums/114/0f186ced50cebe5367f3b0292752513a.jpg"/>,
 <img alt="Sam Smith - A Lonely Christmas EP cover" class="album-image" src="/images/albums/125/f2db884a31babe5c6fc6deee506b2b0d.jpg"/>]

In [46]:
# Getting the Url for the img
img_src = []

for img in album_img:
  base_url = 'https://www.azlyrics.com'
  src = base_url + img['src']
  img_src.append(src)

img_src

['https://www.azlyrics.com/images/albums/301/e5d75335ac015e7cdf969156da6757c1.jpg',
 'https://www.azlyrics.com/images/albums/311/b68c5abf3224e6be1646cdf3cc654317.jpg',
 'https://www.azlyrics.com/images/albums/389/a804e1c4301d624da01765e30a8d7275.jpg',
 'https://www.azlyrics.com/images/albums/507/20be9d855606d7afd018e34e1c1e615a.jpg',
 'https://www.azlyrics.com/images/albums/880/62f6b1bbff0c287e401b40777a009bd8.jpg',
 'https://www.azlyrics.com/images/albums/114/0f186ced50cebe5367f3b0292752513a.jpg',
 'https://www.azlyrics.com/images/albums/125/f2db884a31babe5c6fc6deee506b2b0d.jpg']

In [28]:
# @title Step 5: Getting the `List of songs` under each album
list_of_songs = doc.find_all('div', class_='listalbum-item')

# Creating a list of all the songs
song_list_and_url = {}

for song in list_of_songs:
    # Using find method to extract song name and link
    song_name = song.find('a').text.strip()
    song_link = base_url + song.find('a')['href']

    # Updating the dictionary
    song_list_and_url.update({song_name: song_link})

song_list_and_url


{'Safe With Me': 'https://www.azlyrics.com/lyrics/samsmith/safewithme.html',
 'Nirvana': 'https://www.azlyrics.com/lyrics/samsmith/nirvana.html',
 "I've Told You Now": 'https://www.azlyrics.com/lyrics/samsmith/ivetoldyounow.html',
 'Latch (Acoustic)': 'https://www.azlyrics.com/lyrics/samsmith/latchacoustic.html',
 'Money On My Mind': 'https://www.azlyrics.com/lyrics/samsmith/moneyonmymind.html',
 'Good Thing': 'https://www.azlyrics.com/lyrics/samsmith/goodthing.html',
 'Stay With Me': 'https://www.azlyrics.com/lyrics/samsmith/staywithme.html',
 'Leave Your Lover': 'https://www.azlyrics.com/lyrics/samsmith/leaveyourlover.html',
 "I'm Not The Only One": 'https://www.azlyrics.com/lyrics/samsmith/imnottheonlyone.html',
 'Like I Can': 'https://www.azlyrics.com/lyrics/samsmith/likeican.html',
 'Life Support': 'https://www.azlyrics.com/lyrics/samsmith/lifesupport.html',
 'Not In That Way': 'https://www.azlyrics.com/lyrics/samsmith/notinthatway.html',
 'Lay Me Down': 'https://www.azlyrics.com/



---


## Now that we know how to extract the data we want, let's combine all the lines of code above into a `function` called `song_dataset`

In [47]:
def song_dataset(url):
    # Step 1: Download the webpage
    response = requests.get(url)

    if response.status_code != 200:
        print(f"Failed to retrieve the page. Status code: {response.status_code}")
        return None

    # Step 2: Parse the HTML content
    doc = BeautifulSoup(response.text, 'html.parser')

    # Step 3: Get the artist name
    artist_name = doc.h1.text.strip()
    artist_name = artist_name.strip(' Lyrics')

    # Step 4: Get the list of albums
    list_of_albums = doc.find_all('div', class_='album')
    albums_and_year = [album.text.strip() for album in list_of_albums]

    # Step 5: Get the album images
    album_img = doc.find_all('img', class_='album-image')
    img_src = [base_url+img['src'] for img in album_img]

    # Step 6: Get the list of songs for each album
    list_of_songs = doc.find_all('div', class_='listalbum-item')
    song_list_and_url = {}

    for song in list_of_songs:
        album = song.find_previous('div', class_='album').text.strip()
        song_name = song.find('a').text.strip()
        song_link = url + song.find('a')['href']

        # Update the dictionary with the new structure
        if album not in song_list_and_url:
            song_list_and_url[album] = []
        song_list_and_url[album].append({'song_name': song_name, 'song_link': song_link})

    # Combine all the data into a dictionary
    result = {
        'Artist Name': artist_name,
        'Album Name (Year)': albums_and_year,
        'Cover Image Link': img_src,
        'Song List and Url': song_list_and_url
    }

    return result


### Testing out the function

Example using `Sam Smith`

In [48]:
# Example for "Sam Smith"
url_to_scrape = 'https://www.azlyrics.com/s/samsmith.html'
result_dataset = song_dataset(url_to_scrape)

# Print the result or further processing
print(result_dataset)

{'Artist Name': 'Sam Smith', 'Album Name (Year)': ['EP: "Nirvana" (2013)', 'album: "In The Lonely Hour" (2014)', 'compilation: "The Lost Tapes - Remixed" (2015)', 'album: "The Thrill Of It All" (2017)', 'album: "Love Goes" (2020)', 'album: "Gloria" (2023)', 'EP: "A Lonely Christmas" (2023)', 'other songs:'], 'Cover Image Link': ['https://www.azlyrics.com/images/albums/301/e5d75335ac015e7cdf969156da6757c1.jpg', 'https://www.azlyrics.com/images/albums/311/b68c5abf3224e6be1646cdf3cc654317.jpg', 'https://www.azlyrics.com/images/albums/389/a804e1c4301d624da01765e30a8d7275.jpg', 'https://www.azlyrics.com/images/albums/507/20be9d855606d7afd018e34e1c1e615a.jpg', 'https://www.azlyrics.com/images/albums/880/62f6b1bbff0c287e401b40777a009bd8.jpg', 'https://www.azlyrics.com/images/albums/114/0f186ced50cebe5367f3b0292752513a.jpg', 'https://www.azlyrics.com/images/albums/125/f2db884a31babe5c6fc6deee506b2b0d.jpg'], 'Song List and Url': {'EP: "Nirvana" (2013)': [{'song_name': 'Safe With Me', 'song_link

### Example 2 -- Scraping the song_dataset for using `Rihanna`

In [49]:
# Example for "Sam Smith"
url_to_scrape = 'https://www.azlyrics.com/r/rihanna.html'
result_dataset = song_dataset(url_to_scrape)

# Print the result or further processing
print(result_dataset)

{'Artist Name': 'Rihanna', 'Album Name (Year)': ['album: "Music Of The Sun" (2005)', 'album: "A Girl Like Me" (2006)', 'album: "Good Girl Gone Bad" (2007)', 'album: "Rated R" (2009)', 'album: "Loud" (2010)', 'album: "Talk That Talk" (2011)', 'album: "Unapologetic" (2012)', 'soundtrack: "Home" (2015)', 'album: "Anti" (2016)', 'other songs:'], 'Cover Image Link': ['https://www.azlyrics.com/images/albums/486/c8c9fdd8c445e62a27502c065d281833.jpg', 'https://www.azlyrics.com/images/albums/540/581a721398e196e20775a722602f2690.jpg', 'https://www.azlyrics.com/images/albums/610/5f1d8b6d58c7c2cbd4dbe4b3c642cec0.jpg', 'https://www.azlyrics.com/images/albums/854/59938eea10f80a5838ab10fafb63d225.jpg', 'https://www.azlyrics.com/images/albums/106/0f2cf9308094841455ab0970302649af.jpg', 'https://www.azlyrics.com/images/albums/190/8642986c76fa48b9bc11e09451c59e9f.jpg', 'https://www.azlyrics.com/images/albums/249/1a1070f90be14d6305352a7b42bd460b.jpg', 'https://www.azlyrics.com/images/albums/370/84151a1359



---

## Writing the dataset into a CSV file

In [50]:
import csv
import os

def write_csv(data, path):
    # Check if the file already exists
    file_exists = os.path.isfile(path)

    with open(path, 'a', newline='', encoding='utf-8') as csvfile:
        # Create a CSV writer
        csv_writer = csv.writer(csvfile)

        # If the file doesn't exist, write the header
        if not file_exists:
            csv_writer.writerow(['Artist Name', 'Album Name (Year)', 'Cover Image Link', 'Song Name', 'Link to Song'])

        # Repeat artist_name for each row
        artist_name = data['Artist Name']

        # Iterate through each album
        for album, songs_and_links in data['Song List and Url'].items():
            # Repeat img_src for each row in the same album
            img_src = data['Cover Image Link'][0]

            # Iterate through each song in the album
            for song_info in songs_and_links:
                csv_writer.writerow([artist_name, album, img_src, song_info['song_name'], song_info['song_link']])


---

### Calling and using the CSV writer function

The `write_csv` function is coded such that it appends new artists, their albums, song names and links to the end of the previous dataset to create a gigantic dataset.

An example is shown below using `Sam Smith` and `Rihanna's` page

In [51]:
# Example usage:
url_to_scrape = 'https://www.azlyrics.com/s/samsmith.html'
result_dataset = song_dataset(url_to_scrape)

# Specifying the path where you want to save the CSV file
csv_path = 'output_dataset.csv'

# Calling the write_csv function to save the dataset to a CSV file
write_csv(result_dataset, csv_path)


In [52]:
# Example usage:
url_to_scrape = 'https://www.azlyrics.com/r/rihanna.html'
result_dataset = song_dataset(url_to_scrape)

# Specifying the path where you want to save the CSV file
csv_path = 'output_dataset.csv'

# Calling the write_csv function to save the dataset to a CSV file
write_csv(result_dataset, csv_path)


Testing the entire code and the `write_csv` function to see if it works... This time, we're outputting to `Output_dataset2.csv`

### For Beyonce

In [53]:
# Songs of Beyonce
url_to_scrape = 'https://www.azlyrics.com/k/knowles.html'
result_dataset = song_dataset(url_to_scrape)

# Specifying the path where you want to save the CSV file
csv_path = 'output_dataset2.csv'

# Calling the write_csv function to save the dataset to a CSV file
write_csv(result_dataset, csv_path)


### For Cardi B

To be outputed to the same csv file as Beyonce's

In [54]:
# Songs of Cardi B
url_to_scrape = 'https://www.azlyrics.com/c/cardi-b.html'
result_dataset = song_dataset(url_to_scrape)

# Specifying the path where you want to save the CSV file
csv_path = 'output_dataset2.csv'

# Calling the write_csv function to save the dataset to a CSV file
write_csv(result_dataset, csv_path)

---

This completes the project