# ADS 509 Module 1: APIs and Web Scraping
## Halee Staggs
### Disclaimer: ChatGPT4 was used as a tool to assist with this assignment. I used the HTML source code from the azlyrics website, along with the comment prompts for the desired output, to create most of the code for this assignment. Some of the code produced by ChatGPT4 had a few errors and I trouble-shooted them to ensure the output was accurate and reliable. The code produced by ChatGPT4 is labeled in the code block comments.

# Importing Libraries

In [1]:
import os
import datetime
import re

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random

In [15]:
# Use this cell for any import statements you add
import shutil

---

# Lyrics Scrape

This section asks you to pull data by scraping www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [5]:
# Define artist name, and URL for artist with correct alphabetical folder 
artists = {'chappell':"https://www.azlyrics.com/c/chappellroan.html",
           'nav':"https://www.azlyrics.com/n/nav.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

In [6]:
# Confirm dictionary
artists.items()

dict_items([('chappell', 'https://www.azlyrics.com/c/chappellroan.html'), ('nav', 'https://www.azlyrics.com/n/nav.html')])

## Part 1: Finding Links to Songs Lyrics

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: The approved web scraping features can be verified by looking at the robots.txt file. The file can be viewed by adding the "/robots.txt" destination at the end of a websites main url. The only disallowed attributes are calling the entire lyrics database or a song, so just scraping for indiviual artists or anything else is allowed by all user agent types. Except for user agent type 008, which is not allowed to scrape anything.
* www.azlyrics.com/robots.txt
* User-agent: *
* Disallow: /lyricsdb/
* Disallow: /song/
* Allow: /

* User-agent: 008
* Disallow: /


In [9]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

for artist, artist_page in artists.items():
    try:
        # Request the artist page
        response = requests.get(artist_page)
        # Pause the execution for a short, random period of time
        time.sleep(5 + 10*random.random())
        
        # THE FOLLOWING CODE IN THIS BLOCK WAS PRODCUED WITH HELP FROM CHATGPT4
        # Check if the request was successful
        if response.status_code == 200:
            # Parse the HTML content
            soup = BeautifulSoup(response.text, 'html.parser')
            # Find all links within 'div' elements having class 'listalbum-item'
            links = soup.find_all('div', class_='listalbum-item')
            # Extract 'href' attributes from each link and store them
            for link in links:
                url = link.find('a').get('href')
                full_url = f'https://www.azlyrics.com{url}' if url.startswith('/') else url
                lyrics_pages[artist].append(full_url)
        else:
            print(f"Failed to retrieve page for {artist} with status code {response.status_code}")
    except Exception as e:
        print(f"An error occurred while processing {artist}: {e}")

# Output the dictionary to see which URLs were added
print(lyrics_pages)

defaultdict(<class 'list'>, {'chappell': ['https://www.azlyrics.com/lyrics/chappellroan/dieyoung.html', 'https://www.azlyrics.com/lyrics/chappellroan/goodhurt.html', 'https://www.azlyrics.com/lyrics/chappellroan/meantime.html', 'https://www.azlyrics.com/lyrics/chappellroan/sugarhigh.html', 'https://www.azlyrics.com/lyrics/chappellroan/badforyou.html', 'https://www.azlyrics.com/lyrics/chappellroan/femininomenon.html', 'https://www.azlyrics.com/lyrics/chappellroan/redwinesupernova.html', 'https://www.azlyrics.com/lyrics/chappellroan/aftermidnight.html', 'https://www.azlyrics.com/lyrics/chappellroan/coffee.html', 'https://www.azlyrics.com/lyrics/chappellroan/casual.html', 'https://www.azlyrics.com/lyrics/chappellroan/supergraphicultramoderngirl.html', 'https://www.azlyrics.com/lyrics/chappellroan/hottogo.html', 'https://www.azlyrics.com/lyrics/chappellroan/mykinkiskarma.html', 'https://www.azlyrics.com/lyrics/chappellroan/pictureyou.html', 'https://www.azlyrics.com/lyrics/chappellroan/kal

In [10]:
# Look at lyric pages dictionary in a more organized way
lyrics_pages

defaultdict(list,
            {'chappell': ['https://www.azlyrics.com/lyrics/chappellroan/dieyoung.html',
              'https://www.azlyrics.com/lyrics/chappellroan/goodhurt.html',
              'https://www.azlyrics.com/lyrics/chappellroan/meantime.html',
              'https://www.azlyrics.com/lyrics/chappellroan/sugarhigh.html',
              'https://www.azlyrics.com/lyrics/chappellroan/badforyou.html',
              'https://www.azlyrics.com/lyrics/chappellroan/femininomenon.html',
              'https://www.azlyrics.com/lyrics/chappellroan/redwinesupernova.html',
              'https://www.azlyrics.com/lyrics/chappellroan/aftermidnight.html',
              'https://www.azlyrics.com/lyrics/chappellroan/coffee.html',
              'https://www.azlyrics.com/lyrics/chappellroan/casual.html',
              'https://www.azlyrics.com/lyrics/chappellroan/supergraphicultramoderngirl.html',
              'https://www.azlyrics.com/lyrics/chappellroan/hottogo.html',
              'https://w

Let's make sure we have enough lyrics pages to scrape. 

In [11]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

In [12]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For chappell we have 24.
The full pull will take for this artist will take 0.07 hours.
For nav we have 166.
The full pull will take for this artist will take 0.46 hours.


## Part 2: Pulling Lyrics

In [13]:
# Create a function to produce file names in the repo directory

def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https://www.azlyrics.com", "").replace("http","")
    name = link.replace(".html","")

    # drop the lyrics folder designation
    name = name.replace("/lyrics/","")
    
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    return(name)  # Only left with artist name and song name plus .txt

In [16]:
# Make the lyrics folder here with functionality that checks to see if the folder exists. 
# If it does, then use shutil.rmtree to remove it and create a new one.
if os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [21]:
# Commented out the stub because the links provide full URL already
# url_stub = "https://www.azlyrics.com" 
start = time.time()

total_pages = 0 


# THE FOLLOWING CODE IN THIS BLOCK WAS PRODCUED WITH HELP FROM CHATGPT4
# Iterate over artist/song url dictionary
for artist, urls in lyrics_pages.items():
    artist_dir = os.path.join("lyrics", artist.replace(" ", "_").lower())
    if not os.path.exists(artist_dir):
        os.makedirs(artist_dir)  # Check if artist file path exists, if not, add new file path folder for each artist

    for link in urls[:24]:  # Ensure only the first 24 song links are processed
        full_url = link  # Links from dictionary
        try:
            response = requests.get(full_url)
            time.sleep(5 + 10 * random.random())  # Randomized sleep to prevent being blocked
            
            soup = BeautifulSoup(response.text, 'html.parser')  # Parse out html content
            
            # Extract the title from the <div class="div-share"><h1>"Song Name" lyrics</h1></div>
            # These tags can be identified using inspection on web page
            title_tag = soup.find('div', class_='div-share')
            title = title_tag.h1.text.replace(' lyrics', '') if title_tag and title_tag.h1 else 'No Title'
            
            # Extract lyrics from the div directly containing the lyrics text
            lyrics_div = soup.find('div', class_=False, id=False)  # Assuming the lyrics div has no class or id
            lyrics = lyrics_div.get_text('\n', strip=True) if lyrics_div else 'No Lyrics'
            
            # Generate output files with artist song name and lyric content
            filename = generate_filename_from_link(link)
            with open(os.path.join(artist_dir, filename), 'w', encoding='utf-8') as file:
                file.write(f"{title}\n\n{lyrics}")
            
            total_pages += 1

        except requests.RequestException as e:
            print(f"Failed to retrieve page {full_url}: {e}")  # Error warning if not successful

# Track computation time in hours
end = time.time()            
hours_elapsed = (end - start) / 3600  # Converting seconds to hours
print(f"Processed {total_pages} pages in {hours_elapsed:.2f} hours.")

Processed 48 pages in 0.14 hours.


---

# Evaluation

This assignment asks you to pull data by scraping www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [19]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [22]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For chappell we have 24 files.
For chappell we have roughly 8185 words, 1026 are unique.
For nav we have 22 files.
For nav we have roughly 10984 words, 1216 are unique.
