# ASSIGNMENT 1.1
# ADS 509 - APPLIED TEXT MINING 

# DIP R. BISTA


# ADS 509 Module 1: APIs and Web Scraping

This notebook has two parts. In the first part, you will scrape lyrics from AZLyrics.com. In the second part, you'll run code that verifies the completeness of your data pull.

For this assignment you have chosen two musical artists who have at least 20 songs with lyrics on AZLyrics.com. We start with pulling some information and analyzing them.


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it.

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link.

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell.

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.*


# Importing Libraries

In [39]:
import os
import datetime
import re

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter
import random

In [40]:
# Use this cell for any import statements you add
import os
import shutil
import time
import random
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse

---

# Lyrics Scrape

This section asks you to pull data by scraping www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways.

In [41]:
artists = {'pinkfloyd':"https://www.azlyrics.com/p/pinkfloyd.html",
           'deeppurple':"https://www.azlyrics.com/d/deeppurple.html"}
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).)

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again.

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages.

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know?

A: This can be simply checked by reviewing "robots.txt" file as "https://www.azlyrics.com/robots.txt". This gives us information about permission of using web scraper. Other methods include using code line such as "Disallow:" to check if web scraping is allowed or not. In this assignment also we checked if web scrapping is allowed or not. For this website, web scapping is allowed.      


Let's make sure we have enough lyrics pages to scrape.

In [42]:

# This allows us to check if the web scraping is allowed or not for this task. 
robots_url = "https://www.azlyrics.com/robots.txt"
r_robots = requests.get(robots_url)
print("robots.txt file:")
print(r_robots.text)  

# In this step we create the two dictionaries or storage places for artists and lyrics pages. 
# In artist we store artist here pink Floyd and deep purple. 
# The next lyrics page would create spaces where we can store the song lyrics for the artists. 
artists = {'pinkfloyd': "https://www.azlyrics.com/p/pinkfloyd.html",
           'deeppurple': "https://www.azlyrics.com/d/deeppurple.html"}

lyrics_pages = defaultdict(list)

# In this step we will execute the scraping function in a loop. 
# Based on the URL from step one, the artist’s corresponding page is accessed. 
# Using the time step 5 and 15 second, the song lyrics is downloaded. The beautifulSoup is used to parse the download. 
# The songs are filtered from the “/lyrics/” portion of the web link. 
# Then lastly, the full song URL is added for the respective artist in the lyrics pages storage space.  
for artist, artist_page in artists.items():
    
    r = requests.get(artist_page)
    time.sleep(5 + 10 * random.random())

    
    soup = BeautifulSoup(r.text, 'html.parser')

    
    song_links = soup.find_all('a', href=True)

    
    for link in song_links:
        href = link['href']
        if "/lyrics/" in href:
            full_link = f"https://www.azlyrics.com{href[2:]}"  
            lyrics_pages[artist].append(full_link)

    print(f"Found {len(lyrics_pages[artist])} songs for {artist}")

# In this step, we are ensuring the artist we selected meets the criteria for this assignment 
# which is each artist should have at least 20 unique song links. 
for artist, lp in lyrics_pages.items():
    assert len(set(lp)) > 20, f"Not enough lyrics pages for {artist}"

# Now, we are trying to estimate the scraping time to obtain the song lyrics data. 
# This runs into the loop and goes through the each artists and their songs. The output is generated in the hours. 
# The final result would be printing the number of songs and time in hours to scrape their lyrics.   
for artist, links in lyrics_pages.items():
    num_songs = len(links)
    time_estimation = round(num_songs * 10 / 3600, 2)  
    print(f"For {artist}, we have {num_songs} songs.")
    print(f"The full pull will take approximately {time_estimation} hours.")

# In last step we print the lyrics page dictionary where all the songs page URL is stored for each artist. 
# print(lyrics_pages)
# This is commented out to avoid printing long sereis of songs. 

robots.txt file:
User-agent: *
Disallow: /lyricsdb/
Disallow: /song/
Allow: /

User-agent: 008
Disallow: /

Found 155 songs for pinkfloyd
Found 226 songs for deeppurple
For pinkfloyd, we have 155 songs.
The full pull will take approximately 0.43 hours.
For deeppurple, we have 226 songs.
The full pull will take approximately 0.63 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part.

1. Create an empty folder in our repo called "lyrics".
1. Iterate over the artists in `lyrics_pages`.
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages.
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name.


In [43]:
# This line creates the fucntion that generates the filename from the URL. 
# In this process it removes some parts from the URL.
def generate_filename_from_link(link):
    if not link:
        return None

    # Here we are removing the http or https and the .html
    name = link.replace("https://", "").replace("http://", "")
    name = name.replace(".html", "")
    name = name.replace("/lyrics/", "")

    # Now replacing certain characters with underscores
    name = name.replace(".", "_").replace("/", "_")

    # Finally adding .txt at the end
    name = name + ".txt"

    return name

# In this step we define the slug fucntion which is the last part of URL. This is unique identifier. 
# The full URL is parsed first and then URL is truncated into parts. 
def extract_slug_from_url(url):
    parsed_url = urlparse(url)
    path_parts = parsed_url.path.split('/')
    if len(path_parts) >= 3:
        return path_parts[-1].replace('.html', '')
    return None

# This is the step for creating the lyrics folder to store lyrics. 
# If there is existing folders it would delete and create a new lyrics folder.
if os.path.isdir("lyrics"):
    shutil.rmtree("lyrics/")  
os.mkdir("lyrics")  

# Here, we create the runtime and start the counter the sucessfully downloded lyrics. 
start = time.time()
total_pages = 0

# Now, finally we can loop for the artist and their individual songs. 
# This step is important as we can create the folder for each artist to store song. 
# Using the correct URL to retreive the lyrics using slug, artist and lyrics list. 
# Like the above code we are using delay of 5 to 15 seconds. 
for artist, pages in lyrics_pages.items():
    # artist folder
    artist_folder = f"lyrics/{artist}"
    os.mkdir(artist_folder)

    for page in pages:
        try:
            # slug URL
            if page.startswith("http://") or page.startswith("https://"):
                page = extract_slug_from_url(page)
                if not page:
                    print(f"Skipping malformed URL.")
                    continue

            #  full URL
            full_url = f"https://www.azlyrics.com/lyrics/{artist}/{page}.html"
            
            # requests.get(full_url)
            r = requests.get(full_url)
            time.sleep(5 + 10 * random.random())  
            
            # This steps is important as it would check if the request went through or not. 
            if r.status_code != 200:
                print(f"Failed to fetch lyrics for {artist} - {page}: Status code {r.status_code}")
                continue

            # Using "Beautiful Soup" we parse the lyrics page. The title of the song is used while parsing.
            soup = BeautifulSoup(r.text, 'html.parser')

            # title_tag.text.strip
            title_tag = soup.find('title')
            title = title_tag.text.strip() if title_tag else 'Unknown Title'

            lyrics_div = soup.find('div', class_=None, id=None)  
            lyrics = lyrics_div.get_text(separator='\n').strip() if lyrics_div else 'No Lyrics Found'

            # Finally, using "generate_filename_from_link" we would generate lyrics and save it as tect file. 
            # The lyrics is also categorized into proper folder and correct artist folder. 
            filename = generate_filename_from_link(full_url)
            filepath = os.path.join(artist_folder, filename)

            with open(filepath, 'w', encoding='utf-8') as file:
                file.write(title + "\n\n" + lyrics)

            total_pages += 1
            print(f"Saved lyrics for: {title}")

        except requests.exceptions.RequestException as e:
            print(f"Error fetching lyrics for {artist} - {page}: {e}")
            continue

# This genrates the total run time for the code in hours. 
print(f"Total run time was {round((time.time() - start)/3600, 2)} hours.")


Saved lyrics for: Pink Floyd - Astronomy Domine Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Lucifer Sam Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Matilda Mother Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Flaming Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Take Up Thy Stethoscope And Walk Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - The Gnome Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Chapter 24 Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - The Scarecrow Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Bike Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Let There Be More Light Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Remember A Day Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Set The Controls For The Heart Of The Sun Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Corporal Clegg Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - See-Saw Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd 

Saved lyrics for: Pink Floyd - Free Four Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Embryo Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Signs Of Life Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Learning To Fly Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - The Dogs Of War Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - One Slip Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - On The Turning Away Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Yet Another Movie Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - A New Machine (Part 1) Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - A New Machine (Part 2) Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Sorrow Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - What Do You Want From Me Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Poles Apart Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - A Great Day For Freedom Lyrics | AZLyrics.com
Saved lyrics for: Pink Floyd - Wea

Saved lyrics for: Deep Purple - A Gypsy's Kiss Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Wasted Sunsets Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Hungry Daze Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Not Responsible Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Bad Attitude Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - The Unwritten Law Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Call Of The Wild Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Mad Dog Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Black & White Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Hard Lovin' Woman Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - The Spanish Archer Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Strangeways Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Mitzi Dupree Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Dead Or Alive Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - King Of 

Saved lyrics for: Deep Purple - Watching The River Flow Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Let The Good Times Roll Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Dixie Chicken Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Shapes Of Things Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - The Battle Of New Orleans Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Lucifer Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - White Room Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Caught In The Act Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Show Me Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - A Bit On The Side Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Sharp Shooter Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Portable Door Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - Old-Fangled Thing Lyrics | AZLyrics.com
Saved lyrics for: Deep Purple - If I Were You Lyrics | AZLyrics.com
Saved lyrics for

---

# Evaluation

This assignment asks you to pull data by scraping www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [34]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text):
    return re.findall(r'\w+', text.lower())

## Checking Lyrics

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work.

In [46]:
## This code outputs the artist and lyrics. 
# The total number of files, total words across the file and number of unique words in lyrics are displayed as the output file.

artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders:
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"\nFor {artist}, we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files:
        with open(f"lyrics/{artist}/{f_name}", 'r', encoding='utf-8', errors='ignore') as infile:
            artist_words.extend(words(infile.read()))

    print(f"For {artist}, we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")



For deeppurple, we have 226 files.
For deeppurple, we have roughly 45309 words, 3685 are unique.

For pinkfloyd, we have 145 files.
For pinkfloyd, we have roughly 23954 words, 3233 are unique.
