# 2023.09.29 - Data Science for Music: Webscraping Lyrics with Python (BeautifulSoup)
## Data Science Club at NCSU
### Rebecca Seifert

#### Best Practices for Successful Webscraping:
- Use an API when available.
- Check url/robots.txt file for crawl delays, disallowed pages, and other restrictions.
- Build in delays and limit requests to respect site owner.
- Take care with sensitive data (only scrape data you need and ought to be accessible to you, store appropriately, etc).
- Work iteratively. Get one page and save it. Write, test, and rewrite your script (data processing) until it works. Then, make one request for all the pages, save them, and use your script.

#### Links to Documentation / Tutorials
- Beautiful Soup 4: https://pypi.org/project/beautifulsoup4/
- CSV: https://realpython.com/python-csv/
- Pandas: https://pandas.pydata.org/docs/
- Requests: https://requests.readthedocs.io/en/latest/
- Working directory: https://note.nkmk.me/en/python-os-getcwd-chdir/

#### Sites Used
- www.last.fm
- www.azlyrics.com

These methods can be applied to other sites with appropriate tweaks. These particular sites were chosen for purposes related to project-specific goals/requirements, so you may very well find more direct sources of similar information.

### Set Up

In [11]:
### import needed packages
# from <library> import <package> (as <reference name>)
### ***must install libraries/packages before importing/using***

from bs4 import BeautifulSoup, NavigableString, Tag # for navigating HTML
import csv # for reading input/writing output
import os # for setting working directory
import pandas as pd # for wrangling data
import requests # for getting HTML

In [12]:
### set up working directory
# replace string with your own folder location

os.chdir("/home/jupyter-rjseifer@ncsu.edu/2023.09.29")
os.getcwd()

'/home/jupyter-rjseifer@ncsu.edu/2023.09.29'

### Generating Addresses

In [2]:
### generate URLs and file names
### *you can totally just use the URLs you want up front, I was originally building this with the end goal of automation for a daily report in mind
# url = "<link>"

### Mine came from the Top 10 (All Time) Songs from Top 3 Artists from last.fm in October of 2022
###  https://www.last.fm/charts
###   https://www.last.fm/music/Taylor+Swift
###   https://www.last.fm/music/Drake
###   https://www.last.fm/music/BTS

songs = {'taylorswift': ['lovestory',
                         'blankspace',
                         'shakeitoff',
                         'youbelongwithme',
                         'weareneverevergettingbacktogether',
                         'style',
                         'wildestdreams',
                         'cardigan',
                         'lookwhatyoumademedo',
                         'badblood'],
         'drake': ['bestieverhad',
                   'forever',
                   'onedance',
                   'hotlinebling',
                   'over',
                   'holdonweregoinghome',
                   'passionfruit',
                   'godsplan',
                   'takecare',
                   'niceforwhat'],
         'bangtanboys': ['dynamite',
                         'boywithluv',
                         'butter',
                         'fakelove',
                         'dna',
                         'bloodsweattears',
                         'idol',
                         'blackswan',
                         'euphoriathemeofloveyourselfwonder',
                         'on']}


### Generating corresponding strings following AZLyrics (https://www.azlyrics.com/)
### format and desired file names
###  URLs: https://www.azlyrics.com/lyrics/<artistname>/<songtitle>.html
###  Raw files: rawHTML/<artistname>/<songtitle>.html
###  Final files: lyrics/<artistname>/<songtitle>.txt
urls = []
rawFiles = []
saveFiles = []
for key in songs:
    for title in songs[key]:
        urls.append('https://www.azlyrics.com/lyrics/' + key + '/' + title + '.html')
        rawFiles.append("rawHTML/" + key + "/" + title + ".html")
        saveFiles.append("lyrics_" + key + "_" + title + ".txt")

### Checking urls, raw file names, and final file names
print(urls[0], '\n', rawFiles[0], '\n', saveFiles[0])
print(urls[10], '\n', rawFiles[10], '\n', saveFiles[10])
print(urls[20], '\n', rawFiles[20], '\n', saveFiles[20])

https://www.azlyrics.com/lyrics/taylorswift/lovestory.html 
 rawHTML/taylorswift/lovestory.html 
 lyrics_taylorswift_lovestory.txt
https://www.azlyrics.com/lyrics/drake/bestieverhad.html 
 rawHTML/drake/bestieverhad.html 
 lyrics_drake_bestieverhad.txt
https://www.azlyrics.com/lyrics/bangtanboys/dynamite.html 
 rawHTML/bangtanboys/dynamite.html 
 lyrics_bangtanboys_dynamite.txt


### Storing All Pages as HTML Files

In [15]:
### Testing file creation for one page

html = requests.get(urls[0])
file = open(rawFiles[0], 'wb')
file.write(html.content)
file.close()

### successful if no errors thrown
### if you're running into issues with permissions, you may want to manually create "rawHTML/<artist>" folders for each artist you're using

In [16]:
### Savings all pages to files

for i in range(len(urls)):
    html = requests.get(urls[i])
    file = open(rawFiles[i], 'wb')
    file.write(html.content)
    file.close()

### Navigating Files

In [17]:
### Testing navigation on one page file

file = open(rawFiles[0], 'rb')
soup = BeautifulSoup(file.read(), "html.parser")
file.close()
# print(soup)
lyrics = soup.body.find('div', {"class": "ringtone"}).next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.get_text().strip()
print(lyrics[0:200])

AttributeError: 'NoneType' object has no attribute 'next_sibling'

In [18]:
### Testing file output for one page

lyrics = soup.body.find('div', {"class": "ringtone"}).next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.get_text().strip()
file = open(saveFiles[0], 'w')
file.write(lyrics)
file.close()

AttributeError: 'NoneType' object has no attribute 'next_sibling'

In [19]:
### Reproducing for all songs

for i in range(len(rawFiles)):
    # Navigating
    file = open(rawFiles[i], 'rb')
    soup = BeautifulSoup(file.read(), "html.parser")
    file.close()
    # Exporting (conditional logic to handle different page structures)
    if (soup.body.find('div', {"class": "ringtone"}).next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.get_text().strip() == ""):
        lyrics = soup.body.find('span', {"class": "feat"}).next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.get_text().strip()
    else:
        lyrics = soup.body.find('div', {"class": "ringtone"}).next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.next_sibling.get_text().strip()
    file = open(saveFiles[i], 'w')
    file.write(lyrics)
    file.close()

AttributeError: 'NoneType' object has no attribute 'next_sibling'

### Producing CSV Files from Output

In [21]:
### Testing one file read and count

file = open(saveFiles[0], 'rb')
lyrics = file.read().strip()
file.close()
words = lyrics.split()
words = [str(word, encoding='utf-8?', errors='strict') for word in words]
countWords = len(words)
print(countWords)

FileNotFoundError: [Errno 2] No such file or directory: 'lyrics_taylorswift_lovestory.txt'

In [24]:
### Creating 2D array to store extracted information
### Redundancy purposeful for readability/debugging (for beginners) but can be reduced

data = [['song', 'artist', 'words', 'lines', 'lyrics']]
index = 0
for key in songs:
    for title in songs[key]:
        # Count lines
        file = open(saveFiles[index], 'rb')
        countLines = len(file.readlines())
        file.close()
        # Extract lyrics
        file = open(saveFiles[index], 'rb')
        lyrics = file.read().strip()
        file.close()
        # Process lyrics
        words = lyrics.split()
        words = [str(word, encoding='utf-8?', errors='strict') for word in words]
        countWords = len(words)
        # Add observation to dataset
        if countWords != 0:
            data.append([title, key, countWords, countLines, words])
        else:
            data.append([title, key, 'error', 'error', words])
        index += 1
        
### Check
print(data[0], data[1][0], data[1][1], data[1][2])

FileNotFoundError: [Errno 2] No such file or directory: 'lyrics_taylorswift_lovestory.txt'

In [25]:
### Writing array to CSV

file = open('data_summary.csv', 'w+')
csvWriter = csv.writer(file, delimiter=',')
csvWriter.writerows(data)
file.close()

### Notes
- 'error' values in final dataset may occur when HTML pages have different structures; these can be corrected by manually fixing navigation or building in logic to check BeautifulSoup objects. Some of the Drake pages had additional subtitles and have been adjusted accordingly.
- AZLyrics.com has no crawl delay or specifications in robots.txt file but will flag IP address as bot activity with too many requests. Responses will still have code 200, but html returned does not contain the same information and processing will throw errors.