# Twitter API Pull

In [1]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter

In [35]:
# Use this cell for any import statements you add
import pandas as pd
import random
import shutil
import time
import tqdm

def timer(length_of_time):
    timer = round(length_of_time / 60)

    for num in range(0,timer,1):
        time.sleep(60)
        print(f'Sequence {num} complete out of {timer}.')
        
    print('Ring Ring Ring')

In [4]:
888/60

14.8

# Webpage Specific

In [40]:
from bs4 import BeautifulSoup
import re

def nap_time(print_=False):
    length = 5 + 10*random.random()
    
    if print_:
        print(f'Snoring for {length} seconds')
    time.sleep(length)
    

def folder_maker(folder2make):
    # create directory to store html files
    if os.path.isdir(folder2make):
        shutil.rmtree(folder2make)
    os.mkdir(folder2make)


---

# Lyrics Scrape

This section asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [37]:
artists = {'robyn':"https://www.azlyrics.com/r/robyn.html",
           'cher':"https://www.azlyrics.com/c/cher.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: The website source code has no reference to a robots.txt file, nor does a Google Search bring up such a document.  After reading the site's privacy policy page as well as other irrelevant pages, it appears AZLyrics implicitly allows bot crawling and scraping.

In [100]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

# create lyrics folder
folder_maker('lyrics')

for num, (artist, artist_page) in enumerate(artists.items()):
    # request the page and sleep
    r = requests.get(artist_page)
    nap_time()
    
    # create individual artist folder
    folder_maker(f'lyrics/{artist}')

    # check to see if request worked
    if r.ok:
        file_name = f'lyrics/{artist}/{artist}_main_page.html'
        
        # save/write contents of artist's main page
        with open(file_name, 'w+') as f:
            f.write(r.text)
            f.close()

        soup = BeautifulSoup(r.text.encode('utf-8'), 'html.parser')

        song_pages = soup.find_all('div', id='listAlbum')[0].find_all('div', class_='listalbum-item')

        lyric_links = []
        for page in song_pages:
            try:
                lyric_links.append('azlyrics.com'+page.a['href'])
            except TypeError:
                # TypeError occurs when song has no lyrics, i.e. instrumental
                pass

        lyrics_pages[f'{artist}'] = lyric_links
        
        pd.DataFrame({f'{artist}':lyric_links}).to_csv(f'lyrics/{artist}/{artist}_lyric_pages.csv', index=False)

Let's make sure we have enough lyrics pages to scrape. 

In [47]:
# df1 = pd.read_csv('htmls/cher_lyric_pages.csv').to_dict()
# df2 = pd.read_csv('htmls/robyn_lyric_pages.csv').to_dict()

# # merge the two dictionaries back into 1
# lyrics_pages = df1 | df2

In [83]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20)

In [84]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

For robyn we have 104.
The full pull will take for this artist will take 0.29 hours.
For cher we have 318.
The full pull will take for this artist will take 0.88 hours.


## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [85]:
def lyrics_html2text(text):
    text = re.sub(r'\n', '',text)
    text = re.sub(r'\r', '',text)
    return text

In [102]:
match_ = '<div>\n<!-- Usage'
schema = "https://www." 
start = time.time()

total_pages = 0 

# lyrics_list = []
for artist in lyrics_pages :
    links = lyrics_pages[artist]
    
    lyrics_list = []
    # 2. Iterate over the lyrics pages
    for link in links[:20]:
        nap_time()

        r = requests.get(schema+link)
        
        if r.ok:
            
            soup = BeautifulSoup(r.text.encode('utf-8'), 'html.parser')
            
            # html elements
            elements = soup.find_all('div', class_='col-xs-12 col-lg-8 text-center')[0]

            # seek title
            song_title = elements.find_all('b')[1].text.replace('"','')
            song_title = song_title.replace(' ','_') 

            for element in elements:
                element_string = str(element)[:16]

                if element_string == match_:
                    lyrics = element.get_text()
                    lyrics = lyrics_html2text(lyrics)

                    song_dict = {'Artist':artist,
                                 'Title':song_title,
                                 'Lyrics':lyrics}

                    lyrics_list.append(song_dict)
                    
                    # save/write lyrics to independent file
                    with open(f'lyrics/{artist}/{song_title}' + '.txt', 'w+') as f:
                        f.write(lyrics)
                        f.close()

    # write all of THE artist data into a pandas DF schema csv file
    pd.DataFrame(lyrics_list).to_csv(f'lyrics/{artist}/{artist}_song_lyrics_df.csv', index=False)

-------

In [103]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

Total run time was 0.11 hours.


---

# Evaluation

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [104]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [105]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For robyn we have 22 files.
For robyn we have roughly 9116 words, 1011 are unique.
For cher we have 22 files.
For cher we have roughly 8935 words, 1280 are unique.
