Chris Richardson <br><br>

Sep 11, 2022 <br><br>

ADS-509-Fall <br><br>

Github Link: [https://github.com/CFRichardson/USD_ADS_509_HW1](https://github.com/CFRichardson/USD_ADS_509_HW1)

# ADS 509 Module 1: APIs and Web Scraping

This notebook has three parts. In the first part you will pull data from the Twitter API. In the second, you will scrape lyrics from AZLyrics.com. In the last part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 100,000 Twitter followers and 20 songs with lyrics on AZLyrics.com. In this part of the assignment we pull the some of the user information for the followers of your artist and store them in text files. 


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


# Twitter API Pull

In [None]:
# for the twitter section
import datetime
import os
import re
import tweepy

from pprint import pprint

# for the lyrics scrape section
import requests
import time

from bs4 import BeautifulSoup
from collections import defaultdict, Counter

In [None]:
import pandas as pd
import random
import shutil
import time

def folder_maker(folder2make):
    # create directory to store html files
    if os.path.isdir(folder2make):
        shutil.rmtree(folder2make)
    os.mkdir(folder2make)


def timer(length_of_time):
    timer = round(length_of_time / 60)

    for num in range(0,timer,1):
        time.sleep(60)
        print(f'Sequence {num} complete out of {timer}.')
        
    print('Ring Ring Ring')

We need bring in our API keys. Since API keys should be kept secret, we'll keep them in a file called `api_keys.py`. This file should be stored in the directory where you store this notebook. The example file is provided for you on Blackboard. The example has API keys that are _not_ functional, so you'll need to get Twitter credentials and replace the placeholder keys. 

In [None]:
from api_keys import api_key, api_key_secret, bearer_token

def client_FN():
    return tweepy.Client(bearer_token,wait_on_rate_limit=True)

client = client_FN()

# Testing the API

The Twitter APIs are quite rich. Let's play around with some of the features before we dive into this section of the assignment. For our testing, it's convenient to have a small data set to play with. We will seed the code with the handle of John Chandler, one of the instructors in this course. His handle is `@37chandler`. Feel free to use a different handle if you would like to look at someone else's data. 

We will write code to explore a few aspects of the API: 

1. Pull some of the followers @37chandler.
1. Explore response data, which gives us information about Twitter users. 
1. Pull the last few tweets by @37chandler.


In [None]:
handle = "37chandler"
user_obj = client.get_user(username=handle)

followers = client.get_users_followers(
    user_obj.data.id, user_fields=["created_at","description","location",
                                   "public_metrics"]
)

Now let's explore these a bit. We'll start by printing out names, locations, following count, and followers count for these users. 

In [None]:
num_to_print = 3

for idx, user in enumerate(followers.data) :
    following_count = user.public_metrics['following_count']
    followers_count = user.public_metrics['followers_count']
    
    print(f"{user.name} lists '{user.location}' as their location.")
    print(f" Following: {following_count}, Followers: {followers_count}.")
    print()
    
    if idx >= (num_to_print - 1) :
        break

Let's find the person who follows this handle who has the most followers. 

In [None]:
max_followers = 0

for idx, user in enumerate(followers.data) :
    followers_count = user.public_metrics['followers_count']
    
    if followers_count > max_followers :
        max_followers = followers_count
        max_follower_user = user

print(max_follower_user)
print(max_follower_user.public_metrics)

WedgeLIVE
{'followers_count': 14162, 'following_count': 2223, 'tweet_count': 56079, 'listed_count': 218}


Let's pull some more user fields and take a look at them. The fields can be specified in the `user_fields` argument. 

In [None]:
response = client.get_user(id=user_obj.data.id,
                          user_fields=["created_at","description","location",
                                       "entities","name","pinned_tweet_id","profile_image_url",
                                       "verified","public_metrics"])

for field, value in response.data.items() :
    print(f"for {field} we have {value}")

## Q&A!
Now a few questions for you about the user object.


--- 

<u>Q: How many fields are being returned in this user object?</u>

A: There is a total of 9 fields.

---

<u>Q: Are any of the fields within the user object non-scalar? (I.e., more complicated than a simple data type like integer, float, string, boolean, etc.)</u>

A: Public Metrics is a Dict containing scalar-values, profile_image_url contains a url string, date time as "created_at", and last but not least "description" is also a string.

---
<u>Q: How many friends, followers, and tweets does this user have? </u>

A:<br>
followers count: 194<br>
following count: 590<br>
tweet count: 989<br>
listed count: 3

Although you won't need it for this assignment, individual tweets can be a rich source of text-based data. To illustrate the concepts, let's look at the last few tweets for this user. You are encouraged to explore the fields that are available about Tweets.

In [None]:
response = client.get_users_tweets(user_obj.data.id)

# By default, only the ID and text fields of each Tweet will be returned
for idx, tweet in enumerate(response.data) :
    print(tweet.id)
    print(tweet.text, '\n')
    
    if idx > 10 :
        break

## Pulling Follower Information

In this next section of the assignment, we will pull information about the followers of your two artists. We've seen above how to pull a set of followers using `client.get_users_followers`. This function has a parameter, `max_results`, that we can use to change the number of followers that we pull. Unfortunately, we can only pull 1000 followers at a time, which means we will need to handle the _pagination_ of our results. 

The return object has the `.data` field, where the results will be found. It also has `.meta`, which we use to select the next "page" in the results using the `next_token` result. I will illustrate the ideas using our user from above. 


### Rate Limiting

Twitter limits the rates at which we can pull data, as detailed in [this guide](https://developer.twitter.com/en/docs/twitter-api/rate-limits). We can make 15 user requests per 15 minutes, meaning that we can pull $4 \cdot 15 \cdot 1000 = 60000$ users per hour. I illustrate the handling of rate limiting below, though whether or not you hit that part of the code depends on your value of `handle`.  


In the below example, I'll pull all the followers, 25 at a time. (We're using 25 to illustrate the idea; when you do this set the value to 1000.) 

In [None]:
handle_followers = []
pulls = 0
max_pulls = 100
next_token = None

while True :

    followers = client.get_users_followers(
        user_obj.data.id, 
        max_results=1000, # when you do this for real, set this to 1000!
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.id,follower.name,follower.created_at,follower.description)
        handle_followers.append(follower_row)
    
    if 'next_token' in followers.meta and pulls < max_pulls :
        next_token = followers.meta['next_token']
    else : 
        break

Now let's take a look at your artists and see how long it is going to take to pull all their followers. 

In [None]:
artists = dict()

handles = ['FFDP','OfficialRezz']

client = client_FN()
for handle in handles: 
    user_obj = client.get_user(username=handle,user_fields=["public_metrics"])
    artists[handle] = (user_obj.data.id, 
                       handle,
                       user_obj.data.public_metrics['followers_count'])

for artist, data in artists.items() : 
    print(f"It would take {data[2]/(1000*15*4):.2f} hours to pull all {data[2]} followers for {artist}. ")

It would take 11.27 hours to pull all 675948 followers for FFDP. 
It would take 5.02 hours to pull all 301215 followers for OfficialRezz. 


Depending on what you see in the display above, you may want to limit how many followers you pull. It'd be great to get at least 200,000 per artist. 

As we pull data for each artist we will write their data to a folder called "twitter", so we will make that folder if needed.

## Handle Pulls

In [None]:
max_results = 1000
max_pulls = 100
next_token=None
num_followers_to_pull = 200*1000
pulls = 0
# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

# create folder to store follower data
folder_maker('twitter')

for handle in handles:
    print(f'Processing {handle}')
    client = client_FN()
    user_data = client.get_user(username=handle)
    
    follower_data = []
    follower_ids = []
    current_follower_count = 0
    
    # for get_user_followers param user_fields
    needed_fields = ['description',
                     'id',
                     'location',
                     'name',
                     'public_metrics']

    while True :
        try: # try is used to avoid stoppage from client disconnection 
            current_follower_count += max_results
            
            # print status/current count during pagination/pause
            if current_follower_count % 15000 == 0:
                print(f'Current Count: {current_follower_count}\n')
            # restablish connection
            client = client_FN()

            followers = client.get_users_followers(
                user_data.data.id, 
                max_results=max_results, 
                pagination_token=next_token,
                user_fields=needed_fields
            )
            pulls += 1
            
            
            for follower in followers.data:
                follower_ids.append(follower.id)

                follower_row = {'Artist':handle,
                                'Id':follower.id,
                                'Name':follower.name,
                                'User_Name':follower.username, # screen_name
                                'Description':follower.description,
                                'Location':follower.location,
                                'Followers_Count':follower.public_metrics['followers_count']}
                follower_data.append(follower_row)

            # check to see if there is a next page of users to pull
            if 'next_token' in followers.meta and pulls < max_pulls :
                next_token = followers.meta['next_token']
            # If num of followers pulled reaches designated limit num_followers_to_pull
            elif current_follower_count > num_followers_to_pull:
                break
            else : 
                break
        except:
            pass
    
    # Write the data to the output file in the `twitter` folder.
    pd.DataFrame(follower_data).to_csv(f'twitter/{handle}_followers_data.txt', index=False, sep='\t')
    
    pd.Series({handle:follower_ids}).to_csv(f'twitter/{handle}_followers.txt', index=False, sep='\t')

# Let's see how long it took to grab all follower IDs
end_time = datetime.datetime.now()
print(end_time - start_time)

In [None]:
tricky_description = """
    Home by Warsan Shire
    
    no one leaves home unless
    home is the mouth of a shark.
    you only run for the border
    when you see the whole city
    running as well.

"""
# This won't work in a tab-delimited text file.

clean_description = re.sub(r"\s+"," ",tricky_description)
clean_description

---

# Lyrics Scrape

This section asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [None]:
artists = {'FFDP':"https://www.azlyrics.com/f/fivefingerdeathpunch.html",
           'OfficialRezz':"https://www.azlyrics.com/r/rezz.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

<u>Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know?</u>

A: The website source code has no reference to a robots.txt file, nor does a Google Search bring up such a document.  After reading the site's privacy policy page as well as other irrelevant pages, it appears AZLyrics implicitly allows bot crawling and scraping.

In [None]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)

# create lyrics folder
folder_maker('lyrics')

for num, (artist, artist_page) in enumerate(artists.items()):
    # request the page and sleep
    r = requests.get(artist_page)
    nap_time()
    
    # create individual artist folder
    folder_maker(f'lyrics/{artist}')

    # check to see if request worked
    if r.ok:
        file_name = f'lyrics/{artist}/{artist}_main_page.html'
        
        # save/write contents of artist's main page
        with open(file_name, 'w+') as f:
            f.write(r.text)
            f.close()

            
        soup = BeautifulSoup(r.text.encode('utf-8'), 'html.parser')
        # find_all song links
        song_pages = soup.find_all('div', id='listAlbum')[0].find_all('div', class_='listalbum-item')

        lyric_links = []
        for page in song_pages:
            try:
                # check if div tag has a link, if True append link
                lyric_links.append('azlyrics.com'+page.a['href'])
            except TypeError:
                # TypeError occurs when song has no lyrics, i.e. instrumental
                pass

        lyrics_pages[f'{artist}'] = lyric_links
        
        # save data in the format of a Pandas Dataframe to csv
        pd.DataFrame({f'{artist}':lyric_links}).to_csv(f'lyrics/{artist}/{artist}_lyric_pages.csv', index=False)

Let's make sure we have enough lyrics pages to scrape. 

In [None]:
for artist, lp in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

In [None]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [None]:
def lyrics_html2text(text):
    text = re.sub(r'\n', '',text)
    text = re.sub(r'\r', '',text)
    return text

In [None]:
notice_agreement = '<div>\n<!-- Usage'
schema = "https://www." 
start = time.time()

total_pages = 0 

for artist in lyrics_pages :
    links = lyrics_pages[artist]
    
    lyrics_list = []
    # 2. Iterate over the lyrics pages
    for link in links[:20]:
        nap_time()

        r = requests.get(schema+link)
        
        if r.ok:
            
            soup = BeautifulSoup(r.text.encode('utf-8'), 'html.parser')
            
            # html elements containing lyrics
            elements = soup.find_all('div', class_='col-xs-12 col-lg-8 text-center')[0]

            # seek title
            song_title = elements.find_all('b')[1].text.replace('"','')
            song_title = song_title.replace(' ','_') 

            # search through lyric block contents
            for element in elements:
                element_string = str(element)[:16]
                
                # lyrics start with a License Agreement notice 
                # <!-- Usage of azlyrics.com content by any third-party lyrics provider is prohibited by our licensing agreement. 
                if element_string == notice_agreement:
                    lyrics = element.get_text()
                    lyrics = lyrics_html2text(lyrics)

                    song_dict = {'Artist':artist,
                                 'Title':song_title,
                                 'Lyrics':lyrics}

                    lyrics_list.append(song_dict)
                    
                    # save/write lyrics to independent file
                    with open(f'lyrics/{artist}/{song_title}' + '.txt', 'w+') as f:
                        f.write(lyrics)
                        f.close()

    # write all of THE artist data into a pandas DF schema csv file
    pd.DataFrame(lyrics_list).to_csv(f'lyrics/{artist}/{artist}_song_lyrics_df.csv', index=False)

In [None]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

---

# Evaluation

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [None]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

---

## Checking Twitter Data

The output from your Twitter API pull should be two files per artist, stored in files with formats like `cher_followers.txt` (a list of all follower IDs you pulled) and `cher_followers_data.txt`. These files should be in a folder named `twitter` within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [None]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: OfficialRezz and FFDP.


In [None]:
for artist in artist_handles :
    follower_file = artist + "_followers.txt"
    follower_data_file = artist + "_followers_data.txt"
    
    ids = open("twitter/" + follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + follower_data_file,'r') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        print(f"For {artist} we have {len(locations)} unique locations.")

        print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")

We see 1 in your follower file for OfficialRezz, assuming a header row.
In the follower data file (OfficialRezz_followers_data.txt) for OfficialRezz, we have these columns:
Artist : Id : Name : User_Name : Description : Location : Followers_Count

We have 122520 data rows for OfficialRezz in the follower data file.
For OfficialRezz we have 99997 unique locations.
For OfficialRezz we have 181374 words in the descriptions.
Here are the five most common words:
[('0', 95739), ('1', 4175), ('2', 3221), ('3', 2612), ('4', 2117)]

----------------------------------------

We see 1 in your follower file for FFDP, assuming a header row.
In the follower data file (FFDP_followers_data.txt) for FFDP, we have these columns:
Artist : Id : Name : User_Name : Description : Location : Followers_Count

We have 122213 data rows for FFDP in the follower data file.
For FFDP we have 99996 unique locations.
For FFDP we have 91065 words in the descriptions.
Here are the five most common words:
[('0', 11610), 

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [None]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")


For FFDP we have 22 files.
For FFDP we have roughly 9712 words, 1094 are unique.
For OfficialRezz we have 22 files.
For OfficialRezz we have roughly 4993 words, 564 are unique.
