# ADS 509 Module 1: APIs and Web Scraping

This notebook has three parts. In the first part you will pull data from the Twitter API. In the second, you will scrape lyrics from AZLyrics.com. In the last part, you'll run code that verifies the completeness of your data pull. 

For this assignment you have chosen two musical artists who have at least 100,000 Twitter followers and 20 songs with lyrics on AZLyrics.com. In this part of the assignment we pull the some of the user information for the followers of your artist and store them in text files. 


## Important Note

This assignment requires you to have a version of Tweepy that is at least version 4. The latest version is 4.10 as I write this. Critically, this version of Tweepy is *not* on the upgrade path from Version 3, so you will not be able to simply upgrade the package if you are on Version 3. Instead you will need to explicitly install version 4, which you can do with a command like this: `pip install "tweepy>=4"`. You will also be using Version 2 of the Twitter API for this assignment. 

Run the below cell. If your version of Tweepy begins with a "4", then you should be good to go. If it begins with a "3" then run the following command, found [here](https://stackoverflow.com/questions/5226311/installing-specific-package-version-with-pip), at the command line or in a cell: `pip install -Iv tweepy==4.9`. (You may want to update that version number if Tweepy has moved on past 4.9. 

In [256]:
#pip install "tweepy>=4"

In [265]:
pip show tweepy

Name: tweepy
Version: 4.12.1
Summary: Twitter library for Python
Home-page: https://www.tweepy.org/
Author: Joshua Roesslein
Author-email: tweepy@googlegroups.com
License: MIT
Location: /Users/ryan_s_dunn/opt/anaconda3/lib/python3.8/site-packages
Requires: oauthlib, requests, requests-oauthlib
Required-by: 
Note: you may need to restart the kernel to use updated packages.


## General Assignment Instructions

These instructions are included in every assignment, to remind you of the coding standards for the class. Feel free to delete this cell after reading it. 

One sign of mature code is conforming to a style guide. We recommend the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). If you use a different style guide, please include a cell with a link. 

Your code should be relatively easy-to-read, sensibly commented, and clean. Writing code is a messy process, so please be sure to edit your final submission. Remove any cells that are not needed or parts of cells that contain unnecessary code. Remove inessential `import` statements and make sure that all such statements are moved into the designated cell. 

Make use of non-code cells for written commentary. These cells should be grammatical and clearly written. In some of these cells you will have questions to answer. The questions will be marked by a "Q:" and will have a corresponding "A:" spot for you. *Make sure to answer every question marked with a `Q:` for full credit.* 


# Twitter API Pull

In [266]:
# for the twitter section
import tweepy
import os
import datetime
import re
from pprint import pprint

# for the lyrics scrape section
import requests
import time
from bs4 import BeautifulSoup
from collections import defaultdict, Counter

In [397]:
# Use this cell for any import statements you add
import json
import pandas as pd
import numpy as np
from random import * 

We need bring in our API keys. Since API keys should be kept secret, we'll keep them in a file called `api_keys.py`. This file should be stored in the directory where you store this notebook. The example file is provided for you on Blackboard. The example has API keys that are _not_ functional, so you'll need to get Twitter credentials and replace the placeholder keys. 

In [268]:
from api_keys import api_key, api_key_secret, bearer_token

In [269]:
client = tweepy.Client(bearer_token,wait_on_rate_limit=True)
print(client)

<tweepy.client.Client object at 0x7fbfb866aca0>


# Testing the API

The Twitter APIs are quite rich. Let's play around with some of the features before we dive into this section of the assignment. For our testing, it's convenient to have a small data set to play with. We will seed the code with the handle of John Chandler, one of the instructors in this course. His handle is `@37chandler`. Feel free to use a different handle if you would like to look at someone else's data. 

We will write code to explore a few aspects of the API: 

1. Pull some of the followers @37chandler.
1. Explore response data, which gives us information about Twitter users. 
1. Pull the last few tweets by @37chandler.


In [270]:
handle = "37chandler"

user_obj = client.get_user(username=handle)

followers = client.get_users_followers(
    # Learn about user fields here: 
    # https://developer.twitter.com/en/docs/twitter-api/data-dictionary/object-model/user
    user_obj.data.id, user_fields=["created_at","description","location",
                                   "public_metrics"]
)

#print the user object
print(user_obj)
#print(followers)

Response(data=<User id=33029025 name=John Chandler username=37chandler>, includes={}, errors=[], meta={})


Now let's explore these a bit. We'll start by printing out names, locations, following count, and followers count for these users. 

In [271]:
num_to_print = 20

for idx, user in enumerate(followers.data) :
    following_count = user.public_metrics['following_count']
    followers_count = user.public_metrics['followers_count']
    
    print(f"{user.name} lists '{user.location}' as their location.")
    print(f" Following: {following_count}, Followers: {followers_count}.")
    print()
    
    if idx >= (num_to_print - 1) :
        break
    

John chandler lists 'Decatur, GA' as their location.
 Following: 129, Followers: 10.

Frank P Seidl lists 'Twin Cities, Minnesota USA' as their location.
 Following: 37698, Followers: 37557.

Roberta lists 'Salinas' as their location.
 Following: 1874, Followers: 181.

Anna bikes MKE lists 'mke ' as their location.
 Following: 2288, Followers: 1757.

Catherine lists 'San Angelo' as their location.
 Following: 2178, Followers: 223.

Lisa lists 'None' as their location.
 Following: 2596, Followers: 831.

Lexi lists 'None' as their location.
 Following: 432, Followers: 25.

Dave Renn lists 'None' as their location.
 Following: 98, Followers: 10.

Lionel lists 'None' as their location.
 Following: 200, Followers: 200.

Megan Randall lists 'None' as their location.
 Following: 142, Followers: 98.

Jacob Salzman lists 'None' as their location.
 Following: 563, Followers: 136.

twiter not fun lists 'None' as their location.
 Following: 218, Followers: 20.

Christian Tinsley lists 'None' as th

Let's find the person who follows this handle who has the most followers. 

In [272]:
max_followers = 0

for idx, user in enumerate(followers.data) :
    followers_count = user.public_metrics['followers_count']
    
    if followers_count > max_followers :
        max_followers = followers_count
        max_follower_user = user

        
print(max_follower_user)
print(max_follower_user.public_metrics)

SpaceConscious
{'followers_count': 37557, 'following_count': 37698, 'tweet_count': 13956, 'listed_count': 305}


Let's pull some more user fields and take a look at them. The fields can be specified in the `user_fields` argument. 

In [273]:
response = client.get_user(id=user_obj.data.id,
                          user_fields=["created_at","description","location",
                                       "entities","name","pinned_tweet_id","profile_image_url",
                                       "verified","public_metrics"])

In [274]:
for field, value in response.data.items() :
    print(f"for {field} we have {value}")

for name we have John Chandler
for location we have MN
for verified we have False
for public_metrics we have {'followers_count': 185, 'following_count': 592, 'tweet_count': 1049, 'listed_count': 3}
for username we have 37chandler
for id we have 33029025
for profile_image_url we have https://pbs.twimg.com/profile_images/2680483898/b30ae76f909352dbae5e371fb1c27454_normal.png
for description we have He/Him. Data scientist, urban cyclist, educator, erstwhile frisbee player. 

¯\_(ツ)_/¯
for created_at we have 2009-04-18 22:08:22+00:00


Now a few questions for you about the user object.

Q: How many fields are being returned in the `response` object? 

##### A: For a given user id, 9 additional fields are being returned in the response object. These include description, username, profile_image_url, location, verified (T/F), public_metrics, name, and created_at

---

Q: Are any of the fields within the user object non-scalar? (I.e., more complicated than a simple data type like integer, float, string, boolean, etc.) 

##### A: Yes, the profile_image_url is returned as https path to a .png file

---

Q: How many friends, followers, and tweets does this user have? 

##### A: This user has 184 followers (interpreted as "friends"), is following 592 users, and has 1049 tweets.


Although you won't need it for this assignment, individual tweets can be a rich source of text-based data. To illustrate the concepts, let's look at the last few tweets for this user. You are encouraged to explore the fields that are available about Tweets.

In [275]:
response = client.get_users_tweets(user_obj.data.id)

# By default, only the ID and text fields of each Tweet will be returned
for idx, tweet in enumerate(response.data) :
    print(tweet.id)
    print(tweet.text)
    print()
    
    if idx > 10 :
        break

1611545485029810180
Happy Dia de los Reyes to all who celebrate it. https://t.co/4G7zAuwC70

1608230093071212544
RT @year_progress: ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 99%

1606038920604499969
RT @CoachBalto: This video is perfect. Parents please watch. Part 1/2 https://t.co/NvcBFmyFPO

1602407567036190743
RT @LindsayMasland: I had the realization that "grades are pretend" the first time I taught (as a TA).  

I was grading something with both…

1598645130075856896
RT @marinaendicott: My new favourite lawyer’s letter, just for the sheer joy of the tone.

1598156055997222912
RT @_TanHo: Hey friends, #AdventOfCode starts TONIGHT! I've organized a friendly leaderboard every year for the #rstats (and friends) commu…

1597746144108740608
If you like biking and not getting hit by muederboxes, you should consider one of these. https://t.co/0prMLbvj3b

1597734124927995904
RT @CraigTheDev: A lot of people argue that AI art isn't theft as it isn't copying the original images but referencing them like a person.…

15

## Pulling Follower Information

In this next section of the assignment, we will pull information about the followers of your two artists. We've seen above how to pull a set of followers using `client.get_users_followers`. This function has a parameter, `max_results`, that we can use to change the number of followers that we pull. Unfortunately, we can only pull 1000 followers at a time, which means we will need to handle the _pagination_ of our results. 

The return object has the `.data` field, where the results will be found. It also has `.meta`, which we use to select the next "page" in the results using the `next_token` result. I will illustrate the ideas using our user from above. 


### Rate Limiting

Twitter limits the rates at which we can pull data, as detailed in [this guide](https://developer.twitter.com/en/docs/twitter-api/rate-limits). We can make 15 user requests per 15 minutes, meaning that we can pull $4 \cdot 15 \cdot 1000 = 60000$ users per hour. I illustrate the handling of rate limiting below, though whether or not you hit that part of the code depends on your value of `handle`.  


In the below example, I'll pull all the followers, 25 at a time. (We're using 25 to illustrate the idea; when you do this set the value to 1000.) 

In [276]:
handle_followers = []
pulls = 0
max_pulls = 100
next_token = None

while True :

    followers = client.get_users_followers(
        user_obj.data.id, 
        max_results=25, # when you do this for real, set this to 1000!
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.username, follower.id,follower.name,follower.location, 
                        follower.public_metrics['following_count'],follower.public_metrics['followers_count'],
                        follower.description)
        handle_followers.append(follower_row)
    
    if 'next_token' in followers.meta and pulls < max_pulls :
        next_token = followers.meta['next_token']
    else : 
        break

##### test the functionality of converting to dataframe

In [420]:
# convert from list to dataframe
df = pd.DataFrame(handle_followers, columns = ['username','id', 'name','location','followers_count',
                                               'following_count','description'])
 
# view top 5 elements from dataframe 
df.head()

Unnamed: 0,username,id,name,location,followers_count,following_count,description
0,PattyDuQuin,1610106760597229568,Patty Duarte,,578,8,
1,Sam_Habilay,1586615640324214785,Samantha,,33,1,
2,IcyLapis,1349131136073687046,✨IcyLapis🍮🥐🍡 (COMMISSIONS OPEN),all art is f2u with credit,1300,175,check carrd for dni/byf and commission info. i...
3,dudeguybutdumb,1411358993268842503,dudeguy99,"Sus, France",2286,330,warning beatles fan. drummer. l4d2 my goat🐐🐐🐐🐐
4,n4hu_m,1260356791755509760,nahum,Nevada,1071,191,artist?? / ocasionally funny / sdp fan 💤 / als...


## Pulling Twitter Data for Your Artists

Now let's take a look at your artists and see how long it is going to take to pull all their followers. 

In [278]:
artists = dict()

for handle in ['weezer','greenday'] : 
    user_obj = client.get_user(username=handle,user_fields=["public_metrics"])
    artists[handle] = (user_obj.data.id, 
                       handle,
                       user_obj.data.public_metrics['followers_count'])
    

for artist, data in artists.items() : 
    print(f"It would take {data[2]/(1000*15*4):.2f} hours to pull all {data[2]} followers for {artist}. ")
    

It would take 27.73 hours to pull all 1664073 followers for weezer. 
It would take 81.32 hours to pull all 4879343 followers for greenday. 


Depending on what you see in the display above, you may want to limit how many followers you pull. It'd be great to get at least 200,000 per artist. 

As we pull data for each artist we will write their data to a folder called "twitter", so we will make that folder if needed.

In [279]:
# Make the "twitter" folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then "unlink" it. Then create a new one.
# Specify path

my_path = '/Users/ryan_s_dunn/opt/anaconda3/lib/python3.8/posixpath.py'

isExist = os.path.exists(my_path)
print(isExist)

if not os.path.isdir("twitter") : 
    #shutil.rmtree("twitter/")
    os.mkdir("twitter")

True


In this following cells, build on the above code to pull some of the followers and their data for your two artists. As you pull the data, write the follower ids to a file called `[artist name]_followers.txt` in the "twitter" folder. For instance, for Cher I would create a file named `cher_followers.txt`. As you pull the data, also store it in an object like a list or a data frame.

In addition to creating a file that only has follower IDs in it, you will create a file that includes user data. From the response object please extract and store the following fields: 

* screen_name	
* name	
* id	
* location	
* followers_count	
* friends_count	
* description

Store the fields with one user per row in a tab-delimited text file with the name `[artist name]_follower_data.txt`. For instance, for Cher I would create a file named `cher_follower_data.txt`. 

One note: the user's description can have tabs or returns in it, so make sure to clean those out of the description before writing them to the file. I've included some example code to do that below the stub. 

In [281]:
artist_ids = []

handles = ['weezer','GreenDay']
for handle in handles:
    user_name = client.get_user(username=handle).data.name
    user_id = client.get_user(username=handle).data.id
    artist_ids.append(user_id)
    print(user_name,user_id)

#print(artist_ids)  

weezer 16685316
Green Day 67995848


In [337]:
# Modify the below code stub to pull the follower IDs and write them to a file. 
handles = ['weezer','GreenDay']

whitespace_pattern = re.compile(r"\s+") 

user_data = dict() 
followers_data = dict()

for handle in handles :
    user_data[handle] = [] # will be a list of lists
    followers_data[handle] = [] # will be a simple list of IDs

print(user_data, followers_data)

# Grabs the time when we start making requests to the API
start_time = datetime.datetime.now()

for handle in handles :
    
    # Create the output file names 
    followers_output_file = handle + "_followers.txt"
    user_data_output_file = handle + "_follower_data.txt"
    
    #validate creation
    print(followers_output_file, user_data_output_file)

{'weezer': [], 'GreenDay': []} {'weezer': [], 'GreenDay': []}
weezer_followers.txt weezer_follower_data.txt
GreenDay_followers.txt GreenDay_follower_data.txt


#### Collect Weezer followers data

In [338]:
handle = 'weezer'

total_followers = 201000
weezer_followers = []
weezer_followers_id = []
pulls = 0
max_pulls = 100
next_token = None

user_name = client.get_user(username=handle).data.name
user_id = client.get_user(username=handle).data.id

while True :

    followers = client.get_users_followers(
        user_id, 
        max_results=1000, 
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.username, follower.id,follower.name,follower.location, 
                        follower.public_metrics['following_count'],follower.public_metrics['followers_count'],
                        follower.description)
        weezer_followers.append(follower_row)
    
    for follower in followers.data :
        follower_id_row = (follower.id)
        weezer_followers_id.append(follower_id_row)
        
    print('pull number: ', pulls, ' current followers pulled: ', len(weezer_followers_id))
    
    if len(weezer_followers_id) > total_followers:
        break

end_time = datetime.datetime.now()
print(end_time - start_time)

pull number:  1  current followers pulled:  1000
pull number:  2  current followers pulled:  2000
pull number:  3  current followers pulled:  3000
pull number:  4  current followers pulled:  4000
pull number:  5  current followers pulled:  5000
pull number:  6  current followers pulled:  6000
pull number:  7  current followers pulled:  7000
pull number:  8  current followers pulled:  8000


Rate limit exceeded. Sleeping for 59 seconds.


pull number:  9  current followers pulled:  9000
pull number:  10  current followers pulled:  10000
pull number:  11  current followers pulled:  11000
pull number:  12  current followers pulled:  12000
pull number:  13  current followers pulled:  13000
pull number:  14  current followers pulled:  14000
pull number:  15  current followers pulled:  15000
pull number:  16  current followers pulled:  16000
pull number:  17  current followers pulled:  17000
pull number:  18  current followers pulled:  18000
pull number:  19  current followers pulled:  19000
pull number:  20  current followers pulled:  20000
pull number:  21  current followers pulled:  21000
pull number:  22  current followers pulled:  22000
pull number:  23  current followers pulled:  23000


Rate limit exceeded. Sleeping for 887 seconds.


pull number:  24  current followers pulled:  24000
pull number:  25  current followers pulled:  25000
pull number:  26  current followers pulled:  26000
pull number:  27  current followers pulled:  27000
pull number:  28  current followers pulled:  28000
pull number:  29  current followers pulled:  29000
pull number:  30  current followers pulled:  30000
pull number:  31  current followers pulled:  31000
pull number:  32  current followers pulled:  32000
pull number:  33  current followers pulled:  33000
pull number:  34  current followers pulled:  34000
pull number:  35  current followers pulled:  35000
pull number:  36  current followers pulled:  36000
pull number:  37  current followers pulled:  37000
pull number:  38  current followers pulled:  38000


Rate limit exceeded. Sleeping for 888 seconds.


pull number:  39  current followers pulled:  39000
pull number:  40  current followers pulled:  40000
pull number:  41  current followers pulled:  41000
pull number:  42  current followers pulled:  42000
pull number:  43  current followers pulled:  43000
pull number:  44  current followers pulled:  44000
pull number:  45  current followers pulled:  45000
pull number:  46  current followers pulled:  46000
pull number:  47  current followers pulled:  47000
pull number:  48  current followers pulled:  48000
pull number:  49  current followers pulled:  49000
pull number:  50  current followers pulled:  50000
pull number:  51  current followers pulled:  51000
pull number:  52  current followers pulled:  52000
pull number:  53  current followers pulled:  53000


Rate limit exceeded. Sleeping for 889 seconds.


pull number:  54  current followers pulled:  54000
pull number:  55  current followers pulled:  55000
pull number:  56  current followers pulled:  56000
pull number:  57  current followers pulled:  57000
pull number:  58  current followers pulled:  58000
pull number:  59  current followers pulled:  59000
pull number:  60  current followers pulled:  60000
pull number:  61  current followers pulled:  61000
pull number:  62  current followers pulled:  62000
pull number:  63  current followers pulled:  63000
pull number:  64  current followers pulled:  64000
pull number:  65  current followers pulled:  65000
pull number:  66  current followers pulled:  66000
pull number:  67  current followers pulled:  67000
pull number:  68  current followers pulled:  68000


Rate limit exceeded. Sleeping for 887 seconds.


pull number:  69  current followers pulled:  69000
pull number:  70  current followers pulled:  70000
pull number:  71  current followers pulled:  71000
pull number:  72  current followers pulled:  72000
pull number:  73  current followers pulled:  73000
pull number:  74  current followers pulled:  74000
pull number:  75  current followers pulled:  75000
pull number:  76  current followers pulled:  76000
pull number:  77  current followers pulled:  77000
pull number:  78  current followers pulled:  78000
pull number:  79  current followers pulled:  79000
pull number:  80  current followers pulled:  80000
pull number:  81  current followers pulled:  81000
pull number:  82  current followers pulled:  82000
pull number:  83  current followers pulled:  83000


Rate limit exceeded. Sleeping for 889 seconds.


pull number:  84  current followers pulled:  84000
pull number:  85  current followers pulled:  85000
pull number:  86  current followers pulled:  86000
pull number:  87  current followers pulled:  87000
pull number:  88  current followers pulled:  88000
pull number:  89  current followers pulled:  89000
pull number:  90  current followers pulled:  90000
pull number:  91  current followers pulled:  91000
pull number:  92  current followers pulled:  92000
pull number:  93  current followers pulled:  93000
pull number:  94  current followers pulled:  94000
pull number:  95  current followers pulled:  95000
pull number:  96  current followers pulled:  96000
pull number:  97  current followers pulled:  97000
pull number:  98  current followers pulled:  98000


Rate limit exceeded. Sleeping for 889 seconds.


pull number:  99  current followers pulled:  99000
pull number:  100  current followers pulled:  100000
pull number:  101  current followers pulled:  101000
pull number:  102  current followers pulled:  102000
pull number:  103  current followers pulled:  103000
pull number:  104  current followers pulled:  104000
pull number:  105  current followers pulled:  105000
pull number:  106  current followers pulled:  106000
pull number:  107  current followers pulled:  107000
pull number:  108  current followers pulled:  108000
pull number:  109  current followers pulled:  109000
pull number:  110  current followers pulled:  110000
pull number:  111  current followers pulled:  111000
pull number:  112  current followers pulled:  112000
pull number:  113  current followers pulled:  113000


Rate limit exceeded. Sleeping for 888 seconds.


pull number:  114  current followers pulled:  114000
pull number:  115  current followers pulled:  115000
pull number:  116  current followers pulled:  116000
pull number:  117  current followers pulled:  117000
pull number:  118  current followers pulled:  118000
pull number:  119  current followers pulled:  119000
pull number:  120  current followers pulled:  120000
pull number:  121  current followers pulled:  121000
pull number:  122  current followers pulled:  122000
pull number:  123  current followers pulled:  123000
pull number:  124  current followers pulled:  124000
pull number:  125  current followers pulled:  125000
pull number:  126  current followers pulled:  126000
pull number:  127  current followers pulled:  127000
pull number:  128  current followers pulled:  128000


Rate limit exceeded. Sleeping for 888 seconds.


pull number:  129  current followers pulled:  129000
pull number:  130  current followers pulled:  130000
pull number:  131  current followers pulled:  131000
pull number:  132  current followers pulled:  132000
pull number:  133  current followers pulled:  133000
pull number:  134  current followers pulled:  134000
pull number:  135  current followers pulled:  135000
pull number:  136  current followers pulled:  136000
pull number:  137  current followers pulled:  137000
pull number:  138  current followers pulled:  138000
pull number:  139  current followers pulled:  139000
pull number:  140  current followers pulled:  140000
pull number:  141  current followers pulled:  141000
pull number:  142  current followers pulled:  142000
pull number:  143  current followers pulled:  143000


Rate limit exceeded. Sleeping for 888 seconds.


pull number:  144  current followers pulled:  144000
pull number:  145  current followers pulled:  145000
pull number:  146  current followers pulled:  146000
pull number:  147  current followers pulled:  147000
pull number:  148  current followers pulled:  148000
pull number:  149  current followers pulled:  149000
pull number:  150  current followers pulled:  150000
pull number:  151  current followers pulled:  151000
pull number:  152  current followers pulled:  152000
pull number:  153  current followers pulled:  153000
pull number:  154  current followers pulled:  154000
pull number:  155  current followers pulled:  155000
pull number:  156  current followers pulled:  156000
pull number:  157  current followers pulled:  157000
pull number:  158  current followers pulled:  158000


Rate limit exceeded. Sleeping for 886 seconds.


pull number:  159  current followers pulled:  159000
pull number:  160  current followers pulled:  160000
pull number:  161  current followers pulled:  161000
pull number:  162  current followers pulled:  162000
pull number:  163  current followers pulled:  163000
pull number:  164  current followers pulled:  164000
pull number:  165  current followers pulled:  165000
pull number:  166  current followers pulled:  166000
pull number:  167  current followers pulled:  167000
pull number:  168  current followers pulled:  168000
pull number:  169  current followers pulled:  169000
pull number:  170  current followers pulled:  170000
pull number:  171  current followers pulled:  171000
pull number:  172  current followers pulled:  172000
pull number:  173  current followers pulled:  173000


Rate limit exceeded. Sleeping for 888 seconds.


pull number:  174  current followers pulled:  174000
pull number:  175  current followers pulled:  175000
pull number:  176  current followers pulled:  176000
pull number:  177  current followers pulled:  177000
pull number:  178  current followers pulled:  178000
pull number:  179  current followers pulled:  179000
pull number:  180  current followers pulled:  180000
pull number:  181  current followers pulled:  181000
pull number:  182  current followers pulled:  182000
pull number:  183  current followers pulled:  183000
pull number:  184  current followers pulled:  184000
pull number:  185  current followers pulled:  185000
pull number:  186  current followers pulled:  186000
pull number:  187  current followers pulled:  187000
pull number:  188  current followers pulled:  188000


Rate limit exceeded. Sleeping for 887 seconds.


pull number:  189  current followers pulled:  189000
pull number:  190  current followers pulled:  190000
pull number:  191  current followers pulled:  191000
pull number:  192  current followers pulled:  192000
pull number:  193  current followers pulled:  193000
pull number:  194  current followers pulled:  194000
pull number:  195  current followers pulled:  195000
pull number:  196  current followers pulled:  196000
pull number:  197  current followers pulled:  197000
pull number:  198  current followers pulled:  198000
pull number:  199  current followers pulled:  199000
pull number:  200  current followers pulled:  200000
pull number:  201  current followers pulled:  201000
pull number:  202  current followers pulled:  202000
3:01:44.522487


In [348]:
# convert from lists to dataframes
df_weezer_data = pd.DataFrame(weezer_followers, columns = ['username','id', 'name','location','followers_count',
                                               'following_count','description'])

df_weezer_ids = pd.DataFrame(weezer_followers_id)
df_weezer_data.head()

Unnamed: 0,username,id,name,location,followers_count,following_count,description
0,pistachioIattes,1457901735205228546,madison,co,213,74,222 ⋆⁺₊⋆ I write fun things for @AllTimeEDM
1,NillaChinchilla,1313666612852072449,VanillaChinchilla,,199,27,uhh\nartist\n18
2,addisenisaloser,1346836723716730881,Addisen,her killing jar,154,20,i have an obsession with four guys from jersey...
3,knight_uzumaki,1612319583674978311,Nezuko Grace Knight Wilson Quinn Uzumaki,,103,0,Parent IRL:(Hinata/Naruto)\nIRL: Family Foreve...
4,Emojix_Granie,1430873450818777090,Emojix🫠,"Zadupie, Polska",545,17,Emojix 2: @Emojix11


In [368]:
#write the dataframes as .txt files to the twitter folder
df_weezer_data.to_csv(r'/Users/ryan_s_dunn/twitter/weezer_follower_data.txt', sep=' ')

df_weezer_ids.to_csv(r'/Users/ryan_s_dunn/twitter/weezer_followers.txt', sep=' ', header = ['id'])

#### Collect Greenday Followers Data

In [344]:
handle = 'greenday'

total_followers = 201000
greenday_followers = []
greenday_followers_id = []
pulls = 0
max_pulls = 100
next_token = None

user_name = client.get_user(username=handle).data.name
user_id = client.get_user(username=handle).data.id

while True :

    followers = client.get_users_followers(
        user_id, 
        max_results=1000, 
        pagination_token = next_token,
        user_fields=["created_at","description","location",
                     "entities","name","pinned_tweet_id","profile_image_url",
                     "verified","public_metrics"]
    )
    pulls += 1
    
    for follower in followers.data : 
        follower_row = (follower.username, follower.id,follower.name,follower.location, 
                        follower.public_metrics['following_count'],follower.public_metrics['followers_count'],
                        follower.description)
        greenday_followers.append(follower_row)
    
    for follower in followers.data :
        follower_id_row = (follower.id)
        greenday_followers_id.append(follower_id_row)
        
    print('pull number: ', pulls, ' current followers pulled: ', len(greenday_followers_id),
         datetime.datetime.now())
    
    if len(greenday_followers_id) > total_followers:
        break

# convert from lists to dataframes
df_greenday_data = pd.DataFrame(greenday_followers, columns = ['username','id', 'name','location','followers_count',
                                               'following_count','description'])

df_greenday_ids = pd.DataFrame(greenday_followers_id)

#write the dataframes as .txt files to the twitter folder
df_greenday_data.to_csv(r'/Users/ryan_s_dunn/twitter/greenday_follower_data.txt', 
                      header=None, index=None, sep=' ', mode='a')

df_greenday_ids.to_csv(r'/Users/ryan_s_dunn/twitter/greenday_followers.txt', 
                     header=None, index=None, sep=' ', mode='a')

end_time = datetime.datetime.now()
print(end_time - start_time)

pull number:  1  current followers pulled:  1000 2023-01-16 08:25:47.684023
pull number:  2  current followers pulled:  2000 2023-01-16 08:25:48.327893
pull number:  3  current followers pulled:  3000 2023-01-16 08:25:49.125366
pull number:  4  current followers pulled:  4000 2023-01-16 08:25:49.843901
pull number:  5  current followers pulled:  5000 2023-01-16 08:25:50.564602
pull number:  6  current followers pulled:  6000 2023-01-16 08:25:51.381287
pull number:  7  current followers pulled:  7000 2023-01-16 08:25:52.095082
pull number:  8  current followers pulled:  8000 2023-01-16 08:25:52.916207
pull number:  9  current followers pulled:  9000 2023-01-16 08:25:53.635182
pull number:  10  current followers pulled:  10000 2023-01-16 08:25:54.349793
pull number:  11  current followers pulled:  11000 2023-01-16 08:25:55.030132
pull number:  12  current followers pulled:  12000 2023-01-16 08:25:55.780593
pull number:  13  current followers pulled:  13000 2023-01-16 08:25:56.503283
pull

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  15  current followers pulled:  15000 2023-01-16 08:25:57.826980
pull number:  16  current followers pulled:  16000 2023-01-16 08:40:49.191337
pull number:  17  current followers pulled:  17000 2023-01-16 08:40:50.097236
pull number:  18  current followers pulled:  18000 2023-01-16 08:40:50.770196
pull number:  19  current followers pulled:  19000 2023-01-16 08:40:51.505256
pull number:  20  current followers pulled:  20000 2023-01-16 08:40:52.200671
pull number:  21  current followers pulled:  21000 2023-01-16 08:40:52.917099
pull number:  22  current followers pulled:  22000 2023-01-16 08:40:53.634670
pull number:  23  current followers pulled:  23000 2023-01-16 08:40:54.451853
pull number:  24  current followers pulled:  24000 2023-01-16 08:40:55.071064
pull number:  25  current followers pulled:  25000 2023-01-16 08:40:55.887607
pull number:  26  current followers pulled:  26000 2023-01-16 08:40:56.601076
pull number:  27  current followers pulled:  27000 2023-01-16 08

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  30  current followers pulled:  30000 2023-01-16 08:40:59.672699
pull number:  31  current followers pulled:  31000 2023-01-16 08:55:51.046057
pull number:  32  current followers pulled:  32000 2023-01-16 08:55:51.856983
pull number:  33  current followers pulled:  33000 2023-01-16 08:55:52.546263
pull number:  34  current followers pulled:  34000 2023-01-16 08:55:53.429356
pull number:  35  current followers pulled:  35000 2023-01-16 08:55:54.216265
pull number:  36  current followers pulled:  36000 2023-01-16 08:55:54.884570
pull number:  37  current followers pulled:  37000 2023-01-16 08:55:55.583581
pull number:  38  current followers pulled:  38000 2023-01-16 08:55:56.315711
pull number:  39  current followers pulled:  39000 2023-01-16 08:55:57.132309
pull number:  40  current followers pulled:  40000 2023-01-16 08:55:57.766369
pull number:  41  current followers pulled:  41000 2023-01-16 08:55:58.467810
pull number:  42  current followers pulled:  42000 2023-01-16 08

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  45  current followers pulled:  45000 2023-01-16 08:56:01.226800
pull number:  46  current followers pulled:  46000 2023-01-16 09:10:52.687534
pull number:  47  current followers pulled:  47000 2023-01-16 09:10:53.437618
pull number:  48  current followers pulled:  48000 2023-01-16 09:10:54.047287
pull number:  49  current followers pulled:  49000 2023-01-16 09:10:54.788937
pull number:  50  current followers pulled:  50000 2023-01-16 09:10:55.424791
pull number:  51  current followers pulled:  51000 2023-01-16 09:10:56.215892
pull number:  52  current followers pulled:  52000 2023-01-16 09:10:57.033785
pull number:  53  current followers pulled:  53000 2023-01-16 09:10:57.656298
pull number:  54  current followers pulled:  54000 2023-01-16 09:10:58.368156
pull number:  55  current followers pulled:  55000 2023-01-16 09:10:59.076173
pull number:  56  current followers pulled:  56000 2023-01-16 09:10:59.696889
pull number:  57  current followers pulled:  57000 2023-01-16 09

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  60  current followers pulled:  60000 2023-01-16 09:11:02.764978
pull number:  61  current followers pulled:  61000 2023-01-16 09:25:54.028996
pull number:  62  current followers pulled:  62000 2023-01-16 09:25:54.682649
pull number:  63  current followers pulled:  63000 2023-01-16 09:25:55.399300
pull number:  64  current followers pulled:  64000 2023-01-16 09:25:56.109729
pull number:  65  current followers pulled:  65000 2023-01-16 09:25:56.832436
pull number:  66  current followers pulled:  66000 2023-01-16 09:25:57.651818
pull number:  67  current followers pulled:  67000 2023-01-16 09:25:58.265673
pull number:  68  current followers pulled:  68000 2023-01-16 09:25:58.884012
pull number:  69  current followers pulled:  69000 2023-01-16 09:25:59.478600
pull number:  70  current followers pulled:  70000 2023-01-16 09:26:00.217392
pull number:  71  current followers pulled:  71000 2023-01-16 09:26:00.830866
pull number:  72  current followers pulled:  72000 2023-01-16 09

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  75  current followers pulled:  75000 2023-01-16 09:26:03.560417
pull number:  76  current followers pulled:  76000 2023-01-16 09:41:02.390291
pull number:  77  current followers pulled:  77000 2023-01-16 09:41:03.134365
pull number:  78  current followers pulled:  78000 2023-01-16 09:41:03.809817
pull number:  79  current followers pulled:  79000 2023-01-16 09:41:04.511635
pull number:  80  current followers pulled:  80000 2023-01-16 09:41:05.172270
pull number:  81  current followers pulled:  81000 2023-01-16 09:41:05.769118
pull number:  82  current followers pulled:  82000 2023-01-16 09:41:06.573160
pull number:  83  current followers pulled:  83000 2023-01-16 09:41:07.205217
pull number:  84  current followers pulled:  84000 2023-01-16 09:41:07.930680
pull number:  85  current followers pulled:  85000 2023-01-16 09:41:08.556477
pull number:  86  current followers pulled:  86000 2023-01-16 09:41:09.240773
pull number:  87  current followers pulled:  87000 2023-01-16 09

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  90  current followers pulled:  90000 2023-01-16 09:41:12.220138
pull number:  91  current followers pulled:  90999 2023-01-16 09:56:03.460293
pull number:  92  current followers pulled:  91998 2023-01-16 09:56:04.159166
pull number:  93  current followers pulled:  92997 2023-01-16 09:56:04.792308
pull number:  94  current followers pulled:  93996 2023-01-16 09:56:05.455515
pull number:  95  current followers pulled:  94995 2023-01-16 09:56:06.138022
pull number:  96  current followers pulled:  95994 2023-01-16 09:56:06.761403
pull number:  97  current followers pulled:  96993 2023-01-16 09:56:07.503612
pull number:  98  current followers pulled:  97992 2023-01-16 09:56:08.218557
pull number:  99  current followers pulled:  98991 2023-01-16 09:56:08.830976
pull number:  100  current followers pulled:  99990 2023-01-16 09:56:09.552468
pull number:  101  current followers pulled:  100989 2023-01-16 09:56:10.175369
pull number:  102  current followers pulled:  101988 2023-01-

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  105  current followers pulled:  104985 2023-01-16 09:56:12.888715
pull number:  106  current followers pulled:  105984 2023-01-16 10:11:05.153159
pull number:  107  current followers pulled:  106983 2023-01-16 10:11:05.882086
pull number:  108  current followers pulled:  107982 2023-01-16 10:11:06.507109
pull number:  109  current followers pulled:  108981 2023-01-16 10:11:07.477778
pull number:  110  current followers pulled:  109980 2023-01-16 10:11:08.135396
pull number:  111  current followers pulled:  110979 2023-01-16 10:11:08.811186
pull number:  112  current followers pulled:  111978 2023-01-16 10:11:09.681546
pull number:  113  current followers pulled:  112977 2023-01-16 10:11:10.447888
pull number:  114  current followers pulled:  113976 2023-01-16 10:11:11.069505
pull number:  115  current followers pulled:  114975 2023-01-16 10:11:11.675069
pull number:  116  current followers pulled:  115974 2023-01-16 10:11:12.293373
pull number:  117  current followers pul

Rate limit exceeded. Sleeping for 890 seconds.


pull number:  120  current followers pulled:  119970 2023-01-16 10:11:15.162652
pull number:  121  current followers pulled:  120969 2023-01-16 10:26:06.458412
pull number:  122  current followers pulled:  121968 2023-01-16 10:26:07.177630
pull number:  123  current followers pulled:  122967 2023-01-16 10:26:07.762006
pull number:  124  current followers pulled:  123966 2023-01-16 10:26:08.290044
pull number:  125  current followers pulled:  124965 2023-01-16 10:26:08.876490
pull number:  126  current followers pulled:  125964 2023-01-16 10:26:09.507031
pull number:  127  current followers pulled:  126963 2023-01-16 10:26:10.089348
pull number:  128  current followers pulled:  127962 2023-01-16 10:26:10.637775
pull number:  129  current followers pulled:  128961 2023-01-16 10:26:11.221691
pull number:  130  current followers pulled:  129960 2023-01-16 10:26:11.794873
pull number:  131  current followers pulled:  130959 2023-01-16 10:26:12.390371
pull number:  132  current followers pul

Rate limit exceeded. Sleeping for 892 seconds.


pull number:  135  current followers pulled:  134955 2023-01-16 10:26:14.893912
pull number:  136  current followers pulled:  135954 2023-01-16 10:41:08.282987
pull number:  137  current followers pulled:  136953 2023-01-16 10:41:08.913323
pull number:  138  current followers pulled:  137952 2023-01-16 10:41:09.506414
pull number:  139  current followers pulled:  138951 2023-01-16 10:41:10.259466
pull number:  140  current followers pulled:  139950 2023-01-16 10:41:10.976848
pull number:  141  current followers pulled:  140949 2023-01-16 10:41:11.585917
pull number:  142  current followers pulled:  141948 2023-01-16 10:41:12.196819
pull number:  143  current followers pulled:  142947 2023-01-16 10:41:12.819841
pull number:  144  current followers pulled:  143946 2023-01-16 10:41:13.430794
pull number:  145  current followers pulled:  144945 2023-01-16 10:41:14.053242
pull number:  146  current followers pulled:  145944 2023-01-16 10:41:14.941629
pull number:  147  current followers pul

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  150  current followers pulled:  149940 2023-01-16 10:41:17.374980
pull number:  151  current followers pulled:  150939 2023-01-16 10:56:09.765600
pull number:  152  current followers pulled:  151938 2023-01-16 10:56:10.396680
pull number:  153  current followers pulled:  152937 2023-01-16 10:56:10.919091
pull number:  154  current followers pulled:  153936 2023-01-16 10:56:11.704993
pull number:  155  current followers pulled:  154935 2023-01-16 10:56:12.304606
pull number:  156  current followers pulled:  155934 2023-01-16 10:56:12.955713
pull number:  157  current followers pulled:  156933 2023-01-16 10:56:13.476279
pull number:  158  current followers pulled:  157932 2023-01-16 10:56:14.065396
pull number:  159  current followers pulled:  158931 2023-01-16 10:56:14.887741
pull number:  160  current followers pulled:  159930 2023-01-16 10:56:15.445021
pull number:  161  current followers pulled:  160929 2023-01-16 10:56:15.951882
pull number:  162  current followers pul

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  165  current followers pulled:  164925 2023-01-16 10:56:18.196716
pull number:  166  current followers pulled:  165924 2023-01-16 11:11:10.580816
pull number:  167  current followers pulled:  166923 2023-01-16 11:11:11.157531
pull number:  168  current followers pulled:  167922 2023-01-16 11:11:11.799345
pull number:  169  current followers pulled:  168921 2023-01-16 11:11:12.465705
pull number:  170  current followers pulled:  169920 2023-01-16 11:11:13.079603
pull number:  171  current followers pulled:  170919 2023-01-16 11:11:13.697811
pull number:  172  current followers pulled:  171918 2023-01-16 11:11:14.310762
pull number:  173  current followers pulled:  172917 2023-01-16 11:11:14.926594
pull number:  174  current followers pulled:  173916 2023-01-16 11:11:15.644254
pull number:  175  current followers pulled:  174915 2023-01-16 11:11:16.307991
pull number:  176  current followers pulled:  175914 2023-01-16 11:11:17.173843
pull number:  177  current followers pul

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  180  current followers pulled:  179910 2023-01-16 11:11:19.734201
pull number:  181  current followers pulled:  180909 2023-01-16 11:26:12.003670
pull number:  182  current followers pulled:  181908 2023-01-16 11:26:12.614863
pull number:  183  current followers pulled:  182907 2023-01-16 11:26:13.284127
pull number:  184  current followers pulled:  183906 2023-01-16 11:26:13.995704
pull number:  185  current followers pulled:  184905 2023-01-16 11:26:14.625855
pull number:  186  current followers pulled:  185904 2023-01-16 11:26:15.235080
pull number:  187  current followers pulled:  186903 2023-01-16 11:26:15.948451
pull number:  188  current followers pulled:  187902 2023-01-16 11:26:16.559077
pull number:  189  current followers pulled:  188901 2023-01-16 11:26:17.179608
pull number:  190  current followers pulled:  189900 2023-01-16 11:26:17.770038
pull number:  191  current followers pulled:  190899 2023-01-16 11:26:18.407532
pull number:  192  current followers pul

Rate limit exceeded. Sleeping for 891 seconds.


pull number:  195  current followers pulled:  194895 2023-01-16 11:26:20.964945
pull number:  196  current followers pulled:  195894 2023-01-16 11:41:13.304223
pull number:  197  current followers pulled:  196893 2023-01-16 11:41:14.058514
pull number:  198  current followers pulled:  197892 2023-01-16 11:41:14.840619
pull number:  199  current followers pulled:  198891 2023-01-16 11:41:15.557615
pull number:  200  current followers pulled:  199890 2023-01-16 11:41:16.173199
pull number:  201  current followers pulled:  200889 2023-01-16 11:41:16.821209
pull number:  202  current followers pulled:  201888 2023-01-16 11:41:17.433654
14:25:23.099113


In [382]:
#greenday_followers

In [436]:
# convert from lists to dataframes
df_greenday_data = pd.DataFrame(greenday_followers, columns = ['username','id', 'name','location','followers_count',
                                               'following_count','description'])

df_greenday_ids = pd.DataFrame(greenday_followers_id, columns = ['id'])

df_greenday_data.head()

Unnamed: 0,username,id,name,location,followers_count,following_count,description
0,SamuelStubbs15,1331713470090391557,Samuel Stubbs,Larne,221,10,29M\nLarne \nN.Ireland \nLet’s say Darwin woul...
1,Eugen_Sobran,2941144085,Eugen,,30,8,
2,garybrooks82,55915955,Gary,UK,557,111,"F1 fan, Escape Rooms, Brewdog, Gaming, Star Tr..."
3,jamiecook_777,1142336904370761728,Jamie,,782,48,car go brom
4,ashx_e,1379800080874672135,ngl.xlll10,《𝘏𝘦 | 𝘏𝘪𝘮》intp-t 6w5,601,20,trolo bardeando por todo


In [385]:
#write the dataframes as .txt files to the twitter folder
df_greenday_data.to_csv(r'/Users/ryan_s_dunn/twitter/greenday_follower_data.txt', sep=' ', header = True,
                        index = False,
                       columns = ['username','id','name','location','followers_count','following_count',
                                 'description'])

df_greenday_ids.to_csv(r'/Users/ryan_s_dunn/twitter/greenday_followers.txt', sep=' ', header = ['id'])

In [437]:
tricky_description = """
    Home by Warsan Shire
    
    no one leaves home unless
    home is the mouth of a shark.
    you only run for the border
    when you see the whole city
    running as well.

"""
# This won't work in a tab-delimited text file.

clean_description = re.sub(r"\s+"," ",tricky_description).strip()
clean_description

'Home by Warsan Shire no one leaves home unless home is the mouth of a shark. you only run for the border when you see the whole city running as well.'

---

# Lyrics Scrape

This section asks you to pull data from the Twitter API and scrape www.AZLyrics.com. In the notebooks where you do that work you are asked to store the data in specific ways. 

In [438]:
artists = {'weezer':"https://www.azlyrics.com/w/weezer.html",
           'greenday':"https://www.azlyrics.com/g/greenday.html"} 
# we'll use this dictionary to hold both the artist name and the link on AZlyrics

artists

{'weezer': 'https://www.azlyrics.com/w/weezer.html',
 'greenday': 'https://www.azlyrics.com/g/greenday.html'}

## A Note on Rate Limiting

The lyrics site, www.azlyrics.com, does not have an explicit maximum on number of requests in any one time, but in our testing it appears that too many requests in too short a time will cause the site to stop returning lyrics pages. (Entertainingly, the page that gets returned seems to only have the song title to [a Tom Jones song](https://www.azlyrics.com/lyrics/tomjones/itsnotunusual.html).) 

Whenever you call `requests.get` to retrieve a page, put a `time.sleep(5 + 10*random.random())` on the next line. This will help you not to get blocked. If you _do_ get blocked, which you can identify if the returned pages are not correct, just request a lyrics page through your browser. You'll be asked to perform a CAPTCHA and then your requests should start working again. 

## Part 1: Finding Links to Songs Lyrics

That general artist page has a list of all songs for that artist with links to the individual song pages. 

Q: Take a look at the `robots.txt` page on www.azlyrics.com. (You can read more about these pages [here](https://developers.google.com/search/docs/advanced/robots/intro).) Is the scraping we are about to do allowed or disallowed by this page? How do you know? 

A: <!-- Delete this comment and put your answer here. --> 


In [439]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = defaultdict(list)
test_list = []
for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    print(r)
    soup = BeautifulSoup(r.content, "html.parser")
    song_links = [link["href"] for link in soup.find_all("a", href=True) if "lyrics" in link["href"]]
    test_list.append(song_links)
    print(r)

    time.sleep(3 + 4*random())

<Response [200]>
<Response [200]>
<Response [200]>
<Response [200]>


In [441]:
test_list

[['//www.azlyrics.com',
  '//www.azlyrics.com/a.html',
  '//www.azlyrics.com/b.html',
  '//www.azlyrics.com/c.html',
  '//www.azlyrics.com/d.html',
  '//www.azlyrics.com/e.html',
  '//www.azlyrics.com/f.html',
  '//www.azlyrics.com/g.html',
  '//www.azlyrics.com/h.html',
  '//www.azlyrics.com/i.html',
  '//www.azlyrics.com/j.html',
  '//www.azlyrics.com/k.html',
  '//www.azlyrics.com/l.html',
  '//www.azlyrics.com/m.html',
  '//www.azlyrics.com/n.html',
  '//www.azlyrics.com/o.html',
  '//www.azlyrics.com/p.html',
  '//www.azlyrics.com/q.html',
  '//www.azlyrics.com/r.html',
  '//www.azlyrics.com/s.html',
  '//www.azlyrics.com/t.html',
  '//www.azlyrics.com/u.html',
  '//www.azlyrics.com/v.html',
  '//www.azlyrics.com/w.html',
  '//www.azlyrics.com/x.html',
  '//www.azlyrics.com/y.html',
  '//www.azlyrics.com/z.html',
  '//www.azlyrics.com/19.html',
  'https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.azlyrics.com%2Fw%2Fweezer.html&title=Weezer%20Lyrics&display=page',
  'https://

In [421]:
import requests
from bs4 import BeautifulSoup

# Make an HTTP request to the website
url = "https://www.azlyrics.com/w/weezer.html"
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, "html.parser")

# Find all the song links on the website
song_links = [link["href"] for link in soup.find_all("a", href=True) if "lyrics" in link["href"]]
song_links

['//www.azlyrics.com',
 '//www.azlyrics.com/a.html',
 '//www.azlyrics.com/b.html',
 '//www.azlyrics.com/c.html',
 '//www.azlyrics.com/d.html',
 '//www.azlyrics.com/e.html',
 '//www.azlyrics.com/f.html',
 '//www.azlyrics.com/g.html',
 '//www.azlyrics.com/h.html',
 '//www.azlyrics.com/i.html',
 '//www.azlyrics.com/j.html',
 '//www.azlyrics.com/k.html',
 '//www.azlyrics.com/l.html',
 '//www.azlyrics.com/m.html',
 '//www.azlyrics.com/n.html',
 '//www.azlyrics.com/o.html',
 '//www.azlyrics.com/p.html',
 '//www.azlyrics.com/q.html',
 '//www.azlyrics.com/r.html',
 '//www.azlyrics.com/s.html',
 '//www.azlyrics.com/t.html',
 '//www.azlyrics.com/u.html',
 '//www.azlyrics.com/v.html',
 '//www.azlyrics.com/w.html',
 '//www.azlyrics.com/x.html',
 '//www.azlyrics.com/y.html',
 '//www.azlyrics.com/z.html',
 '//www.azlyrics.com/19.html',
 'https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.azlyrics.com%2Fw%2Fweezer.html&title=Weezer%20Lyrics&display=page',
 'https://twitter.com/intent/tweet?url=h

In [434]:
for url in song_links:
    soup = BeautifulSoup(requests.get("https://www.azlyrics.com"+url).content, "html.parser")
    lyric = soup.get_text(
        strip=True, separator="\n"
    )

#    print(lyric)

ConnectionError: HTTPSConnectionPool(host='www.azlyrics.comhttps', port=443): Max retries exceeded with url: //www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.azlyrics.com%2Fg%2Fgreenday.html&title=Green%20Day%20Lyrics&display=page (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fbf89c60850>: Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'))

In [422]:
lyrics_list = []

for link in song_links:
    lyrics_list.append(soup.get_text())

In [423]:
lyrics_list

['\n\n\n\n\n\n\n\n\n\nWeezer Lyrics\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nA\nB\nC\nD\nE\nF\nG\nH\nI\nJ\nK\nL\nM\nN\nO\nP\nQ\nR\nS\nT\nU\nV\nW\nX\nY\nZ\n#\n\n\n\n\n\n\n\n\n Search\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nWeezer Lyrics\n\n\n\n\n sort by album sort by song\n\nalbum: "Weezer (The Blue Album)" (1994)\nMy Name Is Jonas\nNo One Else\nThe World Has Turned And Left Me Here\nBuddy Holly\nUndone - The Sweater Song\nSurf Wax America\nSay It Ain\'t So\nIn The Garage\nHoliday\nOnly In Dreams\nMykel And Carli(Deluxe Edition Bonus Track)\nSuzanne(Deluxe Edition Bonus Track)\nMy Evaline(Deluxe Edition Bonus Track)\nJamie(Deluxe Edition Bonus Track)\nMy Name Is Jonas (Live)(Deluxe Edition Bonus Track)\nSurf Wax America (Live)(Deluxe Edition Bonus Track)\nJamie (Acoustic Live)(Deluxe Edition Bonus Track)\nNo One Else (Acoustic Live)(Deluxe Edition Bonus Track)\nUndone (The Sweater Song) (Demo)(Deluxe Edition

In [428]:
# Let's set up a dictionary of lists to hold our links
lyrics_pages = []

for artist, artist_page in artists.items() :
    # request the page and sleep
    r = requests.get(artist_page)
    time.sleep(5 + 10*random())

    # now extract the links to lyrics pages from this page
    soup = BeautifulSoup(r.text, "html.parser")

    # Find all the song links on the website
    song_links = [link["href"] for link in soup.find_all("a", href=True) if "lyrics" in link["href"]]
    # store the links `lyrics_pages` where the key is the artist and the
    lyrics_pages.append(song_links)
    # value is a list of links. 
    

In [429]:
lyrics_pages

[['//www.azlyrics.com',
  '//www.azlyrics.com/a.html',
  '//www.azlyrics.com/b.html',
  '//www.azlyrics.com/c.html',
  '//www.azlyrics.com/d.html',
  '//www.azlyrics.com/e.html',
  '//www.azlyrics.com/f.html',
  '//www.azlyrics.com/g.html',
  '//www.azlyrics.com/h.html',
  '//www.azlyrics.com/i.html',
  '//www.azlyrics.com/j.html',
  '//www.azlyrics.com/k.html',
  '//www.azlyrics.com/l.html',
  '//www.azlyrics.com/m.html',
  '//www.azlyrics.com/n.html',
  '//www.azlyrics.com/o.html',
  '//www.azlyrics.com/p.html',
  '//www.azlyrics.com/q.html',
  '//www.azlyrics.com/r.html',
  '//www.azlyrics.com/s.html',
  '//www.azlyrics.com/t.html',
  '//www.azlyrics.com/u.html',
  '//www.azlyrics.com/v.html',
  '//www.azlyrics.com/w.html',
  '//www.azlyrics.com/x.html',
  '//www.azlyrics.com/y.html',
  '//www.azlyrics.com/z.html',
  '//www.azlyrics.com/19.html',
  'https://www.facebook.com/sharer.php?u=https%3A%2F%2Fwww.azlyrics.com%2Fw%2Fweezer.html&title=Weezer%20Lyrics&display=page',
  'https://

Let's make sure we have enough lyrics pages to scrape. 

In [430]:
for artis in lyrics_pages.items() :
    assert(len(set(lp)) > 20) 

AttributeError: 'list' object has no attribute 'items'

In [412]:
# Let's see how long it's going to take to pull these lyrics 
# if we're waiting `5 + 10*random.random()` seconds 
for artist, links in lyrics_pages.items() : 
    print(f"For {artist} we have {len(links)}.")
    print(f"The full pull will take for this artist will take {round(len(links)*10/3600,2)} hours.")

AttributeError: 'list' object has no attribute 'items'

## Part 2: Pulling Lyrics

Now that we have the links to our lyrics pages, let's go scrape them! Here are the steps for this part. 

1. Create an empty folder in our repo called "lyrics". 
1. Iterate over the artists in `lyrics_pages`. 
1. Create a subfolder in lyrics with the artist's name. For instance, if the artist was Cher you'd have `lyrics/cher/` in your repo.
1. Iterate over the pages. 
1. Request the page and extract the lyrics from the returned HTML file using BeautifulSoup.
1. Use the function below, `generate_filename_from_url`, to create a filename based on the lyrics page, then write the lyrics to a text file with that name. 


In [446]:
def generate_filename_from_link(link) :
    
    if not link :
        return None
    
    # drop the http or https and the html
    name = link.replace("https","").replace("http","")
    name = link.replace(".html","")

    name = name.replace("/lyrics/","")
    
    # Replace useless chareacters with UNDERSCORE
    name = name.replace("://","").replace(".","_").replace("/","_")
    
    # tack on .txt
    name = name + ".txt"
    
    print(name)
    return(name)

actual_links = generate_filename_from_link(test_list)

AttributeError: 'list' object has no attribute 'replace'

In [414]:
# Make the lyrics folder here. If you'd like to practice your programming, add functionality 
# that checks to see if the folder exists. If it does, then use shutil.rmtree to remove it and create a new one.

if os.path.isdir("lyrics") : 
    shutil.rmtree("lyrics/")

os.mkdir("lyrics")

In [415]:
url_stub = "https://www.azlyrics.com" 
start = time.time()

total_pages = 0 

for artist in lyrics_pages :

    # Use this space to carry out the following steps: 
    
    # 1. Build a subfolder for the artist
    # 2. Iterate over the lyrics pages
    # 3. Request the lyrics page. 
        # Don't forget to add a line like `time.sleep(5 + 10*random.random())`
        # to sleep after making the request
    # 4. Extract the title and lyrics from the page.
    # 5. Write out the title, two returns ('\n'), and the lyrics. Use `generate_filename_from_url`
    #    to generate the filename. 
    
    # Remember to pull at least 20 songs per artist. It may be fun to pull all the songs for the artist
    

SyntaxError: unexpected EOF while parsing (<ipython-input-415-9e7b4f75ca4d>, line 20)

In [416]:
print(f"Total run time was {round((time.time() - start)/3600,2)} hours.")

NameError: name 'start' is not defined

---

# Evaluation

This assignment asks you to pull data from the Twitter API and scrape www.AZLyrics.com.  After you have finished the above sections , run all the cells in this notebook. Print this to PDF and submit it, per the instructions.

In [417]:
# Simple word extractor from Peter Norvig: https://norvig.com/spell-correct.html
def words(text): 
    return re.findall(r'\w+', text.lower())

---

## Checking Twitter Data

The output from your Twitter API pull should be two files per artist, stored in files with formats like `cher_followers.txt` (a list of all follower IDs you pulled) and `cher_followers_data.txt`. These files should be in a folder named `twitter` within the repository directory. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [418]:
twitter_files = os.listdir("twitter")
twitter_files = [f for f in twitter_files if f != ".DS_Store"]
artist_handles = list(set([name.split("_")[0] for name in twitter_files]))

print(f"We see two artist handles: {artist_handles[0]} and {artist_handles[1]}.")

We see two artist handles: greenday and weezer.


In [419]:
for artist in artist_handles :
    follower_file = artist + "_followers.txt"
    follower_data_file = artist + "_follower_data.txt"
    
    ids = open("twitter/" + follower_file,'r').readlines()
    
    print(f"We see {len(ids)-1} in your follower file for {artist}, assuming a header row.")
    
    with open("twitter/" + follower_data_file,'r') as infile :
        
        # check the headers
        headers = infile.readline().split("\t")
        
        print(f"In the follower data file ({follower_data_file}) for {artist}, we have these columns:")
        print(" : ".join(headers))
        
        description_words = []
        locations = set()
        
        
        for idx, line in enumerate(infile.readlines()) :
            line = line.strip("\n").split("\t")
            
            try : 
                locations.add(line[3])            
                description_words.extend(words(line[6]))
            except :
                pass
    
        

        print(f"We have {idx+1} data rows for {artist} in the follower data file.")

        print(f"For {artist} we have {len(locations)} unique locations.")

        print(f"For {artist} we have {len(description_words)} words in the descriptions.")
        print("Here are the five most common words:")
        print(Counter(description_words).most_common(5))

        
        print("")
        print("-"*40)
        print("")
    

We see 201888 in your follower file for greenday, assuming a header row.
In the follower data file (greenday_follower_data.txt) for greenday, we have these columns:
username id name location followers_count following_count description

We have 245908 data rows for greenday in the follower data file.
For greenday we have 0 unique locations.
For greenday we have 0 words in the descriptions.
Here are the five most common words:
[]

----------------------------------------

We see 202000 in your follower file for weezer, assuming a header row.
In the follower data file (weezer_follower_data.txt) for weezer, we have these columns:
 username id name location followers_count following_count description

We have 260667 data rows for weezer in the follower data file.
For weezer we have 0 unique locations.
For weezer we have 0 words in the descriptions.
Here are the five most common words:
[]

----------------------------------------



FileNotFoundError: [Errno 2] No such file or directory: 'twitter/.ipynb_followers.txt'

## Checking Lyrics 

The output from your lyrics scrape should be stored in files located in this path from the directory:
`/lyrics/[Artist Name]/[filename from URL]`. This code summarizes the information at a high level to help the instructor evaluate your work. 

In [None]:
artist_folders = os.listdir("lyrics/")
artist_folders = [f for f in artist_folders if os.path.isdir("lyrics/" + f)]

for artist in artist_folders : 
    artist_files = os.listdir("lyrics/" + artist)
    artist_files = [f for f in artist_files if 'txt' in f or 'csv' in f or 'tsv' in f]

    print(f"For {artist} we have {len(artist_files)} files.")

    artist_words = []

    for f_name in artist_files : 
        with open("lyrics/" + artist + "/" + f_name) as infile : 
            artist_words.extend(words(infile.read()))

            
    print(f"For {artist} we have roughly {len(artist_words)} words, {len(set(artist_words))} are unique.")
