# Web Data Scraping

[Spring 2019 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Class outline

* **Week 1**: Introduction to Jupyter, browser console, structured data, ethical considerations
* **Week 2**: Scraping HTML with `requests` and `BeautifulSoup`
* **Week 3**: Scraping an API with `requests` and `json`, Wikipedia
* **Week 4**: Scraping web data with Selenium and using the Internet Archive API
* **Week 5**: Scraping data from Twitter

## Acknowledgements

This course will draw on resources built by myself and [Allison Morgan](https://allisonmorgan.github.io/) for the [2018 Summer Institute for Computational Social Science](https://github.com/allisonmorgan/sicss_boulder), which were in turn derived from [other resources](https://github.com/simonmunzert/web-scraping-with-r-extended-edition) developed by [Simon Munzert](http://simonmunzert.github.io/) and [Chris Bail](http://www.chrisbail.net/). 

Thank you also to Professors [Bhuvana Narasimhan](https://www.colorado.edu/linguistics/bhuvana-narasimhan) and [Stefanie Mollborn](https://behavioralscience.colorado.edu/person/stefanie-mollborn) for coordinating the ITSS seminars.

## Class 4 goals

* Sharing accomplishments and challenges with last week's material
* Using Selenium to interact with websites
* Implementing a screen-scraper with Selenium 
* Ethics of spoofing headers, screen scraping, and parallelizing API requests
* Using Internet Archive API to find historical web pages
* Retrieving and parsing Internet Archive pages

Start with our usual suspect packages.

In [None]:
# Lets us talk to servers on the web
import requests

# Parsing HTML magic
from bs4 import BeautifulSoup

# For data manipulation
import pandas as pd

# Will be helful for converting between timestamps
from datetime import datetime

# We want to sleep from time-to-time to avoid overwhelming another server
import time

from urllib.parse import quote, unquote
import json

The block of code below will only work once you've installed Selenium.

In [None]:
# Our interface to a real-life web browser... won't import until you install!
import selenium.webdriver

## Installing Selenium

This is a non-trivial process: you will need to (1) install the Python bindings for Selenium, (2) download a web driver to interface with a web browser, and (3) configure Selenium to recognize your web driver. Follow the installation instructions in the documentation [here](https://selenium-python.readthedocs.io/installation.html) (you won't need the Selenium server).

1. Install the Python bindings for Selenium. Go to your Anaconda terminal window, type in this command, and agree to whatever the package manager wants to install or update.

`conda install selenium`

2. Download the driver(s) for the web browser you want to use from the [links on the Selenium documentation](https://selenium-python.readthedocs.io/installation.html). If you use a Chrome browser, download the Chrome driver. Note that the Safari driver will not work on PCs and the Edge driver will not work on Macs. 

3. You will need to unzip the file and move the executable to the same directory where you are running this notebook. Make a note of the path to this directory.

### Using Selenium to control a web browser
The `driver` object we create is a connection from this Python environment out to the browser window.

In [None]:
# Path to the Chrome driver for my PC -- yours is likely very different
# pc_path = 'E:/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver.exe'
# driver = selenium.webdriver.Chrome(executable_path=pc_path)

# Path to the Chrome driver for my Mac -- yours is likely very different
mac_path = '/Users/briankeegan/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver'
driver = selenium.webdriver.Chrome(executable_path=mac_path)


This single line of code will open a new browser window and will request the "xkcd" homepage.

Your computer's security protocols may vigorously protest because you are launching a program that is controlled by another process/program. You will need to dismiss these warnings in order to proceed. Whether and how to do that will vary considerably across PCs and Macs, the kinds of permissions your account has on this operating system, and other security measures employed by your computer.

In [None]:
driver.get('https://xkcd.com')

In Classes 01 and 02, we used `BeautifulSoup` to turn HTML and XML into a data structure that we could search and access using Python-like syntax. With Selenium we use a standard called "XPath" to navigate through an HTML document: [this is the official tutorial](https://www.w3schools.com/xml/xpath_syntax.asp) for working with XPath. The syntax is different, but the intuition is similar: we can find a parent node by its attribute (class, id, *etc*.) and then navigate down the tree to its children.

The XPath below has the following elements in sequence
* `//` — Select all nodes that match the selection
* `[@id="middleContainer"]` — find the element that has a "middleContainer" id.
* `/ul[2]` — select the second `<ul>` element underneath the `<div id="middleContainer">`
* `/li[3]` — select the third `<li>` element 
* `/a` — select the a element

The combined XPath string `//*[@id="middleContainer"]/ul[1]/li[3]/a` is like a "file directory" that (hopefully!) points to the hyperlink button that takes us to a random xkcd comic. With the directions to this button, we can have the web browser "click" the "Random" button beneath the comic.

In [None]:
# Let's find the 'random' buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[2]/li[3]/a')

# Once we've found it, now click it
element.click()

We can also get the attributes of different parts of the web page. xkcd is famous for its "hidden messages" inside the image alt-text.

In [None]:
alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
alttext_element.get_attribute("title")

We could write a simple loop to click on the random button five times and print the alt-text from each of those pages.

In [None]:
for c in range(5):
    random_element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[2]/li[3]/a')
    random_element.click()
    
    alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
    print('\n',alttext_element.get_attribute("title"))

When you're done playing with your programmable web browser, make sure to close it.

In [None]:
driver.quit()

Note that with the connection to the web browser closed, any of the functions like `find_element_by_xpath`, `click()`, *etc*. will not work.

In [None]:
alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
alttext_element.get_attribute("title")

Just about any operation you do in a web browser can be automated with Selenium: scrolling, clicking, completing forms, moving between tabs/windows, handling pop-ups, navigating back and forward, handling cookies, *etc*. Learn more about the functionality with tutorials and other resources in the [Selenium documentation](https://selenium-python.readthedocs.io/navigating.html).

### Exercises

Start your driver again and get the xkcd homepage.

1. Change the XPath to click on the "Prev" button above the comic.
2. Change the XPath to search for the "comicNav" class instead of the "middleContainer" id.
3. Change the XPath to click on the "About" button in the upper-left.

## Ethical web scraping

James Densmore has a nice summary of [practices for ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

> * If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
> * I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
> * I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
> * I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
> * I will respect any content I do keep. I’ll never pass it off as my own.
> * I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
> * I will respond in a timely fashion to your outreach and work with you towards a resolution.
> * I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other important components of ethical web scraping practices [include](http://robertorocha.info/on-the-ethics-of-web-scraping/):

* Reading the Terms of Service and Privacy Policies for the site's rules on scraping.
* Inspecting the robots.txt file for rules about what pages can be scraped, indexed, *etc*.
* Be gentle on smaller websites by running during off-peak hours and spacing out requests.
* Identify yourself by name and email in your User-Agent strings

What does a robots.txt file look like? Here is CNN's. It helpfull provides a sitemap to the robot to get other pages, it allows all kinds of User-agents, and disallows crawling of pages in specific directories (ads, polls, tests).

In [None]:
print(requests.get('https://www.cnn.com/robots.txt').text)

When we are scraping websites, it is a good idea to include your contact information as a custom User-Agent string so that the webmaster can get in contact.

In [None]:
contact_header = {'User-Agent':'Python research tool by Brian Keegan, brian.keegan@colorado.edu'}

request = requests.get('https://www.cnn.com',headers=contact_header)

Adverse consequences of web scraping include:
* Compromising the privacy and integrity of individual users' data
* Damaging a web server with too many requests
* Denying access to the web service to other authorized users
* Infringing on copyrighted material
* Damaging the business value of a web site

[Amanda Bee](http://velociraptor.info/) compiled [a nice set of examples](https://github.com/amandabee/scraping-for-journalists/wiki/Reporting-Examples) of data journalists using web scraping for their reporting. There are some ethical justifications for violating a site's terms of service to scrape data:
* Obtaining data for the public interest from official statements, government reports, *etc*.
* Conducting audit studies (as long as these are responsibly designed and pre-cleared)
* The data is unavailable from APIs, FOIA requests, and other reports

[Sophie Chou](http://sophiechou.com/) made this nice [decision flow-chart](http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/) of whether to build a scraper or not from a NICAR panel in 2016:

![Should you build a scraper flowchart](http://www.storybench.org/wp-content/uploads/2016/04/flowchart_final.jpeg)

Why is there a "Talk to a lawyer?" outcome at the bottom?

### Computer Fraud and Abuse Act

The [Computer Fraud and Abuse Act](https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act) was passed in 1984, [in large part due to](https://www.cnet.com/news/from-wargames-to-aaron-swartz-how-u-s-anti-hacking-law-went-astray/) the 1983 film [WarGames](https://en.wikipedia.org/wiki/WarGames) starring Matthew Broderick. A plain reading of the text of the law ([18 U.S.C. § 1030](https://www.law.cornell.edu/uscode/text/18/1030)) criminalizes just about any form of web scraping:

> * Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains… information from any protected computer;
> * knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
> * the term “exceeds authorized access” means to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter;
> * the term “damage” means any impairment to the integrity or availability of data, a program, a system, or information;
> * the term “protected computer” means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States;

Violators can be fined and jailed under a misdemeanor charge for up to 1 year for the first violation and jailed up to 10 years under a felony charge for repeated violations.

This law has a [chilling effect](https://en.wikipedia.org/wiki/Chilling_effect) on many forms of research, journalism, and other forms of protected speech. The CFAA has been used by federal prosecutors to bring federal felony charges against programmers, journalists, and activists. In 2011, programmer and hacktivist [Aaron Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz) (who contributed to the development of RSS, Markdown, Creative Commons, and Reddit) was [arrested and charged](https://en.wikipedia.org/wiki/United_States_v._Swartz) with violating the CFAA for downloading several million PDFs from JSTOR over MIT's network. The [decision to prosecute was unusual](https://www.huffingtonpost.com/2013/03/13/aaron-swartz-prosecutorial-misconduct_n_2867529.html). Facing 35 years of imprisonment and over $1 million in fines under the CFAA, Swartz committed suicide on January 11, 2013.

In 2016, four computer science researchers and the publisher of *The Intercept* who all use scraping techniques to run experiments to measure bias and discrimination in web content [filed suit with the ACLU](https://www.aclu.org/cases/sandvig-v-sessions-challenge-cfaa-prohibition-uncovering-racial-discrimination-online) against the U.S. Government: *Sandvig v. Sessions*. Their research involves creating multiple fake accounts, providing inaccurate information to websites, using automated tools to record publicly-available data, and other scraping techniques. In March 2018, the [D.C. Circuit Court ruled](https://www.aclu.org/news/judge-allows-aclu-case-challenging-law-preventing-studies-big-data-discrimination-proceed) two of the plantiffs have standing to sue and the case is currently being prepared for trial.

### Warning

The code we will write and execute below will repeatedly violate Twitter's [Terms of Service](https://twitter.com/en/tos) ("scraping the Services without the prior consent of Twitter is expressly prohibited") for retrieving information from the platform. In effect, we will transmit code in excess of our authorized access and potentially cause damage, in order to obtain information from a protected computer. 

We will do this in order to obtain public statements made by goverment officials acting in their official capacity because this data is otherwise unavailable for retrieval from Twitter. There is an interesting body of emerging legal precedent treating elected officials' use of Twitter as a public forum: [*Knight First Amendment Institute v. Trump*](https://en.wikipedia.org/wiki/Knight_First_Amendment_Institute_v._Trump) established that [the President may not block other Twitter users](https://www.courtlistener.com/docket/6087955/72/knight-first-amendment-institute-at-columbia-university-v-trump/):

> * "We hold that portions of the @realDonaldTrump account -- the “interactive space” where Twitter users may directly engage with the content of the President’s tweets -- are properly analyzed under the “public forum” doctrines set forth by the Supreme Court, that such space is a designated public forum..."
> * "we nonetheless conclude that the extent to which the President and Scavino can, and do, exercise control over aspects of the @realDonaldTrump account are sufficient to establish the government-control element as to the content of the tweets sent by the @realDonaldTrump account, the timeline compiling those tweets, and the interactive space associated with each of those tweets."
> * "Because a Twitter user lacks control over the comment thread beyond the control exercised over first-order replies through blocking, the comment threads -- as distinguished from the content of tweets sent by @realDonaldTrump, the @realDonaldTrump timeline, and the interactive space associated with each tweet -- do not meet the threshold criterion for being a forum."
> * "the account’s timeline, which “displays all tweets generated by the [account]”... all of which is government speech."

I would advise you against using these tools and approaches without a similarly clear public interest rationale and jurisprudence linking behavior to public forum doctrines.

## Screen-scraping a Twitter ego network with Selenium

I am adapting a [tutorial by Shawn Wang](https://dev.to/swyx/scraping-my-twitter-social-graph-with-python-and-selenium--hn8) on scraping a Twitter graph with Python and Selenium.

In [None]:
# Path to the Chrome driver for my PC -- yours is likely very different
# driver = selenium.webdriver.Chrome(executable_path='E:/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver.exe')

# Path to the Chrome driver for my Mac -- yours is likely very different
driver = selenium.webdriver.Chrome(executable_path='/Users/briankeegan/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver')

driver.get('https://www.twitter.com')

Manually log in to your Twitter account through the driver page.

Then go to the "followings" (or followees, also called "friends" in the Twitter API) of an account. 

In [None]:
driver.get('https://twitter.com/realDonaldTrump/following')

At the time of this Notebook's writing, the "realDonaldTrump" account followed 45 other accounts. Depending on the resolution of your display, size of the window, *etc*. there may only be 10–20 accounts visible. We can scroll to see the rest of these accounts programatically.

Run this cell a few times to keep scrolling to the bottom.

In [None]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

Pass the HTML of the web page in the browser back to Python and turn it into soup.

In [None]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

The information about each follower lives inside a `<div "data-item-type":"user">` element.

In [None]:
user_divs = soup.body.find_all('div', attrs={'data-item-type':'user'})

The first `user_div` (at the time of this writing) was for the `@VP` account.

Where does the Twitter account handle live in the HTTP document?

In [None]:
user_divs[0].div['data-screen-name']

Where does the name of the Twitter account live in the HTTP document?

In [None]:
user_divs[0].find_all('a',{'class':'fullname'})[0].text.strip()

Where does the bio live in the HTTP document?

In [None]:
user_divs[0].p.text

Put all the pieces together now: loop through each `user_div` and pull out the relevant information to store as a list of dictionaries.

In [None]:
# Create an empty list to store the followings data
following_graph_alters = []

# Loop through each user_div
for ud in user_divs:
    
    # Create an empty alter dictionary to fill with the handle, name, bio for each user
    alter = {}
    
    # Get the formal account handle
    alter['Screen Name'] = ud.div['data-screen-name']
    
    # Get the displayed name
    alter['Display Name'] = ud.find_all('a',{'class':'fullname'})[0].text.strip()
    
    # Get the biography
    alter['Bio'] = ud.p.text
    
    # Add the alter to the list
    following_graph_alters.append(alter)
    
# Turn the list into a DataFrame
following_graph_df = pd.DataFrame(following_graph_alters)

# Inspect the DataFrame
following_graph_df

## Screen-scraping a Twitter account's timeline of tweets

Go to the Twitter account for the White House.

In [None]:
driver.get('https://twitter.com/WhiteHouse')

Load the source of the page from the browser and soup-ify it.

In [None]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

Count the number of tweet objects currently on the screen.

In [None]:
tweets = soup.find_all('div',{'class':'original-tweet'})
len(tweets)

Now scroll to the bottom of the page, get the source again, parse out the tweets, and count them. On the browser, window size, resolution, *etc*. I'm using, I got 20 more tweets with a single scroll.

In [None]:
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [None]:
raw = driver.page_source.encode('utf-8')
soup = BeautifulSoup(raw)
tweets = soup.find_all('div',{'class':'original-tweet'})
len(tweets)

In [None]:
# Screen name
tweets[0]['data-screen-name']

In [None]:
# Display Name
tweets[0]['data-name']

In [None]:
# Tweet ID
tweets[0]['data-tweet-id']

In [None]:
# Timestamp
tweets[0].find('a',{'class':'tweet-timestamp'}).span['data-time']

This is a UNIX time code, also known as the "UNIX epoch", or the number of seconds since midnight on January 1, 1970. Because it is a common way to store data, `datetime` provides a way to convert it into a meaningful timestamp: `utcfromtimestamp`.

In [None]:
print(datetime.utcfromtimestamp(1549143559))

In [None]:
# Text of the tweet
tweets[0].find('div',{'class':'js-tweet-text-container'}).text.strip()

In [None]:
# Replies
tweets[0].find_all('span',{'class':'ProfileTweet-actionCount'})[0]['data-tweet-stat-count']

In [None]:
# Retweets
tweets[0].find_all('span',{'class':'ProfileTweet-actionCount'})[1]['data-tweet-stat-count']

In [None]:
# Favorites
tweets[0].find_all('span',{'class':'ProfileTweet-actionCount'})[2]['data-tweet-stat-count']

Write a function that extracts all this information from each tweet.

In [None]:
def tweet_timeline_parser(tweet):
    payload = {}
    
    payload['Screen name'] = tweet['data-screen-name']
    payload["Display name"] = tweet['data-name']
    payload['TweetID'] = tweet['data-tweet-id']
    payload['Timestamp'] = tweet.find('a',{'class':'tweet-timestamp'}).span['data-time']
    payload['Text'] = tweet.find('div',{'class':'js-tweet-text-container'}).text.strip()
    payload['Replies'] = int(tweet.find_all('span',{'class':'ProfileTweet-actionCount'})[0]['data-tweet-stat-count'])
    payload['Retweets'] = int(tweet.find_all('span',{'class':'ProfileTweet-actionCount'})[1]['data-tweet-stat-count'])
    payload['Favorites'] = int(tweet.find_all('span',{'class':'ProfileTweet-actionCount'})[2]['data-tweet-stat-count'])
    
    return payload

Loop through all the tweets.

In [None]:
parsed_tweets = []

for tweet in tweets:
    parsed_tweet = tweet_timeline_parser(tweet)
    parsed_tweets.append(parsed_tweet)

Look at the first three tweets that we parsed from the HTML.

In [None]:
parsed_tweets[:3]

There are this many tweets on the account.

In [None]:
tweet_count = soup.find('a',{'class':'ProfileNav-stat'}).find_all('span')[-1]['data-count']
tweet_count

Doing some rough math, 6696 divided by 20 tweets per scroll means we need to do 335 scrolls to get the whole timeline.

In [None]:
int(tweet_count)/20

Let's start with 10 scrolls and see whether things are still working.

In [None]:
for scroll in range(10):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

Pull in the data and parse out the tweets.

In [None]:
# Get the data from the browser
raw = driver.page_source.encode('utf-8')

# Soup-ify
soup = BeautifulSoup(raw)

# Find all the tweets
tweets = soup.find_all('div',{'class':'original-tweet'})

# Create the container
parsed_tweets = []

# Try to parse the tweets
for tweet in tweets:
    parsed_tweet = tweet_timeline_parser(tweet)
    parsed_tweets.append(parsed_tweet)

How many tweets were parsed out after 10 scrolls?

In [None]:
len(parsed_tweets)

We could try to scroll until we can't, logging how many scrolls we went.

In [None]:
# Go to the webpage
driver.get('https://twitter.com/WhiteHouse')

# Initialize the scroll counter and current page height
scroll_counter = 0
last_height = driver.execute_script('return document.body.scrollHeight')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

# Start a loop
while True:
    
    # Print out our progress every 10th scroll
    if scroll_counter > 0 and scroll_counter % 10 == 0:
        print("This is scroll: {0}".format(scroll_counter))
    
    # Sleep for 2 seconds in between scrolls
    time.sleep(2)
    
    # Get the current height of the page
    current_height = driver.execute_script('return document.body.scrollHeight')
    
    # If the current page height is the same as the previous page height, we can't scroll anymore
    if current_height == last_height:
        break
    
    # Scroll to the bottom again
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    # Increment our scroll counter
    scroll_counter += 1
    
    # Update the height of the page
    last_height = current_height

We can still parse the data.

In [None]:
# Get the data from the browser
raw = driver.page_source.encode('utf-8')

# Soup-ify
soup = BeautifulSoup(raw)

# Find all the tweets
tweets = soup.find_all('div',{'class':'original-tweet'})

# Create the container
parsed_tweets = []

# Try to parse the tweets
for tweet in tweets:
    parsed_tweet = tweet_timeline_parser(tweet)
    parsed_tweets.append(parsed_tweet)
    
print(len(parsed_tweets))

We can turn our `parsed_tweets` into a DataFrame for saving to CSV or doing visualization, *etc*.

In [None]:
pd.DataFrame(parsed_tweets).head()

Scrolling through a timeline and parsing the tweets Twitter serves up—until it doesn't—has many good faith assumptions as a model of data scraping that runs afoul of the language in Twitter's Terms of Service: if I am willing to interface like a human user, then when Twitter limits a human user we also stop collecting data.

But we really want all 6,700 of those White House tweets, not just the most recent 831. Twitter's [get statuses/user_timeline](https://developer.twitter.com/en/docs/tweets/timelines/api-reference/get-statuses-user_timeline.html) API will only return up to 3,200 of a Twitter user's most recent tweets. While we could get the `@WhiteHouse` account's most recent 3,200 tweets, we would not be able to retrieve the first 3,500 tweets. We are going to use Twitter's search functionality to get these first `@WhiteHouse` tweets instead. 

This *significantly* escalates the burden of proof on the researcher to demonstrate that this violation of Twitter's Terms of Service is ethical. In this specific case, I will argue that violating Twitter's Terms of Service can be justified by the greater importance of being able to build an archive of public statements by government officials that are not otherwise available. However, using this approach to scrape private users' timeline would raise significant ethical concerns about violating a platform's Terms of Service and users' reasonable expectations about privacy and the availability of their data.

### Screen scraping

We can use Twitter's search functionality to find all the tweet from an account since or until a date and scroll to get all the data. In practice, you can only get up to approximately 9,999 tweets with this approach. 

First, we'll make a `query_params` dictionary with the name of the account, a start date, and a stop date. Here's we will only do the `@WhiteHouse` tweets from the first year of the Trump administration.

In [None]:
# Make the query params
query_params = {}
query_params['from'] = 'WhiteHouse'
query_params['since'] = '2017-01-20'
query_params['until'] = '2018-01-20'

# Pass the params into a string and quote to format it properly
query_params_quoted = quote("from:{from} since:{since} until:{until}".format(**query_params))

# Add the quoted query params into the URL
query_url = "https://twitter.com/search?f=tweets&q={0}&src=typd".format(query_params_quoted)

Then we load the page and scroll to the bottom, preserving the logging functionality to keep track of how far along we are and when problems occur.

In [None]:
# Load the web page from the URL
driver.get(query_url)

# Repeat the scrolling until the end
scroll_counter = 0
last_height = driver.execute_script('return document.body.scrollHeight')

driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

while True:
    if scroll_counter > 0 and scroll_counter % 10 == 0:
        print("This is scroll: {0}".format(scroll_counter))
    
    time.sleep(2)
    
    current_height = driver.execute_script('return document.body.scrollHeight')
    
    if current_height == last_height:
        break
    
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    
    scroll_counter += 1
    last_height = current_height
    
print("It took {:,} scrolls to reach the end.".format(scroll_counter))

Once we've finished scrolling to load all the data, we can parse it. This may take a while; the `raw` HTML code after all that scrolling is close to 20 million characters long.

In [None]:
# Get the source after scrolling and soup-ify
raw = driver.page_source.encode('utf-8')
print("There are {0:,} characters in the raw HTML.".format(len(raw)))

soup = BeautifulSoup(raw)

# Find all the tweets
tweets = soup.find_all('div',{'class':'original-tweet'})

# Create the container
parsed_tweets = []

# Try to parse the tweets
for tweet in tweets:
    parsed_tweet = tweet_timeline_parser(tweet)
    parsed_tweets.append(parsed_tweet)
    
print("There are {0:,} parsed tweets.".format(len(parsed_tweets)))

Quit the driver, which closes the window.

In [None]:
driver.quit()

Turn the tweets into a DataFrame for analysis.

In [None]:
historical_tweets_df = pd.DataFrame(parsed_tweets)

# Replace the UTC timestamp with a more usable timestamp
historical_tweets_df['Timestamp'] = historical_tweets_df['Timestamp'].apply(lambda x:datetime.utcfromtimestamp(int(x)))

# Inspect
historical_tweets_df.head()


Make a basic scatterplot of the relationship between replies and retweets. By ocular inspection, there is a pretty strong correlation between the number of retweets and replies.

In [None]:
ax = historical_tweets_df.plot(x='Replies',y='Retweets',kind='scatter',logx=True,logy=True,s=5)
ax.set_xlim((1e1,1e5))
ax.set_ylim((1e1,1e5));

### Spoofing headers

When we use `requests` to get data from other web servers, each of the get requests carries some meta-data about ourselves, called [headers](https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). These headers tell the server what kind of web browser we are, what kinds of data we can receive, *etc*. so that the server can reply with properly-formatted information. 

But it is also possible for the server to understand a request and refuse to fulfill it, known as a [HTTP 403 error](https://en.wikipedia.org/wiki/HTTP_403). A server's refusal to fulfill a client's request can often be traced back to the identity a client presents through its headers or a client lacking authorization to access the data (*i.e.*, you need to authenticate with the website first). In the case of `requests`, its `get` request includes default header information that identifies it as a Python script rather than a human-driven web browser.

Let's make a request for an article from the NYTimes.

In [None]:
honest_response = requests.get('https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html')

We can see the headers we sent with this request.

In [None]:
honest_response.request.headers

Specifically, the 'User-Agent' string identifies this request as originating from the "python-requests/2.21.0" program, rather than a typical web browser. Some web servers will be configured to inspect the headers of incoming requests and refuse requests unless they are actual web browsers.

We can often circumvent these filters by sending alternative headers that claim to be from a web browser as a part of our `requests.get()`.

In [None]:
# Make a dictionary with spoofed headers for the User-Agent
spoofed_headers = {'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"}

# Make the request with the 
nytimes_url = 'https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html'
spoofed_response = requests.get(nytimes_url,headers=spoofed_headers)

Sure enough, the get request we sent to the NYTimes web server now includes the spoofed "User-Agent" string we wrote that claims our request is from a web browser. The server should now return the data we requested, even though we are not who we claimed to be.

In [None]:
spoofed_response.request.headers

I had trouble finding a website that refused "python-requests" connections automatically (*e.g.*, Amazon, NYTimes, etc.), but you will likely find some along the way. 

Spoofing headers to conceal the identity of your client to a web server is another example of how technological capabilities can overtake ethical responsibilities. The owners of a web server may have good reasons for refusing to serve content to non-web browsers (copyright, privacy, business model, *etc*.). Misrepresenting your identity to extract this data should only be done if the risks to others are small, the benefits are in the public interest, there are no other alternatives for obtaining the data, *etc*. 

There can be *very* real consequences for spoofing headers. Because it is such a common and relatively trivial method for circumventing server security settings, making repeated spoofed requests could result in your IP address or an IP address range (worst case, the entire university) being blocked from making requests to the server.

### Parallelizing requests

A third web scraping practice that warrants ethical scrutiny is parallelization. In the example of getting historical `@WhiteHouse` tweets, we launched a single browser window and "scrolled" until we reached the end; a process that took on the order of a minute.

However, we *could* launch multiple scripts that each creates a browser windows and collect different segments of the data in parallel for us to combine the results at the end. In an API context, we *could* create multiple applications and design our requests so that each works simultaneously to get all the data. 

Each request imposes some cost on the server to receive, process, and return the requested data: making these requests in parallel increases the convenience and efficiency for the data scraper, but also dramatically increases the strain on the server to fulfill other clients' requests. In fact, highly-parallelized and synchronized requests can look like [denial-of-service attacks](https://en.wikipedia.org/wiki/Denial-of-service_attack) and may get your requests far more scrutiny and blowback than patiently waiting for your data to arrive in series. The ethical justifications for employing highly-parallelized scraping approaches are thin: documenting a rapidly-unfolding event before the data disappears, for example.

## Scraping the Internet Archive's Wayback Machine

Now we'll leave some of the ethically-fraught methods of web scraping behind. The Internet Archive maintains the "[Wayback Machine](https://www.archive.org/web/)" where old versions of websites are stored. Some of my favorites:

* [CNN in June 2000](https://web.archive.org/web/20000815052826/http://www.cnn.com/)
* [Facebook in August 2004](https://web.archive.org/web/20040817020419/http://www.facebook.com/)
* [Apple in April 1997](https://web.archive.org/web/19970404064444/http://www.apple.com:80/)

In these URLs above, there is a numeric identifier corresponding to the timestamp when the image of the website was captured. How do we know when the Wayback Machine archived a webpage? There's a free and open API!

### Using the Wayback Machine API

The simplest API request we can make asks for the most recent snapshot of a webpage archived by the Wayback Machine.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com'

wb_response = requests.get(wb_url)

wb_response.json()

This response tells us the timestamp and location of this snapshot, which we could then go retrieve and parse.

In [None]:
wb_response_json = wb_response.json()

recent_fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

recent_fb_wb_response = requests.get(recent_fb_wb_url)

Get the raw text out, soupify, and look for links. For some reason all the links in this snapshot are in German.

In [None]:
recent_fb_wb_raw = recent_fb_wb_response.text

recent_fb_wb_soup = BeautifulSoup(recent_fb_wb_raw)

[link.text for link in recent_fb_wb_soup.find_all('a')]

We can also ask for the most recent snapshot of a webpage around a specific date. Let's ask the Wayback Machine for a snapshot of Facebook around February 1, 2008.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com&timestamp=20080201'

wb_response = requests.get(wb_url)

wb_response_json = wb_response.json()

wb_response_json

Note that this is a relatively deep JSON object we have to navigate into to access information like the Wayback URL or the timestamp of the snapshot. The closest snapshot to February 1, 2008 was January 30, 2008. We use the [`datetime.strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) function to turn this numeric string that we recognize as a timestamp into a datetime object.

In [None]:
print(datetime.strptime(wb_response_json['archived_snapshots']['closest']['timestamp'],
                        '%Y%m%d%H%M%S'))

As before, we could scrape out the links on this 2008 version of the page.

In [None]:
# Find the old URL
fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
fb_wb_response = requests.get(fb_wb_url)

# Get the text from the response
fb_wb_raw = fb_wb_response.text

# Soup-ify
fb_wb_soup = BeautifulSoup(fb_wb_raw)

# Make a list of the text of the links
[link.text for link in fb_wb_soup.find_all('a')]

We could likewise launch this link to view the page in Selenium.

In [None]:
driver = selenium.webdriver.Chrome(executable_path='/Users/briankeegan/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver')

driver.get(fb_wb_url)

In [None]:
driver.quit()

### Scraping historical web pages

A current project I am working on is exploring how social media platforms' terms of service have evolved over time. Let's start with Facebook's terms of service and privacy policy.

In [None]:
fb_tos = 'http://www.facebook.com/terms.php'
fb_pp = 'http://www.facebook.com/policy.php'

We will take advantage of the [`date_range`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) fuction in `pandas` to generate a range of dates between January 2005 and January 2019.

In [None]:
dates_list = pd.date_range(start='2005-01-01',end='2019-01-01',freq='M')
dates_list

We'll use [`datetime.strftime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) (the inverse of `strptime`) to make these date objects into specifically-formatted strings that we can format into a URL.

In [None]:
# Take the first datetime object and turn it into a string
datetime.strftime(dates_list[0],'%Y%m%d')

Use string formatting to put the `fb_tos` URL and formatted timestamp into a request to the Wayback Machine.

In [None]:
date_str = datetime.strftime(dates_list[0],'%Y%m%d')

wb_api_url = 'https://archive.org/wayback/available?url={0}&timestamp={1}'
wb_api_url_formatted = wb_api_url.format(fb_tos,date_str)

print(wb_api_url_formatted)

Make the request to the Wayback Machine to get the URL and timestamp of the Wayback Machine's closest snapshot of Facebook's Terms of Service before January 31, 2005.

In [None]:
wb_api_response = requests.get(wb_api_url_formatted)

wb_api_response.json()

Parse the markup of this old version.

In [None]:
# Find the old URL
wb_fb_old_url = wb_api_response.json()['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
wb_fb_raw = requests.get(wb_fb_old_url).text

# Soup-ify
wb_fb_soup = BeautifulSoup(wb_fb_raw)

# Find the content element and get the text out
wb_fb_terms_str = wb_fb_soup.find('div',{'id':'content'}).text.strip()

# Inspect
wb_fb_terms_str

We could use a really dumb stemmer, [`.split()`](https://docs.python.org/3.7/library/stdtypes.html#str.split) to count the number of words in these terms.

In [None]:
len(wb_fb_terms_str.split())

Write a loop to find a snapshot of Facebook's ToS each month in our `dates_list`. 

In [None]:
def get_urls(url_str,start_date='2005-01-01',end_date='2019-01-01',freq='M'):
    
    # Make the list of dates
    date_l = pd.date_range(start_date,end_date,freq=freq)
    
    # Create an empty container to store our data
    urls = dict()

    # For each date in the list of dates
    for date in date_l:
        
        # Turn the date object back into a string
        date_str = datetime.strftime(date,'%Y%m%d%H%M%S')
        
        # Define the API URL request to the Wayback machine
        wb_api_url = 'http://archive.org/wayback/available?url={0}&timestamp={1}'
        
        # Format the API URL with the URL of the website and the closest datetime
        wb_api_request = wb_api_url.format(url_str,date_str)
        
        # Make the request
        r = requests.get(wb_api_request).json()

        # Check if the returned request has all the right parts (this is probably overkill)
        if 'archived_snapshots' in r.keys():
            if 'closest' in r['archived_snapshots'].keys():
                if 'url' in r['archived_snapshots']['closest'].keys():
                    
                    # If it does have all the right parts, get the URL
                    _url = r['archived_snapshots']['closest']['url']
                    
                    # Get the timestamp
                    _timestamp = r['archived_snapshots']['closest']['timestamp']
                    
                    # Save to our URL dictionary with the timestamp of the snapshot as key, the url as value
                    urls[_timestamp] = _url
    return urls

Run our function to make a dictionary of keys returning the Wayback Machine URLs for each month's version of the terms of service. We'll write a loop to get the Terms for each snapshot and count the words. 

This will take a few minutes. 

I've coverted the code block into a "Raw" cell to prevent accidental execution. You can always turn it into a "Code" cell if you really want to run it.

To avoid having everyone hit the Internet Archive server with the same requests, you can also load this file with the same data.

In [None]:
with open('facebook_tos_archive.json','r') as f:
    fb_terms_wordcount2 = json.load(f)

Visualize the changes in the size of Facebook's Terms of Service over time.

In [None]:
# Turn the dictionary into a pandas Series
fb_terms_s = pd.Series(fb_terms_wordcount2)

# Conver the index to datetime objects
fb_terms_s.index = pd.to_datetime(fb_terms_s.index)

# Plot
ax = fb_terms_s.plot()

# Make the x-tick labels less weird
ax.set_xticklabels(range(2004,2019,2),rotation=0,horizontalalignment='center')

# Always label your axes
ax.set_ylabel('Word count');