# Web Data Scraping

[Spring 2021 ITSS Mini-Course](https://www.colorado.edu/cartss/programs/interdisciplinary-training-social-sciences-itss/mini-course-web-data-scraping) — ARSC 5040  
[Brian C. Keegan, Ph.D.](http://brianckeegan.com/)  
[Assistant Professor, Department of Information Science](https://www.colorado.edu/cmci/people/information-science/brian-c-keegan)  
University of Colorado Boulder  

Copyright and distributed under an [MIT License](https://opensource.org/licenses/MIT)

## Class outline

* **Week 1**: Introduction to Jupyter, browser console, structured data, ethical considerations
* **Week 2**: Scraping HTML with `requests` and `BeautifulSoup`
* **Week 3**: Scraping web data with Selenium and using the Internet Archive API
* **Week 4**: Scraping an API with `requests` and `json`, Wikipedia
* **Week 5**: Scraping data from Twitter

## Acknowledgements

This course will draw on resources built by myself and [Allison Morgan](https://allisonmorgan.github.io/) for the [2018 Summer Institute for Computational Social Science](https://github.com/allisonmorgan/sicss_boulder), which were in turn derived from [other resources](https://github.com/simonmunzert/web-scraping-with-r-extended-edition) developed by [Simon Munzert](http://simonmunzert.github.io/) and [Chris Bail](http://www.chrisbail.net/). 

Thank you also to Professor Terra KcKinnish for coordinating the ITSS seminars.

## Class 3 goals

* Sharing accomplishments and challenges with last week's material
* Using Selenium to interact with websites
* Implementing a screen-scraper with Selenium 
* Ethics of spoofing headers, screen scraping, and parallelizing API requests
* Using Internet Archive API to find historical web pages
* Retrieving and parsing Internet Archive pages

Start with our usual suspect packages.

In [60]:
# Lets us talk to servers on the web
import requests

# Parsing HTML magic
from bs4 import BeautifulSoup

# For data manipulation
import pandas as pd

# Will be helful for converting between timestamps
from datetime import datetime

# We want to sleep from time-to-time to avoid overwhelming another server
import time

# We'll need to parse some strings, so we'll write some regular expressions
import re

from urllib.parse import quote, unquote
import json

The block of code below will only work once you've installed Selenium.

In [2]:
# Our interface to a real-life web browser... won't import until you install!
import selenium.webdriver

## Installing Selenium

This is a non-trivial process: you will need to (1) install the Python bindings for Selenium, (2) download a web driver to interface with a web browser, and (3) configure Selenium to recognize your web driver. Follow the installation instructions in the documentation [here](https://selenium-python.readthedocs.io/installation.html) (you won't need the Selenium server).

1. Install the Python bindings for Selenium. Go to your Anaconda terminal window, type in this command, and agree to whatever the package manager wants to install or update.

`conda install selenium`

2. Download the driver(s) for the web browser you want to use from the [links on the Selenium documentation](https://selenium-python.readthedocs.io/installation.html). If you use a Chrome browser, download the Chrome driver. Note that the Safari driver will not work on PCs and the Edge driver will not work on Macs. 

3. You will need to unzip the file and move the executable to the same directory where you are running this notebook. Make a note of the path to this directory.

### Using Selenium to control a web browser
The `driver` object we create is a connection from this Python environment out to the browser window.

If you're on a Mac, the latest versions of OS X *really* do not like letting you run applications you've just downloaded. You'll need to dive into your system settings to fix it: https://support.apple.com/en-us/HT202491

In [5]:
# Path to the Chrome driver for my PC -- yours is likely very different
# pc_path = 'E:/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver.exe'
# driver = selenium.webdriver.Chrome(executable_path=pc_path)

# Path to the Chrome driver for my Mac -- yours is likely very different
mac_path = '/Users/briankeegan/Documents/GitHub/Web-Data-Scraping-S2021/Class 03 - Scraping with Selenium/geckodriver'
driver = selenium.webdriver.Firefox(executable_path=mac_path)


This single line of code will open a new browser window and will request the "xkcd" homepage.

Your computer's security protocols may vigorously protest because you are launching a program that is controlled by another process/program. You will need to dismiss these warnings in order to proceed. Whether and how to do that will vary considerably across PCs and Macs, the kinds of permissions your account has on this operating system, and other security measures employed by your computer.

In [11]:
driver.get('https://xkcd.com')

In Classes 01 and 02, we used `BeautifulSoup` to turn HTML and XML into a data structure that we could search and access using Python-like syntax. With Selenium we use a standard called "XPath" to navigate through an HTML document: [this is the official tutorial](https://www.w3schools.com/xml/xpath_syntax.asp) for working with XPath. The syntax is different, but the intuition is similar: we can find a parent node by its attribute (class, id, *etc*.) and then navigate down the tree to its children.

The XPath below has the following elements in sequence
* `//` — Select all nodes that match the selection
* `[@id="middleContainer"]` — find the element that has a "middleContainer" id.
* `/ul[2]` — select the second `<ul>` element underneath the `<div id="middleContainer">`
* `/li[3]` — select the third `<li>` element 
* `/a` — select the a element

The combined XPath string `//*[@id="middleContainer"]/ul[1]/li[3]/a` is like a "file directory" that (hopefully!) points to the hyperlink button that takes us to a random xkcd comic. With the directions to this button, we can have the web browser "click" the "Random" button beneath the comic.

In [7]:
# Let's find the 'random' buttom
element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[2]/li[3]/a')

# Once we've found it, now click it
element.click()

We can also get the attributes of different parts of the web page. xkcd is famous for its "hidden messages" inside the image alt-text.

In [8]:
alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
alttext_element.get_attribute("title")

'On the other hand, poor Samara -- transcoded to FLV.  No one deserves that.'

We could write a simple loop to click on the random button five times and print the alt-text from each of those pages.

In [12]:
for c in range(5):
    time.sleep(1)
    random_element = driver.find_element_by_xpath('//*[@id="middleContainer"]/ul[2]/li[3]/a')
    random_element.click()
    
    alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
    print('\n',alttext_element.get_attribute("title"))


 WHEN I WAS ON A BOAT I DROPPED MY PHONE CAN U LOOK FOR IT

 If a wild bun is sighted, a nice gesture of respect is to send a 'BUN ALERT' message to friends and family, with photographs documenting the bun's location and rank. If no photographs are possible, emoji may be substituted.

 President Andrew Johnson once said, "If I am to be shot at, I want Gnome Ann to be in the way of the bullet."

 There was a schism in 2007, when a sect advocating OpenOffice created a fork of Sunday.xlsx and maintained it independently for several months. The efforts to reconcile the conflicting schedules led to the reinvention, within the cells of the spreadsheet, of modern version control.

 On the other hand, as far as they know, my system is working perfectly.


When you're done playing with your programmable web browser, make sure to close it.

In [13]:
driver.quit()

Note that with the connection to the web browser closed, any of the functions like `find_element_by_xpath`, `click()`, *etc*. will not work.

In [14]:
alttext_element = driver.find_element_by_xpath('//*[@id="comic"]/img')
alttext_element.get_attribute("title")

MaxRetryError: HTTPConnectionPool(host='127.0.0.1', port=63263): Max retries exceeded with url: /session/5be3e66d-73fc-5541-84e7-2ae204f72ae1/element (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7fa2b014b250>: Failed to establish a new connection: [Errno 61] Connection refused'))

Just about any operation you do in a web browser can be automated with Selenium: scrolling, clicking, completing forms, moving between tabs/windows, handling pop-ups, navigating back and forward, handling cookies, *etc*. Learn more about the functionality with tutorials and other resources in the [Selenium documentation](https://selenium-python.readthedocs.io/navigating.html).

### Exercises

Start your driver again and get the xkcd homepage.

1. Change the XPath to click on the "Prev" button above the comic.
2. Change the XPath to search for the "comicNav" class instead of the "middleContainer" id.
3. Change the XPath to click on the "About" button in the upper-left.

## Ethical web scraping

James Densmore has a nice summary of [practices for ethical web scraping](https://towardsdatascience.com/ethics-in-web-scraping-b96b18136f01):

> * If you have a public API that provides the data I’m looking for, I’ll use it and avoid scraping all together.
> * I will always provide a User Agent string that makes my intentions clear and provides a way for you to contact me with questions or concerns.
> * I will request data at a reasonable rate. I will strive to never be confused for a DDoS attack.
> * I will only save the data I absolutely need from your page. If all I need it OpenGraph meta-data, that’s all I’ll keep.
> * I will respect any content I do keep. I’ll never pass it off as my own.
> * I will look for ways to return value to you. Maybe I can drive some (real) traffic to your site or credit you in an article or post.
> * I will respond in a timely fashion to your outreach and work with you towards a resolution.
> * I will scrape for the purpose of creating new value from the data, not to duplicate it.

Some other important components of ethical web scraping practices [include](http://robertorocha.info/on-the-ethics-of-web-scraping/):

* Reading the Terms of Service and Privacy Policies for the site's rules on scraping.
* Inspecting the robots.txt file for rules about what pages can be scraped, indexed, *etc*.
* Be gentle on smaller websites by running during off-peak hours and spacing out requests.
* Identify yourself by name and email in your User-Agent strings

What does a robots.txt file look like? Here is CNN's. It helpfull provides a sitemap to the robot to get other pages, it allows all kinds of User-agents, and disallows crawling of pages in specific directories (ads, polls, tests).

In [15]:
print(requests.get('https://www.cnn.com/robots.txt').text)

Sitemap: https://www.cnn.com/sitemaps/cnn/index.xml
Sitemap: https://www.cnn.com/sitemaps/cnn/news.xml
Sitemap: https://www.cnn.com/sitemaps/sitemap-section.xml
Sitemap: https://www.cnn.com/sitemaps/sitemap-interactive.xml
Sitemap: https://www.cnn.com/ampstories/sitemap.xml
Sitemap: https://edition.cnn.com/sitemaps/news.xml
User-agent: *
Allow: /partners/ipad/live-video.json
Disallow: /*.jsx$
Disallow: *.jsx$
Disallow: /*.jsx/
Disallow: *.jsx?
Disallow: /ads/
Disallow: /aol/
Disallow: /beta/
Disallow: /browsers/
Disallow: /cl/
Disallow: /cnews/
Disallow: /cnn_adspaces
Disallow: /cnnbeta/
Disallow: /cnnintl_adspaces
Disallow: /development
Disallow: /editionssi
Disallow: /help/cnnx.html
Disallow: /NewsPass
Disallow: /NOKIA
Disallow: /partners/
Disallow: /pipeline/
Disallow: /pointroll/
Disallow: /POLLSERVER/
Disallow: /pr/
Disallow: /privacy
Disallow: /PV/
Disallow: /Quickcast/
Disallow: /quickcast/
Disallow: /QUICKNEWS/
Disallow: /search/
Disallow: /terms
Disallow: /test/
Disallow: /vir

When we are scraping websites, it is a good idea to include your contact information as a custom User-Agent string so that the webmaster can get in contact.

In [16]:
contact_header = {'User-Agent':'Python research tool by Brian Keegan, brian.keegan@colorado.edu'}

request = requests.get('https://www.cnn.com',headers=contact_header)

Adverse consequences of web scraping include:
* Compromising the privacy and integrity of individual users' data
* Damaging a web server with too many requests
* Denying access to the web service to other authorized users
* Infringing on copyrighted material
* Damaging the business value of a web site

[Amanda Bee](http://velociraptor.info/) compiled [a nice set of examples](https://github.com/amandabee/scraping-for-journalists/wiki/Reporting-Examples) of data journalists using web scraping for their reporting. There are some ethical justifications for violating a site's terms of service to scrape data:
* Obtaining data for the public interest from official statements, government reports, *etc*.
* Conducting audit studies (as long as these are responsibly designed and pre-cleared)
* The data is unavailable from APIs, FOIA requests, and other reports

[Sophie Chou](http://sophiechou.com/) made this nice [decision flow-chart](http://www.storybench.org/to-scrape-or-not-to-scrape-the-technical-and-ethical-challenges-of-collecting-data-off-the-web/) of whether to build a scraper or not from a NICAR panel in 2016:

![Should you build a scraper flowchart](http://www.storybench.org/wp-content/uploads/2016/04/flowchart_final.jpeg)

Why is there a "Talk to a lawyer?" outcome at the bottom?

### Computer Fraud and Abuse Act

The [Computer Fraud and Abuse Act](https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act) was passed in 1984, [in large part due to](https://www.cnet.com/news/from-wargames-to-aaron-swartz-how-u-s-anti-hacking-law-went-astray/) the 1983 film [WarGames](https://en.wikipedia.org/wiki/WarGames) starring Matthew Broderick. A plain reading of the text of the law ([18 U.S.C. § 1030](https://www.law.cornell.edu/uscode/text/18/1030)) criminalizes just about any form of web scraping:

> * Whoever intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains… information from any protected computer;
> * knowingly causes the transmission of a program, information, code, or command, and as a result of such conduct, intentionally causes damage without authorization, to a protected computer;
> * the term “exceeds authorized access” means to access a computer with authorization and to use such access to obtain or alter information in the computer that the accesser is not entitled so to obtain or alter;
> * the term “damage” means any impairment to the integrity or availability of data, a program, a system, or information;
> * the term “protected computer” means a computer which is used in or affecting interstate or foreign commerce or communication, including a computer located outside the United States that is used in a manner that affects interstate or foreign commerce or communication of the United States;

Violators can be fined and jailed under a misdemeanor charge for up to 1 year for the first violation and jailed up to 10 years under a felony charge for repeated violations.

This law has a [chilling effect](https://en.wikipedia.org/wiki/Chilling_effect) on many forms of research, journalism, and other forms of protected speech. The CFAA has been used by federal prosecutors to bring federal felony charges against programmers, journalists, and activists. In 2011, programmer and hacktivist [Aaron Swartz](https://en.wikipedia.org/wiki/Aaron_Swartz) (who contributed to the development of RSS, Markdown, Creative Commons, and Reddit) was [arrested and charged](https://en.wikipedia.org/wiki/United_States_v._Swartz) with violating the CFAA for downloading several million PDFs from JSTOR over MIT's network. The [decision to prosecute was unusual](https://www.huffingtonpost.com/2013/03/13/aaron-swartz-prosecutorial-misconduct_n_2867529.html). Facing 35 years of imprisonment and over $1 million in fines under the CFAA, Swartz committed suicide on January 11, 2013.

In 2016, four computer science researchers and the publisher of *The Intercept* who all use scraping techniques to run experiments to measure bias and discrimination in web content [filed suit with the ACLU](https://www.aclu.org/cases/sandvig-v-sessions-challenge-cfaa-prohibition-uncovering-racial-discrimination-online) against the U.S. Government: *Sandvig v. Sessions*. Their research involves creating multiple fake accounts, providing inaccurate information to websites, using automated tools to record publicly-available data, and other scraping techniques. In March 2018, the [D.C. Circuit Court ruled](https://www.aclu.org/news/judge-allows-aclu-case-challenging-law-preventing-studies-big-data-discrimination-proceed) two of the plantiffs have standing to sue and the case is currently being prepared for trial.

### Warning

The code we will write and execute below will repeatedly violate YouTube's [Terms of Service](https://www.youtube.com/static?template=terms) ("you are not allowed to... access the Service using any automated means (such as robots, botnets or scrapers)...") for retrieving information from the platform. In effect, we will transmit code in excess of our authorized access and potentially cause damage, in order to obtain information from a protected computer. 

We will do this in order to obtain public statements made by goverment officials acting in their official capacity because this data is otherwise unavailable for retrieval from YouTube. There is an interesting body of emerging legal precedent treating elected officials' use of Twitter as a public forum: [*Knight First Amendment Institute v. Trump*](https://en.wikipedia.org/wiki/Knight_First_Amendment_Institute_v._Trump) established that [the President may not block other Twitter users](https://www.courtlistener.com/docket/6087955/72/knight-first-amendment-institute-at-columbia-university-v-trump/):

> * "We hold that portions of the @realDonaldTrump account -- the “interactive space” where Twitter users may directly engage with the content of the President’s tweets -- are properly analyzed under the “public forum” doctrines set forth by the Supreme Court, that such space is a designated public forum..."
> * "we nonetheless conclude that the extent to which the President and Scavino can, and do, exercise control over aspects of the @realDonaldTrump account are sufficient to establish the government-control element as to the content of the tweets sent by the @realDonaldTrump account, the timeline compiling those tweets, and the interactive space associated with each of those tweets."
> * "Because a Twitter user lacks control over the comment thread beyond the control exercised over first-order replies through blocking, the comment threads -- as distinguished from the content of tweets sent by @realDonaldTrump, the @realDonaldTrump timeline, and the interactive space associated with each tweet -- do not meet the threshold criterion for being a forum."
> * "the account’s timeline, which “displays all tweets generated by the [account]”... all of which is government speech."

On this basis, I believe the White House's videos posted to YouTube are government speech and our automated retrieval of this content and associated meta-data in violation of YouTube's Terms of Serice is justifiable for understanding this speech as a public forum.

I would advise you against using these tools and approaches without a similarly clear public interest rationale and jurisprudence linking behavior to public forum doctrines.

## Aside: Screen-scraping Twitter with Selenium

I am adapting a [tutorial by Shawn Wang](https://dev.to/swyx/scraping-my-twitter-social-graph-with-python-and-selenium--hn8) on scraping a Twitter graph with Python and Selenium.

In [41]:
# Path to the Chrome driver for my PC -- yours is likely very different
# driver = selenium.webdriver.Chrome(executable_path='E:/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver.exe')

# Path to the Chrome driver for my Mac -- yours is likely very different
driver = selenium.webdriver.Firefox(executable_path=mac_path)

driver.get('https://www.twitter.com')

Manually log in to your Twitter account through the driver page.

Then go to the "followings" (or followees, also called "friends" in the Twitter API) of an account. 

In [19]:
driver.get('https://twitter.com/JoeBiden/following')

At the time of this Notebook's writing, the "JoeBiden" account followed 47 other accounts. Depending on the resolution of your display, size of the window, *etc*. there may only be 10–20 accounts visible. We can scroll to see the rest of these accounts programatically.

Run this cell a few times to keep scrolling to the bottom.

In [22]:
driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')

Pass the HTML of the web page in the browser back to Python and turn it into soup.

In [23]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

Unfortunately, since I last ran this class in 2019 Twitter changed how they design and populate their website. If we inspect the elements for the following, we can see how they obfuscate the elements to make them hard to scrape. 

*Back in the day*, they had nice div tags with "data-item-type"s called "user". No longer!

In [24]:
soup.body.find_all('div', attrs={'data-item-type':'user'})

## Screen-scraping YouTube with Selenium

Let's get data from YouTube instead, which appears to be better-behaved from a web scraper's perspective.

In [44]:
driver.get('https://www.youtube.com/c/whitehouse/videos')

YouTube like Twitter also loads additional videos on scroll, so let's re-use the code above to scroll until we can't to load as many videos as possible.

In [36]:
# https://stackoverflow.com/a/51345544
from selenium.webdriver.common.keys import Keys

In [46]:
for i in range(20):
    time.sleep(.1)
    html.send_keys(Keys.END)

Now that we've loaded as many videos as the YouTube interface allows by scrolling, pull the contents of the web page. We can revert back to our strategies from Week 2 to identify, navigate, and pull out the relevant fields we want.

In [47]:
raw = driver.page_source.encode('utf-8')

soup = BeautifulSoup(raw)

By inspecting the source, the video cells appear to live within elements called `<ytd-grid-video-renderer>`.

In [52]:
video_divs = soup.find_all('ytd-grid-video-renderer')

Inspect one of them.

In [53]:
video_divs[-1]

<ytd-grid-video-renderer class="style-scope ytd-grid-renderer" lockup="true"><!--css-build:shady--><div class="style-scope ytd-grid-video-renderer" id="dismissible"><ytd-thumbnail class="style-scope ytd-grid-video-renderer" use-hovered-property=""><!--css-build:shady--><a aria-hidden="true" class="yt-simple-endpoint inline-block style-scope ytd-thumbnail" href="/watch?v=q5iCPKDp4V4" id="thumbnail" rel="null" tabindex="-1">
<yt-img-shadow class="style-scope ytd-thumbnail no-transition" ftl-eligible="" loaded="" style="background-color: transparent;"><!--css-build:shady--><img alt="" class="style-scope yt-img-shadow" id="img" src="https://i.ytimg.com/vi/q5iCPKDp4V4/hqdefault.jpg?sqp=-oaymwEcCNACELwBSFXyq4qpAw4IARUAAIhCGAFwAcABBg==&amp;rs=AOn4CLBVngDX-C_GMBN6KcMXK80Pc0rq2A" width="210"/></yt-img-shadow>
<div class="style-scope ytd-thumbnail" id="overlays"><ytd-thumbnail-overlay-time-status-renderer class="style-scope ytd-thumbnail" overlay-style="DEFAULT"><!--css-build:shady--><yt-icon cl

Further drill-down and inspection of this element from the browser reveals that some promising data lives within an element defined as `<a id="video-title>`.

In [56]:
video_divs[-1].find_all('a',{'id':'video-title'})

[<a aria-label="The Inauguration of the 46th President of the United States by The White House 1 month ago 31 minutes 1,018,031 views" class="yt-simple-endpoint style-scope ytd-grid-video-renderer" href="/watch?v=q5iCPKDp4V4" id="video-title" title="The Inauguration of the 46th President of the United States">The Inauguration of the 46th President of the United States</a>]

Within the `aria-label` string there's the title, account, a relative date and a detailed number of views. Within the `href` tag is the video ID. These all seem like promising bits of data to try to grab.

In [58]:
video_divs[-1].find_all('a',{'id':'video-title'})[0]['aria-label']

'The Inauguration of the 46th President of the United States by The White House 1 month ago 31 minutes 1,018,031 views'

We can use regular expressions to try to match the numeric fields like the number of views.

In [81]:
_s = video_divs[-1].find_all('a',{'id':'video-title'})[0]['aria-label']
re.findall(r'([\d,]+) views',_s)

['1,018,031']

Extract other relevant fields.

In [59]:
# The video link
video_divs[-1].find_all('a',{'id':'video-title'})[0]['href']

'/watch?v=q5iCPKDp4V4'

In [82]:
# The video title
video_divs[-1].find_all('a',{'id':'video-title'})[0]['title']

'The Inauguration of the 46th President of the United States'

There's also some helpful data about the length of the video hiding in a `<ytd-thumbnail-overlay-time-status-renderer>`.

In [70]:
video_divs[-1].find_all('ytd-thumbnail-overlay-time-status-renderer')

[<ytd-thumbnail-overlay-time-status-renderer class="style-scope ytd-thumbnail" overlay-style="DEFAULT"><!--css-build:shady--><yt-icon class="style-scope ytd-thumbnail-overlay-time-status-renderer" disable-upgrade="" hidden=""></yt-icon><span aria-label="31 minutes, 45 seconds" class="style-scope ytd-thumbnail-overlay-time-status-renderer">
   31:45
 </span></ytd-thumbnail-overlay-time-status-renderer>]

We can pull out that video length element with:

In [71]:
video_divs[-1].find_all('ytd-thumbnail-overlay-time-status-renderer')[0].text.strip()

'31:45'

There's also labels about whether the videos are closed captioned within a tag called `<span class="style-scope ytd-badge-supported-renderer">`.

In [84]:
video_divs[-1].find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})

[<span class="style-scope ytd-badge-supported-renderer"></span>,
 <span class="style-scope ytd-badge-supported-renderer">CC</span>]

The first video (at the bottom) has a CC tag, the second one does not.

In [85]:
video_divs[-2].find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})

[<span class="style-scope ytd-badge-supported-renderer"></span>]

Let's put these pieces together into a loop that grabs all the data from this list of videos.

In [87]:
videos_l = []

for d in video_divs:
    # Get the number of views
    _s = d.find_all('a',{'id':'video-title'})[0]['aria-label']
    _views = re.findall(r'([\d,]+) views',_s)[0]
    
    # Get the link
    _link = d.find_all('a',{'id':'video-title'})[0]['href']
    
    # Get the title
    _title = d.find_all('a',{'id':'video-title'})[0]['title']
    
    # Get the length
    _length = d.find_all('ytd-thumbnail-overlay-time-status-renderer')[0].text.strip()
    
    # Get the captioning
    if len(d.find_all('span',{'class':'style-scope ytd-badge-supported-renderer'})) > 1:
        _cc = True
    else:
        _cc = False
        
    # Package it all up into a dictionary
    _d = {'Views':_views,
          'Link':_link,
          'Title':_title,
          'Length':_length,
          'Captioned':_cc
         }
    
    # Add our dictionary to the container
    videos_l.append(_d)

Make the `videos_l` container into a DataFrame.

In [88]:
pd.DataFrame(videos_l)

Unnamed: 0,Views,Link,Title,Length,Captioned
0,7193,/watch?v=MqPF_mF8Oww,Vice President Harris Swears In Marcia Fudge a...,1:46,False
1,7317,/watch?v=LwJ8fLt1dow,Congress Passed President Biden's American Res...,0:39,False
2,13392,/watch?v=UjH4_NOVtWc,President Biden Hosts an Event with the CEOs o...,15:02,False
3,14013,/watch?v=3RYqFFwU84g,03/10/21: Press Briefing by Press Secretary Je...,1:16:09,False
4,13953,/watch?v=6y-1OLaFtA0,03/10/21: Press Briefing by White House COVID-...,28:59,False
...,...,...,...,...,...
154,775552,/watch?v=m55tzTIJwwA,President Biden Signs Executive Orders and Oth...,2:56,True
155,336228,/watch?v=EntM93k3wUs,President Biden and Vice President Harris Part...,5:31,False
156,496207,/watch?v=U-bF2vYHTQE,The Work Begins,1:37,True
157,354828,/watch?v=C8OhSd1TY7g,President Biden Reviews the Readiness of Milit...,5:44,False


Depending on your priorities, you could stop here and save this to a CSV since there is already rich data. 

Some limitations to think of include:
* How would YouTube serve up an account with hundreds or thousands of videos? Is there a limit to the videos you can get from scrolling?
* The "Views" columns are stored as strings not as numeric values: you'll want to convert them somehow. Those commas could also complicate things when storing as a *comma separated* file, so you'll want to strip them out too somehow. Fixing both of these are related.
* We don't have any of the valuable data about the actual date the video was posted, the number of up/down votes, or even the transcript from the captioning.

### Retrieving data from each video's page

There's valuable data on each video's page about the specific date, the up and down votes, and even the transcript of the video that we can also retrieve.

Let's start with the inauguration video.

In [89]:
driver.get('https://www.youtube.com/watch?v=q5iCPKDp4V4')

Get the raw markdown and soup-ify it.

In [90]:
yt_inauguration_raw = driver.page_source.encode('utf-8')

yt_inauguration_soup = BeautifulSoup(yt_inauguration_raw)

The date the video was uploaded appears within a `<div id="date">` tag.

In [91]:
yt_inauguration_soup.find_all('div',{'id':'date'})

[<div class="style-scope ytd-video-primary-info-renderer" id="date"><span class="style-scope ytd-video-primary-info-renderer" id="dot">•</span><yt-formatted-string class="style-scope ytd-video-primary-info-renderer">Jan 20, 2021</yt-formatted-string></div>]

Digging in, we can pull out the date.

In [93]:
yt_inauguration_soup.find_all('div',{'id':'date'})[0].text[1:]

'Jan 20, 2021'

The up and downvotes appear as coarse aggregations ("16K","95K") but the true counts at the time are hidden in some "aria-label"s. 

First, find the `<button id="button">` within the `<div id="top-level-buttons">`.

In [99]:
yt_inauguration_soup.find_all('div',{'id':'top-level-buttons'})

[<div class="style-scope ytd-menu-renderer" id="top-level-buttons"></div>,
 <div class="style-scope ytd-menu-renderer" id="top-level-buttons"><ytd-toggle-button-renderer button-renderer="true" class="style-scope ytd-menu-renderer force-icon-button style-text" is-icon-button="" style-action-button="" use-keyboard-focused=""><a class="yt-simple-endpoint style-scope ytd-toggle-button-renderer" tabindex="-1"><yt-icon-button class="style-scope ytd-toggle-button-renderer style-text" id="button"><!--css-build:shady--><button aria-label="like this video along with 16,295 other people" aria-pressed="false" class="style-scope yt-icon-button" id="button"><yt-icon class="style-scope ytd-toggle-button-renderer"><svg class="style-scope yt-icon" focusable="false" preserveaspectratio="xMidYMid meet" style="pointer-events: none; display: block; width: 100%; height: 100%;" viewbox="0 0 24 24"><g class="style-scope yt-icon"><path class="style-scope yt-icon" d="M1 21h4V9H1v12zm22-11c0-1.1-.9-2-2-2h-6.31l.

There's apparently (and hopefully only!) two of these types of divs. The one we care about is the second.

In [101]:
tlb2 = yt_inauguration_soup.find_all('div',{'id':'top-level-buttons'})[1]

tlb2.find_all('button',{'id':'button'})

[<button aria-label="like this video along with 16,295 other people" aria-pressed="false" class="style-scope yt-icon-button" id="button"><yt-icon class="style-scope ytd-toggle-button-renderer"><svg class="style-scope yt-icon" focusable="false" preserveaspectratio="xMidYMid meet" style="pointer-events: none; display: block; width: 100%; height: 100%;" viewbox="0 0 24 24"><g class="style-scope yt-icon"><path class="style-scope yt-icon" d="M1 21h4V9H1v12zm22-11c0-1.1-.9-2-2-2h-6.31l.95-4.57.03-.32c0-.41-.17-.79-.44-1.06L14.17 1 7.59 7.59C7.22 7.95 7 8.45 7 9v10c0 1.1.9 2 2 2h9c.83 0 1.54-.5 1.84-1.22l3.02-7.05c.09-.23.14-.47.14-.73v-1.91l-.01-.01L23 10z"></path></g></svg><!--css-build:shady--></yt-icon></button>,
 <button aria-label="dislike this video along with 95,615 other people" aria-pressed="false" class="style-scope yt-icon-button" id="button"><yt-icon class="style-scope ytd-toggle-button-renderer"><svg class="style-scope yt-icon" focusable="false" preserveaspectratio="xMidYMid mee

The up and down votes are the first two buttons and include an `aria-label` with a count of the number of up and down votes.

In [103]:
upvotes = tlb2.find_all('button',{'id':'button'})[0]
downvotes = tlb2.find_all('button',{'id':'button'})[1]

upvotes

<button aria-label="like this video along with 16,295 other people" aria-pressed="false" class="style-scope yt-icon-button" id="button"><yt-icon class="style-scope ytd-toggle-button-renderer"><svg class="style-scope yt-icon" focusable="false" preserveaspectratio="xMidYMid meet" style="pointer-events: none; display: block; width: 100%; height: 100%;" viewbox="0 0 24 24"><g class="style-scope yt-icon"><path class="style-scope yt-icon" d="M1 21h4V9H1v12zm22-11c0-1.1-.9-2-2-2h-6.31l.95-4.57.03-.32c0-.41-.17-.79-.44-1.06L14.17 1 7.59 7.59C7.22 7.95 7 8.45 7 9v10c0 1.1.9 2 2 2h9c.83 0 1.54-.5 1.84-1.22l3.02-7.05c.09-.23.14-.47.14-.73v-1.91l-.01-.01L23 10z"></path></g></svg><!--css-build:shady--></yt-icon></button>

Access the `aria-label`.

In [104]:
upvotes['aria-label']

'like this video along with 16,295 other people'

Use a regex to extract the number.

In [105]:
re.findall(r'([\d,]+) other people',upvotes['aria-label'])

['16,295']

In [106]:
re.findall(r'([\d,]+) other people',downvotes['aria-label'])

['95,615']

For closed captioned videos, there's also a transcript of the video with timestamps and text within the `<ytd-transcript-renderer>` parent or `<div class="cue-group">` tags.

In [108]:
yt_inauguration_soup.find_all('ytd-transcript-renderer')

[<ytd-transcript-renderer class="style-scope ytd-engagement-panel-section-list-renderer" panel="" panel-content-visible="" panel-target-id="engagement-panel-transcript" refresh=""><!--css-build:shady--><div class="style-scope ytd-transcript-renderer" id="header"></div>
 <div class="style-scope ytd-transcript-renderer" id="body"><ytd-transcript-body-renderer class="style-scope ytd-transcript-renderer" panel="" refresh=""><!--css-build:shady-->
 <div class="cue-group style-scope ytd-transcript-body-renderer active">
 <div class="cue-group-start-offset style-scope ytd-transcript-body-renderer">
       00:00
     </div>
 <div class="cues style-scope ytd-transcript-body-renderer">
 <div class="cue style-scope ytd-transcript-body-renderer active" role="button" start-offset="0" tabindex="0">
           &gt;&gt; Senator Klobuchar:  What
 you are all about
         </div>
 <dom-repeat class="style-scope ytd-transcript-body-renderer"><template is="dom-repeat"></template></dom-repeat>
 </div>
 </

In [120]:
yt_inauguration_soup.find_all('div',{'class':'cue-group'})[-1]

<div class="cue-group style-scope ytd-transcript-body-renderer">
<div class="cue-group-start-offset style-scope ytd-transcript-body-renderer">
      31:42
    </div>
<div class="cues style-scope ytd-transcript-body-renderer">
<div class="cue style-scope ytd-transcript-body-renderer" role="button" start-offset="1902840" tabindex="0">
          Thank you, America.
        </div>
<dom-repeat class="style-scope ytd-transcript-body-renderer"><template is="dom-repeat"></template></dom-repeat>
</div>
</div>

From the `<div class="cue-group">` elements, we can extract the time code and the text.

In [119]:
last_cg = yt_inauguration_soup.find_all('div',{'class':'cue-group'})[-1]
last_cg.find_all('div',{'class':'cue-group-start-offset'})[0].text.strip()

'31:42'

In [121]:
last_cg.find_all('div',{'class':'cue'})[0].text.strip()

'Thank you, America.'

For the whole transcript.

In [129]:
cg_l = []

for cg in yt_inauguration_soup.find_all('div',{'class':'cue-group'}):
    _time_code = cg.find_all('div',{'class':'cue-group-start-offset'})[0].text.strip()
    _text = cg.find_all('div',{'class':'cue'})[0].text.strip()
    
    _d = {'Time':_time_code,
          'Text':_text.replace('\n',' ')}
    
    cg_l.append(_d)
    
pd.DataFrame(cg_l).set_index('Time')['Text']

Time
00:00        >> Senator Klobuchar:  What you are all about
00:01    to be part of, America, is a historic moment o...
00:06    To administer the oath to our first African Am...
00:11        our first Asian-American and our first woman,
00:14    Vice President Kamala Harris, it is my great p...
                               ...                        
31:29            Sustained by faith, driven by conviction,
31:33      And, devoted to one another and to this country
31:35                         we love with all our hearts.
31:37    May God bless America and may God protect our ...
31:42                                  Thank you, America.
Name: Text, Length: 478, dtype: object

That was all to extract the data from a single video. Let's now scrape the content from each of the White House YouTube videos. With something like 160 videos times 2 seconds per video, this scrape should take just over 5 minutes. So let's let this run and take a break. Something will likely break, so let's check back in afterwards! 

In [141]:
for v in videos_l:
    _link = v['Link']
    
    # Have Selenium get the page, get the source, and convert to soup
    driver.get('https://www.youtube.com'+_link)
    
    # Give the page a few seconds to load
    time.sleep(2)
    
    # Retrieve the content and soup-ify
    _raw = driver.page_source.encode('utf-8')
    _soup = BeautifulSoup(_raw)
    
    # Get the date
    try:
        _date = _soup.find_all('div',{'id':'date'})[0].text[1:]
    except:
        time.sleep(1)
        _date = _soup.find_all('div',{'id':'date'})[0].text[1:]
    
    # Get the up and downvotes
    try:
        _tlb2 = _soup.find_all('div',{'id':'top-level-buttons'})[-1]
        _upvotes_soup = _tlb2.find_all('button',{'id':'button'})[0]
        _downvotes_soup = _tlb2.find_all('button',{'id':'button'})[1]
        
        _upvotes = re.findall(r'([\d,]+) other people',_upvotes_soup['aria-label'])[0]
        _downvotes = re.findall(r'([\d,]+) other people',_downvotes_soup['aria-label'])[0]
    except:
        time.sleep(1)
        _tlb2 = _soup.find_all('div',{'id':'top-level-buttons'})[-1]
        _upvotes_soup = _tlb2.find_all('button',{'id':'button'})[0]
        _downvotes_soup = _tlb2.find_all('button',{'id':'button'})[1]
        
        _upvotes = re.findall(r'([\d,]+) other people',_upvotes_soup['aria-label'])[0]
        _downvotes = re.findall(r'([\d,]+) other people',_downvotes_soup['aria-label'])[0]
    
    # Update the dictionary
    v['Date'] = _date
    v['Upvotes'] = _upvotes
    v['Downvotes'] = _downvotes

IndexError: list index out of range

In [143]:
pd.DataFrame(videos_l).head(20)

Unnamed: 0,Views,Link,Title,Length,Captioned,Date,Upvotes,Downvotes
0,7193,/watch?v=MqPF_mF8Oww,Vice President Harris Swears In Marcia Fudge a...,1:46,False,"Mar 10, 2021",16295,95615
1,7317,/watch?v=LwJ8fLt1dow,Congress Passed President Biden's American Res...,0:39,False,"Mar 10, 2021",16295,95615
2,13392,/watch?v=UjH4_NOVtWc,President Biden Hosts an Event with the CEOs o...,15:02,False,"Mar 10, 2021",16295,95615
3,14013,/watch?v=3RYqFFwU84g,03/10/21: Press Briefing by Press Secretary Je...,1:16:09,False,"Mar 10, 2021",16295,95615
4,13953,/watch?v=6y-1OLaFtA0,03/10/21: Press Briefing by White House COVID-...,28:59,False,"Mar 10, 2021",16295,95615
5,32642,/watch?v=Kqlk_h-oIKI,03/09/21: Press Briefing by Press Secretary Je...,54:44,True,"Mar 9, 2021",16295,95615
6,32332,/watch?v=EwSGGg7UFGo,President Biden Visits a Small Business that h...,4:44,True,"Mar 9, 2021",16295,95615
7,35375,/watch?v=H3ym47f014c,President Biden Marks Women's History Month,2:58,False,"Mar 8, 2021",16295,95615
8,36511,/watch?v=TYqKwcquxz8,President Biden Delivers Remarks on Internatio...,19:41,True,"Mar 8, 2021",16295,95615
9,29101,/watch?v=wwtOtsZEMTA,Vice President Harris Remarks to the National ...,10:11,True,"Mar 8, 2021",16295,95615


### Spoofing headers

When we use `requests` or Selenium to get data from other web servers, each of the get requests carries some meta-data about ourselves, called [headers](https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html). These headers tell the server what kind of web browser we are, what kinds of data we can receive, *etc*. so that the server can reply with properly-formatted information. 

But it is also possible for the server to understand a request and refuse to fulfill it, known as a [HTTP 403 error](https://en.wikipedia.org/wiki/HTTP_403). A server's refusal to fulfill a client's request can often be traced back to the identity a client presents through its headers or a client lacking authorization to access the data (*i.e.*, you need to authenticate with the website first). In the case of `requests`, its `get` request includes default header information that identifies it as a Python script rather than a human-driven web browser.

Let's make a request for an article from the NYTimes.

In [None]:
honest_response = requests.get('https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html')

We can see the headers we sent with this request.

In [None]:
honest_response.request.headers

Specifically, the 'User-Agent' string identifies this request as originating from the "python-requests/2.21.0" program, rather than a typical web browser. Some web servers will be configured to inspect the headers of incoming requests and refuse requests unless they are actual web browsers.

We can often circumvent these filters by sending alternative headers that claim to be from a web browser as a part of our `requests.get()`.

In [None]:
# Make a dictionary with spoofed headers for the User-Agent
spoofed_headers = {'User-Agent':"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36"}

# Make the request with the 
nytimes_url = 'https://www.nytimes.com/2019/02/03/us/politics/trump-interview-mueller.html'
spoofed_response = requests.get(nytimes_url,headers=spoofed_headers)

Sure enough, the get request we sent to the NYTimes web server now includes the spoofed "User-Agent" string we wrote that claims our request is from a web browser. The server should now return the data we requested, even though we are not who we claimed to be.

In [None]:
spoofed_response.request.headers

I had trouble finding a website that refused "python-requests" connections automatically (*e.g.*, Amazon, NYTimes, etc.), but you will likely find some along the way. 

Spoofing headers to conceal the identity of your client to a web server is another example of how technological capabilities can overtake ethical responsibilities. The owners of a web server may have good reasons for refusing to serve content to non-web browsers (copyright, privacy, business model, *etc*.). Misrepresenting your identity to extract this data should only be done if the risks to others are small, the benefits are in the public interest, there are no other alternatives for obtaining the data, *etc*. 

There can be *very* real consequences for spoofing headers. Because it is such a common and relatively trivial method for circumventing server security settings, making repeated spoofed requests could result in your IP address or an IP address range (worst case, the entire university) being blocked from making requests to the server.

### Parallelizing requests

A third web scraping practice that warrants ethical scrutiny is parallelization. In the example of getting historical `@WhiteHouse` tweets, we launched a single browser window and "scrolled" until we reached the end; a process that took on the order of a minute.

However, we *could* launch multiple scripts that each creates a browser windows and collect different segments of the data in parallel for us to combine the results at the end. In an API context, we *could* create multiple applications and design our requests so that each works simultaneously to get all the data. 

Each request imposes some cost on the server to receive, process, and return the requested data: making these requests in parallel increases the convenience and efficiency for the data scraper, but also dramatically increases the strain on the server to fulfill other clients' requests. In fact, highly-parallelized and synchronized requests can look like [denial-of-service attacks](https://en.wikipedia.org/wiki/Denial-of-service_attack) and may get your requests far more scrutiny and blowback than patiently waiting for your data to arrive in series. The ethical justifications for employing highly-parallelized scraping approaches are thin: documenting a rapidly-unfolding event before the data disappears, for example.

## Scraping the Internet Archive's Wayback Machine

Now we'll leave some of the ethically-fraught methods of web scraping behind. The Internet Archive maintains the "[Wayback Machine](https://www.archive.org/web/)" where old versions of websites are stored. Some of my favorites:

* [CNN in June 2000](https://web.archive.org/web/20000815052826/http://www.cnn.com/)
* [Facebook in August 2004](https://web.archive.org/web/20040817020419/http://www.facebook.com/)
* [Apple in April 1997](https://web.archive.org/web/19970404064444/http://www.apple.com:80/)

In these URLs above, there is a numeric identifier corresponding to the timestamp when the image of the website was captured. How do we know when the Wayback Machine archived a webpage? There's a free and open API!

### Using the Wayback Machine API

The simplest API request we can make asks for the most recent snapshot of a webpage archived by the Wayback Machine.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com'

wb_response = requests.get(wb_url)

wb_response.json()

This response tells us the timestamp and location of this snapshot, which we could then go retrieve and parse.

In [None]:
wb_response_json = wb_response.json()

recent_fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

recent_fb_wb_response = requests.get(recent_fb_wb_url)

Get the raw text out, soupify, and look for links. For some reason all the links in this snapshot are in German.

In [None]:
recent_fb_wb_raw = recent_fb_wb_response.text

recent_fb_wb_soup = BeautifulSoup(recent_fb_wb_raw)

[link.text for link in recent_fb_wb_soup.find_all('a')]

We can also ask for the most recent snapshot of a webpage around a specific date. Let's ask the Wayback Machine for a snapshot of Facebook around February 1, 2008.

In [None]:
wb_url = 'http://archive.org/wayback/available?url=facebook.com&timestamp=20080201'

wb_response = requests.get(wb_url)

wb_response_json = wb_response.json()

wb_response_json

Note that this is a relatively deep JSON object we have to navigate into to access information like the Wayback URL or the timestamp of the snapshot. The closest snapshot to February 1, 2008 was January 30, 2008. We use the [`datetime.strptime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) function to turn this numeric string that we recognize as a timestamp into a datetime object.

In [None]:
print(datetime.strptime(wb_response_json['archived_snapshots']['closest']['timestamp'],
                        '%Y%m%d%H%M%S'))

As before, we could scrape out the links on this 2008 version of the page.

In [None]:
# Find the old URL
fb_wb_url = wb_response_json['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
fb_wb_response = requests.get(fb_wb_url)

# Get the text from the response
fb_wb_raw = fb_wb_response.text

# Soup-ify
fb_wb_soup = BeautifulSoup(fb_wb_raw)

# Make a list of the text of the links
[link.text for link in fb_wb_soup.find_all('a')]

We could likewise launch this link to view the page in Selenium.

In [None]:
driver = selenium.webdriver.Chrome(executable_path='/Users/briankeegan/Dropbox/Courses/2019 Spring - ITSS Web Data Scraping/chromedriver')

driver.get(fb_wb_url)

In [None]:
driver.quit()

### Scraping historical web pages

A current project I am working on is exploring how social media platforms' terms of service have evolved over time. Let's start with Facebook's terms of service and privacy policy.

In [None]:
fb_tos = 'http://www.facebook.com/terms.php'
fb_pp = 'http://www.facebook.com/policy.php'

We will take advantage of the [`date_range`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.date_range.html) fuction in `pandas` to generate a range of dates between January 2005 and January 2019.

In [None]:
dates_list = pd.date_range(start='2005-01-01',end='2021-03-01',freq='M')
dates_list

We'll use [`datetime.strftime`](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior) (the inverse of `strptime`) to make these date objects into specifically-formatted strings that we can format into a URL.

In [None]:
# Take the first datetime object and turn it into a string
datetime.strftime(dates_list[0],'%Y%m%d')

Use string formatting to put the `fb_tos` URL and formatted timestamp into a request to the Wayback Machine.

In [None]:
date_str = datetime.strftime(dates_list[0],'%Y%m%d')

wb_api_url = 'https://archive.org/wayback/available?url={0}&timestamp={1}'
wb_api_url_formatted = wb_api_url.format(fb_tos,date_str)

print(wb_api_url_formatted)

Make the request to the Wayback Machine to get the URL and timestamp of the Wayback Machine's closest snapshot of Facebook's Terms of Service before January 31, 2005.

In [None]:
wb_api_response = requests.get(wb_api_url_formatted)

wb_api_response.json()

Parse the markup of this old version.

In [None]:
# Find the old URL
wb_fb_old_url = wb_api_response.json()['archived_snapshots']['closest']['url']

# Go get the archived snapshot from the Wayback Machine
wb_fb_raw = requests.get(wb_fb_old_url).text

# Soup-ify
wb_fb_soup = BeautifulSoup(wb_fb_raw)

# Find the content element and get the text out
wb_fb_terms_str = wb_fb_soup.find('div',{'id':'content'}).text.strip()

# Inspect
wb_fb_terms_str

We could use a really dumb stemmer, [`.split()`](https://docs.python.org/3.7/library/stdtypes.html#str.split) to count the number of words in these terms.

In [None]:
len(wb_fb_terms_str.split())

Write a loop to find a snapshot of Facebook's ToS each month in our `dates_list`. 

In [None]:
def get_urls(url_str,start_date='2005-01-01',end_date='2019-01-01',freq='M'):
    
    # Make the list of dates
    date_l = pd.date_range(start_date,end_date,freq=freq)
    
    # Create an empty container to store our data
    urls = dict()

    # For each date in the list of dates
    for date in date_l:
        
        # Turn the date object back into a string
        date_str = datetime.strftime(date,'%Y%m%d%H%M%S')
        
        # Define the API URL request to the Wayback machine
        wb_api_url = 'http://archive.org/wayback/available?url={0}&timestamp={1}'
        
        # Format the API URL with the URL of the website and the closest datetime
        wb_api_request = wb_api_url.format(url_str,date_str)
        
        # Make the request
        r = requests.get(wb_api_request).json()

        # Check if the returned request has all the right parts (this is probably overkill)
        if 'archived_snapshots' in r.keys():
            if 'closest' in r['archived_snapshots'].keys():
                if 'url' in r['archived_snapshots']['closest'].keys():
                    
                    # If it does have all the right parts, get the URL
                    _url = r['archived_snapshots']['closest']['url']
                    
                    # Get the timestamp
                    _timestamp = r['archived_snapshots']['closest']['timestamp']
                    
                    # Save to our URL dictionary with the timestamp of the snapshot as key, the url as value
                    urls[_timestamp] = _url
    return urls

Run our function to make a dictionary of keys returning the Wayback Machine URLs for each month's version of the terms of service. We'll write a loop to get the Terms for each snapshot and count the words. 

This will take a few minutes. 

I've coverted the code block into a "Raw" cell to prevent accidental execution. You can always turn it into a "Code" cell if you really want to run it.

To avoid having everyone hit the Internet Archive server with the same requests, you can also load this file with the same data.

In [None]:
with open('facebook_tos_archive.json','r') as f:
    fb_terms_wordcount2 = json.load(f)

Visualize the changes in the size of Facebook's Terms of Service over time.

In [None]:
# Turn the dictionary into a pandas Series
fb_terms_s = pd.Series(fb_terms_wordcount2)

# Conver the index to datetime objects
fb_terms_s.index = pd.to_datetime(fb_terms_s.index)

# Plot
ax = fb_terms_s.plot()

# Make the x-tick labels less weird
ax.set_xticklabels(range(2004,2019,2),rotation=0,horizontalalignment='center')

# Always label your axes
ax.set_ylabel('Word count');