# Introduction to Data Collection and Extraction
In this primer we will go over a number of techniques to gather and collect our own data from various sources.

## 1. Using HTTP requests and API

### 1.1. HTTP request basics
Application programming interface (API) is a blueprint for communication between softwares; it defines, among others, the kind of requests that can be made, how to make them, as well as the data format that can be used. Many online services provide their own APIs for developers to request data from (for example, [Facebook](https://developers.facebook.com/docs/graph-api/), [Twitter](https://developer.twitter.com/en/docs) and [Yelp](https://www.yelp.com/developers)). A common way of interacting with these APIs is to send an HTTP GET request to the provided URL.

To make a request, we use `requests.get()` and provide the URL string:

In [48]:
import requests

response = requests.get("https://api.github.com")
response

<Response [200]>

We can then examine different attributes of the `response` objects to see what is returned by our request. Here are some important attributes to pay attention to:

* `status_code` indicates the HTTP status code of the request. There are many status codes that represent different response states -- you can see the full list on [Wikipedia](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes). Here is a handy summary table:

<br>

| Status code 	| Meaning 	| What the server is basically saying 	|
|-	|-	|-	|
| 1xx - informational response 	| Request was received, continuing process 	| "Hold on" 	|
| 2xx - successful 	| Request was successfully received and accepted 	| "Here you go" 	|
| 3xx - redirect 	| Further action needs to be taken elsewhere 	| "Go away" 	|
| 4xx - client error 	| The request cannot be fulfilled 	| "You messed up" 	|
| 5xx - server error 	| The server failed to fulfill the request 	| "I messed up :( " 	|

In [49]:
response.status_code

200

Generally a 200 status code is what we would expect from a successful request. Since we did get 200 from our request, we can go head and examine other attributes:

* `headers` contains meta-information about the page; this typically includes information not related to the main content that may be useful, e.g., server, content type, and cache control.

In [50]:
response.headers

{'server': 'GitHub.com', 'date': 'Sun, 21 Jun 2020 07:10:30 GMT', 'content-type': 'application/json; charset=utf-8', 'status': '200 OK', 'cache-control': 'public, max-age=60, s-maxage=60', 'vary': 'Accept, Accept-Encoding, Accept, X-Requested-With', 'etag': 'W/"c6bac8870a7f94b08b440c3d5873c9ca"', 'x-github-media-type': 'github.v3; format=json', 'access-control-expose-headers': 'ETag, Link, Location, Retry-After, X-GitHub-OTP, X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset, X-OAuth-Scopes, X-Accepted-OAuth-Scopes, X-Poll-Interval, X-GitHub-Media-Type, Deprecation, Sunset', 'access-control-allow-origin': '*', 'strict-transport-security': 'max-age=31536000; includeSubdomains; preload', 'x-frame-options': 'deny', 'x-content-type-options': 'nosniff', 'x-xss-protection': '1; mode=block', 'referrer-policy': 'origin-when-cross-origin, strict-origin-when-cross-origin', 'content-security-policy': "default-src 'none'", 'content-encoding': 'gzip', 'X-Ratelimit-Limit': '60', 'X-Ratelim

* `content` and `text` typically have the response information that you are interested in. `text` is in Unicode format and `content` is in byte format. When the response is in text, you should use `.text` so that the encoding can be automatically inferred by Python. When the response is in byte (for example, if you send a GET request to an image link, the response will be binary data), you should use `content`.

In [51]:
print(response.content)
print("\n\n")
print(response.text)

b'{"current_user_url":"https://api.github.com/user","current_user_authorizations_html_url":"https://github.com/settings/connections/applications{/client_id}","authorizations_url":"https://api.github.com/authorizations","code_search_url":"https://api.github.com/search/code?q={query}{&page,per_page,sort,order}","commit_search_url":"https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}","emails_url":"https://api.github.com/user/emails","emojis_url":"https://api.github.com/emojis","events_url":"https://api.github.com/events","feeds_url":"https://api.github.com/feeds","followers_url":"https://api.github.com/user/followers","following_url":"https://api.github.com/user/following{/target}","gists_url":"https://api.github.com/gists{/gist_id}","hub_url":"https://api.github.com/hub","issue_search_url":"https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}","issues_url":"https://api.github.com/issues","keys_url":"https://api.github.com/user/keys","label_sea

You may notice that the above response is not particularly easy to read, because it is in JSON (JavaScript Object Notation) format. Fortunately, JSON integrates with Python well, and you can seamlessly convert from a JSON string to a Python dictionary or vice versa:

In [52]:
import json
json.loads(response.text)

{'current_user_url': 'https://api.github.com/user',
 'current_user_authorizations_html_url': 'https://github.com/settings/connections/applications{/client_id}',
 'authorizations_url': 'https://api.github.com/authorizations',
 'code_search_url': 'https://api.github.com/search/code?q={query}{&page,per_page,sort,order}',
 'commit_search_url': 'https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}',
 'emails_url': 'https://api.github.com/user/emails',
 'emojis_url': 'https://api.github.com/emojis',
 'events_url': 'https://api.github.com/events',
 'feeds_url': 'https://api.github.com/feeds',
 'followers_url': 'https://api.github.com/user/followers',
 'following_url': 'https://api.github.com/user/following{/target}',
 'gists_url': 'https://api.github.com/gists{/gist_id}',
 'hub_url': 'https://api.github.com/hub',
 'issue_search_url': 'https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}',
 'issues_url': 'https://api.github.com/issues',
 'keys_url': '

Note that you can also call `response.json()` to get the above output, without having to import the `json` package. However, this package is also useful for [many functionalities](https://docs.python.org/3/library/json.html) other than displaying response content, so you are encouraged to import it every time you work with HTTP requests.

### 1.2. HTTP request customization
It is common that you may want to pass some parameters with your requests in order to customize the content you get back. For the `requests` library, we can specify a `params` parameter in the form of a dictionary. These parameters are later parsed down and added to the base url or the api-endpoint.

How the keys and values are specified depends on the specific API; typically there will be an API documentation that provides detailed guidance. In this case, let's say we want to search for public Github repositories that use Python. We can consult the [Search API page](https://developer.github.com/v3/search/#search-repositories) and the [List of qualifiers](https://help.github.com/en/github/searching-for-information-on-github/searching-for-repositories) to compose a search query:

In [53]:
URL = "https://api.github.com/search/repositories"
      
# defining a params dict for the parameters to be sent to the API
PARAMS={'q': 'requests+language:python'}
  
# sending get request and saving the response as response object 
response = requests.get(url = URL, params = PARAMS)
print(response.status_code)

200


The status code indicates a success -- let's see what the response content is:

In [54]:
response.text[:2000]

'{"total_count":10503,"incomplete_results":false,"items":[{"id":4290214,"node_id":"MDEwOlJlcG9zaXRvcnk0MjkwMjE0","name":"grequests","full_name":"spyoungtech/grequests","private":false,"owner":{"login":"spyoungtech","id":15212758,"node_id":"MDQ6VXNlcjE1MjEyNzU4","avatar_url":"https://avatars2.githubusercontent.com/u/15212758?v=4","gravatar_id":"","url":"https://api.github.com/users/spyoungtech","html_url":"https://github.com/spyoungtech","followers_url":"https://api.github.com/users/spyoungtech/followers","following_url":"https://api.github.com/users/spyoungtech/following{/other_user}","gists_url":"https://api.github.com/users/spyoungtech/gists{/gist_id}","starred_url":"https://api.github.com/users/spyoungtech/starred{/owner}{/repo}","subscriptions_url":"https://api.github.com/users/spyoungtech/subscriptions","organizations_url":"https://api.github.com/users/spyoungtech/orgs","repos_url":"https://api.github.com/users/spyoungtech/repos","events_url":"https://api.github.com/users/spyoungt

Our response contains around 10000 results, where each result is a Github repository link that contains Python code. Let's check out the `html_url` of the first result to confirm:

https://github.com/spyoungtech/grequests

This is indeed a Python repository (and a popular one!), so we got what we wanted.

Note that in the majority of cases, you would also need an authenticated API key to send requests to an API; this key can be obtained by manually signing up for a developer account. For example, to send a request to the Yelp API, you can request a key [here](https://www.yelp.com/developers/documentation/v3/authentication), save it in a file called `api_key.txt`, then append its content to the request header:

In [56]:
api_key = open("api_key.txt").read().strip()
headers = headers = {'Authorization': 'Bearer ' + api_key}
params = {"location" : "Pittsburgh"}
response = requests.get("https://api.yelp.com/v3/businesses/search", params = params, headers = headers)
print(response.status_code)
response.text[:2000]

200


'{"businesses": [{"id": "JLbgvGM4FXh9zNP4O5ZWjQ", "alias": "meat-and-potatoes-pittsburgh", "name": "Meat & Potatoes", "image_url": "https://s3-media4.fl.yelpcdn.com/bphoto/Rc_WMhLcgKSAJnsitlJj1g/o.jpg", "is_closed": false, "url": "https://www.yelp.com/biz/meat-and-potatoes-pittsburgh?adjust_creative=5x_un5xfjVKExwcT_HwPvA&utm_campaign=yelp_api_v3&utm_medium=api_v3_business_search&utm_source=5x_un5xfjVKExwcT_HwPvA", "review_count": 2023, "categories": [{"alias": "gastropubs", "title": "Gastropubs"}], "rating": 4.0, "coordinates": {"latitude": 40.4431445862832, "longitude": -80.0011037907791}, "transactions": ["delivery", "restaurant_reservation"], "price": "$$$", "location": {"address1": "649 Penn Ave", "address2": "", "address3": "", "city": "Pittsburgh", "zip_code": "15222", "country": "US", "state": "PA", "display_address": ["649 Penn Ave", "Pittsburgh, PA 15222"]}, "phone": "+14123257007", "display_phone": "(412) 325-7007", "distance": 1870.1784637351072}, {"id": "woXlprCuowrLJswWer

Here we have searched for all restaurants in Pittsburgh, and the status code 200 indicates a successful query. The returned JSON maps the key `business` to a list of JSONs, one for each restaurant. Let's see how many there are in total:

In [57]:
businesses = json.loads(response.text)['businesses']
len(businesses)

20

Surprisingly we only get 20 results, but there should be more than 20 restaurants in Pittsburgh! This is due to *pagination*, which is a safeguard against returning too much data in a single request, which imposes heavy burden on the server. Instead, the API would divide the response results into several pages, and return one page per request. To get all the results we want, we would then need to send multiple requests, each with a different "page number" parameter.

Let's start by recording the 20 business ids that we got from the last request:

In [58]:
print([b['id'] for b in businesses])

['JLbgvGM4FXh9zNP4O5ZWjQ', 'woXlprCuowrLJswWere3TQ', 'dLc1d1zwd1Teu2QED5TmlA', 'SmkYLXEYhzwUZdS6TAevHg', 'j83FLfnjvSHw-vrCNHuTCg', 'NoF90rswXBHESSyDaWeKKA', 'MKYcOZSpMwJK7uwacK13EA', 'd2ZQRjuizstCTnicysmpMQ', 'BcLFIr4wtd3GQ3fnz15yDQ', 'S8urdN6ACnRQUm-7o9An8w', 'VgWvHMs5TJ2lr0ec2DaMtQ', 'sc__kdcFV4IcNTfBx1707w', 'xcmmTXhuMx2fZF2Bt69F4w', 'RKVYQ00LvK0_FO6Ll7lvOg', 'hcFSc0OHgZJybnjQBrL_8Q', 'Ul6JwluSTm12PVDIqnNaTg', 'LQFmktF43j2NPncKdNd9mg', '4mYS-4UOjTKgsf0tX1_IkQ', 'Cf0iV72DTqR0ggBje2d0sg', '0PCBt3JKD6IooicImKNBzA']


Now we send another query to request the next "page" of businesses. For the Yelp API, this can be done by the `offset` parameter, which specifies the starting index of our restaurant list. Since we already got the restaurants from 0 to 19, we can now set `offset` to 20:

In [59]:
params["offset"] = 20
response = requests.get("https://api.yelp.com/v3/businesses/search", params = params, headers = headers)
print([b['id'] for b in json.loads(response.text)["businesses"]])

['sMzNLdhJZGzYirIWt-fMAg', 'gldPX9ANF5Nic0N7igu2og', 'wmCBxE0PfLZD8sxIwAY59Q', 'Z2NIEKTVVP2nEmOF4jNxqQ', 'eIedjt0mHVKDFmfhtfKSAQ', 'lKom12WnYEjH5FFemK3M1Q', 'VxRlBe2wjtycFWSZm1orTA', 'LsjYGLrWe6psx1y5m2J-4A', 'PU-CSnMYXizOS3uhr316eg', 'dQj5DLZjeDK3KFysh1SYOQ', 'SvCjBtbN1cKElDKPTw9dOA', '0bjFYstj8USMzEV4ZQldjA', 'efSbCWuU0FJbLmPC5CDfdg', 'bXCWON2Me0o86qvAb-XZPQ', 'oeW0vIYd3rUnAPgmD4fEFg', 'ejaUQ1hYo7Q7xCL1HdPINw', '7z2x16M7IuG8KPfMsyVrKA', 'Fozo0B-y42EhRMomR0K5vQ', 'BskUTTscZ1XGa9ev7TlfeQ', 'dRbmeC5hl211hH-WeMtv-g']


And we indeed have 20 different ids. In general you typically want to collect a large number of restaurants (say 10000), and you can use a for-loop to send multiple requests. However, make sure to keep in mind the *rate limit*: you should pause slightly (at least 0.2 seconds) between consecutive reqeusts so as to not overwhelm the API and get blocked. The rate limit is also API-dependent; for example, with [Github API](https://developer.github.com/v3/search/#search-repositories), you can send up to 30 authenticated requests or 10 unauthenticated requests per minute (so you would need a longer timeout than 0.2 seconds between requests).

## 2. Web scraping with BeautifulSoup
While API provides a convenient way to collect data, not all websites have a dedicated API set up for developers (in fact, most do not). In many cases, the only data available to us is the source code of the same HTML page that we would see through the web browser. The good news is that this HTML content may still contain the data we need; the bad news is that we now have to parse it ourselves. Fortunately, the `BeautifulSoup` library in Python makes this task a lot easier.

Before we start, if you aren't familiar with HTML and CSS, here is a [CSS selector cheatsheet](https://gist.github.com/magicznyleszek/809a69dd05e1d5f12d01) that may come in handy when you go through the materials below and Project 3.

### 2.1. BeautifulSoup basics
To begin parsing, we can input the HTML string of a page (obtained through a GET request) to a BeautifulSoup object. We can also specify the parser for this string; the recommended option is `html.parser`. For now let's create our own simple HTML source code to play with:

In [12]:
from bs4 import BeautifulSoup

html = """
<div class="container" id="title">
    <div class="row" hello_attr="hello">
        <p onclick="execute();">
            <img src="pic.png"/>
            Click me
            <br />
        </p>
    </div>
</div>
"""
soup = BeautifulSoup(html)

Scraping an HTML page typically involves identifying the traversal path through the page structure to what you want, and then providing that path to BeautifulSoup using HTML tags and CSS selectors. There are three common methods you can use:

* `find` returns the first HTML tag which is a descendant of the caller and satisfies the search condition, or `None` if no match is found. Typically you would provide the tag name and any custom attribute that your targeted tag has.

In [13]:
# look for the first p element
soup.find("p")

<p onclick="execute();">
<img src="pic.png"/>
            Click me
            <br/>
</p>

In [14]:
# look for the first div element whose id is "title"
soup.find("div", id = "title")

<div class="container" id="title">
<div class="row" hello_attr="hello">
<p onclick="execute();">
<img src="pic.png"/>
            Click me
            <br/>
</p>
</div>
</div>

In [15]:
# look for the first div element whose custom attribute "hello_attr" has value "hello"
soup.find("div", {"hello_attr" : "hello"})

<div class="row" hello_attr="hello">
<p onclick="execute();">
<img src="pic.png"/>
            Click me
            <br/>
</p>
</div>

Note that here we are using a dictionary to specify which attribute to look for and what its corresponding values is. For some common attributes such as `class` or `id` or `itemprop`, there are also built-in parameters to quickly specify them. For example, in the second code block above we could use `find("div", id = "title")` instead of `find("div", {"id" : "title"})`. Note that if you want to find by class name, the corresponding parameter is `class_`, not `class`, because the latter is a reserved Python keyword. In addition, if you want to search for a tag that has an id, but it doesn't matter with the id is, you can simply use `.find(id = True)`

* `find_all` returns a list of HTML tags that are descendants of the caller and satisfy the search condition, or an empty list if no match is found. It accepts the same input parameters as `find`.

In [16]:
# find all divs
soup.find_all("div")

[<div class="container" id="title">
 <div class="row" hello_attr="hello">
 <p onclick="execute();">
 <img src="pic.png"/>
             Click me
             <br/>
 </p>
 </div>
 </div>,
 <div class="row" hello_attr="hello">
 <p onclick="execute();">
 <img src="pic.png"/>
             Click me
             <br/>
 </p>
 </div>]

In [17]:
# find all p tags whose class is "hello"
# empty list because no match is found
soup.find_all("p", class_ = "hello")

[]

* `select` functions similarly as `find_all` in that it returns a list of matches, or an empty list if no match is found. The difference is that it only takes as input a CSS selector string. If you are familiar with CSS, `.search` is a good option to concisely specify exactly what you need. Note that BeautifulSoup doesn't have support for many CSS selectors, such as chaining attributes `a[href][src]`; however, what it does support should be sufficient for many web scraping tasks.

In [18]:
# find all div tags whose id is "title"
# for each div tag from the last step, find all div tags whose class is "row"
# for each div tag from the last step, find all p tags
soup.select("div#title div.row p")

[<p onclick="execute();">
 <img src="pic.png"/>
             Click me
             <br/>
 </p>]

To perform the same task using `find_all`, we would need to chain multiple calls:

In [19]:
[
    z
    for x in soup.find_all("div", id = "title")
    for y in soup.find_all("div", class_ = "row")
    for z in soup.find_all("p")
]

[<p onclick="execute();">
 <img src="pic.png"/>
             Click me
             <br/>
 </p>]

Or in this specific case, we can use `.find` since there are no duplicate instances of any tag:

In [20]:
soup.find("div", id = "title").find("div", class_ = "row").find("p")

<p onclick="execute();">
<img src="pic.png"/>
            Click me
            <br/>
</p>

You may be wondering why we need to specify three levels of HTML tags while doing a `soup.find("p")` is enough. In this case it actually is enough, but in the more general case, a website can contain many `p` tags at different locations -- in order to avoid capturing tags you don't intend to, you would need to be as specific about your input path as possible.

Once we have the targeted element / list of elements, we can call some methods to extract their content. The most common method is `get_text()`, which gives the inner text content:

In [21]:
soup.find("p").get_text()

'\n\n            Click me\n            \n'

This is now a normal Python string that you can process, for example by calling `.strip()` to remove leading and trailing whitespaces. To get an element attribute, you can also use indexing:

In [22]:
soup.find("p")["onclick"]

'execute();'

You should also check out the [official BeautifulSoup documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/). This is a self-contained tutorial with many useful examples and advanced scraping functions.

### 2.2. BeautifulSoup example
Now let's do an example where we want to collect the names of all faculties in the MCDS program. We first retrieve the HTML code of the [faculty page](https://mcds.cs.cmu.edu/directory/all/154/1):

In [60]:
url = "https://mcds.cs.cmu.edu/directory/all/154/1"
soup = BeautifulSoup(requests.get(url).text, "html.parser")
print(soup.prettify()[:2000])

<!DOCTYPE html>
<!--[if IEMobile 7]><html class="iem7"  lang="en" dir="ltr"><![endif]-->
<!--[if lte IE 6]><html class="lt-ie9 lt-ie8 lt-ie7"  lang="en" dir="ltr"><![endif]-->
<!--[if (IE 7)&(!IEMobile)]><html class="lt-ie9 lt-ie8"  lang="en" dir="ltr"><![endif]-->
<!--[if IE 8]><html class="lt-ie9"  lang="en" dir="ltr"><![endif]-->
<!--[if (gte IE 9)|(gt IEMobile 7)]><!-->
<html dir="ltr" lang="en">
 <!--<![endif]-->
 <head>
  <meta charset="utf-8"/>
  <link href="https://mcds.cs.cmu.edu/sites/all/themes/mcds2015/favicon.ico" rel="shortcut icon" type="image/vnd.microsoft.icon"/>
  <meta content="Drupal 7 (http://drupal.org)" name="Generator"/>
  <title>
   Master of Computational Data Science - Faculty | Carnegie Mellon University - Language Technologies Institute
  </title>
  <meta content="width" name="MobileOptimized"/>
  <meta content="true" name="HandheldFriendly"/>
  <meta content="width=device-width" name="viewport"/>
  <!--[if IEMobile]><meta http-equiv="cleartype" content="on

Let's open the same page on a web browser, so that we can investigate the HTML structure. We can right click on the name of the first faculty on the list, George Amvrosidias, and select "Inspect" to check which HTML tag contains that name.

![mcds_faculty_1](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/mcds_faculty_1.png)

We see that our target tag is an `<h2>`. Let's start with finding all `h2` tags:

In [24]:
soup.find_all("h2")

[<h2 class="element-invisible">Search form</h2>,
 <h2>George Amvrosidias</h2>,
 <h2>Jeffrey Bigham</h2>,
 <h2>Jamie Callan</h2>,
 <h2>William Cohen</h2>,
 <h2>Lorrie Cranor</h2>,
 <h2>Christos Faloutsos</h2>,
 <h2>Kayvon Fatahalian</h2>,
 <h2>Gregory Ganger</h2>,
 <h2>Matthias Grabmair</h2>,
 <h2>Geoff Kauffman</h2>,
 <h2>Aniket Kittur</h2>,
 <h2>Robert Kraut</h2>,
 <h2>Jennifer Lucas</h2>,
 <h2>Todd Mowry</h2>,
 <h2>Daniel Neill</h2>,
 <h2>Eric Nyberg</h2>,
 <h2>Andrew Pavlo</h2>,
 <h2>Ronald Rosenfeld</h2>,
 <h2>Alexander Rudnicky</h2>,
 <h2>Majd Sakr</h2>,
 <h2 class="element-invisible">Pages</h2>,
 <h2>Contact Us</h2>,
 <h2>Connect</h2>]

We see that this does give us all the faculty names, but also some unneeded content like "Search form" and "Contact Us". We want to avoid having to remove these manually, so let's form a stricter search condition -- this time we search for all the `h2` tags inside an `a` tag:

In [25]:
soup.select("a h2")

[<h2>George Amvrosidias</h2>,
 <h2>Jeffrey Bigham</h2>,
 <h2>Jamie Callan</h2>,
 <h2>William Cohen</h2>,
 <h2>Lorrie Cranor</h2>,
 <h2>Christos Faloutsos</h2>,
 <h2>Kayvon Fatahalian</h2>,
 <h2>Gregory Ganger</h2>,
 <h2>Matthias Grabmair</h2>,
 <h2>Geoff Kauffman</h2>,
 <h2>Aniket Kittur</h2>,
 <h2>Robert Kraut</h2>,
 <h2>Jennifer Lucas</h2>,
 <h2>Todd Mowry</h2>,
 <h2>Daniel Neill</h2>,
 <h2>Eric Nyberg</h2>,
 <h2>Andrew Pavlo</h2>,
 <h2>Ronald Rosenfeld</h2>,
 <h2>Alexander Rudnicky</h2>,
 <h2>Majd Sakr</h2>]

And this works! In case it still doesn't, we can narrow down the scope even more, e.g., 

`select("span.field-content a h2")`

until we get exactly what is needed. As the last step, let's now extract the string names and put them in a list:

In [26]:
print([name.get_text() for name in soup.select("a h2")])

['George Amvrosidias', 'Jeffrey Bigham', 'Jamie Callan', 'William Cohen', 'Lorrie Cranor', 'Christos Faloutsos', 'Kayvon Fatahalian', 'Gregory Ganger', 'Matthias Grabmair', 'Geoff Kauffman', 'Aniket Kittur', 'Robert Kraut', 'Jennifer Lucas', 'Todd Mowry', 'Daniel Neill', 'Eric Nyberg', 'Andrew Pavlo', 'Ronald Rosenfeld', 'Alexander Rudnicky', 'Majd Sakr']


Note that the above list is not complete, since the faculty directory actually spans 2 pages. You can extract the URL to the second page using BeautifulSoup, send another GET request to that second page, and repeat the above process to fill in the remaining faculties.

As another note, if you look at the top portion of the HTML code in the screenshot above, you will see a `div` tag with a very long class name `view-dom-id-55a0ca...` If this looks like a computer-generated name to you, the reason is that it is in fact computer-generated, and you may see a different name if you access the same webpage at a different time. Therefore, make sure you do not hard code these names in your BeautifulSoup searches.

## 3. Navigating dynamic web pages with Selenium

### 3.1. Introduction
BeautifulSoup does a good job when you have all of the content needed in the HTML page source, i.e., when you request a *static* webpage. However, thanks to advances in JavaScript (😢) many modern webpages are *dynamic* -- their content will change based on user interaction. In other words, you may need to perform some action on the webpage itself, such as clicking a button or entering some text.

As an example, consider the following search page: https://www.foxnews.com/search-results/search?q=president
We see that the search results are ordered by their recency. Let's say we want to search for all articles in 03/2020. We can first try doing it manually to see what the page interaction looks like, by clicking on the "By Content" and "Date Range" boxes:
![fox_news](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/fox_news.png)

When clicking on Search, we see that the search results are updated to indeed show articles in March; however, the page URL doesn't change at all. In other words, there are no GET parameters for content type or date range, so in order to filter our result by these attributes, we need to access the page and do the button clicking ourselves, which is not scalable.

This is when Selenium comes in handy. Selenium is an open source automated testing suite for web application testing across different platforms. The basic idea is that if you are developing a website and you expect an element X to show up when the user clicks on button Y, you can use Selenium to simulate the Y button clicking and to then check the presence of X. Note that it was not built for web scraping, but its functionality is exactly what we want in this case, so let's go ahead and get started!

### 3.2. Installation
After installing the `selenium` Python package, the next steps depend on whether you are running Jupyter locally or on an online environment, i.e., Google Colab. Note that Selenium **does not work on Azure Notebook**, since we do not have permission to install packages via `apt-get` in the Azure terminal.

#### 3.2.1. For local Jupyter notebook
Other than the `selenium` package itself, you will also need a web browser and the corresponding webdriver. The browsers that work best with Selenium are either some version of Chrome (Google Chrome or Chromium) or Firefox.

If you use Chrome, head to [this page](https://chromedriver.chromium.org/downloads) and download the appropriate webdriver zip file, based on your operating system and Chrome version. Then unzip it in the **same directory as your Jupyter notebook**; if you are on Mac OS or Linux, run the following Bash command:

In [27]:
!chmod +x ./chromedriver

If you use Firefox, head to [this page](https://github.com/mozilla/geckodriver/releases) and download the latest driver version (currently 0.26.0), based on your operating system. Then unzip it in the **same directory as your Jupyter notebook**; if you are on Mac OS or Linux, run the following Bash command:

In [28]:
!chmod +x ./geckodriver

#### 3.2.2 For Google Colab
If you use Colab, there is no need to download anything. Simply create this Bash cell at the beginning of the notebook and run it. If you use this in your project 3 notebook, remember to tag this cell with `excluded_from_script` so that it doesn't get run by the autograder:

In [None]:
!pip install selenium
!apt-get update 
!apt-get install chromium-chromedriver
!cp /usr/lib/chromium-browser/chromedriver .

## 3.3. Selenium functionalities
From now we will assume that you have either the `chromedriver` or `geckodriver` executable in the same folder as your notebook. We will import the `webdriver` from `selenium` and initialize the two drivers as follows. Note that if you use Windows, you may have to append `.exe` to the `executable_path` parameter here:

In [29]:
from selenium import webdriver

def init_chromedriver(debug = False):
    options = webdriver.ChromeOptions()
    if not debug:
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument("--disable-setuid-sandbox")
    return webdriver.Chrome(executable_path = "./chromedriver", options = options)


def init_geckodriver(debug = False):
    options = webdriver.FirefoxOptions()
    if not debug:
        options.add_argument('--headless')
        options.add_argument('--no-sandbox')
        options.add_argument("--disable-setuid-sandbox")
    return webdriver.Firefox(executable_path = "./geckodriver", options = options)

We will use the Chrome driver for the rest of this primer; if you prefer Firefox, simply edit the following cell to call `init_geckodriver` instead:

In [30]:
driver = init_chromedriver()

Note: if you get an "executable must be in path" error from the above cell, again make sure that your driver executable is in the same directory as this `data_collection_extraction_primer.ipynb` file. You can run the following cell and check that the driver name shows up:

In [61]:
!ls

GiftOfTheMagi.pdf                       fox_news_content_type.png
api_key.txt                             fox_news_search_results.png
[31mchromedriver[m[m                            [31mgeckodriver[m[m
data_collection_extraction_primer.ipynb mcds_faculty_1.png
fox_news.png                            text_data_processing_primer.ipynb
fox_news_articles_selected.png


All of the functionalities of the Python selenium webdriver are described in the [Selenium documentation](https://selenium-python.readthedocs.io/getting-started.html). In our context we are mostly interested in section 3 (Navigating), 4 (Locating Elements), and 5 (Waits). In particular, here are the commonly used methods:

* **Page navigation**: `.get` navigates the webdriver to a given webpage, based on the input string URL.
* **Element search**: `.find_element_by_*` or `.find_elements_by_*` are roughly the equivalence of the search functionalities in BeautifulSoup. Typically we only need to use `find_elements_by_css_selector` since it is the most flexible.
* **Element action**: once you get the returned object from one of the element search functions, you can call a number of methods to simulate actions on that object. Some common actions include `click`, `double_click`, and `send_keys`. See Section 7.2 of the [API](https://selenium-python.readthedocs.io/api.html) for their usage. 

Once we have performed the necessary (simulated) user interactions, we can call `driver.page_source` to get the string HTML of the current stage of the page, and pass this to BeautifulSoup for scraping as we did before.

Let's walk through an example of getting the first 10 articles about the term "president" in March from Fox News. We begin by directing the webdriver to the search URL above:

In [31]:
driver.get("https://www.foxnews.com/search-results/search?q=president")

Let's first inspect the "By Content" box and see how we can select the "Article" from the dropdown menu. We will record the actions we took manually:

1. Click on the "Select Content Type" box.
1. Click on the "Article" cell in the dropdown menu.

After these two actions, we get the resulting HTML structure as follows:

![content_type](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/fox_news_content_type.png)

Now we translate the above actions into Selenium function calls and specify one possible traversal path to the "Action" cell as follows:

In [32]:
driver.get("https://www.foxnews.com/search-results/search?q=president")

content = driver.find_element_by_css_selector("div.filter.content")
content.click()
content.find_element_by_css_selector("ul.option>li>label>input[value=\"%s\"]"% "Article").click()
content.click()

html_content = driver.page_source

How do we check whether this works? We notice that in the browser, after we manually seleced "Article", there is a purple "Article" cell that pops up below the search query box:

![article_selected](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/fox_news_articles_selected.png)

Since we did record the html content of the page after our Selenium calls, we can pass it to BeautifulSoup and check that this "Article" cell is present:

In [33]:
soup = BeautifulSoup(html_content, "html.parser")
soup.find("div", class_ = "search-terms").find("li")

<li class="Article" id="Article">Article<span></span></li>

And we did see the `<li>` Article cell from the screenshot, so we are good to go! Similarly, we can click on the other boxes and then the Search button to submit our query. Note that we will run `driver.get` again, as all Selenium codes should be excecuted together in the same scope; if you run some codes in one cell and some in another a moment later, you may get an "element not interactable" exception.

In [35]:
import time
def autoclick(driver, min_month_val, min_day_val, max_month_val, max_day_val, year):
    driver.get("https://www.foxnews.com/search-results/search?q=president")

    content = driver.find_element_by_css_selector("div.filter.content")
    content.click()
    content.find_element_by_css_selector("ul.option>li>label>input[value=\"%s\"]"% "Article").click()
    content.click()

    min_month = driver.find_element_by_css_selector("div.date.min div.sub.month")
    min_month.click()
    min_month.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%min_month_val).click()

    min_day = driver.find_element_by_css_selector("div.date.min div.sub.day")
    min_day.click()
    min_day.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%min_day_val).click()

    min_year = driver.find_element_by_css_selector("div.date.min div.sub.year")
    min_year.click()
    min_year.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%year).click()

    max_month = driver.find_element_by_css_selector("div.date.max div.sub.month")
    max_month.click()
    max_month.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%max_month_val).click()

    max_day = driver.find_element_by_css_selector("div.date.max div.sub.day")
    max_day.click()
    max_day.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%max_day_val).click()

    max_year = driver.find_element_by_css_selector("div.date.max div.sub.year")
    max_year.click()
    max_year.find_element_by_css_selector("ul.option>li[id=\"%s\"]"%year).click()

    search = driver.find_element_by_css_selector("div.search-form a")
    search.click()

    # wait a bit for the new search results to load
    time.sleep(3)
    return driver.page_source

html_content = autoclick(driver, "03", "01", "03", "31", "2020")

Here is what the resulting search result page looks like. We see that every result article is represented by an `<article>` tag, with some inner tags for the content.

![fox_news_search_results](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/fox_news_search_results.png)

Let's first print out the published time of all articles in the result page, and make sure that they are in March (i.e., assuming we are in 06/2020, the time value should be "3 months ago"):

In [36]:
soup = BeautifulSoup(html_content, "html.parser")
articles = soup.find_all("article")
print([article.find("span", class_ = "time").get_text() for article in articles])

['3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ', '3 months ago ']


That was indeed the case! Now let's check that the article titles are also the same as what we see from the screenshot:

In [37]:
[article.select("h2.title a")[0].get_text() for article in articles]

['Trump calls himself a ‘wartime president’ over coronavirus as he invokes Defense Production Act',
 'President Trump reveals he took coronavirus test',
 'Vice President Pence, wife test negative for COVID-19',
 "Brazil President Bolsonaro's son claims father tested negative for coronavirus despite earlier reports",
 "Former White House photographer dubs Andrew Cuomo 'acting president' amid COVID-19 outbreak",
 'Venezuela President Maduro wanted by DOJ for drug trafficking, Barr announces',
 'Fox News hosting virtual coronavirus town hall with President Trump, White House task force',
 'Trump downplays coronavirus threat, notes ‘common flu’ kills thousands every year',
 'LIVE BLOG: Fox News hosts virtual coronavirus town hall with President Trump',
 'Mexico suspends large gatherings over coronavirus days after its president urges residents to dine out']

These check out as well, so we are good to go! Note that Fox News sometimes gives inconsistent search results even with the same search query, so if you rerun this code, your output may not look like the screenshots above, but it should be consistent with what you see on your browser at the time of running.

As a bonus, in the following cell we put together a copy of all the Selenium codes above, but with `debug = True` in the driver initialization. When you run this cell, there will be an actual browser window that pops up and you can see Selenium performing the actions in real time; it's quite cool!

In [39]:
driver = init_chromedriver(debug = True)
html_content = autoclick(driver, "03", "01", "03", "31", "2020")
print("Finished autoclicking!")

Finished autoclicking!


From this point, the next steps are:
1. Instead of extracting the article titles, you can extract their URLs, then visit each individual article page to parse their content.
1. Note that we only get 10 articles because the rest are hidden; there is a "Load More" button at the end of the page that we have to click on to get more search results. This step can also be done with Selenium.

In Project 3, you will get to practice with these two steps.

### 3.4. Dealing with cookies
Certain sites such as https://nytimes.com/ or http://towardsdatascience.com/ may require you to sign in after you have accessed the site for a number of times. While this does not affect the Selenium operations if you use headless browser (by setting `debug = False`), it may prevent you from visiting the site manually to explore the page's HTML structure. Typically your visit history to a site is stored in local cookies, so if you simply clear the cookies for that site, you will be able to access it again.

For example, You may see this popup when visiting an article on https://nytimes.com/:

![nytimes_blocked](http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/nytimes_blocked.png)

On Google Chrome you can reset cookies with the following steps:

1. Go to the Cookies page [chrome://settings/siteData?search=cookies](chrome://settings/siteData?search=cookies)
1. In the "Search cookies" box, type in "nytimes"
1. You will see some cookies associated with `nytimes.com` listed. Click on "Remove All Shown" to remove all of them.
1. On a new browser tab, reopen the NYT article you previously tried to access.

For other browsers, you can also search for specific instructions on clearing cookies. For example, [this guide](https://support.mozilla.org/en-US/kb/clear-cookies-and-site-data-firefox) is for Firefox. You will perform webscraping on nytimes.com in Project 3, so keep this trick in mind. 

### 3.5. Some closing notes on Selenium
While we have seen the power of Selenium in handling dynamic website interaction, it should be noted that Selenium was designed as a tool to test *your* own website, not to scrape *others*' websites. Many websites are not happy with you crawling their content in this manner, and they would either implement a captcha test to thwart the bots (in this case, your Selenium script) or downright block your IP. Therefore, use this tool carefully and responsibly. See [this article](https://levelup.gitconnected.com/web-scraping-with-selenium-in-python-8fde2f0fd559) for more details.

The websites that you will scrape in Project 3 are not too strict about this; just make sure to pause shortly between consecutive requests.

## 4. Parsing pdf files
We have spent much of this primer covering how to parse online content from APIs or websites. There is actually an equally rich data source: pdf files. PDF stands for "portable document format" and is typically used for distribution of read-only files. Nowadays it is perhaps the most ubiquitous type of file that can be shared and read across different platforms.

However, this popularity comes with a cost: because so many file formats can be converted to pdf (e.g., Word documents, Excel spreadsheets, Jupyter notebooks), a pdf file is much closer to an image than a (structured) document, despite having the term "document" in its name. Libraries that support parsing pdf files typically convert them to binary first, and then try to convert the binary back to text. In this section we will introduce one such library, called `pdfminer`. You can install it by running

`pip3 install pdfminer.six`

**Note**: do not run `pip3 install pdfminer`, since it will give you an [outdated version](https://github.com/euske/pdfminer). Instead, make sure to install `pdfminer.six` instead; after this installation we can still `import pdfminer` as usual.

In [40]:
import pdfminer
from pdfminer.high_level import extract_text

The most straightforward way to extract text from a pdf file is by calling the `extract_text` function:

In [41]:
# download the file from here: http://clouddatascience.blob.core.windows.net/m20-foundation-data-science/p3-data-collection-extraction-primer/GiftOfTheMagi.pdf
extract_text("GiftOfTheMagi.pdf")

'T h e   G i\n\nf\n\nt\n\n  o f\n\n \n\nt h e   M a g i\n\np\n\nT h e   G i f t   o f   t h e   M a g i\n\nThat was all. She had put it aside, one cent and then another and then \n\nONE DOLLAR AND EIGHTY-SEVEN CENTS.     \n\nanother,  in  her  careful  buying  of  meat  and  other  food.  Della  counted  \n\nit  three  times.  One  dollar  and  eighty-seven  cents.  And  the  next  day  \n\nwould be Christmas.\n\nThere was nothing to do but fall on the bed and cry. So Della did it. \n\nWhile  the  lady  of  the  home  is  slowly  growing  quieter,  we  can  \n\nlook at the home. Furnished rooms at a cost of $8 a week. There is lit-\n\ntle more to say about it.\n\nIn the hall below was a letter-box too small to hold a letter. There \n\nwas  an  electric  bell,  but  it  could  not  make  a  sound.  Also  there  was  a  \n\nname beside the door: “Mr. James Dillingham Young.”\n\n1\n\n\x0cO .\n\n  H e n r y\n\nWhen the name was placed there, Mr. James Dillingham Young \nwas being paid $30 

You will only need to use this function for Project 3. For reference, we also provide another useful text extraction function that does so page by page, which you can consider for future projects if, for example, a given pdf file is too large to be parsed all at once.

In [42]:
from pdfminer.converter import TextConverter
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfpage import PDFPage
import io

# source: https://www.blog.pythonlibrary.org/2018/05/03/exporting-data-from-pdfs-with-python/
def extract_text_by_page(pdf_path):
    with open(pdf_path, 'rb') as fh:
        for page in PDFPage.get_pages(fh, caching = True, check_extractable = True):
            resource_manager = PDFResourceManager()
            fake_file_handle = io.StringIO()
            converter = TextConverter(resource_manager, fake_file_handle)
            page_interpreter = PDFPageInterpreter(resource_manager, converter)
            page_interpreter.process_page(page)

            text = fake_file_handle.getvalue()
            yield text

            converter.close()
            fake_file_handle.close()
            
for index, page in enumerate(extract_text_by_page("GiftOfTheMagi.pdf")):
    print(f"**** PAGE {index + 1} ****\n")
    print(page)
    print()

**** PAGE 1 ****

The Gift of the MagipThe Gift of the MagiONE DOLLAR AND EIGHTY-SEVEN CENTS.     That was all. She had put it aside, one cent and then another and then another, in her careful buying of meat and other food. Della counted  it three times. One dollar and eighty-seven cents. And the next day  would be Christmas.There was nothing to do but fall on the bed and cry. So Della did it. While the lady of the home is slowly growing quieter, we can  look at the home. Furnished rooms at a cost of $8 a week. There is lit-tle more to say about it.In the hall below was a letter-box too small to hold a letter. There was an electric bell, but it could not make a sound. Also there was a  name beside the door: “Mr. James Dillingham Young.”1

**** PAGE 2 ****

O. HenryWhen the name was placed there, Mr. James Dillingham Young was being paid $30 a week. Now, when he was being paid only $20 a week, the name seemed too long and important. It should perhaps have been “Mr. James D. Young.” But

`pdfminer` also offers other functionalities, such as extracting images from pdf files, which you can read more in the [official documentation](https://pdfminersix.readthedocs.io/en/latest/index.html).