# Web Scraping

In [1]:
from time import time
import pandas as pd
import requests                 # HTTP programming
from bs4 import BeautifulSoup   # HTML parsing
from selenium import webdriver  # Browser automation
from selenium.common.exceptions import NoSuchElementException

In [2]:
latest_xkcd_comic = 2436
oldest_xkcd_comic = 2350

[Web scraping](https://en.wikipedia.org/wiki/Web_scraping) is a recently popular set of techniques to create a dataset from unstructured data we can find on the Internet. There are some compelling reasons for which Economists may want to scrape data on the web. Here are some examples.

- The availability of structured data is limited by the incentives to create it. Assembling a dataset is quite costly. Not only you need manual labor to enter data in the dataset, you also need to put in place mechanisms to verify that the data is accurate. Many times raw data points are not comparable, and hence require some methodology to ensure comparability. The provider of the data is also held responsible for inaccuracies or mistakes. On the other hand, the benefits are typically not enjoyed by private businesses. Often times, government agencies or comparable institutions provide public structured data. They do so because they believe there is a common, public benefit, such as academic or policy research.
- The Internet is a medium of information exchange. Consider firms such as Amazon or eBay, who use the web to sell products and secure revenue streams. Consumers demand such services and firms supply them. This requires an exchange of information that includes, but is not limited to, quantities and prices. These firms have no incentive to release _public_ structured data about information that is exchanged on their websites. Rather, they tend to protect such data in the name of industrial interests. On the other hand, consumers have no incentive to collect data in a structured manner.
- The Internet is also a platform for user-generated content. This characteristic of the web rose with the popularity of platforms such as Facebook, Twitter, YouTube, Reddit and so on. A researcher may be interested in collecting user-generated information (e.g., political sentiment on Twitter) for their own data analyses. Again, neither firms nor consumers have incentives in assembling structured datasets and make them public.

These examples have to be considered together with the incentive every researcher has in uncovering new evidence. Some papers that are published ask old questions and use old methodologies, but provide novel answers simply because they obtained novel data. While a new dataset per se will not grant a publication per se, the implications of potentially new evidence may.

In this TA class, I will show brief examples for three web scraping techniques.

1. HTTP programming
2. HTML parsing
3. DOM parsing

These are basic techniques that cover a wide set of needs. Before I explain how these techniques work, I need to provide an overview of the fundamentals of the World Wide Web.

The most familiar action everybody probably takes everyday is to open a web browser, type a URL in the address bar and press the <kbd>Enter</kbd> key. When that happens, there is an exchange of information between your computer, which is called a _client_, and a remote _server_. While this exchange of information can be very complicated (e.g., it can be encrypted, or routed through other in-between servers), its most basic form involves a [HTTP request](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol) and a [HTTP response](https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol). When we type a URL in the address bar and press Enter, the client sends a request: it contacts the remote server and requests specific data. Upon receipt of the request, the server analyzes it, carries out the required tasks and sends back a response.

The HTTP request contains the following ingredients:

- A request line
- Some request headers
- An empty line
- (Optional) A message body

On the other hand, the HTTP response contains the following ones:

- A response status line
- Some response headers
- An empty line
- (Optional) A message body

An example of such basic exchange is the following.

The request may look like this

<pre>
GET / HTTP/1.1
Host: www.example.com

</pre>

(note the empty line)
The first line uses the method `GET` to send a request through the protocol HTTP 1.1. The request asks for all that is found at the root (i.e., `/`) of the server. The server is hosted at the address `www.example.com`. Finally, there is the mandatory empty line and no message body.

The response, instead, may look like this

```html
HTTP/1.1 200 OK
Date: Mon, 23 May 2005 22:38:34 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 155
Last-Modified: Wed, 08 Jan 2003 23:11:55 GMT
Server: Apache/1.3.3.7 (Unix) (Red-Hat/Linux)
ETag: "3f80f-1b6-3e1cb03b"
Accept-Ranges: bytes
Connection: close

<html>
<head>
    <title>An Example Page</title>
</head>
<body>
    <p>Hello World, this is a HTML document.</p>
</body>
</html>
```

Here, the response acknowledges the HTTP 1.1 protocol and informs the client about the success of the request, both with the [status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) `200` and the human-readable message `OK`. The response attaches several "metadata" with headers, such as the date and time at which the request was served, information about the message body of the response, identification information for the server, and so on. We find the mandatory empty line and the message body, which here consists of a simple HTML file.

The reason for which we need to know about HTTP requests and responses is that the three techniques I will show are adequate depending on the message body of the response.

1. HTTP programming: the message body is _per se_ the data we are after. For example, it can be a CSV file, a JSON file, and so on.
2. HTML parsing: the message body is a HTML file which contains the data we are after. This requires us to know how to navigate (programmatically) the content of the HTML file.
3. Browser automation: the HTML code in the message body changes (through other requests) depending on what the user does. This is typically the case when the message body of the response contains JavaScript code or other code that requests additional information (e.g., think of what happens when you continuously scroll down on your Twitter feed: new tweets are dynamically loaded).

Even though HTTP programming is the technique that is least likely associated with web scraping (that would imply that the data is already somewhat structured), knowing its basics is necessary because it is the starting point for the other two techniques.

I will now proceed and provide one example for each of the three techniques. The operating example is the [xkcd](https://xkcd.com) website. While this example is not interesting from an economic point of view, it is quite excellent (in my opinion) for pedagogical purposes. This website (for those who do not know it) presents a set of comics drawn by Randall Munroe, a physicist. On top of being a source of fun, the website allows me to showcase how each of the three techniques listed above can work. At https://xkcd.com/about/, we even find mention of a machine-readable interface through [JSON files](https://en.wikipedia.org/wiki/JSON).

## HTTP programming

The Python package [requests](https://requests.readthedocs.io/) provides a simple interface and some convenience utilities to manage HTTP requests and responses. The package is quite powerful, as it delivers quite some flexibility and allows for very complicated scenarios (e.g., connections that remain open, programmatic user authentication). Here I just want to scratch the surface.

XKCD's Randall Munroe is kind enough to provide us with his comics through a JSON interface. To see it, simply see https://xkcd.com/info.0.json. This is all the data we need about his latest comic. For older comics, we simply include the comic number in the URL, such as https://xkcd.com/2434/info.0.json. In the following code, I write a Python class that symbolizes a single xkcd comic. Given a comic number, it reaches out to the adequate URL, fetches the JSON file and transforms it into a Python dictionary. As a bonus, I write a class method that allows us to download the PNG file of the comic to a disk.

In [3]:
class xkcdComicJson:
    """
    Uses the JSON interface at https://xkcd.com/ for retrieving information about a single xkcd comic.
    """

    def __init__(self, comic_no):
        url = f'https://xkcd.com/{comic_no}/info.0.json'
        response = requests.get(url)
        response.raise_for_status()  # throws an error for bad requests
        comic_info = response.json()  # automagically converts JSON into a dict
        self.number = comic_no
        self.json = comic_info
        self.year = int(comic_info['year'])
        self.month = int(comic_info['month'])
        self.day = int(comic_info['day'])
        self.title = comic_info['title']
        self.caption = comic_info['alt']
        self.img_url = comic_info['img']
        self.url = url
        self.img_name = self.img_url.split('/')[-1]
        
    def save_img_to_disk(self, directory='./'):
        response = requests.get(self.img_url)
        response.raise_for_status()
        if directory[-1] != '/':
            directory += '/'
        with open(directory + f'{self.number}-{self.img_name}', mode='wb') as f:
            f.write(response.content)

As we can see, the HTTP request is sent with the instruction `requests.get(URL)`. This returns the HTTP response. With `response.raise_for_status()`, I ask Python to raise an exception (i.e., an error) if the request was bad. In other words, if the server responded with status codes `4xx` (client error) or `5xx` (server error), we would have an error and the rest of the code would not be executed. With `response.json()` we convert the JSON file into a Python dictionary. This is straightforward, because a JSON file is nothing more than the text representation of a Python dictionary. The conversion is internally handled by the Python package [json](https://docs.python.org/3/library/json.html). The rest of the code in the ``__init__`` function is self explanatory. The method `save_img_to_disk` takes a string argument describing the path where we wish the PNG file to be saved. We use the URL of the comic image (now saved in `self`) and we download its binary representation with `requests.get`. The binary representation is directly written to disk using the `open()` method in Python.

Now we can turn to using the class. Suppose that we wish to create a dataset about these comics. All we do is download all the JSON files using the `xkcdComicJson` class we wrote above.

In [4]:
df_json_rows = []
t0 = time()
for no in range(oldest_xkcd_comic, latest_xkcd_comic+1):
    comic = xkcdComicJson(no)
    df_json_rows.append(comic.json)
    # comic.save_img_to_disk()
t1 = time()
time_json = t1 - t0
df_json = pd.DataFrame(df_json_rows)
print("Data download completed in {:.3f} seconds.".format(time_json))

Data download completed in 42.673 seconds.


<sup>We never grow a `pandas.DataFrame` iteratively, row by row. An accurate and detailed account on the reason is found [here](https://stackoverflow.com/a/56746204).</sup>

In [5]:
df_json.tail()

Unnamed: 0,month,num,link,year,news,safe_title,transcript,alt,img,title,day
82,3,2432,,2021,,Manage Your Preferences,,Manage cookies related to essential site funct...,https://imgs.xkcd.com/comics/manage_your_prefe...,Manage Your Preferences,3
83,3,2433,,2021,,Mars Rovers,,I just Googled 'roomba sojourner mod' and was ...,https://imgs.xkcd.com/comics/mars_rovers.png,Mars Rovers,5
84,3,2434,,2021,,Vaccine Guidance,,I can't wait until I'm fully vaccinated and ca...,https://imgs.xkcd.com/comics/vaccine_guidance.png,Vaccine Guidance,8
85,3,2435,,2021,,Geothmetic Meandian,,"Pythagorean means are nice and all, but throwi...",https://imgs.xkcd.com/comics/geothmetic_meandi...,Geothmetic Meandian,10
86,3,2436,,2021,,Circles,,( MSTE ( AR ) CD ),https://imgs.xkcd.com/comics/circles.png,Circles,12


_Voilà!_ While this example is not particularly interesting from an economic perspective, we can see how we can leverage the [requests](https://requests.readthedocs.io/) package of Python to download data. This technique can also come in handy when we need to download many files that are listed as clickable links on a website. All we need to do is to obtain the relevant URLs, include them in a Python list and run a simple `for` loop.

## HTML parsing

Now, suppose that xkcd did not have this nice and clean JSON interface. How can we obtain the same dataset (or an observationally equivalent one anyway) otherwise? 

The [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/) package provides an alternative. All this package does is parse a HTML file and provide some convenience utilities to navigate it. The Quick Start section of the documentation has a simple walk-through with a simple HTML document. I will not repeat the same here. Instead, I will directly jump onto how we can use the package, hands-on. A fundamental facility we need are the Web Developer Tools of your browser. You can bring them up by pressing <kbd>F12</kbd> or <kbd>Ctrl</kbd>+<kbd>Shift</kbd>+<kbd>I</kbd>. A side pane will appear with an overwhelming bunch of information. However, all we need is the HTML inspector. This shows us the HTML code tree that structures the webpage. What we need to do before writing any Python code is carefully inspect the HTML code around our objects of interest.

![inspector on xkcd](../slides/assets/xkcd-home.png)

As you will see in your experience, the most popular HTML tag is `<div>`. This is a generic container, which can be filled with content and styled (e.g., colored). In well-coded websites, each `<div>` tag comes with a `id` and/or `class` attributes, which provide a (not necessarily unique) identifier. From a technical point of view, these attributes are useful for a Cascading Style Sheet (CSS). This is the file that fills colors, decides which fonts to use and so on. By carefully looking at how HTML tags and attributes are used, we may be able to find unique patterns that identify the objects of interest. In our example, we have the following.

- Most of the information we need is in a `<div>` container with `id` attribute `comic`.
- Within such container, the tag `<img>` has a `src` attribute pointing at the location of the image on the remote server.
- In the same `<img>` tag, the attribute `title` gives, contrary to expectations, the caption (i.e., the _alt_ text) of the comic.
- There is a `<div>` tag with `id` attribute `ctitle` contains the title of the comic.

Note that with this approach we have no information on the date in which each comic was published. This is exclusively due to the fact that the HTML code does not contain this information. However, a last-modified date is provided among the header information of the HTTP response giving the image of the comic. But, such date would also reflect dates in which the comic was simply changed, as opposed to the date of first upload, so it may be subject to errors.

In [6]:
class xkcdComicSoup:
    """
    Uses Beautiful Soup to parse the HTML page for a given comic.
    """
    
    def __init__(self, comic_no):
        url = f'https://xkcd.com/{comic_no}/'
        response = requests.get(url)                     # retrieve the remote HTML document
        response.raise_for_status()                      # make sure the request was successful
        soup = BeautifulSoup(response.text)              # parse the HTML code
        container = soup.find('div', id='comic')         # find a specific unique occurrence of this <div> tag
        self.img_url = 'https:' + container.img['src']
        self.caption = container.img['title']            # access <div>...<img src="..." title="...">...</div>
        self.title = soup.find('div', id='ctitle').text  # find a specific unique occurrence of this <div> tag
        self.number = comic_no
        self.url = url
        self.img_name = self.img_url.split('/')[-1]
        self._soup = soup
        self.img_response = requests.get(self.img_url)   # retrieve the image only
        self.img_response.raise_for_status()             # make sure this request was successful
        self.date = self.img_response.headers['Last-Modified']
    
    def save_img_to_disk(self, directory='./'):
        if directory[-1] != '/':
            directory += '/'
        with open(directory + f'{self.number}-{self.img_name}', mode='wb') as f:
            f.write(self.img_response.content)

Equipped with this new class, we can use it iteratively again to collect information and build a dataset.

In [7]:
df_soup_rows = []
t0 = time()
for no in range(oldest_xkcd_comic, latest_xkcd_comic+1):
    comic = xkcdComicSoup(no)
    row = {
        'number':   comic.number,
        'date':     comic.date,
        'title':    comic.title,
        'caption':  comic.caption,
        'img_name': comic.img_name,
        'img':      comic.img_url
    }
    df_soup_rows.append(row)
    # comic.save_img_to_disk()
t1 = time()
time_soup = t1 - t0
df_soup = pd.DataFrame(df_soup_rows)
print("Data download completed in {:.3f} seconds.".format(time_soup))

Data download completed in 68.071 seconds.


In [8]:
df_soup.tail()

Unnamed: 0,number,date,title,caption,img_name,img
82,2432,"Thu, 04 Mar 2021 00:09:10 GMT",Manage Your Preferences,Manage cookies related to essential site funct...,manage_your_preferences.png,https://imgs.xkcd.com/comics/manage_your_prefe...
83,2433,"Sat, 06 Mar 2021 01:47:55 GMT",Mars Rovers,I just Googled 'roomba sojourner mod' and was ...,mars_rovers.png,https://imgs.xkcd.com/comics/mars_rovers.png
84,2434,"Tue, 09 Mar 2021 03:04:34 GMT",Vaccine Guidance,I can't wait until I'm fully vaccinated and ca...,vaccine_guidance.png,https://imgs.xkcd.com/comics/vaccine_guidance.png
85,2435,"Wed, 10 Mar 2021 23:45:59 GMT",Geothmetic Meandian,"Pythagorean means are nice and all, but throwi...",geothmetic_meandian.png,https://imgs.xkcd.com/comics/geothmetic_meandi...
86,2436,"Fri, 12 Mar 2021 18:20:11 GMT",Circles,( MSTE ( AR ) CD ),circles.png,https://imgs.xkcd.com/comics/circles.png


Here it is. The dataset is obviously not identical to the previous one we obtained, but it gets really close. We should notice, however, that the column `date` is indeed not the date the comic was published on. It is the date on which the image of the comic was last modified. It looks like some comics have been updated after they have been published, as well as some comics were uploaded before going public.

Note that this method took much longer than the previous one. This is due to the fact that, with BeautifulSoup, we need to download a full webpage for every comic, which in itself contains the image. Additionally, we re-download the image separately exclusively to access the "Last-Modified" header that pertains that specific file. With the JSON interface, we only needed to download very small text files. Obviously, in economically relevant applications, we rarely get to choose which method to use. However, it is important here to understand that web scraping tends to take its time, however efficient our code may be. This download time also critically depends on our Internet connection. For best results in terms of time and stability, it is recommended to use a cabled connection over a wireless one.

## Browser Automation

Suppose that, for some reason, neither of the previous techniques were viable. This may well happen when the content of the page dynamically changes. In other words, suppose that the website loads new content in the webpage without reloading the whole page. This is often the case with websites that support "infinite scrolling": a technique to automatically load new content, to be stacked vertically, as long as the user scrolls down (e.g., Twitter, Facebook, reddit). The last resort option in these cases is to automate your own browser, such that you can emulate human behavior using a program.

Browser webdrivers are essential tools in the toolkit of a web developer. They are mostly used for testing websites and trying many possible combinations of user behavior, without actually doing it manually. We can use webdrivers to our advantage. Instead of testing a website, we would just be instructing our browser to navigate to a given webpage, move around and select or read elements using HTML identifiers. In essence, our practical approach is unchanged relative to HTML parsing: we still need to manually inspect the HTML code of the webpage. However, the main difference is that we write code that walks our same steps, just automatically. [Selenium](https://www.selenium.dev/) is a set of code bindings to a number of programming languages, including Python.

Using a webdriver is relatively simple, but it requires an additional piece of software. Specifically, [such software is the webdriver itself](https://www.selenium.dev/selenium/docs/api/py/index.html#drivers)! This is a small executable file that provides the necessary code to translate commands in your programming language into human-like actions in the browser. We have two options here. Either (1) we download the executable, place it in an arbitrary folder, and instruct Selenium about such folder, or (2) we place the executable inside a folder known listed in the [PATH variable](https://en.wikipedia.org/wiki/PATH_(variable)) of the Operating System. I opted for the first option, purely out of personal preference.

In [9]:
browser = webdriver.Firefox(executable_path='C:/Users/Andrea/Documents/geckodriver.exe')

This line creates a new browser window that is automatically controlled by the code we will write next.

To showcase the power of browser automation, we take an approach to searching comics different from the one we used above. We start from the latest comic, found at https://xkcd.com, and then we iteratively press the "Previous" button up until we get to comic number 2350. This shows how we can emulate human behavior, by means of clicks. Therefore, we will have a `while` loop that uses Selenium functions to find elements. To find these elements, we use the following strategies (in bold, the information we gather):

- The **title of the comic** is in a `<div>` tag with `id` attribute `ctitle`
- The `<div>` tag with `id` attribute `middleContainer` contains the following lines
  - A line with the comic title
  - A line with navigation buttons (above the comic)
  - A line with navigation buttons (below the comic)
  - An empty line
  - A line with a "permalink" to the comic, which gives us the **comic number**
  - A line with a "hotlink" to the comic image, which gives us the URL of the comic image
- The `<div>` tag with `id` attribute `comic` contains a `<img>` tag that contains
  - The **URL of the comic image**, at the `src` attribute
  - The **caption of the comic**, at the `title` attribute
- The "Previous" button is a link with text `< Prev`

We close the procedure by closing the automated browser window.

In [10]:
df_dom_rows = []
t0 = time()
browser.get('https://xkcd.com')  # point the browser to the homepage
number = 3000

while number > oldest_xkcd_comic:
    # Find the number of the comic
    div = browser.find_element_by_xpath('//div[@id="middleContainer"]')
    lines = div.text.split('\n')
    permalink = lines[-2].split(': ')[-1]
    number = int( permalink.split('/')[-2] )
    # img_url = lines[-1].split(': ')[-1]  we could use this instead of ...
    
    # Find the title of the comic
    title = browser.find_element_by_xpath('//div[@id="ctitle"]').text
    
    # Find the caption of the comic
    try:
        img = browser.find_element_by_xpath('//div[@id="comic"]/img')
    except NoSuchElementException:  # see 2399
        img = browser.find_element_by_xpath('//div[@id="comic"]/a/img')
    caption = img.get_attribute('title')
    
    # Find the URL of the comic image
    img_url = img.get_attribute('src')  # ... this
    
    # Find the name of the PNG file
    img_name = img_url.split('/')[-1]
    
    # Collect information for dataset
    row = {
        'number': number,
        'title': title,
        'caption': caption,
        'img_name': img_name,
        'img': img_url
    }
    
    # Append information to list
    df_dom_rows.append(row)
    
    # Go to the previous comic
    browser.find_element_by_link_text("< Prev").click()
    
browser.quit()  # close the automated browser window
t1 = time()
time_dom = t1-t0
print("Data download completed in {:.3f} seconds.".format(time_dom))

Data download completed in 45.347 seconds.


In [11]:
df_dom = pd.DataFrame(df_dom_rows)
df_dom.head()

Unnamed: 0,number,title,caption,img_name,img
0,2436,Circles,( MSTE ( AR ) CD ),circles.png,https://imgs.xkcd.com/comics/circles.png
1,2435,Geothmetic Meandian,"Pythagorean means are nice and all, but throwi...",geothmetic_meandian.png,https://imgs.xkcd.com/comics/geothmetic_meandi...
2,2434,Vaccine Guidance,I can't wait until I'm fully vaccinated and ca...,vaccine_guidance.png,https://imgs.xkcd.com/comics/vaccine_guidance.png
3,2433,Mars Rovers,I just Googled 'roomba sojourner mod' and was ...,mars_rovers.png,https://imgs.xkcd.com/comics/mars_rovers.png
4,2432,Manage Your Preferences,Manage cookies related to essential site funct...,manage_your_preferences.png,https://imgs.xkcd.com/comics/manage_your_prefe...


Overall, these procedures retrieve roughly the same data. The main differences pertain the method and the execution times. While browser automation may seem more intuitive, as it directly relates to what we normally do in a browser, it is also the method that takes more time. Furthermore, if the connection to the Internet is not particularly fast, HTTP programming may just be the best option, because it allows for optimizations over what exactly to download. However, when scraping data, we often do not have a choice. Only one method is feasible. Knowing what to expect is important.

In [12]:
print('Comics retrieved: {:0d}.'.format(latest_xkcd_comic - oldest_xkcd_comic))
print('HTTP programming took   {:.3f} seconds.'.format(time_json))
print('HTML parsing took       {:.3f} seconds.'.format(time_soup))
print('Browser automation took {:.3f} seconds.'.format(time_dom))

Comics retrieved: 86.
HTTP programming took   42.673 seconds.
HTML parsing took       68.071 seconds.
Browser automation took 45.347 seconds.
