# Web scraping using Python: Example 2

## Introduction

Computational methods for collecting, cleaning and analysing data are an increasingly important component of a social scientist’s toolkit. Central to engaging in these methods is the ability to write readable and effective code using a programming language.

In this lesson we apply the logic of web scraping to some a simple, genuine website.

### Aims

This lesson - **Web scraping using Python: Example 2** - has two aims:
1. Demonstrate how to use Python to collect data found on more complicated, realistic websites.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data collection problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 30-40 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand the key steps and requirements for collecting data from web pages using computational methods.
    2. Be able to use Python for requesting, parsing, extracting and saving data stored on a web page.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is web-scraping?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
1. The location (i.e., web address) where the web page can be accessed. For example, the UK Data Service homepage can be accessed via <a href="https://ukdataservice.ac.uk/" target=_blank>https://ukdataservice.ac.uk/</a>.
2. The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
1. Request the web page using its web address.
2. Parse the structure of the web page so your programming language can work with its contents.
3. Extract the information we are interested in.
4. Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## A social science web scraping example: COVID-19 data

The examples we covered previously were ideal for demonstrating the general web scraping approach. However the websites and files were simple to navigate, request and parse. It's time to encounter more difficult examples, this time focusing on accessing up-to-date COVID-19 data from Worldometer.

Worldometer is a website that provides up-to-date statistics on the following domains: the global population; food, water and energy consumption; environmental degradation and many others (known as its Real Time Statistics Project). In its own words:<sup>[1]</sup>
> Worldometer is run by an international team of developers, researchers, and volunteers with the goal of making world statistics available in a thought-provoking and time relevant format to a wide audience around the world. Worldometer is owned by Dadax, an independent company. We have no political, governmental, or corporate affiliation.

Since the outbreak of Covid-19 it has provided regular daily snapshots on the progress of this disease, both globally and at a country level.

[1]: https://www.worldometers.info/about/

### Identifying the web address

The website can be accessed here: <a href="https://www.worldometers.info/coronavirus/" target=_blank>https://www.worldometers.info/coronavirus/</a>

Let's work through the steps necessary to collect data about the number of Covid-19 cases, deaths and recoveries globally.

First, let's become familiar with this website: click on the link below to view the web page in your browser: <a href="https://www.worldometers.info/coronavirus/" target=_blank>https://www.worldometers.info/coronavirus/</a>

(Note: it possible to load websites into Python in order to view them, however the website we are interested in doesn't allow this. See the example code below for how it would work for a different website - just remove the quotation marks enclosing the code and run the cell).

In [None]:
"""
from IPython.display import IFrame

IFrame("https://httpbin.org/html", width="600", height="650")
"""

### Locating information

The statistics we need are near the top of the page under the following headings:
* Coronavirus Cases:
* Deaths:
* Recovered:

#### Visually inspecting the underlying HTML code

Therefore, what we need are the tags that identify the section of the web page where the statistics are stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options. 

The cell below shows a snippet of the source code for the section of the web page we are interested in.

In the above example, we can see multiple tags that contain various elements (e.g., text content, other tags). For instance, we can see that the Covid-19 statistics are enclosed in `<span><\span>` tags, which themselves are located within `<div><\div>` tags.

As you can see, exploring and locating the contents of a web page remains a manual and visual process, and in Brooker's estimation (2020, 252):
> Hence, more so than the actual Python, it's the detective work of unpicking the internal structure of a webpage that is probably the most vital skill here.

### Requesting the web page

Now that we possess the necessary information, let's begin the process of scraping the web page. There is a preliminary step, which is setting up Python with the modules it needs to perform the web-scrape.

In [None]:
# Import modules

import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
import csv # module for handling csv files
import pandas as pd # module for handling data
from datetime import datetime # module for working with dates and time
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

Modules are additional techniques or functions that are not present when you launch Python. Some do not even come with Python when you download it and must be installed on your machine separately - think of using `ssc install <package>` in Stata, or `install.packages(<package>)` in R. For now just understand that many useful modules need to be imported every time you start a new Python session.

Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL).

In [None]:
# Define the URL where the web page can be accessed

url = "https://www.worldometers.info/coronavirus/"

# Request the web page from the URL

response = requests.get(url, allow_redirects=True) # request the url
# response.headers
response.status_code # check if page was requested successfully

Good, we get a status code of *200*, which means the request was successful. A status code in *400s* or *500s* represent an unsuccessful attempt at requesting a web page (see <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> for a succinct description of different types of response status codes).

Let's unpack the code a bit. First, we define a variable (also known as an 'object' in Python) called `url` that contains the web address of the page we want to request. Next, we use the `get()` method of the `requests` module to request the web page, and in the same line of code, we store the results of the request in a variable called `response`. Finally, we check whether the request was successful by calling on the `status_code` attribute of the `response` variable.

Confused? Don't worry, the conventions of Python and using its modules take a bit of getting used to. At this point, just understand that you can store the results of commands in variables, and a variable can have different attributes that can be accessed when needed. Also note that you have a lot of freedom in how you name your variables (subject to certain restrictions - see <a href="https://www.python.org/dev/peps/pep-0008/" target=_blank>here for some guidance</a>).

For example, the following would also work:

In [None]:
web_address = "https://www.worldometers.info/coronavirus/"

scrape_result = requests.get(web_address, allow_redirects=True)
scrape_result.status_code

We can also view the metadata associated with our request:

In [None]:
response.headers

Back to the request:

Good, we get a status code of _200_ - this means we successfully requested the web page. <a href="https://www.textbook.ds100.org/ch/07/web_http.html" target=_blank>Lau, Gonzalez and Nolan</a> provide a succinct description of different types of response status codes:

* **100s** - Informational: More input is expected from client or server (e.g. 100 Continue, 102 Processing)
* **200s** - Success: The client's request was successful (e.g. 200 OK, 202 Accepted)
* **300s** - Redirection: Requested URL is located elsewhere; May need user's further action (e.g. 300 Multiple Choices, 301 Moved Permanently)
* **400s** - Client Error: Client-side error (e.g. 400 Bad Request, 403 Forbidden, 404 Not Found)
* **500s** - Server Error: Server-side error or server is incapable of performing the request (e.g. 500 Internal Server Error, 503 Service Unavailable)

For clarity:
* **Client**: your machine
* **Server**: the machine you are requesting the web page from

You may be wondering exactly what it is we requested: if you were to type the URL (https://www.worldometers.info/coronavirus/) into your browser and hit `enter`, the web page should appear on your screen. This is not the case when we request the URL through Python but rest assured, we have successfully requested the web page. To see the content of our request, we can examine the `text` attribute of the `response` variable:

In [None]:
response.text[:1000]

This shows us the underlying code (HTML) of the web page we requested. It should be obvious that in its current form, the result of this request will be difficult to work with. This is where the `BeautifulSoup` module comes in handy.

### Parsing the web page

Now it's time to identify and understand the structure of the web page we requested. We do this by converting the content contained in the `response.text` attribute into a `BeautifulSoup` variable. `BeautifulSoup` is a Python module that provides a systematic way of navigating the elements of a web page and extracting its contents. Let's see how it works in practice:

In [None]:
# Extract the contents of the web page from the response

soup_response = soup(response.text, "html.parser") # Parse the text as a Beautiful Soup object
soup_sample = soup(response.text[:1000], "html.parser") # Parse a sample of the text
soup_sample

Notice how the hierarchical structure of the web page is now recognised by Python? `BeautifulSoup` has taken the unstructured text contained in `response.text` and parsed it as HTML: now we can clearly see the hierarchical structure and tags that comprise a web page's HTML. 

Note again how we call on a method (`soup()`) from a module (`BeautifulSoup`) and store the results in a variable (`soup_response`).

Of course, we've only displayed a sample of the code here for readability. What about the full text contained in `soup_response`: how do we navigate such voluminous results? Thankfully the `BeautifulSoup` module provides some intuitive methods for doing so.

### Extracting information

Now that we have parsed the web page, we can use Python to navigate and extract the information of interest. To begin with, let's locate the section of the web page containing the overall Covid-19 statistics.

In [None]:
sections = soup_response.find_all("div", id="maincounter-wrap")
sections

We used the `find_all()` method to search for all `<div>` tags where the id="maincounter-wrap". And because there is more than one set of tags matching this id, we get a list of results. We can check how many tags match this id by calling on the `len()` function:

In [None]:
len(sections)

We can view each element in the list of results as follows:

In [None]:
for section in sections:
    print("--------")
    print(section)
    print("--------")
    print("\r") # print some blank space for better formatting

We are nearing the end of our scrape. The penultimate task is to extract the statistics within the `<span>` tags and store them in some variables. We do this by accessing each item in the _sections_ list using its positional value (index).

In [None]:
cases = sections[0].find("span").text.replace(" ", "").replace(",", "")
deaths = sections[1].find("span").text.replace(",", "")
recov = sections[2].find("span").text.replace(",", "")
print("Number of cases: {}; deaths: {}; and recoveries: {}.".format(cases, deaths, recov))

The above code performs a couple of operations:
* For each item (i.e., set of `<div>` tags) in the list, it finds the `<span>` tags and extracts the text enclosed within them.
* We clean the text by removing blank spaces and commas.

In this example, referring to an item's positional index works because our list of `<div>` tags stored in the `sections` variable is ordered: the tag containing the number of cases appears before the tag containing the number of deaths, which appears before the tag containing the number of recovered patients.

In Python, indexing begins at zero (in R indexing begins at 1). Therefore, the first item in the list is accessed using `sections[0]`, the second using `sections[1]` etc.

(To learn more about lists in Python, see Chapter 22 of <a href="https://assets.digitalocean.com/books/python/how-to-code-in-python.pdf" target=_blank>*How to Code in Python 3*</a>)

### Saving results from the scrape

The final task is to save the variables to a file that we can use in the future. We'll write to a Comma-Separated Values (CSV) file for this purpose, as it is an open-source, text-based file format that is commonly used for sharing data on the web.

In [None]:
# Create a downloads folder

try:
    os.mkdir("./downloads")
except:
    print("Unable to create folder: already exists")

The use of "./" tells the `os.mkdir()` command that the "downloads" folder should be created at the same level of the directory where this notebook is located. So if this notebook was stored in a directory located at "C:/Users/joebloggs/notebooks", the `os.mkdir()` command would result in a new folder located at "C:/Users/joebloggs/notebooks/downloads".
   
(Technically the "./" is not needed and you could just write `os.mkdir("downloads")` but it's good practice to be explicit)

In [None]:
# Write the results to a CSV file

date = datetime.now().strftime("%Y-%m-%d") # get today's date in YYYY-MM-DD format
print(date)

variables = ["Cases", "Deaths", "Recoveries"] # define variable names for the file
outfile = "./downloads/covid-19-statistics-" + date + ".csv" # define a file for writing the results
obs = cases, deaths, recov # define an observation (row)
print(obs)

with open(outfile, "w", newline="") as f: # with the file open in "write" mode, and giving it a shorter name (f)
    writer = csv.writer(f) # define a 'writer' object that allows us to export information to a CSV
    writer.writerow(variables) # write the variable names to the first row of the file
    writer.writerow(obs) # write the observation to the next row in the file

The code above defines some headers and a name and location for the file which will store the results of the scrape. We then open the file in *write* mode, and write the headers to the first row, and the statistics to subsequent rows.

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [None]:
# Check presence of file in "downloads" folder

os.listdir("./downloads")

In [None]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    data = f.read()
    
print(data)    

And Voila, we have successfully scraped a web page!

### Country-level COVID-19 data

We will complete our work gathering data on the COVID-19 pandemic by employing the techniques we learned previously to capture country-level statistics. We'll assume some knowledge on your part and thus you'll notice fewer annotations explaining what is happening at each step. As we progress through this example, there are some tasks for you to complete and some questions to be answered also.

**TASK**: if you need to, re-acquaint yourself with the <a href="https://www.worldometers.info/coronavirus/" target=_blank>Worldometer website</a> - the table with country-level data is located near the bottom of the page.

First, let's get some the preliminaries out of the way:

In [None]:
import os
import requests
import csv
import pandas as pd
from datetime import datetime
from bs4 import BeautifulSoup as soup

date = datetime.now().strftime("%Y-%m-%d")

Request the web page and parse it:

In [None]:
url = "https://www.worldometers.info/coronavirus/"

response = requests.get(url, allow_redirects=True)
response.headers
soup_response = soup(response.text, "html.parser")
#
# QUESTION: How do you know if the web page was requested successfully?
#

Find the table containing country-level statistics:

In [None]:
table = soup_response.find("table", id="main_table_countries_today").find("tbody")
rows = table.find_all("tr", style="")

Extract the information contained in each row in the table: 

In [None]:
global_info = []
for row in rows:
    columns = row.find_all("td")
    country_info = [column.text.strip() for column in columns]
    del country_info[7]
    global_info.append(country_info)

print(global_info[0:10])
print("\r")
print("Number of rows in table: {}".format(len(global_info)))
print("\r")

del global_info[0] # delete first row containing world statistics

First, we define a blank list to store statistics for each country (`global_info = []`); then for each row in the table, we extract the contents of each column and store the results in a list (`country_info = [column.text.strip() for column in columns])`; finally we add the results for each country to the overall list (`global_info.append(country_info)`).

We save the results of the scrape to a file:

In [None]:
try:
    os.mkdir("./downloads")
except OSError as error:
    print("Folder already exists")

variables = ["Number", "Country", "Total Cases", "New Cases", "Total Deaths", 
            "New Deaths", "Total Recovered", "Active Cases", 
            "Serious_Critical", "Total Cases Per 1m Pop", "Deaths Per 1m Pop",
            "Total Tests", "Tests Per 1m Pop", "Population"]
outfile = "./downloads/covid-19-country-statistics-" + date + ".csv"
print(outfile)

with open(outfile, "w", newline="") as f:
    writer = csv.writer(f)
    writer.writerow(variables)
    for country in global_info:
        writer.writerow(country)

Finally, we check the file was created; if so we load it into Python and examine its contents:

In [None]:
data = pd.read_csv(outfile, encoding = "ISO-8859-1", index_col=False)
data.head(10)

In [None]:
data[data["Country"]=="Ireland"]

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to request and parse web pages**. You can use Python to request a web page, and the `BeautifulSoup` module to parse its contents.
* **How to read and write data**. You can save the results of your scrape to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Tasks

## Conclusion

While the above examples demonstrate the basics of web scraping well, collecting research-relevant data from a web page is a little more difficult:
* Data may be spread throughout a web page (or across multiple pages).
* There may be many tags with similar data that need to be filtered in order to get to the information you need.
* And many other potential issues.

Thankfully the process/logic is the same even for more complicated examples - we'll explore these in the next lesson.

## Bibliography

Barba, Lorena A. et al. (2019). *Teaching and Learning with Jupyter*. <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>https://jupyter4edu.github.io/jupyter-edu-book/</a>.

Lau, S., Gonzalez, J., & Nolan, D. (n.d.). *Principles and Techniques of Data Science*. https://www.textbook.ds100.org

--END OF FILE--