![SGSSS Logo](../../img/SGSSS_Stacked.png)

# Practical Computational Methods for Social Scientists

## Introduction

Computational methods are transforming research practice across the disciplines. For social scientists these methods offer a number of valuable opportunities, including creating new datasets from digital sources; unearthing new insights and avenues for research from existing data sources; and improving the accuracy and efficiency of fundamental research activities.

In this lesson we apply the logic of web scraping to some a simple, genuine website.

### Aims

This lesson has two aims:
1. Demonstrate how to use Python to collect data found on more complex websites.
2. Cultivate your computational thinking skills through coding examples. In particular, how to define and solve a data collection problem using a computational method.

### Lesson details

* **Level**: Introductory
* **Time**: 40-60 minutes
* **Pre-requisites**: None
* **Audience**: Researchers and analysts from any disciplinary background
* **Learning outcomes**:
    1. Understand the key steps and requirements for collecting data from web pages using computational methods.
    2. Be able to use Python for requesting, parsing, extracting and saving data stored on a web page.

## Guide to using this resource

This learning resource was built using <a href="https://jupyter.org/" target=_blank>Jupyter Notebook</a>, an open-source software application that allows you to mix code, results and narrative in a single document. As <a href="https://jupyter4edu.github.io/jupyter-edu-book/" target=_blank>Barba et al. (2019)</a> espouse:
> In a world where every subject matter can have a data-supported treatment, where computational devices are omnipresent and pervasive, the union of natural language and computation creates compelling communication and learning opportunities.

If you are familiar with Jupyter notebooks then skip ahead to the main content (*What is the general approach for scraping data from a web page?*). Otherwise, the following is a quick guide to navigating and interacting with the notebook.

### Interaction

**You only need to execute the code that is contained in sections which are marked by `In []`.**

To execute a cell, click or double-click the cell and press the `Run` button on the top toolbar (you can also use the keyboard shortcut Shift + Enter).

Try it for yourself:

In [None]:
print("Enter your name and press enter:")
name = input()
print("\r")
print("Hello {}, enjoy learning more about Python and web-scraping!".format(name))

### Learn more

Jupyter notebooks provide rich, flexible features for conducting and documenting your data analysis workflow. To learn more about additional notebook features, we recommend working through some of the <a href="https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb" target=_blank>materials</a> provided by Dani Arribas-Bel at the University of Liverpool. 

## What is the general approach for scraping data from a web page?

We begin by identifying a web page containing information we are interested in collecting. Then we need to **know** the following:
* The location (i.e., web address) where the web page can be accessed. For example, the SGSSS homepage can be accessed via <a href="https://www.sgsss.ac.uk/" target=_blank>https://www.sgsss.ac.uk/</a>.
* The location of the information we are interested in within the structure of the web page. This involves visually inspecting a web page's underlying code using a web browser.

And **do** the following:
* Request the web page using its web address.
* Parse the structure of the web page so your programming language can work with its contents.
* Extract the information we are interested in.
* Write this information to a file for future use.

For any programming task, it is useful to write out the steps needed to solve the problem: we call this *pseudo-code*, as it is captures the main tasks and the order in which they need to be executed.

## A more complex web scraping example

Let's work through the steps in our general approach using a real web page, one that is **not** designed for practicing web scraping but instead contains information relevant to a social science research project.

###  Identifying the web address

The web page we are interested in can be found at the following web address: <a href="https://www.edinburgh.gov.uk/directory/10258/other-warm-and-welcoming-locations" target=_blank>https://www.edinburgh.gov.uk/directory/10258/other-warm-and-welcoming-locations</a>.

This is a web page on City of Edinburgh Council's website containing a list of organisations that provide warm and welcoming spaces to residents. There is no data file to download, therefore the only way to get this information is by scraping the web page.

### Locating information

Our task is to extract the list of organisations providing warm and welcoming spaces. In order to do so, we need to understand where the text is located within the underlying *source code* of the web page. Web pages are written in a langauge called HyperText Markup Language (HTML). HTML describes the structure of a web page, and consists of a number of elements (e.g., paragraphs, tables, headers), with each element represented by a tag (e.g., `<p>`, `<table>`, `<h1>`). Browsers do not display the HTML tags, but use them to render the content of the page.

See <a href="https://www.w3schools.com/html/html_intro.asp" target=_blank>https://www.w3schools.com/html/html_intro.asp</a> for more information on HTML.

#### Visually inspecting the underlying HTML code

Therefore, what we need are the tags that identify the section of the web page where the text is stored. We can discover the tags by examining the *source code* (HTML) of the web page. This can be done using your web browser: for example, if you use use Firefox you can right-click on the web page and select *View Page Source* from the list of options. (Chrome: *View page source*; Safari: follow <a href="https://www.lifewire.com/view-html-source-in-safari-3469315" target=_blank>these instructions</a>).

The cell below shows a sample of the HTML code for the web page.

Our data collection task is a little more complicated that the previous lesson, as the information we require is not actually on this page, instead we have links to multiple pages that might contain the information we require.

### Preliminaries

Let's setup Python with the modules it needs, as well as the file paths and dates necessary to process the extracted data.

In [None]:
# Import modules

import string # module for working with ASCII and other strings
import os # module for navigating your machine (e.g., file directories)
import requests # module for requesting urls
import json # module for working with JSON data structures
import csv # module for working with csv files
import pandas as pd  # module for working with dataframes
from datetime import datetime as dt # module for working with dates and time
from bs4 import BeautifulSoup as soup # module for parsing web pages

print("Succesfully imported necessary modules")

In [None]:
# Filepaths

other_data = "./data/"
la_data = "./data/local-authorities/"

In [None]:
# Create data folders

for folder in other_data, la_data:
    try:
        os.mkdir(folder)
    except:
        print("Unable to create {}: already exists".format(folder))

**QUESTION:** What do you think the `for loop` and the `try-except` clause are doing in the code block above?

In [None]:
# Download date

ddate = dt.now().strftime("%Y-%m-%d")
ddate

### Requesting the web page

Now, let's implement the process of scraping the page. First, we need to request the web page using Python; this is analogous to opening a web browser and entering the web address manually. We refer to a page's location on the internet as its web address or Uniform Resource Locator (URL).

The specific page we need is https://www.edinburgh.gov.uk/directory/10258/a-to-z/A

In [None]:
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}
base = "https://www.edinburgh.gov.uk/directory/10258/a-to-z/"
abc = list(string.ascii_uppercase)

Let's unpack the above code as we haven't seen it before. The first line creates a variable called `header` which provides information on the types of web browsers used to request web pages. This is an important step as most modern websites are able to detect whether a URL has been requested using programming code and subsequently refuse to accept the request. The information in the `header` variable essentially tells the website you are making request using a web browser - this could be considered deceptive but it is a technical barrier, not an ethical one (we will discuss those ones in the lecture).

The second line defines a base / stem of a URL - that is, it is not a complete URL but can be if the right information is appended to the end of it. For example *https://www.edinburgh.gov.uk/directory/10258/a-to-z/A* is a valid URL (the web page containing the list of organisations whose name begins with the letter 'A'). We define a base URL because we have many web pages to loop over.

The final line creates a variable called `abc` that contains a list of the letters in the English alphabet.

**TASK:** Print the contents of the `abc` variable.

In [None]:
# INSERT CODE HERE

OK let's start requesting the list of organisations.

In [None]:
# Define a variable to store the list of organisations

org_list = []

# Loop over the A-Z list of organisations
    
for l in abc:
    url = base + str(l) # build the URL to be requested
    response = requests.get(url, headers=header) # request the web address
    
    if response.status_code==200: # if the web page is successfully requested
        orgs = soup(response.text, "html.parser")
        try:
            results = orgs.find("ul", class_="list list--record").find_all("li") # find all organisations listed on page
            for el in results:
                name = el.find("a").text # extract organisation name from <a> tag
                link = el.find("a").get("href") # extract organisation URL from <a> tag
                obs = {"org_name": name, "org_url": link} # create an observation (dictionary) with details of each organisations
                #print(obs)
                org_list.append(obs)
        except:
             print("Could not find list of organisations for letter {}".format(l))

**TASK**: Write pseudo-code that explains the logic and steps in the above code.

**TASK**: Check the web pages that did not provide information on organisations. Are there genuinely no organisations listed or is the code failing to identify them?

The above code should produce a list of observations containing details for each organisation listed under each web page.

In [None]:
print(len(org_list)) # display number of elements in list
org_list # display contents of list

Excellent, we have a list of organisations that provide warm and welcoming spaces. Now it's time to go to each organisation's web page on the Council's website and extract the information of interest. See an example below.

In [None]:
from IPython.display import IFrame

IFrame("https://www.edinburgh.gov.uk/directory-record/1697699/action-porty-", width="1000", height="650")

Uh oh, we cannot request and display the web page through this notebook. Most modern websites have protections in place that prevent their contents being displayed elsewhere - otherwise it would be possible to display a website under a different URL / location and claim it as your own. This isn't an issue for web scraping but simply an example of how real, functioning websites provide additional challenges when collecting data.

Let's look at the source code of the web page using our browser.

### Parsing the web page and extracting information

Let's speed up the process by requesting each organisation's web page, parsing it as HTML and extracting the information required in a single block of code.

In [None]:
org_details = []
base = "https://www.edinburgh.gov.uk"

for org in org_list:
    url = base + org["org_url"]
    
    response = requests.get(url, headers=header) # request the web address
    if response.status_code==200:
        soup_org = soup(response.text, "html.parser")
        results = soup_org.find("dl", class_="list list--definition definition")
        #print(results)
        
        dts = results.find_all("dt")
        dt_list = []
        for dt in dts:
            dt_list.append(dt.text.strip())
            
        dds = results.find_all("dd")
        dd_list = []
        for dd in dds:
            dd_list.append(dd.text.strip())
        
        obs = dict(zip(dt_list, dd_list))
        obs["org_name"] = org["org_name"]
        obs["org_url"] = url
        #print(obs)
        
        org_details.append(obs)
    else:
        print("Could not request webpage for organisation {}".format(org["org_name"]))

In [None]:
org_details[0:3] # display the first three elements in the list

### Saving results from the scrape

Let's conclude by saving the scraped data to a file for future use.

In [None]:
outfile = la_data + "coe-warm-spaces-" + ddate + ".json"
with open(outfile, "w", encoding="utf-8") as f:
    json.dump(org_details, f)

How do we know this worked? The simplest way is to check whether a) the file was created, and b) the results were written to it.

In [None]:
# Check presence of file in current folder

os.listdir("./data/local-authorities")

In [None]:
# Open file and read (import) its contents

with open(outfile, "r") as f:
    data = json.load(f)
    
print(data)  

And Voila, we have successfully scraped multiple web pages!

## What have we learned?

Let's recap what key skills and techniques we've learned:
* **How to import modules**. You will usually need to import modules into Python to support your work. Python does come with some methods and functions that are ready to use straight away, but for computational social science tasks you'll almost certainly need to import some additional modules.
* **How to request and parse web pages**. You can use Python to request a web page, and the `BeautifulSoup` module to parse its contents.
* **How to read and write data**. You can save the results of your scrape to a file for future use.
* **How to do all of the above in an efficient, clear and effective manner**.

## Conclusion

The above examples demonstrate the basics of using computational methods for collecting social science data from real websites. These techniques should be sufficient for most social science-relevant data collection exercises, however there are more complicated examples:
* Data may only appear on a web page after user input (e.g., https://www.charities.gov.sg/Pages/AdvanceSearch.aspx).
* Data may be contained in embedded maps and thus are more difficult to extract (e.g., https://www.foodaidnetwork.org.uk/our-members).
* And many other potential issues.

### Exercise

Returning to our example from City of Edinburgh Council, see if you can scrape the list of public libraries:
* https://www.edinburgh.gov.uk/directory/10199/library-locations-and-opening-hours

In [None]:
# INSERT CODE HERE

In [None]:
# INSERT CODE HERE

The solution is provided at the end of this notebook.

## Appendix A

### Exercise Solution

In [None]:
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}
base = "https://www.edinburgh.gov.uk/directory/10199/a-to-z/"
abc = list(string.ascii_uppercase)

org_list = []
    
for l in abc:
    url = base + str(l)
    print(url)
    response = requests.get(url, headers=header) # request the web address
    
    if response.status_code==200:
        orgs = soup(response.text, "html.parser")
        try:
            results = orgs.find("ul", class_="list list--record").find_all("li")
            for el in results:
                name = el.find("a").text
                link = el.find("a").get("href")
                obs = {"org_name": name, "org_url": link}
                #print(obs)
                org_list.append(obs)
        except:
             print("Could not find list of organisations")
            
#print(response.text)

In [None]:
org_details = []
base = "https://www.edinburgh.gov.uk"
for org in org_list:
    url = base + org["org_url"]
    
    response = requests.get(url, headers=header) # request the web address
    if response.status_code==200:
        soup_org = soup(response.text, "html.parser")
        results = soup_org.find("dl", class_="list list--definition definition")
        #print(results)
        
        dts = results.find_all("dt")
        dt_list = []
        for dt in dts:
            dt_list.append(dt.text.strip())
            
        dds = results.find_all("dd")
        dd_list = []
        for dd in dds:
            dd_list.append(dd.text.strip())
        
        obs = dict(zip(dt_list, dd_list))
        obs["org_name"] = org["org_name"]
        obs["org_url"] = url
        #print(obs)
        
        org_details.append(obs)
    else:
        print("Could not request webpage")

In [None]:
org_details[0:5]

--END OF FILE--