![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Introduction

Computational methods are transforming the social sciences, enabling researchers to collect, analyse, and interpret data at scales and speeds that were previously impossible. One of the most powerful techniques in this toolkit is **web scraping** — the automated extraction of information from websites. Web scraping allows social scientists to create new datasets from digital sources, turning the vast and often unstructured content of the internet into structured, analysable data.

This practical session introduces you to web scraping using R. We will start with a simple example — extracting text from a single web page — and then move on to a more realistic scenario involving multiple pages. By the end of this session, you will have a solid foundation for collecting digital data from the web.

## Aims

1. **Demonstrate how R can be used for web scraping** — from requesting web pages, to parsing HTML, extracting information, and saving results.
2. **Cultivate computational thinking skills** — breaking down a data collection problem into a series of logical, repeatable steps.

## Lesson Details

| | |
| --- | --- |
| **Level** | Introductory |
| **Time** | ~45 minutes |
| **Pre-requisites** | None |
| **Learning outcomes** | Understand the key steps involved in web scraping |
| | Be able to use R to request a web page |
| | Be able to use R to parse HTML content |
| | Be able to use R to extract specific information from a web page |
| | Be able to use R to save scraped data to a file |

## Guide to Using This Resource

This is a **Jupyter Notebook** — an interactive document that combines text, code, and output in a single environment. If you are viewing this in **Google Colab**, you are running the notebook in the cloud, which means you do not need to install anything on your own machine.

**Note: This notebook uses R.** In Google Colab, you need to change the runtime: go to **Runtime > Change runtime type > select R**.

A notebook is made up of **cells**. There are two main types:

- **Markdown cells** contain formatted text (like this one). They provide explanations, instructions, and context.
- **Code cells** contain R code that you can execute. Code cells are displayed with a grey background and have a play button on the left.

To **run a cell**, click on it and press `Shift+Enter` (or click the play button). The output will appear directly below the cell. You should run the code cells **in order**, from top to bottom, as later cells often depend on variables or packages loaded in earlier cells.

If you are new to Jupyter Notebooks and would like a more detailed introduction, see the excellent materials by Dani Arribas-Bel: [https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb](https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb)

In [None]:
install.packages(c("httr", "rvest", "jsonlite"))
library(httr)
library(rvest)
library(jsonlite)
cat("Successfully loaded packages\n")

In [None]:
name <- readline("Enter your name: ")
cat(paste0("\nHello ", name, ", enjoy learning about R and web scraping!\n"))

## General Approach

Web scraping follows a consistent pattern regardless of the website or the data you want to collect. Before writing any code, there are things you need to **KNOW** and things you need to **DO**.

**What you need to KNOW:**
- The **URL** (web address) of the page(s) containing the data you want.
- The **HTML structure** of the page — specifically, which HTML tags and attributes contain the information you need.

**What you need to DO:**
1. **Request** the web page (download the HTML).
2. **Parse** the HTML (turn the raw text into a structured, searchable object).
3. **Extract** the specific information you need.
4. **Save** the results to a file.

This four-step pattern — request, parse, extract, save — is the foundation of nearly all web scraping tasks. It can be expressed as **pseudo-code**, which is an informal, plain-language description of the steps a program needs to follow. Writing pseudo-code before you write real code is an excellent habit: it helps you think through the logic of your task without getting bogged down in syntax.

## Simple Text Extraction

We will begin with a simple example: extracting a passage of text from a single web page. The website we will use is [httpbin.org/html](https://httpbin.org/html), which serves a short excerpt from *Moby Dick* by Herman Melville. This is a deliberately simple page, which makes it ideal for learning the basics of web scraping.

### Identifying the web address

The first thing we need to know is the **URL** of the page we want to scrape. In this case, the address is:

> [https://httpbin.org/html](https://httpbin.org/html)

If you visit this URL in your browser, you will see a short passage of text from *Moby Dick*. This is the data we want to extract.

### Locating information in the HTML

Web pages are written in **HTML** (HyperText Markup Language). HTML uses **tags** to structure content. For example, a paragraph of text is enclosed in `<p>` tags:

```html
<p>This is a paragraph.</p>
```

To scrape a web page, we need to identify which HTML tags contain the information we want. You can view the HTML source code of any web page in your browser by right-clicking on the page and selecting **"View Page Source"** (or pressing `Ctrl+U`).

The HTML source of [httpbin.org/html](https://httpbin.org/html) looks like this:

Looking at this HTML, we can see that:

- The entire page is wrapped in `<html>` tags.
- The visible content is inside the `<body>` tag.
- The text we want is inside a `<p>` (paragraph) tag, which is nested inside a `<div>` tag.

This tells us that to extract the text, we need to find the `<p>` tag and get its text content.

### Requesting the web page

Now we are ready to write some code. The first step is to **load** the R packages we need.

In [None]:
library(httr)
library(rvest)

link <- "https://httpbin.org/html"
response <- GET(link)
status_code(response)

We have loaded two packages:

- **`httr`** — a package for making HTTP requests (i.e., downloading web pages). The `GET()` function sends a request to a URL, similar to what your browser does when you visit a page.
- **`rvest`** — a package for parsing HTML and extracting information from it. It provides a set of functions for navigating and querying HTML documents.

Let's break down what just happened:

1. We defined the URL of the page we want to scrape and stored it in a variable called `link`.
2. We used `GET()` to send an HTTP GET request to that URL.
3. The server's response is stored in a variable called `response`.
4. We checked the **status code** of the response. A status code of **200** means the request was successful (the page was found and returned). Other common status codes include 404 (page not found) and 403 (access forbidden).

### Parsing the web page

The raw HTML is just a long string of text. To search through it and extract specific elements, we need to **parse** it — that is, turn it into a structured object that R can navigate.

In [None]:
page <- read_html(content(response, "text"))
page

We pass the raw HTML text (`content(response, "text")`) to `read_html()` from the `rvest` package. The result, `page`, is a parsed HTML document that we can search and navigate using R functions. The output may look different from the raw HTML, but it is now a structured object rather than a plain string.

### Extracting information

Now we can use `rvest`'s `html_element()` function to locate the `<p>` tag and `html_text2()` to extract its text content.

In [None]:
paragraph <- html_text2(html_element(page, "p"))
cat(paragraph)

### Saving results

The final step is to save our extracted data to a file. First, we need to create a folder to store the output.

In [None]:
dir.create("downloads", showWarnings = FALSE)
writeLines(paragraph, "downloads/moby-dick-scraped-data.txt")

In [None]:
readLines("downloads/moby-dick-scraped-data.txt")

## Multi-Page Scraping

In practice, the data you need is rarely on a single page. A more realistic scenario involves collecting information spread across **multiple pages** of a website. In this section, we will scrape the **Edinburgh Council Warm Spaces Directory**, which lists organisations across the city that offer warm, welcoming spaces for members of the public.

The directory is organised as an **A-to-Z listing**: there is a separate page for each letter of the alphabet (e.g., one page for organisations beginning with "A", another for "B", and so on). Each page contains a list of organisation names, each of which links to a detail page with further information (address, opening hours, etc.).

This means we need **two loops**:
1. A first loop to visit each A-Z page and collect the names and links of all organisations.
2. A second loop to visit each organisation's detail page and extract the relevant information.

### Setup

In [None]:
library(httr)
library(rvest)
library(jsonlite)

dir.create("data", showWarnings = FALSE)

### Building the URL list

To scrape all 26 pages, we need to construct the URL for each letter. The pattern is:

> `https://www.edinburgh.gov.uk/directory/10258/a-to-z/A`
> `https://www.edinburgh.gov.uk/directory/10258/a-to-z/B`
> ... and so on.

We can generate these URLs programmatically using R's built-in `LETTERS` constant.

In [None]:
header <- add_headers(`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
base <- "https://www.edinburgh.gov.uk/directory/10258/a-to-z/"
abc <- LETTERS
print(abc)

A few things to note:

- **`header`**: Some websites block requests that do not include a `User-Agent` header, because they look like automated bots rather than real browsers. By including a header that mimics a standard web browser, we make our requests look like normal web traffic. This is common practice in web scraping. In R, we use `add_headers()` from the `httr` package to set custom headers.
- **`base`**: This is the base URL for the A-Z directory. We will append each letter of the alphabet to this base to construct the full URL for each page.
- **`abc`**: `LETTERS` is a built-in R constant that gives us all 26 uppercase letters of the English alphabet as a character vector: `"A" "B" "C" ... "Z"`.

### First loop: collect organisation names and links

In this step, we loop over each letter of the alphabet, visit the corresponding A-Z page, and extract the name and link for every organisation listed on that page.

In [None]:
org_list <- list()

for (letter in abc) {
  url <- paste0(base, letter)
  response <- GET(url, header)

  if (status_code(response) == 200) {
    page <- read_html(content(response, "text"))
    items <- tryCatch({
      page |> html_elements("ul.list.list--record li")
    }, error = function(e) NULL)

    if (!is.null(items) && length(items) > 0) {
      for (item in items) {
        name <- html_text2(html_element(item, "a"))
        link <- html_attr(html_element(item, "a"), "href")
        org_list <- append(org_list, list(list(org_name = name, org_url = link)))
      }
    } else {
      cat(paste("No organisations found for letter", letter, "\n"))
    }
  }
}

cat(paste("Found", length(org_list), "organisations\n"))

Let's walk through the logic of this loop:

1. We start with an empty list called `org_list` to store our results.
2. For each letter in the alphabet, we construct the full URL by pasting the letter onto the base URL using `paste0()`.
3. We request the page using `GET()` and check that the status code is 200 (success).
4. We parse the HTML using `read_html()` and look for `<li>` elements inside a `<ul>` tag with the class `"list list--record"` — this is the unordered list that contains the directory entries.
5. For each list item, we extract the organisation name (the text inside the `<a>` tag using `html_text2()`) and the link (the `href` attribute of the `<a>` tag using `html_attr()`).
6. We wrap the element selection in a `tryCatch()` block because some letters may have no organisations listed, which would cause an error.
7. Each organisation is stored as a named list with two elements: `org_name` and `org_url`.

In [None]:
org_list[1:3]

### Second loop: visit each organisation's page

Now that we have a list of organisations and their URLs, we can visit each organisation's detail page and extract the information displayed there (e.g., address, opening hours, contact details).

In [None]:
org_details <- list()
base_url <- "https://www.edinburgh.gov.uk"

for (org in org_list) {
  url <- paste0(base_url, org$org_url)
  response <- GET(url, header)

  if (status_code(response) == 200) {
    page <- read_html(content(response, "text"))
    dl <- html_element(page, "dl.list.list--definition.definition")

    keys <- html_text2(html_elements(dl, "dt"))
    values <- html_text2(html_elements(dl, "dd"))

    obs <- setNames(as.list(values), keys)
    obs$org_name <- org$org_name
    obs$org_url <- url
    org_details <- append(org_details, list(obs))
  }
}

cat(paste("Collected details for", length(org_details), "organisations\n"))

### Saving to JSON

We will save the collected data as a **JSON** file — a widely used, human-readable format for structured data. We include the current date in the filename so that we know when the data was collected.

In [None]:
write_json(org_details, paste0("data/edinburgh-warm-spaces-", Sys.Date(), ".json"))
cat("Saved!\n")

## Exercise

Now it's your turn! Using the same approach we used for the warm spaces directory, scrape the **Edinburgh Council Library Locations** directory:

> [https://www.edinburgh.gov.uk/directory/10199/library-locations-and-opening-hours](https://www.edinburgh.gov.uk/directory/10199/library-locations-and-opening-hours)

The URL pattern for the A-Z pages is:

> `https://www.edinburgh.gov.uk/directory/10199/a-to-z/{letter}`

Use the skeleton code below to guide you. Replace the `# INSERT CODE HERE` comments with your own code, following the same pattern as the warm spaces example above.

In [None]:
# Exercise: Scrape Edinburgh library locations
# The URL pattern is: https://www.edinburgh.gov.uk/directory/10199/a-to-z/{letter}

# Step 1: Define your variables
header <- add_headers(`User-Agent` = "Mozilla/5.0")
base <- "https://www.edinburgh.gov.uk/directory/10199/a-to-z/"
abc <- LETTERS
base_url <- "https://www.edinburgh.gov.uk"

# Step 2: Loop over A-Z pages and collect library names and links
library_list <- list()

# INSERT CODE HERE

cat(paste("Found", length(library_list), "libraries\n"))

In [None]:
# Step 3: Visit each library page and extract details
library_details <- list()

# INSERT CODE HERE

cat(paste("Collected details for", length(library_details), "libraries\n"))

In [None]:
# Step 4: Save results to JSON

# INSERT CODE HERE

In practice, before scraping a website, always check whether the data is available through an **API** (Application Programming Interface). APIs provide structured, reliable access to data and avoid many of the legal and ethical issues associated with web scraping. We'll explore APIs in the next session.

## Appendix: Exercise Solution

In [None]:
# Exercise Solution: Scrape Edinburgh library locations

library(httr)
library(rvest)
library(jsonlite)

# Step 1: Define variables
header <- add_headers(`User-Agent` = "Mozilla/5.0 (Windows NT 10.0; Win64; x64)")
base <- "https://www.edinburgh.gov.uk/directory/10199/a-to-z/"
abc <- LETTERS
base_url <- "https://www.edinburgh.gov.uk"

# Step 2: Loop over A-Z pages and collect library names and links
library_list <- list()

for (letter in abc) {
  url <- paste0(base, letter)
  response <- GET(url, header)

  if (status_code(response) == 200) {
    page <- read_html(content(response, "text"))
    items <- tryCatch({
      page |> html_elements("ul.list.list--record li")
    }, error = function(e) NULL)

    if (!is.null(items) && length(items) > 0) {
      for (item in items) {
        name <- html_text2(html_element(item, "a"))
        link <- html_attr(html_element(item, "a"), "href")
        library_list <- append(library_list, list(list(library_name = name, library_url = link)))
      }
    } else {
      cat(paste("No libraries found for letter", letter, "\n"))
    }
  }
}

cat(paste("Found", length(library_list), "libraries\n"))

# Step 3: Visit each library page and extract details
library_details <- list()

for (lib in library_list) {
  url <- paste0(base_url, lib$library_url)
  response <- GET(url, header)

  if (status_code(response) == 200) {
    page <- read_html(content(response, "text"))
    dl <- html_element(page, "dl.list.list--definition.definition")

    keys <- html_text2(html_elements(dl, "dt"))
    values <- html_text2(html_elements(dl, "dd"))

    obs <- setNames(as.list(values), keys)
    obs$library_name <- lib$library_name
    obs$library_url <- url
    library_details <- append(library_details, list(obs))
  } else {
    cat(paste("Could not request page for", lib$library_name, "\n"))
  }
}

cat(paste("Collected details for", length(library_details), "libraries\n"))

# Step 4: Save results to JSON
outfile <- paste0("data/edinburgh-libraries-", Sys.Date(), ".json")
write_json(library_details, outfile)
cat(paste("Saved", length(library_details), "libraries to", outfile, "\n"))

---

**END OF FILE**