![SGSSS Logo](../img/SGSSS_Stacked.png)

# Collecting Digital Data for Social Scientists

## Introduction

Computational methods are transforming the social sciences, enabling researchers to collect, analyse, and interpret data at scales and speeds that were previously impossible. One of the most powerful techniques in this toolkit is **web scraping** — the automated extraction of information from websites. Web scraping allows social scientists to create new datasets from digital sources, turning the vast and often unstructured content of the internet into structured, analysable data.

This practical session introduces you to web scraping using Python. We will start with a simple example — extracting text from a single web page — and then move on to a more realistic scenario involving multiple pages. By the end of this session, you will have a solid foundation for collecting digital data from the web.

## Aims

1. **Demonstrate how Python can be used for web scraping** — from requesting web pages, to parsing HTML, extracting information, and saving results.
2. **Cultivate computational thinking skills** — breaking down a data collection problem into a series of logical, repeatable steps.

## Lesson Details

| | |
| --- | --- |
| **Level** | Introductory |
| **Time** | ~45 minutes |
| **Pre-requisites** | None |
| **Learning outcomes** | Understand the key steps involved in web scraping |
| | Be able to use Python to request a web page |
| | Be able to use Python to parse HTML content |
| | Be able to use Python to extract specific information from a web page |
| | Be able to use Python to save scraped data to a file |

## Guide to Using This Resource

This is a **Jupyter Notebook** — an interactive document that combines text, code, and output in a single environment. If you are viewing this in **Google Colab**, you are running the notebook in the cloud, which means you do not need to install anything on your own machine.

A notebook is made up of **cells**. There are two main types:

- **Markdown cells** contain formatted text (like this one). They provide explanations, instructions, and context.
- **Code cells** contain Python code that you can execute. Code cells are displayed with a grey background and have a play button on the left.

To **run a cell**, click on it and press `Shift+Enter` (or click the play button). The output will appear directly below the cell. You should run the code cells **in order**, from top to bottom, as later cells often depend on variables or modules defined in earlier cells.

If you are new to Jupyter Notebooks and would like a more detailed introduction, see the excellent materials by Dani Arribas-Bel: [https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb](https://github.com/darribas/gds19/blob/master/content/labs/lab_00.ipynb)

In [None]:
print("Enter your name and press enter:")
name = input()
print(f"\nHello {name}, enjoy learning about Python and web scraping!")

## General Approach

Web scraping follows a consistent pattern regardless of the website or the data you want to collect. Before writing any code, there are things you need to **KNOW** and things you need to **DO**.

**What you need to KNOW:**
- The **URL** (web address) of the page(s) containing the data you want.
- The **HTML structure** of the page — specifically, which HTML tags and attributes contain the information you need.

**What you need to DO:**
1. **Request** the web page (download the HTML).
2. **Parse** the HTML (turn the raw text into a structured, searchable object).
3. **Extract** the specific information you need.
4. **Save** the results to a file.

This four-step pattern — request, parse, extract, save — is the foundation of nearly all web scraping tasks. It can be expressed as **pseudo-code**, which is an informal, plain-language description of the steps a program needs to follow. Writing pseudo-code before you write real code is an excellent habit: it helps you think through the logic of your task without getting bogged down in syntax.

## Simple Text Extraction

We will begin with a simple example: extracting a passage of text from a single web page. The website we will use is [httpbin.org/html](https://httpbin.org/html), which serves a short excerpt from *Moby Dick* by Herman Melville. This is a deliberately simple page, which makes it ideal for learning the basics of web scraping.

### Identifying the web address

The first thing we need to know is the **URL** of the page we want to scrape. In this case, the address is:

> [https://httpbin.org/html](https://httpbin.org/html)

If you visit this URL in your browser, you will see a short passage of text from *Moby Dick*. This is the data we want to extract.

### Locating information in the HTML

Web pages are written in **HTML** (HyperText Markup Language). HTML uses **tags** to structure content. For example, a paragraph of text is enclosed in `<p>` tags:

```html
<p>This is a paragraph.</p>
```

To scrape a web page, we need to identify which HTML tags contain the information we want. You can view the HTML source code of any web page in your browser by right-clicking on the page and selecting **"View Page Source"** (or pressing `Ctrl+U`).

The HTML source of [httpbin.org/html](https://httpbin.org/html) looks like this:

Looking at this HTML, we can see that:

- The entire page is wrapped in `<html>` tags.
- The visible content is inside the `<body>` tag.
- The text we want is inside a `<p>` (paragraph) tag, which is nested inside a `<div>` tag.

This tells us that to extract the text, we need to find the `<p>` tag and get its text content.

### Requesting the web page

Now we are ready to write some code. The first step is to **import** the Python modules (libraries) we need.

In [None]:
import os
import requests
from bs4 import BeautifulSoup as soup
print("Successfully imported necessary modules")

We have imported three modules:

- **`os`** — a built-in Python module for interacting with the operating system (e.g., creating folders).
- **`requests`** — a popular module for making HTTP requests (i.e., downloading web pages).
- **`BeautifulSoup`** (from the `bs4` package) — a module for parsing HTML and extracting information from it. We import it with the alias `soup` for convenience.

In [None]:
link = "https://httpbin.org/html"
response = requests.get(link)
response.status_code

Let's break down what just happened:

1. We defined the URL of the page we want to scrape and stored it in a variable called `link`.
2. We used `requests.get()` to send an HTTP GET request to that URL — this is the same thing your browser does when you visit a web page.
3. The server's response is stored in a variable called `response`.
4. We checked the **status code** of the response. A status code of **200** means the request was successful (the page was found and returned). Other common status codes include 404 (page not found) and 403 (access forbidden).

In [None]:
response.text

### Parsing the web page

The raw HTML is just a long string of text. To search through it and extract specific elements, we need to **parse** it — that is, turn it into a structured object that Python can navigate.

In [None]:
soup_response = soup(response.text, "html.parser")
soup_response

We pass the raw HTML text (`response.text`) to BeautifulSoup along with the parser we want to use (`"html.parser"`). The result, `soup_response`, is a BeautifulSoup object that we can search and navigate using Python methods. The output may look similar to the raw HTML, but it is now a structured object rather than a plain string.

### Extracting information

Now we can use BeautifulSoup's `.find()` method to locate the `<p>` tag and extract its text content.

In [None]:
paragraph = soup_response.find("p")
paragraph

In [None]:
data = paragraph.text
print(data)

### Saving results

The final step is to save our extracted data to a file. First, we need to create a folder to store the output.

In [None]:
try:
    os.mkdir("./downloads")
except:
    print("Folder already exists")

In [None]:
outfile = "./downloads/moby-dick-scraped-data.txt"
with open(outfile, "w") as f:
    f.write(data)

In [None]:
with open(outfile, "r") as f:
    print(f.read())

## Multi-Page Scraping

In practice, the data you need is rarely on a single page. A more realistic scenario involves collecting information spread across **multiple pages** of a website. In this section, we will scrape the **Edinburgh Council Warm Spaces Directory**, which lists organisations across the city that offer warm, welcoming spaces for members of the public.

The directory is organised as an **A-to-Z listing**: there is a separate page for each letter of the alphabet (e.g., one page for organisations beginning with "A", another for "B", and so on). Each page contains a list of organisation names, each of which links to a detail page with further information (address, opening hours, etc.).

This means we need **two loops**:
1. A first loop to visit each A-Z page and collect the names and links of all organisations.
2. A second loop to visit each organisation's detail page and extract the relevant information.

### Setup

In [None]:
import string
import os
import requests
import json
from datetime import datetime as dt
from bs4 import BeautifulSoup as soup

data_folder = "./data/"
try:
    os.mkdir(data_folder)
except:
    print("Folder already exists")

### Building the URL list

To scrape all 26 pages, we need to construct the URL for each letter. The pattern is:

> `https://www.edinburgh.gov.uk/directory/10258/a-to-z/A`
> `https://www.edinburgh.gov.uk/directory/10258/a-to-z/B`
> ... and so on.

We can generate these URLs programmatically using Python's `string` module.

In [None]:
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}
base = "https://www.edinburgh.gov.uk/directory/10258/a-to-z/"
abc = list(string.ascii_uppercase)
print(abc)

A few things to note:

- **`header`**: Some websites block requests that do not include a `User-Agent` header, because they look like automated bots rather than real browsers. By including a header that mimics a standard web browser, we make our requests look like normal web traffic. This is common practice in web scraping.
- **`base`**: This is the base URL for the A-Z directory. We will append each letter of the alphabet to this base to construct the full URL for each page.
- **`abc`**: `string.ascii_uppercase` gives us all 26 uppercase letters of the English alphabet as a list: `['A', 'B', 'C', ..., 'Z']`.

### First loop: collect organisation names and links

In this step, we loop over each letter of the alphabet, visit the corresponding A-Z page, and extract the name and link for every organisation listed on that page.

In [None]:
org_list = []

for letter in abc:
    url = base + letter
    response = requests.get(url, headers=header)
    
    if response.status_code == 200:
        page = soup(response.text, "html.parser")
        try:
            results = page.find("ul", class_="list list--record").find_all("li")
            for item in results:
                name = item.find("a").text
                link = item.find("a").get("href")
                org_list.append({"org_name": name, "org_url": link})
        except:
            print(f"No organisations found for letter {letter}")

print(f"Found {len(org_list)} organisations")

Let's walk through the logic of this loop:

1. We start with an empty list called `org_list` to store our results.
2. For each letter in the alphabet, we construct the full URL by appending the letter to the base URL.
3. We request the page and check that the status code is 200 (success).
4. We parse the HTML and look for a `<ul>` tag with the class `"list list--record"` — this is the unordered list that contains the directory entries. Within that list, we find all `<li>` (list item) tags.
5. For each list item, we extract the organisation name (the text inside the `<a>` tag) and the link (the `href` attribute of the `<a>` tag).
6. We wrap this in a `try/except` block because some letters may have no organisations listed, which would cause an error.
7. Each organisation is stored as a dictionary with two keys: `org_name` and `org_url`.

In [None]:
org_list[:5]

### Second loop: visit each organisation's page

Now that we have a list of organisations and their URLs, we can visit each organisation's detail page and extract the information displayed there (e.g., address, opening hours, contact details).

In [None]:
org_details = []
base_url = "https://www.edinburgh.gov.uk"

for org in org_list:
    url = base_url + org["org_url"]
    response = requests.get(url, headers=header)
    
    if response.status_code == 200:
        page = soup(response.text, "html.parser")
        results = page.find("dl", class_="list list--definition definition")
        
        keys = [dt_tag.text.strip() for dt_tag in results.find_all("dt")]
        values = [dd.text.strip() for dd in results.find_all("dd")]
        
        obs = dict(zip(keys, values))
        obs["org_name"] = org["org_name"]
        obs["org_url"] = url
        org_details.append(obs)
    else:
        print(f"Could not request page for {org['org_name']}")

In [None]:
org_details[:3]

### Saving to JSON

We will save the collected data as a **JSON** file — a widely used, human-readable format for structured data. We include the current date in the filename so that we know when the data was collected.

In [None]:
ddate = dt.now().strftime("%Y-%m-%d")
outfile = f"./data/edinburgh-warm-spaces-{ddate}.json"

with open(outfile, "w", encoding="utf-8") as f:
    json.dump(org_details, f)

print(f"Saved {len(org_details)} organisations to {outfile}")

In [None]:
with open(outfile, "r") as f:
    data = json.load(f)
print(f"Loaded {len(data)} records")
data[:2]

## Exercise

Now it's your turn! Using the same approach we used for the warm spaces directory, scrape the **Edinburgh Council Library Locations** directory:

> [https://www.edinburgh.gov.uk/directory/10199/library-locations-and-opening-hours](https://www.edinburgh.gov.uk/directory/10199/library-locations-and-opening-hours)

The URL pattern for the A-Z pages is:

> `https://www.edinburgh.gov.uk/directory/10199/a-to-z/{letter}`

Use the skeleton code below to guide you. Replace the `# INSERT CODE HERE` comments with your own code, following the same pattern as the warm spaces example above.

In [None]:
# Exercise: Scrape Edinburgh library locations
# The URL pattern is: https://www.edinburgh.gov.uk/directory/10199/a-to-z/{letter}

# Step 1: Define your variables
header = {"user-agent": "Mozilla/5.0"}
base = "https://www.edinburgh.gov.uk/directory/10199/a-to-z/"
abc = list(string.ascii_uppercase)

# Step 2: Loop over A-Z pages and collect library names and links
library_list = []

# INSERT CODE HERE

print(f"Found {len(library_list)} libraries")

In [None]:
# Step 3: Visit each library page and extract details
library_details = []

# INSERT CODE HERE

print(f"Collected details for {len(library_details)} libraries")

In [None]:
# Step 4: Save results to JSON

# INSERT CODE HERE

In practice, before scraping a website, always check whether the data is available through an **API** (Application Programming Interface). APIs provide structured, reliable access to data and avoid many of the legal and ethical issues associated with web scraping. We'll explore APIs in the next session.

## Appendix: Exercise Solution

In [None]:
# Exercise Solution: Scrape Edinburgh library locations

import string
import os
import requests
import json
from datetime import datetime as dt
from bs4 import BeautifulSoup as soup

# Step 1: Define variables
header = {"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36"}
base = "https://www.edinburgh.gov.uk/directory/10199/a-to-z/"
abc = list(string.ascii_uppercase)
base_url = "https://www.edinburgh.gov.uk"

# Step 2: Loop over A-Z pages and collect library names and links
library_list = []

for letter in abc:
    url = base + letter
    response = requests.get(url, headers=header)
    
    if response.status_code == 200:
        page = soup(response.text, "html.parser")
        try:
            results = page.find("ul", class_="list list--record").find_all("li")
            for item in results:
                name = item.find("a").text
                link = item.find("a").get("href")
                library_list.append({"library_name": name, "library_url": link})
        except:
            print(f"No libraries found for letter {letter}")

print(f"Found {len(library_list)} libraries")

# Step 3: Visit each library page and extract details
library_details = []

for lib in library_list:
    url = base_url + lib["library_url"]
    response = requests.get(url, headers=header)
    
    if response.status_code == 200:
        page = soup(response.text, "html.parser")
        results = page.find("dl", class_="list list--definition definition")
        
        keys = [dt_tag.text.strip() for dt_tag in results.find_all("dt")]
        values = [dd.text.strip() for dd in results.find_all("dd")]
        
        obs = dict(zip(keys, values))
        obs["library_name"] = lib["library_name"]
        obs["library_url"] = url
        library_details.append(obs)
    else:
        print(f"Could not request page for {lib['library_name']}")

print(f"Collected details for {len(library_details)} libraries")

# Step 4: Save results to JSON
ddate = dt.now().strftime("%Y-%m-%d")
outfile = f"./data/edinburgh-libraries-{ddate}.json"

with open(outfile, "w", encoding="utf-8") as f:
    json.dump(library_details, f)

print(f"Saved {len(library_details)} libraries to {outfile}")

---

**END OF FILE**