# Week 02 ‚Äî Extracting Structure: Headings, Sections, and ‚ÄúChoice‚Äù Cues

**Time budget:** ~2 hours  
**Goal:** Extract headings/sections and detect UX-relevant cues (choices, opt-out, consent, retention).

**Theme (PhD focus):** Human factors of privacy & security ‚Äî scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3‚Äì5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 ‚Äî Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

## Step 1 ‚Äî Choose 5 policy/security-help URLs

In [None]:
urls = [
    "https://www.mozilla.org/en-US/privacy/",
    "https://www.nist.gov/privacy-framework",
    "https://support.google.com/accounts/answer/6294825?hl=en",
    "https://www.enisa.europa.eu/topics/data-protection",
    "https://www.wikipedia.org/",
]

## Step 2 ‚Äî Extract headings + cue flags

In [None]:
CUE_PATTERNS = {
    "choices_controls": r"\b(choice|choices|control|opt\s?-?out|preferences|settings)\b",
    "consent": r"\b(consent|agree|manage consent)\b",
    "cookies": r"\b(cookie|tracking|pixels)\b",
    "sharing_third_party": r"\b(third\s?-?party|share|sharing|partners)\b",
    "retention": r"\b(retention|retain|stored|storage period)\b",
    "security": r"\b(security|protect|encryption|safeguards)\b",
}

def extract_headings(soup: BeautifulSoup) -> list[str]:
    return [
        t.get_text(" ", strip=True)
        for t in soup.find_all(["h1","h2","h3"])
        if t.get_text(" ", strip=True)
    ]

def score_cues(text: str) -> dict:
    return {k: bool(re.search(p, text, re.I)) for k, p in CUE_PATTERNS.items()}

def analyze(url: str) -> dict:
    r = requests.get(url, timeout=20)
    soup = BeautifulSoup(r.text, "html.parser")
    headings = extract_headings(soup)
    text = soup.get_text(" ", strip=True)
    row = {
        "url": url,
        "status": r.status_code,
        "title": soup.title.get_text(strip=True) if soup.title else None,
        "num_headings": len(headings),
        "headings_preview": headings[:12],
    }
    row.update(score_cues(text))
    return row

### üß† Concept: List of Dictionaries = A Dataset

If a **Dictionary** is a Row...
And a **List** is a container...
Then a **List of Dictionaries** is a **Table**!

```python
dataset = [
  {"url": "google.com", "status": 200},  # Row 1
  {"url": "bing.com",   "status": 200},  # Row 2
]
```

This is exactly how pandas (and Excel) thinks about data.

In [None]:
rows = []
for u in urls:
    try:
        rows.append(analyze(u))
    except Exception as e:
        rows.append({"url": u, "error": str(e)})
rows

## Reflection: headings as navigational UX

- Which headings looked most ‚Äúactionable‚Äù for users?
- Did you see ‚ÄúYour choices‚Äù / ‚Äúcontrols‚Äù / ‚Äúopt out‚Äù type structure?
- What does this suggest about usability/comprehension?
