# Week 02 ‚Äî Extracting Structure: Headings, Sections, and ‚ÄúChoice‚Äù Cues

**Time budget:** ~2 hours  
**Goal:** Extract headings/sections and detect UX-relevant cues (choices, opt-out, consent, retention).

**Theme (PhD focus):** Human factors of privacy & security ‚Äî scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
We‚Äôll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

## Why structure matters (human factors angle)
Keywords are a start, but policies and security pages usually communicate through **sections** like:
- ‚ÄúYour choices‚Äù / ‚ÄúChoices & controls‚Äù
- ‚ÄúHow we use data‚Äù
- ‚ÄúSharing‚Äù / ‚ÄúThird parties‚Äù
- ‚ÄúRetention‚Äù
- ‚ÄúSecurity measures‚Äù

This week: extract headings and map them to ‚ÄúUX cues‚Äù.


### üß† Concept: Regular Expressions (Regex)

Code: `re.search(...)`

Unlocking **Super Search** powers.
- **Normal search**: Finds exact words like "cookie".
- **Regex**: Finds patterns. "cookie" OR "cookies" OR "tracking pixel".

In our code below:
- `r"\b(cookie|tracking)\b"` means:
    - `\b`: Whole word only (so "uncooked" doesn't count).
    - `(cookie|tracking)`: Either "cookie" OR "tracking".

In [None]:
CUE_PATTERNS = {
    "choices_controls": r"\b(choice|choices|control|opt\s?-?out|preferences|settings)\b",
    "consent": r"\b(consent|agree|manage consent)\b",
    "cookies": r"\b(cookie|tracking|pixels)\b",
    "sharing_third_party": r"\b(third\s?-?party|share|sharing|partners)\b",
    "retention": r"\b(retention|retain|stored|storage period)\b",
    "security": r"\b(security|protect|encryption|safeguards)\b",
}

def extract_headings(soup: BeautifulSoup) -> list[str]:
    headings = []
    for tag in soup.find_all(["h1","h2","h3"]):
        text = tag.get_text(" ", strip=True)
        if text:
            headings.append(text)
    return headings

def score_cues(text: str) -> dict:
    out = {}
    for cue, pat in CUE_PATTERNS.items():
        out[cue] = bool(re.search(pat, text, flags=re.I))
    return out

### üß† Concept: The Dictionary (`dict`)

Code: `{"key": "value"}`

Think of a Dictionary like a **Contact Card**:
- You look up a **Key** (Name) -> You get a **Value** (Phone Number).

In Data Analysis:
- We use a dictionary to represent **One Row** of data.
- **Key**: Column Name (e.g., "url")
- **Value**: The Data (e.g., "https://...")

In [None]:
def analyze_policy_structure(url: str) -> dict:
    r = requests.get(url, timeout=20)
    soup = BeautifulSoup(r.text, "html.parser")

    headings = extract_headings(soup)
    full_text = soup.get_text(" ", strip=True)

    row = {
        "url": url,
        "status": r.status_code,
        "title": (soup.find("title").get_text(strip=True) if soup.find("title") else None),
        "num_headings": len(headings),
        "headings_preview": headings[:10],
    }
    row.update(score_cues(full_text))
    return row

analyze_policy_structure("https://www.mozilla.org/en-US/privacy/")

## Interpretation prompt
Consider: do headings make it easier for a user to locate ‚Äúchoices‚Äù?
