# Week 03 â€” Pagination + Robustness

**Time budget:** ~2 hours  
**Goal:** Scrape multiple pages with politeness (rate limiting) and robust error handling; introduce comprehensions.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

## Step 1 — Provide a small list of URLs (8–15 max)

In [None]:
urls = [
    "https://www.mozilla.org/en-US/privacy/",
    "https://www.nist.gov/privacy-framework",
    "https://www.enisa.europa.eu/topics/data-protection",
    "https://support.google.com/accounts/answer/6294825?hl=en",
    "https://www.wikipedia.org/",
    # add a few more
]

## Step 2 — Implement polite multi-page scrape with delay + error capture

### ðŸ§  Concept: Rate Limiting (Politeness)

Servers aren't infinite. If you request 1,000 pages in 1 second, you look like an attacker (DDoS).
- **Golden Rule**: Wait 1-2 seconds between requests.
- **Code**: `time.sleep(1.0)`

In [None]:
def safe_get(url: str):
    try:
        r = requests.get(url, timeout=20, headers={"User-Agent":"HF-PrivacyScraper/0.1"})
        r.raise_for_status()
        return r
    except Exception as e:
        return e

def scrape_many(urls: list[str], delay_s: float = 1.0) -> list[dict]:
    rows = []
    for u in urls:
        res = safe_get(u)
        if isinstance(res, Exception):
            rows.append({"url": u, "error": str(res)})
        else:
            soup = BeautifulSoup(res.text, "html.parser")
            text = soup.get_text(" ", strip=True).lower()
            rows.append({
                "url": u,
                "status": res.status_code,
                "title": soup.title.get_text(strip=True) if soup.title else None,
                "mentions_choices": ("opt out" in text) or ("preferences" in text) or ("your choices" in text),
                "mentions_retention": ("retention" in text) or ("retain" in text),
            })
        time.sleep(delay_s)
    return rows

In [None]:
rows = scrape_many(urls, delay_s=1.0)
rows

## Step 3 — Use comprehensions to filter/transform

In [None]:
ok_rows = [r for r in rows if "error" not in r]
ok_rows

## Reflection

- What kinds of failures occurred (timeouts, 403s, etc.)?
- How would failures bias a research dataset?
