# Week 03 â€” Pagination + Robustness

**Time budget:** ~2 hours  
**Goal:** Scrape multiple pages with politeness (rate limiting) and robust error handling; introduce comprehensions.

**Theme (PhD focus):** Human factors of privacy & security â€” scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

## Multi-page scraping patterns
We introduce:
- Politeness delay (`time.sleep`)
- Retry-ish error handling
- List comprehensions for clean transforms
- Basic logging (print is okay for now)


### ðŸ§  Concept: Functions (`def`)

Think of a Function like a **Kitchen Appliance** (e.g., a Blender).
- **Input (Arguments)**: Fruit, Milk (`url`, `timeout`).
- **Action (Body)**: Blends them (Executes code).
- **Output (Return)**: Smoothie (`response` object).

We write functions so we don't have to build the blender every time we want a smoothie. We just use it.

In [None]:
def safe_get(url: str, timeout: int = 20):
    try:
        r = requests.get(url, timeout=timeout, headers={"User-Agent": "HF-PrivacyScraper/0.1"})
        r.raise_for_status()
        return r
    except Exception as e:
        return e

In [None]:
def scrape_many(urls: list[str], delay_s: float = 1.0) -> list[dict]:
    rows = []
    for u in urls:
        result = safe_get(u)
        if isinstance(result, Exception):
            rows.append({"url": u, "error": str(result)})
        else:
            soup = BeautifulSoup(result.text, "html.parser")
            text = soup.get_text(" ", strip=True).lower()
            rows.append({
                "url": u,
                "status": result.status_code,
                "title": soup.title.get_text(strip=True) if soup.title else None,
                "mentions_choices": ("choice" in text) or ("opt out" in text) or ("preferences" in text),
                "mentions_retention": ("retention" in text) or ("retain" in text),
            })
        time.sleep(delay_s)
    return rows

In [None]:
urls = [
    "https://www.nist.gov/privacy-framework",
    "https://www.enisa.europa.eu/topics/data-protection",
    "https://www.mozilla.org/en-US/privacy/",
]
rows = scrape_many(urls, delay_s=1.0)
rows

## Comprehension example
Filter only successful rows:


### ðŸ§  Concept: List Comprehension

It's a "One-Liner" loop.

**The Long Way:**
```python
results = []
for r in rows:
    if "error" not in r:
        results.append(r)
```

**The Comprehension Way:**
```python
results = [r for r in rows if "error" not in r]
```
Read it like English: "Give me `r` for every `r` in `rows` IF `error` is not in `r`."

In [None]:
ok_rows = [r for r in rows if "error" not in r]
ok_rows