# Week 11 — More Complex Sites: Sitemaps + Multi-source Datasets

**Time budget:** ~2 hours  
**Goal:** Use sitemap.xml (where available) or curated sources; unify multi-source datasets.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## Sitemaps + multi-source datasets
Many sites expose `sitemap.xml`. If available, it can help you discover relevant pages.
We will keep this lightweight and ethical: extract a *small* subset of URLs.

If a site has no sitemap, fall back to a curated list.


### ðŸ§  Concept: Scraping vs. Crawling

| Activity | Definition | Analogy |
| :--- | :--- | :--- |
| **Scraping** | Extracting data from a specific page. | Reading a book and taking notes. |
| **Crawling** | Finding new URLs to read. | Walking through the library to *find* books. |

**Sitemaps** are the Library Catalog. They list every book (page) so you don't have to wander around.

In [None]:
def try_fetch_sitemap(base_url: str) -> list[str]:
    # naive: base_url like https://example.com
    sitemap_url = base_url.rstrip("/") + "/sitemap.xml"
    r = requests.get(sitemap_url, timeout=20, headers={"User-Agent":"HF-PrivacyScraper/0.1"})
    if r.status_code != 200:
        return []
    soup = BeautifulSoup(r.text, "xml")
    locs = [loc.get_text(strip=True) for loc in soup.find_all("loc")]
    return locs

# Example (may or may not exist):
try_fetch_sitemap("https://www.mozilla.org")[:10]

## Unifying sources
Add a `source` field and a common schema so you can compare across institutions.
