# Week 11 — More Complex Sites: Sitemaps + Multi-source Datasets

**Time budget:** ~2 hours  
**Goal:** Use sitemap.xml (where available) or curated sources; unify multi-source datasets.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## Step 1 — Try sitemap discovery (small sample)

In [None]:
def try_fetch_sitemap(base_url: str) -> list[str]:
    sitemap_url = base_url.rstrip("/") + "/sitemap.xml"
    r = requests.get(sitemap_url, timeout=20, headers={"User-Agent":"HF-PrivacyScraper/0.1"})
    if r.status_code != 200:
        return []
    soup = BeautifulSoup(r.text, "xml")
    return [loc.get_text(strip=True) for loc in soup.find_all("loc")]

locs = try_fetch_sitemap("https://www.mozilla.org")
print("Found:", len(locs))
locs[:10]

## Step 2 — Filter sitemap URLs for relevant keywords (privacy/security)

In [None]:
keywords = ["privacy", "security", "cookie", "data"]
relevant = [u for u in locs if any(k in u.lower() for k in keywords)]
relevant[:20], len(relevant)

## Step 3 — Scrape a small subset (5 max)

In [None]:
subset = relevant[:5]
rows = []
for u in subset:
    try:
        html = requests.get(u, timeout=20).text
        soup = BeautifulSoup(html, "html.parser")
        text = soup.get_text(" ", strip=True)
        rows.append({"url": u, "title": soup.title.get_text(strip=True) if soup.title else None, **{
            "mentions_choices": bool(re.search(r"\b(opt\s?-?out|preferences|your choices|controls?)\b", text, re.I))
        }})
    except Exception as e:
        rows.append({"url": u, "error": str(e)})
pd.DataFrame(rows)

## Reflection

- How does sitemap-based sampling differ from “choose 5 URLs by hand”?
- What sampling biases might appear?
