# Week 09 — OOP Scraper Design

**Time budget:** ~2 hours  
**Goal:** Implement a scraper class with clear responsibilities; introduce dataclasses.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## Step 1 — Implement a scraper class + run it for 3 pages

In [None]:
from dataclasses import dataclass

@dataclass
class ScrapeResult:
    url: str
    status: int | None
    title: str | None
    cues: dict
    error: str | None = None

class PolicyScraper:
    def __init__(self, user_agent: str = "HF-PrivacyScraper/0.1"):
        self.session = requests.Session()
        self.session.headers.update({"User-Agent": user_agent})

    def fetch(self, url: str) -> str:
        r = self.session.get(url, timeout=20)
        r.raise_for_status()
        return r.text

    def parse(self, html: str) -> tuple[str | None, str]:
        soup = BeautifulSoup(html, "html.parser")
        title = soup.title.get_text(strip=True) if soup.title else None
        text = soup.get_text(" ", strip=True)
        return title, text

    def extract_cues(self, text: str) -> dict:
        return {
            "choices_controls": bool(re.search(r"\b(opt\s?-?out|preferences|your choices|controls?)\b", text, re.I)),
            "retention": bool(re.search(r"\b(retention|retain)\b", text, re.I)),
            "third_party": bool(re.search(r"\b(third\s?-?party|share|sharing)\b", text, re.I)),
        }

    def scrape(self, url: str) -> ScrapeResult:
        try:
            html = self.fetch(url)
            title, text = self.parse(html)
            cues = self.extract_cues(text)
            return ScrapeResult(url=url, status=200, title=title, cues=cues)
        except Exception as e:
            return ScrapeResult(url=url, status=None, title=None, cues={}, error=str(e))

In [None]:
scraper = PolicyScraper()
urls = [
    "https://www.mozilla.org/en-US/privacy/",
    "https://www.nist.gov/privacy-framework",
    "https://www.enisa.europa.eu/topics/data-protection",
]
results = [scraper.scrape(u) for u in urls]
results

## Reflection: responsibilities

- What belongs in `fetch` vs `parse` vs `extract_cues`?
