# Week 09 — OOP Scraper Design

**Time budget:** ~2 hours  
**Goal:** Implement a scraper class with clear responsibilities; introduce dataclasses.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## OOP pattern: a Scraper class
A class helps manage:
- shared session / headers
- caching config
- parsing methods


### ðŸ§  Concept: The Blueprint (Class) vs The House (Object)

- **Class** (`class PolicyScraper`): The Architectural Drawing. It says "Every scraper has a session and a fetch method".
- **Object** (`scraper = PolicyScraper()`): The Actual House built from that drawing. You can build 10 different scrapers (houses).

### ðŸ§  Concept: `self`
- `self` just means **"My Own"**.
- `self.session` = "My own session" (not someone else's).
- When a house (Object) wants to open its *own* front door, it uses `self.open_door()`.

In [None]:
@dataclass
class ScrapeResult:
    url: str
    status: int | None
    title: str | None
    cues: dict
    error: str | None = None

In [None]:
class PolicyScraper:
    def __init__(self, user_agent: str = "HF-PrivacyScraper/0.1"):
        self.session = requests.Session()
        self.session.headers.update({"User-Agent": user_agent})

    def fetch(self, url: str) -> str:
        r = self.session.get(url, timeout=20)
        r.raise_for_status()
        return r.text

    def parse(self, url: str, html: str) -> ScrapeResult:
        soup = BeautifulSoup(html, "html.parser")
        text = soup.get_text(" ", strip=True)
        cues = {
            "choices_controls": bool(re.search(r"\b(opt\s?-?out|preferences|your choices|controls?)\b", text, re.I)),
            "retention": bool(re.search(r"\b(retention|retain)\b", text, re.I)),
            "third_party": bool(re.search(r"\b(third\s?-?party|sharing|share)\b", text, re.I)),
        }
        return ScrapeResult(
            url=url,
            status=200,
            title=soup.title.get_text(strip=True) if soup.title else None,
            cues=cues,
        )

    def scrape(self, url: str) -> ScrapeResult:
        try:
            html = self.fetch(url)
            return self.parse(url, html)
        except Exception as e:
            return ScrapeResult(url=url, status=None, title=None, cues={}, error=str(e))

In [None]:
scraper = PolicyScraper()
result = scraper.scrape("https://www.mozilla.org/en-US/privacy/")
result