# Week 12 — LLM Enrichment (Gemini/OpenAI) for Policy UX Coding

**Time budget:** ~2 hours  
**Goal:** Use an LLM to label/summary/classify policy sections; store enriched outputs safely.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## Step 1 — Choose 5 short excerpts from your scraped pages

Paste excerpts (100–300 words each) below. Keep it small.

In [None]:
excerpts = [
    """PASTE EXCERPT 1""",
    """PASTE EXCERPT 2""",
]

## Step 2 — Implement an LLM call (Gemini/OpenAI) OR use the placeholder

In [None]:
def llm_label_section(text: str) -> dict:
    # TODO: replace with real API call
    labels = {
        "choices_controls": bool(re.search(r"\b(opt\s?-?out|preferences|your choices|controls?)\b", text, re.I)),
        "retention": bool(re.search(r"\b(retention|retain)\b", text, re.I)),
        "third_party": bool(re.search(r"\b(third\s?-?party|share|sharing)\b", text, re.I)),
        "security": bool(re.search(r"\b(encrypt|encryption|security|safeguards)\b", text, re.I)),
    }
    return {"labels": labels, "summary": text[:200] + ("..." if len(text) > 200 else "")}

In [None]:
enriched = [llm_label_section(x) for x in excerpts]
enriched

## Step 3 — Save enriched outputs with provenance

In [None]:
from datetime import datetime, timezone
payload = {
    "timestamp": datetime.now(timezone.utc).isoformat(),
    "model": "placeholder-or-your-model-name",
    "items": [{"text": t, "enriched": e} for t, e in zip(excerpts, enriched)],
}
Path("week12_enriched.json").write_text(json.dumps(payload, indent=2), encoding="utf-8")
print("Saved week12_enriched.json")

## Reflection

- What did the LLM help with (coding, summarization, classification)?
- What are the risks (hallucination, bias) and how would you validate?
