# Week 10 — Functional Pipelines + Testing Basics

**Time budget:** ~2 hours  
**Goal:** Build transformation pipelines (map/filter) and write a few unit tests for parser functions.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt
from dataclasses import dataclass

## Functional pipelines + testing mindset
Weâ€™ll:
- separate parsing from fetching (pure functions are testable)
- map URLs → rows
- write simple tests (assertions) for parser helpers


### ðŸ§  Concept: Pure Functions (The Math Machine)

A **Pure Function** is like `2 + 2`. It ALWAYS equals `4`.
- **Impures**: `fetch_url()` (depends on internet).
- **Pure**: `parse_html()` (depends ONLY on the HTML you give it).

**Why care?**
You can test Pure Functions without the internet. This is called **Unit Testing**.

### ðŸ§  Concept: Unit Tests (The Safety Net)

Code: `assert result == expected`

It's a robotic checklist.
- *"Did the parser find the 'Opt Out' button? Yes/No."*
- If `False`, the robot yells (Error).
- You run this **every time you save** to make sure you didn't break anything.

In [None]:
def parse_cues(text: str) -> dict:
    return {
        "choices_controls": bool(re.search(r"\b(opt\s?-?out|preferences|your choices|controls?)\b", text, re.I)),
        "retention": bool(re.search(r"\b(retention|retain)\b", text, re.I)),
    }

def test_parse_cues():
    t = "You can opt out in settings. We retain data for 30 days."
    cues = parse_cues(t)
    assert cues["choices_controls"] is True
    assert cues["retention"] is True

test_parse_cues()
print("Tests passed")

In [None]:
urls = [
    "https://www.mozilla.org/en-US/privacy/",
    "https://www.nist.gov/privacy-framework",
]

def fetch_text(url: str) -> str:
    r = requests.get(url, timeout=20)
    soup = BeautifulSoup(r.text, "html.parser")
    return soup.get_text(" ", strip=True)

rows = [{"url": u, **parse_cues(fetch_text(u))} for u in urls]
pd.DataFrame(rows)