# Week 04 â€” Data Formats: CSV vs JSON + Clean Schemas

**Time budget:** ~2 hours  
**Goal:** Design a simple schema and export/import clean datasets; basic validation.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

### ðŸ§  Concept: CSV vs JSON

| Feature | CSV (Comma Separated Values) | JSON (JavaScript Object Notation) |
| :--- | :--- | :--- |
| **Shape** | Flat Table (Rows & Columns) | Nested Tree (Boxes inside Boxes) |
| **Best For** | Excel, Pandas, distinct lists | API responses, Complex data |
| **Analogy** | A Spreadsheet | A Folder System |

**The Problem**: How do you store a list of authors `["Alice", "Bob"]` in a single CSV box?
- CSV struggles with this (you have to fake it: `"Alice, Bob"`).
- JSON loves this (`["Alice", "Bob"]`).

## Designing a clean schema
A schema is just an agreed set of keys + types. Example fields:
- url (str)
- source (str) e.g., 'mozilla', 'nist'
- cues (bool flags)
- extracted headings (list[str]) — *JSON only* friendly

This week: export CSV (flat) vs JSON (nested).


In [None]:
import csv
from pathlib import Path

In [None]:
def to_json(path: str, rows: list[dict]) -> None:
    Path(path).write_text(json.dumps(rows, indent=2), encoding="utf-8")

def to_csv_flat(path: str, rows: list[dict], fieldnames: list[str]) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fieldnames})

In [None]:
rows = [
    {"url":"a", "status":200, "mentions_privacy":True, "headings_preview":["Your choices","Cookies"]},
    {"url":"b", "status":200, "mentions_privacy":False, "headings_preview":["Overview"]},
]
to_json("week04_data.json", rows)
to_csv_flat("week04_data.csv", rows, fieldnames=["url","status","mentions_privacy"])

print("Wrote:", "week04_data.json", "and", "week04_data.csv")

## Note
CSV is great for tables, but it canâ€™t naturally store lists (like headings) without awkward encoding.
