# Week 04 â€” Data Formats: CSV vs JSON + Clean Schemas

**Time budget:** ~2 hours  
**Goal:** Design a simple schema and export/import clean datasets; basic validation.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 â€” Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

## Step 1 â€” Start from your Week 2 or Week 3 `rows`

In [None]:
rows = [
    # paste your rows here or re-run your scraper from prior weeks
]

## Step 2 â€” Export JSON (nested-friendly) and CSV (flat)

### ðŸ§  Pro Tip: `pathlib` for File Paths

Instead of:
```python
open("data/file.txt", "w").write("hello")
```
We use:
```python
from pathlib import Path
Path("data/file.txt").write_text("hello", encoding="utf-8")
```
It handles Mac/Windows path differences (`/` vs `\`) automatically!

In [None]:
import csv
from pathlib import Path

def to_json(path: str, rows: list[dict]) -> None:
    Path(path).write_text(json.dumps(rows, indent=2), encoding="utf-8")

def to_csv_flat(path: str, rows: list[dict], fieldnames: list[str]) -> None:
    with open(path, "w", newline="", encoding="utf-8") as f:
        w = csv.DictWriter(f, fieldnames=fieldnames)
        w.writeheader()
        for r in rows:
            w.writerow({k: r.get(k) for k in fieldnames})

In [None]:
# TODO: choose a minimal set of CSV fields you care about
fieldnames = ["url", "status", "num_headings"]  # example
to_json("week04_rows.json", rows)
to_csv_flat("week04_rows.csv", rows, fieldnames=fieldnames)
print("Saved week04_rows.json and week04_rows.csv")

## Reflection: schema decisions

- What did you lose when converting to CSV?
- Which format fits your research workflow better and why?
