# Week 08 — From Notebook to Script: Project Structure

**Time budget:** ~2 hours  
**Goal:** Move code into .py modules and run from VS Code; introduce argparse.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt

## Step 1 — Draft a mini project structure (in markdown)

```
privacy_scraper/
  scraper.py
  analyze.py
  run.py
  data/
```


## Step 2 — Write a `main()` that returns rows (still in notebook)

In [None]:
def main():
    urls = [
        "https://www.mozilla.org/en-US/privacy/",
        "https://www.nist.gov/privacy-framework",
    ]
    rows = []
    for u in urls:
        try:
            html = requests.get(u, timeout=20).text
            soup = BeautifulSoup(html, "html.parser")
            rows.append({"url": u, "title": soup.title.get_text(strip=True) if soup.title else None})
        except Exception as e:
            rows.append({"url": u, "error": str(e)})
    return rows

rows = main()
rows

## Step 3 — Move to VS Code (offline step)

- Copy `main()` into `run.py`
- Run: `python run.py`
- Paste your terminal output here (as markdown)

## Reflection

- What was harder in scripts vs notebooks?
