# Week 1 — Guided Assignment (Human Factors, Privacy & Security)

**Goal:** Build your first mini dataset from public privacy/security-related pages.

**Deliverable:** A list of dictionaries named `dataset` with at least **5 rows**.
Each row should include:
- `url`
- `status`
- `title`
- `num_links`
- `mentions_cookies`
- `mentions_privacy`
- `mentions_security`

**Time target:** ~60–90 minutes

---

## Step 0 — Imports
Run the cell below.


In [None]:
import requests
from bs4 import BeautifulSoup

## Step 1 — Choose your 5 URLs (public pages)

Pick pages relevant to human factors of privacy & security, for example:
- privacy policy pages
- account security help pages
- cookie policy pages
- “how we use your data” pages
- NIST / ISO / ENISA public pages
- university lab pages that mention privacy/security

Add them to the list below.

Tip: mix sources so your dataset is interesting.


In [None]:
urls = [
    # Replace these with your chosen pages (at least 5)
    "https://www.mozilla.org/en-US/privacy/",
    "https://support.google.com/accounts/answer/6294825?hl=en",  # account security example
    "https://www.wikipedia.org/",
    "https://www.nist.gov/privacy-framework",
    "https://www.enisa.europa.eu/topics/data-protection",
]

## Step 2 — Implement `analyze_page(url)`

Use the template below and fill in the TODOs.

Hints:
- `requests.get(url, timeout=20)`
- `BeautifulSoup(r.text, "html.parser")`
- `soup.find("title")`
- `soup.find_all("a")`
- `soup.get_text(" ", strip=True).lower()`


In [None]:
def analyze_page(url: str, timeout: int = 20) -> dict:
    # TODO 1: fetch the page
    r = requests.get(url, timeout=timeout)

    # TODO 2: parse HTML
    soup = BeautifulSoup(r.text, "html.parser")

    # TODO 3: title extraction (handle missing title)
    title_tag = soup.find("title")
    title = title_tag.get_text(strip=True) if title_tag else None

    # TODO 4: count links
    links = soup.find_all("a")

    # TODO 5: build a lowercased text blob for keyword checks
    text_lower = soup.get_text(" ", strip=True).lower()

    # TODO 6: return a dict row (schema below)
    return {
        "url": url,
        "status": r.status_code,
        "title": title,
        "num_links": len(links),
        "mentions_cookies": ("cookie" in text_lower),
        "mentions_privacy": ("privacy" in text_lower),
        "mentions_security": ("security" in text_lower),
    }

## Step 3 — Build the dataset (list of dicts)

Run the function across your URLs, handling errors.

Expected:
- `dataset` ends up with one dict per URL
- If a URL fails, store `{"url": ..., "error": ...}`

Fill the TODO.


In [None]:
dataset = []

for u in urls:
    try:
        row = analyze_page(u)
        dataset.append(row)
    except Exception as e:
        dataset.append({"url": u, "error": str(e)})

dataset

## Step 4 — Quick checks

You should confirm:
- How many rows succeeded?
- Which pages mention “privacy”?
- Which pages mention “cookies”?

Fill in the TODOs.


In [None]:
# TODO: how many rows have an 'error' key?
num_errors = sum(1 for row in dataset if "error" in row)

# TODO: count mentions of privacy/cookies/security (ignore errored rows)
privacy_count = sum(1 for row in dataset if row.get("mentions_privacy") is True)
cookies_count = sum(1 for row in dataset if row.get("mentions_cookies") is True)
security_count = sum(1 for row in dataset if row.get("mentions_security") is True)

print("Total rows:", len(dataset))
print("Errors:", num_errors)
print("Mentions privacy:", privacy_count)
print("Mentions cookies:", cookies_count)
print("Mentions security:", security_count)

## Step 5 — Reflection (research lens)

Write 3–5 bullet points answering:
- Which pages were easiest/hardest to scrape and why?
- Does keyword presence (“cookie”, “privacy”) capture what you care about in human factors research?
- What signals might be more meaningful (e.g., readability, presence of opt-out instructions, headings like “Your choices”)?

(Just write below in Markdown.)


### My reflection

- 
- 
- 


## Optional stretch (if you finish early)

Add one more field to each row:
- `num_words`: count words in the page text

Hint:
```python
words = soup.get_text(" ", strip=True).split()
num_words = len(words)
```
