# Week 06 — Text Signals for Human Factors: Readability + Sentiment (light)

**Time budget:** ~2 hours  
**Goal:** Compute basic text metrics (length, readability approximation) and visualize distributions.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt

## Step 1 — Scrape text + compute metrics for 3–5 pages

In [None]:
def basic_text_metrics(text: str) -> dict:
    words = text.split()
    num_words = len(words)
    sentences = re.split(r"[.!?]+\s+", text)
    sentences = [s for s in sentences if s.strip()]
    avg_sentence_len = (num_words / len(sentences)) if sentences else None
    return {"num_words": num_words, "num_sentences": len(sentences), "avg_sentence_len": avg_sentence_len}

def scrape_text_metrics(url: str) -> dict:
    r = requests.get(url, timeout=20)
    soup = BeautifulSoup(r.text, "html.parser")
    text = soup.get_text(" ", strip=True)
    row = {"url": url, "status": r.status_code}
    row.update(basic_text_metrics(text))
    return row

In [None]:
urls = [
    "https://www.mozilla.org/en-US/privacy/",
    "https://www.nist.gov/privacy-framework",
]
rows = [scrape_text_metrics(u) for u in urls]
df = pd.DataFrame(rows)
df

In [None]:
df["num_words"].plot(kind="bar", title="Word count by page")
plt.xticks(rotation=45, ha="right")
plt.show()

## Reflection: readability as a UX issue

- Does length correlate with comprehensibility?
- What other measures would you want (Flesch, jargon count, etc.)?
