# Week 06 â€” Text Signals for Human Factors: Readability + Sentiment (light)

**Time budget:** ~2 hours  
**Goal:** Compute basic text metrics (length, readability approximation) and visualize distributions.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Setup
Weâ€™ll use `requests` + `BeautifulSoup`. Install if needed:

```bash
pip install requests beautifulsoup4 pandas matplotlib
```


In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt

## Text metrics that matter for human factors
Policies affect comprehension. Simple proxies:
- total word count
- average sentence length (rough)
- readability approximations (very rough)

We wonâ€™t overclaim: these are *signals*, not truth.


### ðŸ§  Concept: Text as Data (Tokenization)

Computers don't read. They count.

To analyze text, we chop it up into pieces called **Tokens**.
- **Sentence Tokenization**: Splitting by periods (`.`) or exclamation marks (`!`).
- **Word Tokenization**: Splitting by spaces (` `).

**Why?**
Long sentences are harder to read. By counting words per sentence, we measure *Cognitive Load*.

In [None]:
def basic_text_metrics(text: str) -> dict:
    words = text.split()
    num_words = len(words)
    sentences = re.split(r"[.!?]+\s+", text)
    sentences = [s for s in sentences if s.strip()]
    avg_sentence_len = (num_words / len(sentences)) if sentences else None
    return {"num_words": num_words, "num_sentences": len(sentences), "avg_sentence_len": avg_sentence_len}

sample = "We use cookies. You can opt out in settings. We retain data for 30 days."
basic_text_metrics(sample)

In [None]:
# Example: integrate metrics into a scrape
def scrape_text(url: str) -> dict:
    r = requests.get(url, timeout=20)
    soup = BeautifulSoup(r.text, "html.parser")
    text = soup.get_text(" ", strip=True)
    row = {"url": url, "status": r.status_code}
    row.update(basic_text_metrics(text))
    return row

In [None]:
df = pd.DataFrame([
    scrape_text("https://www.mozilla.org/en-US/privacy/"),
])
df.head()

In [None]:
df["num_words"].plot(kind="hist", title="Word count")
plt.show()