# Week 07 — Requests Session + Caching + Reproducibility

**Time budget:** ~2 hours  
**Goal:** Use sessions, headers, caching strategy, and reproducible runs; introduce pathlib and configs.

**Theme (PhD focus):** Human factors of privacy & security — scraping public pages (privacy policies, cookie notices, security help pages, standards/regulator guidance) and extracting *UX-relevant* signals.

---


## Deliverables
- A completed notebook with working code
- A dataset variable (`rows` or `df`) saved to disk (CSV/JSON depending on week)
- 3–5 bullet reflection grounded in human factors/privacy-security research


## Responsible scraping note (important)
We will only scrape **public pages** and keep the volume small.
- Prefer a few pages, not thousands
- Respect robots.txt/Terms of Service when you scale later
- Avoid collecting personal data
- Add delays for politeness when doing multi-page work


## Step 0 — Imports

In [None]:
import re
import time
import json
from urllib.parse import urljoin, urlparse

import requests
from bs4 import BeautifulSoup

import pandas as pd
import matplotlib.pyplot as plt

## Step 1 — Implement caching and demonstrate it works

In [None]:
from pathlib import Path
CACHE_DIR = Path("cache_html")
CACHE_DIR.mkdir(exist_ok=True)

session = requests.Session()
session.headers.update({"User-Agent":"HF-PrivacyScraper/0.1"})

def cache_key(url: str) -> str:
    return re.sub(r"[^a-zA-Z0-9]+", "_", url)[:120] + ".html"

def fetch_with_cache(url: str, use_cache: bool = True) -> str:
    path = CACHE_DIR / cache_key(url)
    if use_cache and path.exists():
        return path.read_text(encoding="utf-8", errors="ignore")
    r = session.get(url, timeout=20)
    r.raise_for_status()
    path.write_text(r.text, encoding="utf-8")
    return r.text

In [None]:
url = "https://www.mozilla.org/en-US/privacy/"
html1 = fetch_with_cache(url, use_cache=True)
html2 = fetch_with_cache(url, use_cache=True)
print("Same length:", len(html1) == len(html2))
print("Cache files:", len(list(CACHE_DIR.glob("*.html"))))

## Reflection: reproducibility

- Why is caching important for research reproducibility?
- What are the risks (stale cache) and how would you mitigate them?
