Skip to content

docs(audit): scraper-defence-r1 completion — Phase 1 gap report + Phases 3+4 runbook#35

Merged
SoapyRED merged 1 commit into
mainfrom
chore/scraper-defence-r1-completion
May 15, 2026
Merged

docs(audit): scraper-defence-r1 completion — Phase 1 gap report + Phases 3+4 runbook#35
SoapyRED merged 1 commit into
mainfrom
chore/scraper-defence-r1-completion

Conversation

@SoapyRED
Copy link
Copy Markdown
Owner

Summary

Sprint scraper-defence-r1-completion — docs-only audit appendix to the existing 2026-05-14 scraper-signature audit. Documents Phase 1 (Redis observation) gap report, qualitative Phase 1.5/1.6 verification (PR #32 edge cache is doing its job), Phase 2 skip with justification, and runbooks for Phases 3 (UA/IP firewall rules) and 4 (CRS xss/rce → deny) that Soap will execute via the Vercel Dashboard.

Why this is docs-only

The sprint asked for firewall mutations (Phases 3+4) and a Redis commands/day measurement (Phase 1). The sandbox cannot do any of these:

  • No VERCEL_TOKEN in the env (env | grep -iE "vercel|upstash" returns null)
  • Vercel CLI not installed (which vercel → not found)
  • Vercel MCP server on this account exposes no firewall-mutation tools (confirmed via tool-search for firewall, waf, rules — only docs-search and read-only project/deployment tools)
  • Upstash KV_REST_API_* lives in Vercel prod env as sensitive, not retrievable from outside runtime (per the prior 2026-05-14 audit)
  • MCP get_runtime_logs truncates the message column to ~30 chars even though PR feat(scrapeguard): log user-agent and full source IP on 429/block decisions #33 added ua= and ip= to the [ScrapeGuard] 429 log line — the data is emitted but not extractable here

User confirmed via AskUserQuestion: write a runbook for Soap, honest gap report on Phase 1, no force-fired firewall changes.

What's in the appendix

  1. Phase 1 — Redis observation: GAP REPORT. Documents the access blocker plainly. Offers what can be observed: ~40 visible [ScrapeGuard] 429 entries in a 24h sample (paged out — actual count higher).
  2. Phase 1.5 verification — qualitative. 6-URL /hs/code/* and /hs/heading/* spot-check via web_fetch_vercel_url. 4 of 6 fetches returned x-vercel-cache: HIT, including one with age=51256s (≈14 h served from edge, no middleware invocation, no Redis INCR). PR perf(hs): edge-cache /hs/code/* and /hs/heading/* to reduce middleware load #32's s-maxage=86400 Cache-Control is correctly emitted on every response. Confirms edge cache absorbs scraper repeats.
  3. Phase 2 — SKIPPED with documented justification. Conditional trigger cannot be evaluated; defaulting to skip on (a) build-time wall-clock risk (estimated 20–60 min for 12 164 page renders vs 15 min budget), (b) HS data is in lib/data/*.json so static-gen would not hit Redis (no build-time Redis cost is the only argument for static-gen, and edge cache already provides the same benefit at runtime), (c) loss of dynamicParams = true fall-through.
  4. Phase 3 — UA-based firewall rules: RUNBOOK. Step-by-step for Soap: dashboard log explorer → filter Status: 429 AND Path: /hs/* → group by UA → add Vercel Firewall rule scoped to scraper-bait paths for any UA with ≥50 blocks AND zero /api/* hits. Same for IPs/CIDR with ≥100 blocks.
  5. Phase 4 — CRS deny upgrade: RUNBOOK. Click-path for Vercel Dashboard → Firewall → Managed Rulesets. xss + rce → Deny. sqli + gen kept at Log (false-positive risk on apostrophes in shipping names and JSON bodies on POST endpoints, with the specific endpoint examples). Verification step + Sentry-watch period spelled out.
  6. Phase 5 — verification (executed). 10-path 5xx sweep against www.freightutils.com10/10 returned 200, zero 5xx. Sentry-quiet criterion satisfied.

Sprint exit table

Criterion Status
Redis commands/day documented (Phase 1) GAP REPORT — sandbox cannot measure; qualitative observation + edge-cache spot-check supplied instead
Phase 2 shipped or skipped with justification SKIPPED with justification
UA-based firewall rules added DEFERRED — runbook for Soap to execute
CRS xss + rce flipped to deny DEFERRED — runbook for Soap to execute
Audit doc updated YES (this PR)
FAULT 5 applied N/A — no user-visible change
Sentry quiet via 10-path sweep PASS (10/10 = 200)
CHANGELOG entry N/A — no user-visible change

Test plan

  • git diff --stat — single file, 118 insertions, no code paths touched
  • 10-path 5xx sweep against production — all 200
  • 5 /hs/code/* spot-checks for cache-hit headers — 4/6 HIT, all carry s-maxage=86400
  • Vercel preview build (auto)
  • Soap to execute Phases 3+4 runbook in dashboard

🤖 Generated with Claude Code


Generated by Claude Code

…ge-cache verification, Phases 3+4 runbook

Phase 1 (Redis observation): cannot measure commands/day from this sandbox
(no Upstash creds, no Vercel storage-metrics MCP tool). Documented honestly.
Qualitative observation: ~40 visible ScrapeGuard 429s in 24h sample (paged
out — actual count is higher). MCP runtime-logs renderer still truncates the
message column to ~30 chars, so PR #33's UA/IP additions are emitted by
middleware but not extractable here.

Phase 1.5/1.6 verification: edge cache from PR #32 is working as designed.
Spot-checked 5 /hs/code/* and /hs/heading/* URLs — 4 of 6 fetches were
x-vercel-cache: HIT, including one with age=51256s (≈14h served from edge,
no middleware invocation, no Redis INCR). Confirms PR #32 absorbs scraper
repeats.

Phase 2 (static-gen all 6,940 HS codes): SKIPPED with documented
justification — build-time wall-clock risk (estimated 20–60 min vs 15 min
budget), HS data is in lib/data/*.json so static-gen would not hit Redis
(no build-time Redis cost), and edge cache already covers the failure mode.
Lose dynamicParams=true fall-through if static-only.

Phases 3 (UA/IP firewall rules) + 4 (CRS xss/rce → deny): runbook for Soap
to execute via Vercel Dashboard. Sandbox blockers documented (no
VERCEL_TOKEN, no firewall-mutation MCP tool, MCP log truncation prevents UA
extraction). Dashboard log explorer renders the full PR #33 line; CRS click-
path and verification steps spelled out.

Phase 5 (verification): 10-path 5xx sweep against www.freightutils.com —
10/10 returned 200, zero 5xx. Sentry-quiet criterion satisfied.

No CHANGELOG entry — docs-only PR, no user-visible change.
@vercel
Copy link
Copy Markdown

vercel Bot commented May 15, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
freighttools Ready Ready Preview, Comment May 15, 2026 6:30pm

Request Review

@SoapyRED SoapyRED marked this pull request as ready for review May 15, 2026 18:32
@SoapyRED SoapyRED merged commit ebb9566 into main May 15, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants