docs(audit): scraper-defence-r1 completion — Phase 1 gap report + Phases 3+4 runbook#35
Merged
Merged
Conversation
…ge-cache verification, Phases 3+4 runbook Phase 1 (Redis observation): cannot measure commands/day from this sandbox (no Upstash creds, no Vercel storage-metrics MCP tool). Documented honestly. Qualitative observation: ~40 visible ScrapeGuard 429s in 24h sample (paged out — actual count is higher). MCP runtime-logs renderer still truncates the message column to ~30 chars, so PR #33's UA/IP additions are emitted by middleware but not extractable here. Phase 1.5/1.6 verification: edge cache from PR #32 is working as designed. Spot-checked 5 /hs/code/* and /hs/heading/* URLs — 4 of 6 fetches were x-vercel-cache: HIT, including one with age=51256s (≈14h served from edge, no middleware invocation, no Redis INCR). Confirms PR #32 absorbs scraper repeats. Phase 2 (static-gen all 6,940 HS codes): SKIPPED with documented justification — build-time wall-clock risk (estimated 20–60 min vs 15 min budget), HS data is in lib/data/*.json so static-gen would not hit Redis (no build-time Redis cost), and edge cache already covers the failure mode. Lose dynamicParams=true fall-through if static-only. Phases 3 (UA/IP firewall rules) + 4 (CRS xss/rce → deny): runbook for Soap to execute via Vercel Dashboard. Sandbox blockers documented (no VERCEL_TOKEN, no firewall-mutation MCP tool, MCP log truncation prevents UA extraction). Dashboard log explorer renders the full PR #33 line; CRS click- path and verification steps spelled out. Phase 5 (verification): 10-path 5xx sweep against www.freightutils.com — 10/10 returned 200, zero 5xx. Sentry-quiet criterion satisfied. No CHANGELOG entry — docs-only PR, no user-visible change.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Sprint
scraper-defence-r1-completion— docs-only audit appendix to the existing 2026-05-14 scraper-signature audit. Documents Phase 1 (Redis observation) gap report, qualitative Phase 1.5/1.6 verification (PR #32 edge cache is doing its job), Phase 2 skip with justification, and runbooks for Phases 3 (UA/IP firewall rules) and 4 (CRS xss/rce → deny) that Soap will execute via the Vercel Dashboard.Why this is docs-only
The sprint asked for firewall mutations (Phases 3+4) and a Redis commands/day measurement (Phase 1). The sandbox cannot do any of these:
VERCEL_TOKENin the env (env | grep -iE "vercel|upstash"returns null)which vercel→ not found)firewall,waf,rules— only docs-search and read-only project/deployment tools)sensitive, not retrievable from outside runtime (per the prior 2026-05-14 audit)get_runtime_logstruncates the message column to ~30 chars even though PR feat(scrapeguard): log user-agent and full source IP on 429/block decisions #33 addedua=andip=to the[ScrapeGuard] 429log line — the data is emitted but not extractable hereUser confirmed via
AskUserQuestion: write a runbook for Soap, honest gap report on Phase 1, no force-fired firewall changes.What's in the appendix
[ScrapeGuard] 429entries in a 24h sample (paged out — actual count higher)./hs/code/*and/hs/heading/*spot-check viaweb_fetch_vercel_url. 4 of 6 fetches returnedx-vercel-cache: HIT, including one withage=51256s(≈14 h served from edge, no middleware invocation, no Redis INCR). PR perf(hs): edge-cache /hs/code/* and /hs/heading/* to reduce middleware load #32'ss-maxage=86400Cache-Control is correctly emitted on every response. Confirms edge cache absorbs scraper repeats.lib/data/*.jsonso static-gen would not hit Redis (no build-time Redis cost is the only argument for static-gen, and edge cache already provides the same benefit at runtime), (c) loss ofdynamicParams = truefall-through.Status: 429 AND Path: /hs/*→ group by UA → add Vercel Firewall rule scoped to scraper-bait paths for any UA with ≥50 blocks AND zero/api/*hits. Same for IPs/CIDR with ≥100 blocks.www.freightutils.com— 10/10 returned 200, zero 5xx. Sentry-quiet criterion satisfied.Sprint exit table
Test plan
git diff --stat— single file, 118 insertions, no code paths touched/hs/code/*spot-checks for cache-hit headers — 4/6 HIT, all carrys-maxage=86400🤖 Generated with Claude Code
Generated by Claude Code