feat(scrapeguard): log user-agent and full source IP on 429/block decisions#33
Conversation
…isions Extends ScrapeGuard middleware to capture the User-Agent and full client IP on every block decision (429 path only — never on the success / cache-hit path). Unblocks evidence-based firewall additions: we can now correlate IP ranges with UA signatures before promoting a block to the Cloudflare WAF. - middleware.ts: new getSanitisedUa() helper. Strips ASCII control chars (incl. \r\n\t and DEL) as a log-injection guard, replaces internal " with ' to keep the quoted ua="..." field parseable, truncates at 200 chars, returns the literal 'empty' for null / whitespace-only UAs. - Both [ScrapeGuard] 429 warn sites (tryBulkRefScrape + handleScrape- Protection) now emit key=value pairs (path=, ip=, ua=, group=, limit=, resets=) for grep/awk parsing. Existing IP resolution (x-real-ip first, Vercel-trusted) unchanged — see existing getClientIp() comment. - scripts/test-scrapeguard-ua-sanitiser.mjs: 20-assertion smoke test covering null/empty, plain UAs, CR/LF/tab/NUL/DEL stripping, quote escape, 200-char truncation, and full log-line shape including injection-attempt regression guard. - CHANGELOG.md + lib/changelog-data.ts: 2026-05-14 Security entry. Privacy: IP + UA logging on suspicious traffic falls under legitimate interest (UK GDPR Art. 6(1)(f)) for security purposes. Logs stay in Vercel's internal log stream — covered by the existing Vercel DPA. No log drain export to third parties in this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Preview verification — PASSVerification flow:
(MCP log viewer table-truncates the message column at Smoke test — IP-capped on prod, not a regressionProduction smoke from this dev machine returned 22 × 429 failures because my IP is at the 25/day anonymous cap on Change surface is middleware logging only — Will re-run smoke against production post-merge with a fresh IP window and document the result. |
Post-merge production verification — PASSDeploy: Trigger: 15 GETs to Log verification via Vercel MCP
Sentry-quiet 10-path 5xx sweep (post-merge ≤ 10 min): 0 × 5xx across 11 paths. Sentry quiet. ✓ |
Bumps Last-updated 9 May → 16 May. Captures the 17 PRs landed across 2026-05-13..2026-05-16 (PR #25 through PR #41) plus the 14 May infra changes that didn't have their own PR (Cloudflare disconnect, Upstash PAYG, IndexNow live). Sections refreshed: - Sprint cadence 13–16 May (new): full PR list with one-liner per PR. - Platform: MCP v2.1.0 → v2.1.1; route count 36 → 38. - Infrastructure changes (new): CF Workers disconnected 14 May, CF DNS- only / Vercel firewall is sole edge security, Upstash PAYG $20 cap, CLAUDE.md at root encodes FAULT 5 + FAULT 14, IndexNow workflow live. - Data integrity status (new): table for ULD / Airlines / ADR / Containers / UN-LOCODE / HS / Vehicles / Customs-duty. ULD + Airlines + ADR verified: true; the other 5 verified: false pending allowlist extension (specific domains enumerated). - Scraper defence status (new): PR #31 / #32 / #33 / #38 live, Phases 3+4 deferred to runbook, Phase 2 skipped. - Edge firewall: scoped to Vercel-only (CF inert now). - Distribution surfaces: table with current download counts, Smithery score, MCP Registry STALE flag, Glama description STALE flag. - Weekly digest CLI (new): six FAULT 14 invariants summarised; points at scripts/weekly-digest/README.md for the full spec. - Vercel Analytics: 30-day baseline updated (3,311 visitors / 6,070 PV / 69% bounce / SG 73%). - First validated user signals: Tom (CEVA) preserved + Simon's team organic adoption added per 16 May report. - What's blocked / What's next / Red flags: updated to reflect today's reality — vehicles+customs SHIPPED (#39 #40), weekly digest SHIPPED (#41), Make.com Town Hall 21 May 4PM BST queued, CEVA→WFS transition complete with week 2 of induction pending. - Canonical references: added pointers to scripts/weekly-digest/ and the IndexNow workflow. No CHANGELOG entry — internal doc, not user-visible. Per the prompt. Co-authored-by: SoapyRED <soapyred@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Extends ScrapeGuard middleware to capture the User-Agent and full source IP on every block decision (429 path only — never on the success / cache-hit path). This unblocks evidence-based firewall rule additions: we can now correlate IP ranges with UA signatures before promoting a block from application-layer rate-limiting to Cloudflare WAF.
What changes
Before (existing block log):
After (new format):
key=valuepairs survivegrep/awkparsing intact.ua=is quoted when present (ua="...") so UAs with spaces stay one field; sanitiser strips ASCII control chars (incl.\r\n\tand DEL — log-injection guard), replaces internal"with', truncates at 200 chars.ua=empty(neverua=null).getClientIp()already returns the full client IP and the Vercel-trust ordering (x-real-ipfirst) is documented in-source.Block-only logging — the 2xx / cache-hit path doesn't log UA/IP, keeping log volume bounded and respecting privacy on non-suspicious traffic.
Privacy note (UK GDPR)
/dpa)./dpasub-processor list before going live.FAULT 5 (minimal — internal-only change)
Securityentry at top ofentries[], renders on/changelog)withAuditRest/generateMetadata— N/A (middleware-only change)Test plan
npx tsc --noEmit— cleannpm run lint— same pre-existing baseline (49 problems, 14 errors); zero new findings in middleware.ts or the new test scriptnode scripts/test-scrapeguard-ua-sanitiser.mjs— 20/20 PASS (null/empty, plain curl + python UAs, CR/LF/tab/NUL/DEL stripping, quote escape, 200-char truncation, log-injection regression guard)npx next build— succeedscurl -H 'User-Agent: python-requests/2.31.0'hammering/hs/code/*and inspect logs via Vercel MCPget_runtime_logsfor the newua=+ip=fields. (Vercel preview is auth-walled — bypass viaget_access_to_vercel_url.)Out of scope (does not block this PR)
fix(scrapeguard): rate-limit Redis error logs to prevent log-storm) is currently OPEN with a failing Cloudflare Workers Build check. That PR is from a separate sprint and has its own exit criteria — flagged for follow-up.🤖 Generated with Claude Code