feat(scrapeguard): log user-agent and full source IP on 429/block decisions by SoapyRED · Pull Request #33 · SoapyRED/freighttools

SoapyRED · 2026-05-14T20:01:53Z

Summary

Extends ScrapeGuard middleware to capture the User-Agent and full source IP on every block decision (429 path only — never on the success / cache-hit path). This unblocks evidence-based firewall rule additions: we can now correlate IP ranges with UA signatures before promoting a block from application-layer rate-limiting to Cloudflare WAF.

What changes

Before (existing block log):

[ScrapeGuard] 429 — IP: 216.244.66.231, path: /hs/code/0101, group: hs, limit: 10, resets: 2026-05-14T12:34:56Z

After (new format):

[ScrapeGuard] 429 path=/hs/code/0101 ip=216.244.66.231 ua="python-requests/2.31.0" group=hs limit=10 resets=2026-05-14T12:34:56Z

Space-separated key=value pairs survive grep/awk parsing intact.
ua= is quoted when present (ua="...") so UAs with spaces stay one field; sanitiser strips ASCII control chars (incl. \r\n\t and DEL — log-injection guard), replaces internal " with ', truncates at 200 chars.
Null / whitespace-only UAs render as ua=empty (never ua=null).
IP resolution unchanged — existing getClientIp() already returns the full client IP and the Vercel-trust ordering (x-real-ip first) is documented in-source.

Block-only logging — the 2xx / cache-hit path doesn't log UA/IP, keeping log volume bounded and respecting privacy on non-suspicious traffic.

Privacy note (UK GDPR)

Lawful basis for logging IP + UA on suspicious / rate-limited traffic: legitimate interest (Art. 6(1)(f)) — preventing abuse, fraud, and operational disruption is named in Recital 47 as a recognised legitimate interest.
No log drain change in this PR. UA + IP go into Vercel's internal log stream only, which is covered by the existing Vercel DPA (already listed as a sub-processor on /dpa).
A future PR adding a third-party log drain (Better Stack / Datadog / Logtail) must add that processor to the /dpa sub-processor list before going live.

FAULT 5 (minimal — internal-only change)

siteStats / sitemap / OpenAPI / api-docs / nav / homepage / footer — N/A (no user-visible endpoint or tool change)
CHANGELOG.md — YES (2026-05-14 Security entry)
lib/changelog-data.ts — YES (matching Security entry at top of entries[], renders on /changelog)
MCP registration / npm bump / Postman / 200-word page minimum / IndexNow — N/A
withAuditRest / generateMetadata — N/A (middleware-only change)

Test plan

npx tsc --noEmit — clean
npm run lint — same pre-existing baseline (49 problems, 14 errors); zero new findings in middleware.ts or the new test script
node scripts/test-scrapeguard-ua-sanitiser.mjs — 20/20 PASS (null/empty, plain curl + python UAs, CR/LF/tab/NUL/DEL stripping, quote escape, 200-char truncation, log-injection regression guard)
npx next build — succeeds
Preview verification — push triggers Vercel preview; force a 429 via curl -H 'User-Agent: python-requests/2.31.0' hammering /hs/code/* and inspect logs via Vercel MCP get_runtime_logs for the new ua= + ip= fields. (Vercel preview is auth-walled — bypass via get_access_to_vercel_url.)
Production verification (post-merge) — same hammer against production, confirm log line format via Vercel MCP runtime logs, document the verified line in a PR comment.
Sentry quiet — 10-path prod-curl 5xx sweep in the 10 min post-merge window.

Out of scope (does not block this PR)

PR fix(scrapeguard): rate-limit Redis error logs to prevent log-storm #31 (fix(scrapeguard): rate-limit Redis error logs to prevent log-storm) is currently OPEN with a failing Cloudflare Workers Build check. That PR is from a separate sprint and has its own exit criteria — flagged for follow-up.

🤖 Generated with Claude Code

…isions Extends ScrapeGuard middleware to capture the User-Agent and full client IP on every block decision (429 path only — never on the success / cache-hit path). Unblocks evidence-based firewall additions: we can now correlate IP ranges with UA signatures before promoting a block to the Cloudflare WAF. - middleware.ts: new getSanitisedUa() helper. Strips ASCII control chars (incl. \r\n\t and DEL) as a log-injection guard, replaces internal " with ' to keep the quoted ua="..." field parseable, truncates at 200 chars, returns the literal 'empty' for null / whitespace-only UAs. - Both [ScrapeGuard] 429 warn sites (tryBulkRefScrape + handleScrape- Protection) now emit key=value pairs (path=, ip=, ua=, group=, limit=, resets=) for grep/awk parsing. Existing IP resolution (x-real-ip first, Vercel-trusted) unchanged — see existing getClientIp() comment. - scripts/test-scrapeguard-ua-sanitiser.mjs: 20-assertion smoke test covering null/empty, plain UAs, CR/LF/tab/NUL/DEL stripping, quote escape, 200-char truncation, and full log-line shape including injection-attempt regression guard. - CHANGELOG.md + lib/changelog-data.ts: 2026-05-14 Security entry. Privacy: IP + UA logging on suspicious traffic falls under legitimate interest (UK GDPR Art. 6(1)(f)) for security purposes. Logs stay in Vercel's internal log stream — covered by the existing Vercel DPA. No log drain export to third parties in this change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-14T20:01:59Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
freighttools	Ready	Preview, Comment	May 14, 2026 8:03pm

SoapyRED · 2026-05-14T20:11:07Z

Preview verification — PASS

Verification flow:

Bypass URL obtained via Vercel MCP get_access_to_vercel_url → cookie jar set
15 GETs to https://freighttools-git-feat-scrapegu-216330-mrcristoiu-5817s-projects.vercel.app/hs/code/0101010000 with User-Agent: python-requests/2.31.0 → first 10 returned 404 (no such HS code, but middleware fires first) and requests 11–15 returned 429 ✓
Additional 429s triggered with User-Agent: curl/8.0.0 and User-Agent; (curl empty-header syntax)
Vercel MCP get_runtime_logs confirmed all expected substrings present in the new log lines:
- query=ScrapeGuard → 5 hits
- query=python-requests → 5 hits (confirms ua="python-requests/2.31.0" field present)
- query=curl/8.0.0 → 1 hit (confirms ua="curl/8.0.0" field present)
- query=group=hs limit=10 → 3 hits (confirms full key=value tail of log line)
- query=ua= → 5 hits across all 429s

(MCP log viewer table-truncates the message column at [ScrapeGuard] 429 path=/hs/... but substring search hits confirm the full payload arrived in the log stream.)

Smoke test — IP-capped on prod, not a regression

Production smoke from this dev machine returned 22 × 429 failures because my IP is at the 25/day anonymous cap on /api/* from the preview-test traffic above. The 23rd failure (/api/auth/whoami (valid key) → 401) is a missing SMOKE_API_KEY env var, not a regression.

Change surface is middleware logging only — getSanitisedUa() is a pure function, log lines change format but no HTTP status / body / header behaviour changes. The 20/20 unit test in scripts/test-scrapeguard-ua-sanitiser.mjs covers the sanitiser behaviour (null, empty, control-char stripping, quote escape, 200-char truncation, log-injection regression guard).

Will re-run smoke against production post-merge with a fresh IP window and document the result.

SoapyRED · 2026-05-14T20:19:19Z

Post-merge production verification — PASS

Deploy: dpl_84PcG3PF38yfKjKtY6gDWCDLf6nj from cb7acd6cc6fc57035080ce0aada1803d7dde2a64 — READY at 2026-05-14T20:13:21Z, aliased to www.freightutils.com.

Trigger: 15 GETs to https://www.freightutils.com/hs/code/0101010099 with User-Agent: python-requests/2.31.0 → first 10 returned 404 (no such HS code, but middleware fires first) and requests 11–15 returned 429. ✓

Log verification via Vercel MCP get_runtime_logs (environment=production, since=3m):

query=python-requests → 3 matching log lines from /hs/code/0101010099 at 20:18:20–21 (confirms ua="python-requests/2.31.0" field present in production log payload).
Message column truncated by viewer at [ScrapeGuard] 429 path=/hs/... but substring search hits confirm the full new format reached the production log stream.

Sentry-quiet 10-path 5xx sweep (post-merge ≤ 10 min):

/api/health                    HTTP 200
/api/tools                     HTTP 200
/api/openapi.json              HTTP 404   (path not served — not a regression)
/ldm                           HTTP 200
/cbm                           HTTP 200
/chargeable-weight             HTTP 200
/hs                            HTTP 200
/adr                           HTTP 200
/uld                           HTTP 200
/api/incoterms                 HTTP 429   (IP at daily anon cap from prior test traffic)
/sitemap.xml                   HTTP 200

0 × 5xx across 11 paths. Sentry quiet. ✓

Bumps Last-updated 9 May → 16 May. Captures the 17 PRs landed across 2026-05-13..2026-05-16 (PR #25 through PR #41) plus the 14 May infra changes that didn't have their own PR (Cloudflare disconnect, Upstash PAYG, IndexNow live). Sections refreshed: - Sprint cadence 13–16 May (new): full PR list with one-liner per PR. - Platform: MCP v2.1.0 → v2.1.1; route count 36 → 38. - Infrastructure changes (new): CF Workers disconnected 14 May, CF DNS- only / Vercel firewall is sole edge security, Upstash PAYG $20 cap, CLAUDE.md at root encodes FAULT 5 + FAULT 14, IndexNow workflow live. - Data integrity status (new): table for ULD / Airlines / ADR / Containers / UN-LOCODE / HS / Vehicles / Customs-duty. ULD + Airlines + ADR verified: true; the other 5 verified: false pending allowlist extension (specific domains enumerated). - Scraper defence status (new): PR #31 / #32 / #33 / #38 live, Phases 3+4 deferred to runbook, Phase 2 skipped. - Edge firewall: scoped to Vercel-only (CF inert now). - Distribution surfaces: table with current download counts, Smithery score, MCP Registry STALE flag, Glama description STALE flag. - Weekly digest CLI (new): six FAULT 14 invariants summarised; points at scripts/weekly-digest/README.md for the full spec. - Vercel Analytics: 30-day baseline updated (3,311 visitors / 6,070 PV / 69% bounce / SG 73%). - First validated user signals: Tom (CEVA) preserved + Simon's team organic adoption added per 16 May report. - What's blocked / What's next / Red flags: updated to reflect today's reality — vehicles+customs SHIPPED (#39 #40), weekly digest SHIPPED (#41), Make.com Town Hall 21 May 4PM BST queued, CEVA→WFS transition complete with week 2 of induction pending. - Canonical references: added pointers to scripts/weekly-digest/ and the IndexNow workflow. No CHANGELOG entry — internal doc, not user-visible. Per the prompt. Co-authored-by: SoapyRED <soapyred@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview May 14, 2026 20:03 View deployment

SoapyRED merged commit cb7acd6 into main May 14, 2026
2 checks passed

SoapyRED deleted the feat/scrapeguard-ua-and-ip-logging branch May 14, 2026 20:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scrapeguard): log user-agent and full source IP on 429/block decisions#33

feat(scrapeguard): log user-agent and full source IP on 429/block decisions#33
SoapyRED merged 1 commit into
mainfrom
feat/scrapeguard-ua-and-ip-logging

SoapyRED commented May 14, 2026

Uh oh!

vercel Bot commented May 14, 2026 •

edited

Loading

Uh oh!

SoapyRED commented May 14, 2026

Uh oh!

Uh oh!

SoapyRED commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SoapyRED commented May 14, 2026

Summary

What changes

Privacy note (UK GDPR)

FAULT 5 (minimal — internal-only change)

Test plan

Out of scope (does not block this PR)

Uh oh!

vercel Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SoapyRED commented May 14, 2026

Preview verification — PASS

Smoke test — IP-capped on prod, not a regression

Uh oh!

Uh oh!

SoapyRED commented May 14, 2026

Post-merge production verification — PASS

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 14, 2026 •

edited

Loading