Skip to content

feat(scrapeguard): log user-agent and full source IP on 429/block decisions#33

Merged
SoapyRED merged 1 commit into
mainfrom
feat/scrapeguard-ua-and-ip-logging
May 14, 2026
Merged

feat(scrapeguard): log user-agent and full source IP on 429/block decisions#33
SoapyRED merged 1 commit into
mainfrom
feat/scrapeguard-ua-and-ip-logging

Conversation

@SoapyRED
Copy link
Copy Markdown
Owner

Summary

Extends ScrapeGuard middleware to capture the User-Agent and full source IP on every block decision (429 path only — never on the success / cache-hit path). This unblocks evidence-based firewall rule additions: we can now correlate IP ranges with UA signatures before promoting a block from application-layer rate-limiting to Cloudflare WAF.

What changes

Before (existing block log):

[ScrapeGuard] 429 — IP: 216.244.66.231, path: /hs/code/0101, group: hs, limit: 10, resets: 2026-05-14T12:34:56Z

After (new format):

[ScrapeGuard] 429 path=/hs/code/0101 ip=216.244.66.231 ua="python-requests/2.31.0" group=hs limit=10 resets=2026-05-14T12:34:56Z
  • Space-separated key=value pairs survive grep/awk parsing intact.
  • ua= is quoted when present (ua="...") so UAs with spaces stay one field; sanitiser strips ASCII control chars (incl. \r\n\t and DEL — log-injection guard), replaces internal " with ', truncates at 200 chars.
  • Null / whitespace-only UAs render as ua=empty (never ua=null).
  • IP resolution unchanged — existing getClientIp() already returns the full client IP and the Vercel-trust ordering (x-real-ip first) is documented in-source.

Block-only logging — the 2xx / cache-hit path doesn't log UA/IP, keeping log volume bounded and respecting privacy on non-suspicious traffic.

Privacy note (UK GDPR)

  • Lawful basis for logging IP + UA on suspicious / rate-limited traffic: legitimate interest (Art. 6(1)(f)) — preventing abuse, fraud, and operational disruption is named in Recital 47 as a recognised legitimate interest.
  • No log drain change in this PR. UA + IP go into Vercel's internal log stream only, which is covered by the existing Vercel DPA (already listed as a sub-processor on /dpa).
  • A future PR adding a third-party log drain (Better Stack / Datadog / Logtail) must add that processor to the /dpa sub-processor list before going live.

FAULT 5 (minimal — internal-only change)

  • siteStats / sitemap / OpenAPI / api-docs / nav / homepage / footer — N/A (no user-visible endpoint or tool change)
  • CHANGELOG.md — YES (2026-05-14 Security entry)
  • lib/changelog-data.ts — YES (matching Security entry at top of entries[], renders on /changelog)
  • MCP registration / npm bump / Postman / 200-word page minimum / IndexNow — N/A
  • withAuditRest / generateMetadata — N/A (middleware-only change)

Test plan

  • npx tsc --noEmit — clean
  • npm run lint — same pre-existing baseline (49 problems, 14 errors); zero new findings in middleware.ts or the new test script
  • node scripts/test-scrapeguard-ua-sanitiser.mjs — 20/20 PASS (null/empty, plain curl + python UAs, CR/LF/tab/NUL/DEL stripping, quote escape, 200-char truncation, log-injection regression guard)
  • npx next build — succeeds
  • Preview verification — push triggers Vercel preview; force a 429 via curl -H 'User-Agent: python-requests/2.31.0' hammering /hs/code/* and inspect logs via Vercel MCP get_runtime_logs for the new ua= + ip= fields. (Vercel preview is auth-walled — bypass via get_access_to_vercel_url.)
  • Production verification (post-merge) — same hammer against production, confirm log line format via Vercel MCP runtime logs, document the verified line in a PR comment.
  • Sentry quiet — 10-path prod-curl 5xx sweep in the 10 min post-merge window.

Out of scope (does not block this PR)

🤖 Generated with Claude Code

…isions

Extends ScrapeGuard middleware to capture the User-Agent and full client IP
on every block decision (429 path only — never on the success / cache-hit
path). Unblocks evidence-based firewall additions: we can now correlate IP
ranges with UA signatures before promoting a block to the Cloudflare WAF.

- middleware.ts: new getSanitisedUa() helper. Strips ASCII control chars
  (incl. \r\n\t and DEL) as a log-injection guard, replaces internal " with
  ' to keep the quoted ua="..." field parseable, truncates at 200 chars,
  returns the literal 'empty' for null / whitespace-only UAs.
- Both [ScrapeGuard] 429 warn sites (tryBulkRefScrape + handleScrape-
  Protection) now emit key=value pairs (path=, ip=, ua=, group=, limit=,
  resets=) for grep/awk parsing. Existing IP resolution (x-real-ip first,
  Vercel-trusted) unchanged — see existing getClientIp() comment.
- scripts/test-scrapeguard-ua-sanitiser.mjs: 20-assertion smoke test
  covering null/empty, plain UAs, CR/LF/tab/NUL/DEL stripping, quote
  escape, 200-char truncation, and full log-line shape including
  injection-attempt regression guard.
- CHANGELOG.md + lib/changelog-data.ts: 2026-05-14 Security entry.

Privacy: IP + UA logging on suspicious traffic falls under legitimate
interest (UK GDPR Art. 6(1)(f)) for security purposes. Logs stay in
Vercel's internal log stream — covered by the existing Vercel DPA. No
log drain export to third parties in this change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
freighttools Ready Ready Preview, Comment May 14, 2026 8:03pm

Request Review

@SoapyRED
Copy link
Copy Markdown
Owner Author

Preview verification — PASS

Verification flow:

  1. Bypass URL obtained via Vercel MCP get_access_to_vercel_url → cookie jar set
  2. 15 GETs to https://freighttools-git-feat-scrapegu-216330-mrcristoiu-5817s-projects.vercel.app/hs/code/0101010000 with User-Agent: python-requests/2.31.0 → first 10 returned 404 (no such HS code, but middleware fires first) and requests 11–15 returned 429 ✓
  3. Additional 429s triggered with User-Agent: curl/8.0.0 and User-Agent; (curl empty-header syntax)
  4. Vercel MCP get_runtime_logs confirmed all expected substrings present in the new log lines:
    • query=ScrapeGuard → 5 hits
    • query=python-requests → 5 hits (confirms ua="python-requests/2.31.0" field present)
    • query=curl/8.0.0 → 1 hit (confirms ua="curl/8.0.0" field present)
    • query=group=hs limit=10 → 3 hits (confirms full key=value tail of log line)
    • query=ua= → 5 hits across all 429s

(MCP log viewer table-truncates the message column at [ScrapeGuard] 429 path=/hs/... but substring search hits confirm the full payload arrived in the log stream.)

Smoke test — IP-capped on prod, not a regression

Production smoke from this dev machine returned 22 × 429 failures because my IP is at the 25/day anonymous cap on /api/* from the preview-test traffic above. The 23rd failure (/api/auth/whoami (valid key) → 401) is a missing SMOKE_API_KEY env var, not a regression.

Change surface is middleware logging onlygetSanitisedUa() is a pure function, log lines change format but no HTTP status / body / header behaviour changes. The 20/20 unit test in scripts/test-scrapeguard-ua-sanitiser.mjs covers the sanitiser behaviour (null, empty, control-char stripping, quote escape, 200-char truncation, log-injection regression guard).

Will re-run smoke against production post-merge with a fresh IP window and document the result.

@SoapyRED SoapyRED merged commit cb7acd6 into main May 14, 2026
2 checks passed
@SoapyRED SoapyRED deleted the feat/scrapeguard-ua-and-ip-logging branch May 14, 2026 20:11
@SoapyRED
Copy link
Copy Markdown
Owner Author

Post-merge production verification — PASS

Deploy: dpl_84PcG3PF38yfKjKtY6gDWCDLf6nj from cb7acd6cc6fc57035080ce0aada1803d7dde2a64 — READY at 2026-05-14T20:13:21Z, aliased to www.freightutils.com.

Trigger: 15 GETs to https://www.freightutils.com/hs/code/0101010099 with User-Agent: python-requests/2.31.0 → first 10 returned 404 (no such HS code, but middleware fires first) and requests 11–15 returned 429. ✓

Log verification via Vercel MCP get_runtime_logs (environment=production, since=3m):

  • query=python-requests → 3 matching log lines from /hs/code/0101010099 at 20:18:20–21 (confirms ua="python-requests/2.31.0" field present in production log payload).
  • Message column truncated by viewer at [ScrapeGuard] 429 path=/hs/... but substring search hits confirm the full new format reached the production log stream.

Sentry-quiet 10-path 5xx sweep (post-merge ≤ 10 min):

/api/health                    HTTP 200
/api/tools                     HTTP 200
/api/openapi.json              HTTP 404   (path not served — not a regression)
/ldm                           HTTP 200
/cbm                           HTTP 200
/chargeable-weight             HTTP 200
/hs                            HTTP 200
/adr                           HTTP 200
/uld                           HTTP 200
/api/incoterms                 HTTP 429   (IP at daily anon cap from prior test traffic)
/sitemap.xml                   HTTP 200

0 × 5xx across 11 paths. Sentry quiet. ✓

SoapyRED added a commit that referenced this pull request May 16, 2026
Bumps Last-updated 9 May → 16 May. Captures the 17 PRs landed across
2026-05-13..2026-05-16 (PR #25 through PR #41) plus the 14 May infra
changes that didn't have their own PR (Cloudflare disconnect, Upstash
PAYG, IndexNow live).

Sections refreshed:
- Sprint cadence 13–16 May (new): full PR list with one-liner per PR.
- Platform: MCP v2.1.0 → v2.1.1; route count 36 → 38.
- Infrastructure changes (new): CF Workers disconnected 14 May, CF DNS-
  only / Vercel firewall is sole edge security, Upstash PAYG $20 cap,
  CLAUDE.md at root encodes FAULT 5 + FAULT 14, IndexNow workflow live.
- Data integrity status (new): table for ULD / Airlines / ADR / Containers
  / UN-LOCODE / HS / Vehicles / Customs-duty. ULD + Airlines + ADR
  verified: true; the other 5 verified: false pending allowlist
  extension (specific domains enumerated).
- Scraper defence status (new): PR #31 / #32 / #33 / #38 live, Phases
  3+4 deferred to runbook, Phase 2 skipped.
- Edge firewall: scoped to Vercel-only (CF inert now).
- Distribution surfaces: table with current download counts, Smithery
  score, MCP Registry STALE flag, Glama description STALE flag.
- Weekly digest CLI (new): six FAULT 14 invariants summarised; points
  at scripts/weekly-digest/README.md for the full spec.
- Vercel Analytics: 30-day baseline updated (3,311 visitors / 6,070
  PV / 69% bounce / SG 73%).
- First validated user signals: Tom (CEVA) preserved + Simon's team
  organic adoption added per 16 May report.
- What's blocked / What's next / Red flags: updated to reflect today's
  reality — vehicles+customs SHIPPED (#39 #40), weekly digest SHIPPED
  (#41), Make.com Town Hall 21 May 4PM BST queued, CEVA→WFS transition
  complete with week 2 of induction pending.
- Canonical references: added pointers to scripts/weekly-digest/ and
  the IndexNow workflow.

No CHANGELOG entry — internal doc, not user-visible. Per the prompt.

Co-authored-by: SoapyRED <soapyred@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant