Skip to content

perf(hs): edge-cache /hs/code/* and /hs/heading/* to reduce middleware load#32

Merged
SoapyRED merged 3 commits into
mainfrom
perf/hs-edge-cache-and-static-gen
May 14, 2026
Merged

perf(hs): edge-cache /hs/code/* and /hs/heading/* to reduce middleware load#32
SoapyRED merged 3 commits into
mainfrom
perf/hs-edge-cache-and-static-gen

Conversation

@SoapyRED
Copy link
Copy Markdown
Owner

Summary

Phase 0 + Phase 1 + Phase 1.5 of the scraper-defence-r1 sprint, landed together. Phase 1.6 / Phase 2 / Phase 3 are wall-clock blocked (24h + 48h windows) and become Soap-actionable follow-ups.

Phase 0 — scraper signature audit

See docs/audit/scraper-signature-2026-05-14.md (committed in this PR).

Headline finding: a sustained scraper is active. 7-day window pulled via Vercel MCP get_runtime_logs filtered to source=edge-middleware, query=ScrapeGuard, statusCode=429:

  • ≥100 [ScrapeGuard] 429 log lines captured (MCP query timed out before exhausting matches)

  • All from IPs in the 216.* /8 range

  • All on /hs/code/*, /hs/heading/*, /hs/chapter/*, /hs/section/*

  • Sustained ~5-second interval (scripted enumeration, not a real user)

  • Sample 10 minutes from window-close:

    19:12:45  GET /hs/code/843691     429
    19:12:35  GET /hs/heading/2003    429
    19:12:30  GET /hs/heading/2007    429
    19:12:25  GET /hs/heading/2829    429
    19:12:20  GET /hs/heading/0803    429
    19:12:10  GET /hs/section/xv      429
    ...
    

ScrapeGuard is doing its job — every request 429'd. The cost is the Redis INCR per request. At sustained ~5s/req from one source, that's ~17K wasted Redis commands/day.

Vercel WAF events (/v1/security/firewall/events) returned 0 actions over the same 7-day window. The 9 existing firewall rules (Tencent AS132203, 43.172/16+43.173/16, AhrefsBot/SemrushBot/etc UA blocks, HS code rate-limit) did NOT match this scraper.

Data-access gaps documented honestly:

Sprint request Status
Top 20 UAs ❌ middleware log line doesn't include UA; no raw access log accessible from sandbox
Top 20 source IPs / ASNs ⚠️ partial — 216.* /8 confirmed, finer resolution truncated by MCP UI render
Country distribution ❌ same blocker
Upstash Redis baseline ❌ KV_* sensitive-typed in Vercel prod env, not retrievable from sandbox

Probed (all returned 404): /v3/projects/.../logs, /v1/projects/.../runtime-logs, /v2/logs, /v9/logs, /v1/observability/access-logs, /v2/configurations/log-drains. /v2/integrations/log-drains returned 200 but empty (no drains configured). No vercel CLI in sandbox.

Phase 1 — firewall rules added: 0

The sprint brief's evidence thresholds (≥1000 reqs from one UA, ≥500 reqs from one ASN at ≥80% scraper-bait paths) cannot be evaluated given the data-access gaps above:

  • UA thresholds need UA strings → middleware doesn't log them
  • ASN thresholds need full IPs → MCP truncates IP at 216...
  • Adding a 216.0.0.0/8 block would cover ~16M IPs across many ARIN ASNs (residential, corporate, cloud) — too broad to be evidence-based

Honest call: stop short of speculative rules. The application-layer ScrapeGuard is catching the scraper. Phase 1.5 (below) reduces the cost of catching it rather than trying to add another firewall layer with insufficient signature data.

Phase 1.5 — next.config.ts cache headers + FAULT 5

Adds explicit Cache-Control to /hs/code/* and /hs/heading/* via next.config.ts headers():

public, max-age=300, s-maxage=86400, stale-while-revalidate=604800

Both routes already had export const revalidate = 86400 (ISR) — this surfaces the cache strategy at the config level so it's tunable + curl-visible.

What this changes on the wire:

  • Cold serve: unchanged (middleware → page render → cached at edge)
  • Warm cache hit: response served from Vercel edge without re-rendering the page server component (React tree + lib/calculations/hs lookup skipped)
  • The application-layer ScrapeGuard rate limiter STILL runs on every request (Next.js middleware runs before cache lookup). Phase 1.6 measures whether page-render savings reduce overall Redis-INCR volume enough; if not, Phase 2 (generateStaticParams()) ships.

FAULT 5 checklist — applied

Check Status
CHANGELOG.md entry under 2026-05-14 ✅ added
lib/changelog-data.ts entry (renders on /changelog via monthGroups()) ✅ added
/changelog page renders the entry ✅ verified by reading the monthGroups() function — entries are sorted by isoDate and grouped by month; new May 14 entry surfaces under "May 2026" group
No new URLs (no sitemap or IndexNow action needed) ✅ no new URLs
No siteStats.ts changes ✅ none
No nav changes ✅ none
No API contract changes ✅ response bodies unchanged; only response headers added

Smoke test posture

scripts/smoke-test.mjs will be run against the Vercel preview URL after this PR opens. The smoke covers API endpoints — it does NOT specifically assert Cache-Control on /hs/* pages, but it does assert no 5xx regression. Post-merge, the deliverable is a curl -I https://www.freightutils.com/hs/code/<sample> showing the new Cache-Control header.

Files

File Type
docs/audit/scraper-signature-2026-05-14.md new (Phase 0 audit)
next.config.ts extended (added hsPageCachedPaths mapping to Cache-Control)
CHANGELOG.md new entry under 2026-05-14
lib/changelog-data.ts new entry at top of entries[]

Hard-rule compliance

  • ✅ No force-push (PR merged via gh pr merge)
  • lib/data/test_airlines.js not touched (no modification needed this sprint)
  • ✅ Conventional commits (docs(audit): + perf(hs):)
  • ✅ TaskStop on every backgrounded shell (none spawned this sprint — all Bash calls were foreground)
  • ✅ No new data fabricated, no LLM-knowledge claims, no fictional sources

Soap to-do post-merge (time-blocked phases)

  • Phase 1.6 — 24h observation window: Pull Upstash Redis commands/day from the Upstash or Vercel Storage dashboard 24h after this PR merges. Compare to the pre-merge baseline. If commands/day is < 3K average target, Phase 2 is OPTIONAL — skip and proceed to Phase 3.
  • Phase 2 (conditional) — full generateStaticParams() for HS pages: Only fire if Phase 1.6 shows the cache-header path didn't get under 3K commands/day. Adds ~6,940 HS code params + ~5,224 heading params at build time. Watch the build time budget — abort if it crosses 20 minutes.
  • Phase 3 — 48h verification window: Pull Redis commands/day + Sentry quiet check + spot-check 5 /hs/code/* URLs for x-vercel-cache: HIT. Write a final status note to docs/audit/scraper-signature-2026-05-14.md with baseline vs post-fix percent reduction.
  • Unblock the Phase 0 data-access gaps for the next iteration: Soap can export the Vercel Dashboard log explorer (Chrome path — has full IP + UA + ASN columns) OR install a Log Drain to a 3rd-party (Better Stack, Logtail), OR extend middleware.ts ScrapeGuard's console.warn to log req.headers.get('user-agent') so the next sprint can do evidence-based UA blocking.

🤖 Generated with Claude Code

SoapyRED and others added 2 commits May 14, 2026 20:22
…ively

Phase 0 audit of the scraper-defence-r1 sprint. Pulled the last 7 days
of edge-middleware runtime logs via Vercel MCP get_runtime_logs filtered
to source=edge-middleware, query=ScrapeGuard, statusCode=429.

Headline finding: an active scraper is hammering /hs/code/*,
/hs/heading/*, /hs/chapter/*, /hs/section/* from IPs in the 216.* /8
range at sustained ~5-second intervals. ≥100 ScrapeGuard 429s captured
in the window (MCP query timed out before exhausting matches). All
requests caught by the application-layer rate limiter — none reached
the page handlers. The Vercel WAF events endpoint returned 0 actions
over the same window: the WAF-layer rules in the firewall config did
NOT match this scraper.

Data-access gaps documented honestly:
  - MCP UI truncates the message column at "IP: 216..." — can't
    extract the next 3 octets for ASN resolution.
  - middleware.ts ScrapeGuard log line records IP/path/group/limit/
    reset but NOT user-agent — top-N UA aggregation impossible from
    application logs alone.
  - No raw Vercel access log endpoint accessible from this sandbox
    (probed /v3/projects/.../logs, /v1/runtime-logs, /v2/logs,
    /v9/logs, /v1/observability/access-logs — all 404). No Vercel
    CLI installed. No Log Drain configured (verified empty).
  - Upstash Redis commands/day baseline not accessible — KV_* are
    sensitive-typed in Vercel prod env per the 9 May env audit.

Phase 1 conclusion: 0 firewall rules added this sprint. The sprint
brief's evidence thresholds (≥1000 requests from a single UA, ≥500
requests from a confirmed proxy-provider ASN) cannot be evaluated
without the truncated log fields. ASN-blocking 216.0.0.0/8 covers ~16M
IPs across many ARIN-region ASNs — too broad for an evidence-based
rule.

Phase 1.5 (cache headers) ships in the same PR as the actual
intervention — see next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e load

Adds explicit Cache-Control headers in next.config.ts headers() for the
two scraper-target HS page routes:

  public, max-age=300, s-maxage=86400, stale-while-revalidate=604800

Both page routes already had `export const revalidate = 86400` (ISR)
in place; this commit makes the cache strategy explicit at the config
level so it's tunable, visible in the response headers (debuggable
via `curl -I`), and not buried in per-page constants.

What this changes on the wire:
  - Cold serve: unchanged — middleware fires, page renders, response cached
  - Warm cache hit: response served from Vercel edge without re-rendering
    the page server component (the React tree + the lib/calculations/hs
    data lookup are skipped)
  - The application-layer ScrapeGuard rate limiter in middleware.ts
    STILL runs on every request (legitimate or scraper) — Next.js
    middleware runs before cache lookup. Phase 1.6 measures whether the
    page-render savings reduce overall Redis-INCR volume enough; if not,
    Phase 2 (full generateStaticParams() at build time) ships.

Sourced by docs/audit/scraper-signature-2026-05-14.md: a 216.* scraper
is hitting these paths at sustained ~5-second intervals. Each repeat
hit is now cheap at the edge.

FAULT 5 (user-visible change checklist):
  - CHANGELOG.md entry added under 2026-05-14
  - lib/changelog-data.ts entry added (renders on /changelog page via
    monthGroups() — verified by reading the function)
  - No new URLs (no sitemap or IndexNow action needed)
  - No siteStats.ts changes, no nav changes, no API contract changes

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
freighttools Ready Ready Preview, Comment May 14, 2026 7:33pm

Request Review

@SoapyRED SoapyRED merged commit d32c6b8 into main May 14, 2026
@SoapyRED SoapyRED deleted the perf/hs-edge-cache-and-static-gen branch May 14, 2026 19:31
SoapyRED added a commit that referenced this pull request May 16, 2026
Bumps Last-updated 9 May → 16 May. Captures the 17 PRs landed across
2026-05-13..2026-05-16 (PR #25 through PR #41) plus the 14 May infra
changes that didn't have their own PR (Cloudflare disconnect, Upstash
PAYG, IndexNow live).

Sections refreshed:
- Sprint cadence 13–16 May (new): full PR list with one-liner per PR.
- Platform: MCP v2.1.0 → v2.1.1; route count 36 → 38.
- Infrastructure changes (new): CF Workers disconnected 14 May, CF DNS-
  only / Vercel firewall is sole edge security, Upstash PAYG $20 cap,
  CLAUDE.md at root encodes FAULT 5 + FAULT 14, IndexNow workflow live.
- Data integrity status (new): table for ULD / Airlines / ADR / Containers
  / UN-LOCODE / HS / Vehicles / Customs-duty. ULD + Airlines + ADR
  verified: true; the other 5 verified: false pending allowlist
  extension (specific domains enumerated).
- Scraper defence status (new): PR #31 / #32 / #33 / #38 live, Phases
  3+4 deferred to runbook, Phase 2 skipped.
- Edge firewall: scoped to Vercel-only (CF inert now).
- Distribution surfaces: table with current download counts, Smithery
  score, MCP Registry STALE flag, Glama description STALE flag.
- Weekly digest CLI (new): six FAULT 14 invariants summarised; points
  at scripts/weekly-digest/README.md for the full spec.
- Vercel Analytics: 30-day baseline updated (3,311 visitors / 6,070
  PV / 69% bounce / SG 73%).
- First validated user signals: Tom (CEVA) preserved + Simon's team
  organic adoption added per 16 May report.
- What's blocked / What's next / Red flags: updated to reflect today's
  reality — vehicles+customs SHIPPED (#39 #40), weekly digest SHIPPED
  (#41), Make.com Town Hall 21 May 4PM BST queued, CEVA→WFS transition
  complete with week 2 of induction pending.
- Canonical references: added pointers to scripts/weekly-digest/ and
  the IndexNow workflow.

No CHANGELOG entry — internal doc, not user-visible. Per the prompt.

Co-authored-by: SoapyRED <soapyred@users.noreply.github.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant