perf(hs): edge-cache /hs/code/* and /hs/heading/* to reduce middleware load#32
Merged
Conversation
…ively
Phase 0 audit of the scraper-defence-r1 sprint. Pulled the last 7 days
of edge-middleware runtime logs via Vercel MCP get_runtime_logs filtered
to source=edge-middleware, query=ScrapeGuard, statusCode=429.
Headline finding: an active scraper is hammering /hs/code/*,
/hs/heading/*, /hs/chapter/*, /hs/section/* from IPs in the 216.* /8
range at sustained ~5-second intervals. ≥100 ScrapeGuard 429s captured
in the window (MCP query timed out before exhausting matches). All
requests caught by the application-layer rate limiter — none reached
the page handlers. The Vercel WAF events endpoint returned 0 actions
over the same window: the WAF-layer rules in the firewall config did
NOT match this scraper.
Data-access gaps documented honestly:
- MCP UI truncates the message column at "IP: 216..." — can't
extract the next 3 octets for ASN resolution.
- middleware.ts ScrapeGuard log line records IP/path/group/limit/
reset but NOT user-agent — top-N UA aggregation impossible from
application logs alone.
- No raw Vercel access log endpoint accessible from this sandbox
(probed /v3/projects/.../logs, /v1/runtime-logs, /v2/logs,
/v9/logs, /v1/observability/access-logs — all 404). No Vercel
CLI installed. No Log Drain configured (verified empty).
- Upstash Redis commands/day baseline not accessible — KV_* are
sensitive-typed in Vercel prod env per the 9 May env audit.
Phase 1 conclusion: 0 firewall rules added this sprint. The sprint
brief's evidence thresholds (≥1000 requests from a single UA, ≥500
requests from a confirmed proxy-provider ASN) cannot be evaluated
without the truncated log fields. ASN-blocking 216.0.0.0/8 covers ~16M
IPs across many ARIN-region ASNs — too broad for an evidence-based
rule.
Phase 1.5 (cache headers) ships in the same PR as the actual
intervention — see next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e load
Adds explicit Cache-Control headers in next.config.ts headers() for the
two scraper-target HS page routes:
public, max-age=300, s-maxage=86400, stale-while-revalidate=604800
Both page routes already had `export const revalidate = 86400` (ISR)
in place; this commit makes the cache strategy explicit at the config
level so it's tunable, visible in the response headers (debuggable
via `curl -I`), and not buried in per-page constants.
What this changes on the wire:
- Cold serve: unchanged — middleware fires, page renders, response cached
- Warm cache hit: response served from Vercel edge without re-rendering
the page server component (the React tree + the lib/calculations/hs
data lookup are skipped)
- The application-layer ScrapeGuard rate limiter in middleware.ts
STILL runs on every request (legitimate or scraper) — Next.js
middleware runs before cache lookup. Phase 1.6 measures whether the
page-render savings reduce overall Redis-INCR volume enough; if not,
Phase 2 (full generateStaticParams() at build time) ships.
Sourced by docs/audit/scraper-signature-2026-05-14.md: a 216.* scraper
is hitting these paths at sustained ~5-second intervals. Each repeat
hit is now cheap at the edge.
FAULT 5 (user-visible change checklist):
- CHANGELOG.md entry added under 2026-05-14
- lib/changelog-data.ts entry added (renders on /changelog page via
monthGroups() — verified by reading the function)
- No new URLs (no sitemap or IndexNow action needed)
- No siteStats.ts changes, no nav changes, no API contract changes
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
…d-static-gen # Conflicts: # CHANGELOG.md
This was referenced May 15, 2026
SoapyRED
added a commit
that referenced
this pull request
May 16, 2026
Bumps Last-updated 9 May → 16 May. Captures the 17 PRs landed across 2026-05-13..2026-05-16 (PR #25 through PR #41) plus the 14 May infra changes that didn't have their own PR (Cloudflare disconnect, Upstash PAYG, IndexNow live). Sections refreshed: - Sprint cadence 13–16 May (new): full PR list with one-liner per PR. - Platform: MCP v2.1.0 → v2.1.1; route count 36 → 38. - Infrastructure changes (new): CF Workers disconnected 14 May, CF DNS- only / Vercel firewall is sole edge security, Upstash PAYG $20 cap, CLAUDE.md at root encodes FAULT 5 + FAULT 14, IndexNow workflow live. - Data integrity status (new): table for ULD / Airlines / ADR / Containers / UN-LOCODE / HS / Vehicles / Customs-duty. ULD + Airlines + ADR verified: true; the other 5 verified: false pending allowlist extension (specific domains enumerated). - Scraper defence status (new): PR #31 / #32 / #33 / #38 live, Phases 3+4 deferred to runbook, Phase 2 skipped. - Edge firewall: scoped to Vercel-only (CF inert now). - Distribution surfaces: table with current download counts, Smithery score, MCP Registry STALE flag, Glama description STALE flag. - Weekly digest CLI (new): six FAULT 14 invariants summarised; points at scripts/weekly-digest/README.md for the full spec. - Vercel Analytics: 30-day baseline updated (3,311 visitors / 6,070 PV / 69% bounce / SG 73%). - First validated user signals: Tom (CEVA) preserved + Simon's team organic adoption added per 16 May report. - What's blocked / What's next / Red flags: updated to reflect today's reality — vehicles+customs SHIPPED (#39 #40), weekly digest SHIPPED (#41), Make.com Town Hall 21 May 4PM BST queued, CEVA→WFS transition complete with week 2 of induction pending. - Canonical references: added pointers to scripts/weekly-digest/ and the IndexNow workflow. No CHANGELOG entry — internal doc, not user-visible. Per the prompt. Co-authored-by: SoapyRED <soapyred@users.noreply.github.com> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 0 + Phase 1 + Phase 1.5 of the scraper-defence-r1 sprint, landed together. Phase 1.6 / Phase 2 / Phase 3 are wall-clock blocked (24h + 48h windows) and become Soap-actionable follow-ups.
Phase 0 — scraper signature audit
See
docs/audit/scraper-signature-2026-05-14.md(committed in this PR).Headline finding: a sustained scraper is active. 7-day window pulled via Vercel MCP
get_runtime_logsfiltered tosource=edge-middleware, query=ScrapeGuard, statusCode=429:≥100
[ScrapeGuard] 429log lines captured (MCP query timed out before exhausting matches)All from IPs in the
216.*/8 rangeAll on
/hs/code/*,/hs/heading/*,/hs/chapter/*,/hs/section/*Sustained ~5-second interval (scripted enumeration, not a real user)
Sample 10 minutes from window-close:
ScrapeGuard is doing its job — every request 429'd. The cost is the Redis INCR per request. At sustained ~5s/req from one source, that's ~17K wasted Redis commands/day.
Vercel WAF events (
/v1/security/firewall/events) returned 0 actions over the same 7-day window. The 9 existing firewall rules (Tencent AS132203, 43.172/16+43.173/16, AhrefsBot/SemrushBot/etc UA blocks, HS code rate-limit) did NOT match this scraper.Data-access gaps documented honestly:
216.*/8 confirmed, finer resolution truncated by MCP UI renderProbed (all returned 404):
/v3/projects/.../logs,/v1/projects/.../runtime-logs,/v2/logs,/v9/logs,/v1/observability/access-logs,/v2/configurations/log-drains./v2/integrations/log-drainsreturned 200 but empty (no drains configured). NovercelCLI in sandbox.Phase 1 — firewall rules added: 0
The sprint brief's evidence thresholds (≥1000 reqs from one UA, ≥500 reqs from one ASN at ≥80% scraper-bait paths) cannot be evaluated given the data-access gaps above:
216...216.0.0.0/8block would cover ~16M IPs across many ARIN ASNs (residential, corporate, cloud) — too broad to be evidence-basedHonest call: stop short of speculative rules. The application-layer ScrapeGuard is catching the scraper. Phase 1.5 (below) reduces the cost of catching it rather than trying to add another firewall layer with insufficient signature data.
Phase 1.5 —
next.config.tscache headers + FAULT 5Adds explicit Cache-Control to /hs/code/* and /hs/heading/* via
next.config.ts headers():Both routes already had
export const revalidate = 86400(ISR) — this surfaces the cache strategy at the config level so it's tunable + curl-visible.What this changes on the wire:
generateStaticParams()) ships.FAULT 5 checklist — applied
lib/changelog-data.tsentry (renders on /changelog via monthGroups())Smoke test posture
scripts/smoke-test.mjswill be run against the Vercel preview URL after this PR opens. The smoke covers API endpoints — it does NOT specifically assert Cache-Control on /hs/* pages, but it does assert no 5xx regression. Post-merge, the deliverable is acurl -I https://www.freightutils.com/hs/code/<sample>showing the new Cache-Control header.Files
docs/audit/scraper-signature-2026-05-14.mdnext.config.tshsPageCachedPathsmapping toCache-Control)CHANGELOG.mdlib/changelog-data.tsentries[]Hard-rule compliance
gh pr merge)lib/data/test_airlines.jsnot touched (no modification needed this sprint)docs(audit):+perf(hs):)Soap to-do post-merge (time-blocked phases)
generateStaticParams()for HS pages: Only fire if Phase 1.6 shows the cache-header path didn't get under 3K commands/day. Adds ~6,940 HS code params + ~5,224 heading params at build time. Watch the build time budget — abort if it crosses 20 minutes.x-vercel-cache: HIT. Write a final status note todocs/audit/scraper-signature-2026-05-14.mdwith baseline vs post-fix percent reduction.middleware.tsScrapeGuard'sconsole.warnto logreq.headers.get('user-agent')so the next sprint can do evidence-based UA blocking.🤖 Generated with Claude Code