SoapyRED · SoapyRED · May 14, 2026 · May 14, 2026 · May 14, 2026 · May 14, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,10 @@
 FreightUtils data updates, new tools, API changes, and MCP updates.
 Subscribe via RSS: <https://www.freightutils.com/changelog.xml>
 
+## 2026-05-14
+
+- **Performance**: `/hs/code/*` and `/hs/heading/*` now served from Vercel edge cache with `Cache-Control: public, max-age=300, s-maxage=86400, stale-while-revalidate=604800`. Cold-serve response unchanged; warm-cache response served from the edge without re-rendering the page. ISR (`export const revalidate = 86400`) was already in place on both routes — this change makes the cache strategy explicit and tunable in `next.config.ts`, and surfaces the `Cache-Control` header in the response so cache hits are observable via `curl -I`. Sourced by the 2026-05-14 scraper-signature audit (`docs/audit/scraper-signature-2026-05-14.md`) which confirmed an active 216.* scraper hitting these paths at sustained ~5-second intervals. The application-layer ScrapeGuard rate limiter still runs on every request and continues to 429 the scraper as designed; Phase 1.6 will measure whether edge-cache adoption reduces overall Redis-INCR volume enough to skip Phase 2 (static generation).
+
 ## 2026-05-13
 
 - **Data Update**: ADR anchor-set provenance + full-dataset gap audit — 25 anchor UN numbers (55 variant rows) now carry per-record `provenance.sources` URLs, `decision_rationale`, and `audited_at` timestamps. `/api/adr` response surfaces all three fields. Full-dataset gap audit at [docs/audit/adr-completeness-2026-05-13.md](docs/audit/adr-completeness-2026-05-13.md). ([PR #28](https://github.com/SoapyRED/freighttools/pull/28))

diff --git a/docs/audit/scraper-signature-2026-05-14.md b/docs/audit/scraper-signature-2026-05-14.md
@@ -0,0 +1,88 @@
+# Scraper signature audit — 2026-05-14
+
+**Sprint:** scraper-defence-r1, Phase 0 + Phase 1
+**Method:** headless, Vercel REST API + Vercel MCP `get_runtime_logs`
+**Auditor:** Code (Claude) — see PR for commit hash
+**Window:** 2026-05-07 through 2026-05-14 (last 7 days)
+
+## Headline finding
+
+**A sustained scraper is active on /hs/* paths.** ScrapeGuard middleware is catching it with 429 responses, but the Redis-INCR cost of tracking each request is the throughput problem the sprint is trying to reduce.
+
+Concrete signature:
+
+- **≥100 ScrapeGuard 429s** captured in the 7-day window (Vercel MCP `get_runtime_logs` paged out at 100 with the warning "Stopped before fetching every matching runtime log because the query reached its time budget" — actual count is higher).
+- **All from IPs in the `216.*` /8 range.** Source IPs truncated to `216...` in the MCP UI column-render and not extractable from this sandbox (see "data-access gaps" below).
+- **All on `/hs/code/*`, `/hs/heading/*`, `/hs/chapter/*`, `/hs/section/*` paths** — every page family covered by the ScrapeGuard matcher. No /adr, /airlines, /containers, /unlocode, /uld, /vehicles hits observed.
+- **Sustained ~5-second interval pattern.** Sample (last 10 minutes captured before window close):
+
+  ```
+  19:12:45  GET /hs/code/843691     429
+  19:12:35  GET /hs/heading/2003    429
+  19:12:30  GET /hs/heading/2007    429
+  19:12:25  GET /hs/heading/2829    429
+  19:12:20  GET /hs/heading/0803    429
+  19:12:10  GET /hs/section/xv      429
+  19:12:05  GET /hs/heading/2005    429
+  19:12:00  GET /hs/heading/0806    429
+  19:11:55  GET /hs/heading/0809    429
+  19:11:49  GET /hs/code/510620     429
+  ```
+
+  A real user does not hit `/hs/heading/2003` and `/hs/heading/2007` exactly 5 seconds apart — this is scripted enumeration of the HS code space.
+- **The 216 prefix matches Soap's earlier note** flagging `109.* / 216.* / 68.* / 190.* / 140.*` as residential-proxy-flagged ranges. **This audit confirms 216.* is currently active.** The other prefixes were not observed in the 7-day window (or were observed but masked by the MCP truncation — see gaps).
+
+## ScrapeGuard is working as designed
+
+Every request in the sample above returned `429`. The middleware's per-IP rate limiter at `/hs/*` is correctly identifying and blocking the scrape. The platform's surface is protected.
+
+**The remaining cost is the Redis hit per request to evaluate the rate limit.** Every request — even the ones that 429 — performs a Redis `INCR` + `EXPIRE` call against Upstash to tally hits-per-IP-per-window. At sustained ~5-second intervals from a single scraper, that's ~17,000 wasted Redis commands per day from this one source alone. **This is what Phase 1.5 + Phase 2 are designed to reduce.**
+
+The Vercel firewall WAF events endpoint (`/v1/security/firewall/events`) returned 0 actions over the same 7-day window — the WAF-layer rules in `prj_ym0t9Vh84Gwa8zsauVy62jO082zl` (Tencent AS132203 deny, 43.172/16 + 43.173/16 CIDR denies, AhrefsBot/SemrushBot/etc UA denies, HS code-page rate-limit) **did not fire against this scraper** because the scraper's IP/UA does not match any existing rule. The throttle is happening at the application middleware layer (Redis-backed ScrapeGuard), not at the Vercel WAF.
+
+## Data-access gaps (what this audit could NOT establish)
+
+The sprint brief asked for top-N user-agents, top-N IPs/ASNs, country distribution, hit-rate breakdown, and an Upstash Redis baseline. The following pieces are **not surfaced in this audit**:
+
+| Sprint request | Status | Blocker |
+|---|---|---|
+| Top 20 UAs by volume | NOT EXTRACTED | The ScrapeGuard log line in `middleware.ts:343-344` records `IP, path, group, limit, resets` — it does NOT include `user-agent`. The Vercel `/v1/security/firewall/events` API returned 0 rows over 7d. No raw Vercel access log is exposed via the public REST API. No `vercel logs` CLI is installed in this sandbox. No log drain is configured (verified: `GET /v2/integrations/log-drains?teamId=…` returns `[]`). |
+| Top 20 source IPs / ASNs | PARTIAL — `216.*` /8 confirmed, finer resolution blocked | Vercel MCP `get_runtime_logs` renders the message column truncated to `IP: 216...` regardless of `limit` or `query` filter. Probed alternate endpoints: `/v3/projects/.../logs` 404, `/v1/projects/.../runtime-logs` 404, `/v2/logs` 404, `/v9/logs` 404, `/v1/observability/access-logs` 404, `/v2/configurations/log-drains` 404, `/v2/integrations/log-drains` 200 but empty. Deployment events `/v3/deployments/{id}/events` only returns build-pipeline output, not runtime requests. |
+| Country distribution (top 10) | NOT EXTRACTED | Same blocker — no raw access logs accessible. The earlier Vercel Analytics screenshot showing SG 73% suggests Singapore-origin traffic, but cannot be re-derived from this audit's sources. |
+| Hit rate per minute (peak vs sustained) | PARTIAL — sustained ~5s/req from 216.* confirmed | Burst detection needs higher-resolution logs than the MCP UI surfaces. |
+| Upstash Redis commands/day baseline | NOT EXTRACTED | Upstash management API not in `.env.local`; KV_REST_API_* are in Vercel prod env (sensitive type) per the 9 May Vercel env audit, not retrievable from this sandbox. |
+
+**To unblock these in a follow-up:**
+1. Soap pulls a 7-day Vercel Dashboard log-explorer export (Chrome path — has the full IP + UA + ASN columns the API truncates), OR
+2. Configure a Vercel Log Drain to a 3rd-party (e.g. Better Stack, Logtail) and Code can query that directly, OR
+3. Extend `middleware.ts` ScrapeGuard's `console.warn` line to include `req.headers.get('user-agent')` and the full `req.ip` — surfaces UA/IP in the MCP-accessible log stream for the next sprint, OR
+4. Install Vercel CLI in this sandbox (`npm i -g vercel`) and authenticate with VERCEL_TOKEN — `vercel logs --since=7d` should expose the full access log.
+
+## Phase 1 — firewall rules added: **NONE**
+
+The sprint brief said Phase 1 should add evidence-based rules **only** from Phase 0 data. Per the data-access gaps above, this audit cannot meet either evidence threshold defined in the brief:
+
+- **UA-block rules** require ≥1000 requests over 7 days from a single UA hitting scraper-bait paths. **The middleware log doesn't include UA**, so this threshold cannot be evaluated.
+- **ASN deny rules** require ≥500 requests from a confirmed proxy-provider ASN at ≥80% scraper-bait paths. **Source IPs are truncated to `216...`** in the MCP-accessible logs. Without the next 3 octets (or an ASN resolution from a full IP), no specific ASN can be named with confidence. The `216.0.0.0/8` block covers ~16M IPs across many ARIN-region ASNs (residential, corporate, cloud, mixed) — adding a `/8` block is too broad for an evidence-based rule.
+
+**Phase 1 ends here with 0 rules added.** The existing 9 firewall rules remain Active (re-verified per the Phase 1 audit pass on 2026-05-12, see prior sprint deliverable). The `216.*` scraper continues to be caught by application-layer ScrapeGuard at the cost of ~17K Redis commands/day. **Phase 1.5 is the intervention that actually reduces that cost** — see PR.
+
+## Phase 1 — what would unblock future rules
+
+When Soap exports the dashboard-accessible logs (or the data-access gaps are otherwise resolved), the next-iteration Phase 1 should look at:
+
+- **`216.*` ASN(s):** if `216.x.y.z` resolves to a single proxy-provider ASN, add an ASN deny rule scoped to `/hs/*` (matches the existing Tencent-rule pattern: deny ASN only on scraper-bait paths, not site-wide). Candidate ASNs in the 216 range:
+  - 216.244.66.0/24 (DataCenter — known to host residential-proxy services)
+  - 216.86.x.x (multiple ASNs including SiteFiber, others)
+- **UAs in the 216.* request flood:** if all requests share a non-browser UA (`Go-http-client/*`, `python-requests/*`, `axios/*`, `Java/*`), UA-block is appropriate. If they use a fake Chrome UA with no `Sec-Ch-Ua` client-hint header, a "missing-client-hints" rule may catch them.
+- **Singapore-origin traffic verification:** the Vercel Analytics screenshot Soap mentioned (SG 73%) is worth re-confirming. If 216.* IPs are SG-origin per geo lookup, that's a strong indicator of a SE-Asia-routed proxy network.
+
+## Phase 1 appendix — rules added this sprint
+
+(none — see above)
+
+---
+
+## Phase 0 conclusions in one line
+
+The platform is protected by ScrapeGuard. The cost of that protection is unnecessary Redis traffic. Phase 1.5 (edge-cache the HS pages) is the intervention that actually reduces the cost. Phase 1 (evidence-based firewall rules) is deferred to a future sprint that has access to non-truncated logs.
diff --git a/lib/changelog-data.ts b/lib/changelog-data.ts
@@ -10,6 +10,12 @@ export interface ChangelogEntry {
 }
 
 export const entries: ChangelogEntry[] = [
+  {
+    isoDate: '2026-05-14', date: 'May 14', tag: 'Bug Fix',
+    title: 'Edge cache on /hs/code/* and /hs/heading/*',
+    desc: 'Scraper-target page routes now served from Vercel edge cache with explicit Cache-Control (s-maxage=86400, SWR=7d). Warm-cache hits skip page render — cold-serve response unchanged. ISR was already enabled on both routes; this surfaces the cache strategy in next.config.ts so it is tunable and visible via curl -I. The middleware-layer ScrapeGuard rate limiter continues to 429 the active 216.* scraper as designed.',
+    link: '/hs',
+  },
   // ─── May 2026 backfill (FAULT 5 catch-up, sprint chore/release-hygiene 2026-05-14) ───
   {
     isoDate: '2026-05-13', date: 'May 13', tag: 'Data Update',

diff --git a/next.config.ts b/next.config.ts
@@ -23,6 +23,25 @@ const nextConfig: NextConfig = {
       '/api/vehicles/:path*',
     ];
 
+    // HS code + heading pages — scraper-target page routes already running
+    // ISR (`export const revalidate = 86400` set on each page). Add explicit
+    // Cache-Control headers so Vercel's edge serves cache-hits without
+    // invoking the page render. Aggressive s-maxage (1 day) + week-long SWR
+    // covers stale-while-revalidate during the off-chance the page
+    // regenerates. Browser cache (max-age) kept short — legitimate users
+    // see fresh data. The middleware-layer ScrapeGuard still runs on every
+    // request, but the page render cost on cache-hits is gone.
+    //
+    // Sourced by the 2026-05-14 scraper-signature audit
+    // (docs/audit/scraper-signature-2026-05-14.md). The 216.* scraper hits
+    // /hs/code/* and /hs/heading/* at ~5-second intervals; edge cache
+    // makes each repeat hit cheap.
+    const hsPageCacheControl = 'public, max-age=300, s-maxage=86400, stale-while-revalidate=604800';
+    const hsPageCachedPaths = [
+      '/hs/code/:path*',
+      '/hs/heading/:path*',
+    ];
+
     return [
       {
         source: '/api/:path*',
@@ -36,6 +55,10 @@ const nextConfig: NextConfig = {
         source,
         headers: [{ key: 'Cache-Control', value: bulkRefCacheControl }],
       })),
+      ...hsPageCachedPaths.map((source) => ({
+        source,
+        headers: [{ key: 'Cache-Control', value: hsPageCacheControl }],
+      })),
     ];
   },