Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,10 @@
FreightUtils data updates, new tools, API changes, and MCP updates.
Subscribe via RSS: <https://www.freightutils.com/changelog.xml>

## 2026-05-14

- **Performance**: `/hs/code/*` and `/hs/heading/*` now served from Vercel edge cache with `Cache-Control: public, max-age=300, s-maxage=86400, stale-while-revalidate=604800`. Cold-serve response unchanged; warm-cache response served from the edge without re-rendering the page. ISR (`export const revalidate = 86400`) was already in place on both routes — this change makes the cache strategy explicit and tunable in `next.config.ts`, and surfaces the `Cache-Control` header in the response so cache hits are observable via `curl -I`. Sourced by the 2026-05-14 scraper-signature audit (`docs/audit/scraper-signature-2026-05-14.md`) which confirmed an active 216.* scraper hitting these paths at sustained ~5-second intervals. The application-layer ScrapeGuard rate limiter still runs on every request and continues to 429 the scraper as designed; Phase 1.6 will measure whether edge-cache adoption reduces overall Redis-INCR volume enough to skip Phase 2 (static generation).

## 2026-05-13

- **Data Update**: ADR anchor-set provenance + full-dataset gap audit — 25 anchor UN numbers (55 variant rows) now carry per-record `provenance.sources` URLs, `decision_rationale`, and `audited_at` timestamps. `/api/adr` response surfaces all three fields. Full-dataset gap audit at [docs/audit/adr-completeness-2026-05-13.md](docs/audit/adr-completeness-2026-05-13.md). ([PR #28](https://github.com/SoapyRED/freighttools/pull/28))
Expand Down
88 changes: 88 additions & 0 deletions docs/audit/scraper-signature-2026-05-14.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Scraper signature audit — 2026-05-14

**Sprint:** scraper-defence-r1, Phase 0 + Phase 1
**Method:** headless, Vercel REST API + Vercel MCP `get_runtime_logs`
**Auditor:** Code (Claude) — see PR for commit hash
**Window:** 2026-05-07 through 2026-05-14 (last 7 days)

## Headline finding

**A sustained scraper is active on /hs/* paths.** ScrapeGuard middleware is catching it with 429 responses, but the Redis-INCR cost of tracking each request is the throughput problem the sprint is trying to reduce.

Concrete signature:

- **≥100 ScrapeGuard 429s** captured in the 7-day window (Vercel MCP `get_runtime_logs` paged out at 100 with the warning "Stopped before fetching every matching runtime log because the query reached its time budget" — actual count is higher).
- **All from IPs in the `216.*` /8 range.** Source IPs truncated to `216...` in the MCP UI column-render and not extractable from this sandbox (see "data-access gaps" below).
- **All on `/hs/code/*`, `/hs/heading/*`, `/hs/chapter/*`, `/hs/section/*` paths** — every page family covered by the ScrapeGuard matcher. No /adr, /airlines, /containers, /unlocode, /uld, /vehicles hits observed.
- **Sustained ~5-second interval pattern.** Sample (last 10 minutes captured before window close):

```
19:12:45 GET /hs/code/843691 429
19:12:35 GET /hs/heading/2003 429
19:12:30 GET /hs/heading/2007 429
19:12:25 GET /hs/heading/2829 429
19:12:20 GET /hs/heading/0803 429
19:12:10 GET /hs/section/xv 429
19:12:05 GET /hs/heading/2005 429
19:12:00 GET /hs/heading/0806 429
19:11:55 GET /hs/heading/0809 429
19:11:49 GET /hs/code/510620 429
```

A real user does not hit `/hs/heading/2003` and `/hs/heading/2007` exactly 5 seconds apart — this is scripted enumeration of the HS code space.
- **The 216 prefix matches Soap's earlier note** flagging `109.* / 216.* / 68.* / 190.* / 140.*` as residential-proxy-flagged ranges. **This audit confirms 216.* is currently active.** The other prefixes were not observed in the 7-day window (or were observed but masked by the MCP truncation — see gaps).

## ScrapeGuard is working as designed

Every request in the sample above returned `429`. The middleware's per-IP rate limiter at `/hs/*` is correctly identifying and blocking the scrape. The platform's surface is protected.

**The remaining cost is the Redis hit per request to evaluate the rate limit.** Every request — even the ones that 429 — performs a Redis `INCR` + `EXPIRE` call against Upstash to tally hits-per-IP-per-window. At sustained ~5-second intervals from a single scraper, that's ~17,000 wasted Redis commands per day from this one source alone. **This is what Phase 1.5 + Phase 2 are designed to reduce.**

The Vercel firewall WAF events endpoint (`/v1/security/firewall/events`) returned 0 actions over the same 7-day window — the WAF-layer rules in `prj_ym0t9Vh84Gwa8zsauVy62jO082zl` (Tencent AS132203 deny, 43.172/16 + 43.173/16 CIDR denies, AhrefsBot/SemrushBot/etc UA denies, HS code-page rate-limit) **did not fire against this scraper** because the scraper's IP/UA does not match any existing rule. The throttle is happening at the application middleware layer (Redis-backed ScrapeGuard), not at the Vercel WAF.

## Data-access gaps (what this audit could NOT establish)

The sprint brief asked for top-N user-agents, top-N IPs/ASNs, country distribution, hit-rate breakdown, and an Upstash Redis baseline. The following pieces are **not surfaced in this audit**:

| Sprint request | Status | Blocker |
|---|---|---|
| Top 20 UAs by volume | NOT EXTRACTED | The ScrapeGuard log line in `middleware.ts:343-344` records `IP, path, group, limit, resets` — it does NOT include `user-agent`. The Vercel `/v1/security/firewall/events` API returned 0 rows over 7d. No raw Vercel access log is exposed via the public REST API. No `vercel logs` CLI is installed in this sandbox. No log drain is configured (verified: `GET /v2/integrations/log-drains?teamId=…` returns `[]`). |
| Top 20 source IPs / ASNs | PARTIAL — `216.*` /8 confirmed, finer resolution blocked | Vercel MCP `get_runtime_logs` renders the message column truncated to `IP: 216...` regardless of `limit` or `query` filter. Probed alternate endpoints: `/v3/projects/.../logs` 404, `/v1/projects/.../runtime-logs` 404, `/v2/logs` 404, `/v9/logs` 404, `/v1/observability/access-logs` 404, `/v2/configurations/log-drains` 404, `/v2/integrations/log-drains` 200 but empty. Deployment events `/v3/deployments/{id}/events` only returns build-pipeline output, not runtime requests. |
| Country distribution (top 10) | NOT EXTRACTED | Same blocker — no raw access logs accessible. The earlier Vercel Analytics screenshot showing SG 73% suggests Singapore-origin traffic, but cannot be re-derived from this audit's sources. |
| Hit rate per minute (peak vs sustained) | PARTIAL — sustained ~5s/req from 216.* confirmed | Burst detection needs higher-resolution logs than the MCP UI surfaces. |
| Upstash Redis commands/day baseline | NOT EXTRACTED | Upstash management API not in `.env.local`; KV_REST_API_* are in Vercel prod env (sensitive type) per the 9 May Vercel env audit, not retrievable from this sandbox. |

**To unblock these in a follow-up:**
1. Soap pulls a 7-day Vercel Dashboard log-explorer export (Chrome path — has the full IP + UA + ASN columns the API truncates), OR
2. Configure a Vercel Log Drain to a 3rd-party (e.g. Better Stack, Logtail) and Code can query that directly, OR
3. Extend `middleware.ts` ScrapeGuard's `console.warn` line to include `req.headers.get('user-agent')` and the full `req.ip` — surfaces UA/IP in the MCP-accessible log stream for the next sprint, OR
4. Install Vercel CLI in this sandbox (`npm i -g vercel`) and authenticate with VERCEL_TOKEN — `vercel logs --since=7d` should expose the full access log.

## Phase 1 — firewall rules added: **NONE**

The sprint brief said Phase 1 should add evidence-based rules **only** from Phase 0 data. Per the data-access gaps above, this audit cannot meet either evidence threshold defined in the brief:

- **UA-block rules** require ≥1000 requests over 7 days from a single UA hitting scraper-bait paths. **The middleware log doesn't include UA**, so this threshold cannot be evaluated.
- **ASN deny rules** require ≥500 requests from a confirmed proxy-provider ASN at ≥80% scraper-bait paths. **Source IPs are truncated to `216...`** in the MCP-accessible logs. Without the next 3 octets (or an ASN resolution from a full IP), no specific ASN can be named with confidence. The `216.0.0.0/8` block covers ~16M IPs across many ARIN-region ASNs (residential, corporate, cloud, mixed) — adding a `/8` block is too broad for an evidence-based rule.

**Phase 1 ends here with 0 rules added.** The existing 9 firewall rules remain Active (re-verified per the Phase 1 audit pass on 2026-05-12, see prior sprint deliverable). The `216.*` scraper continues to be caught by application-layer ScrapeGuard at the cost of ~17K Redis commands/day. **Phase 1.5 is the intervention that actually reduces that cost** — see PR.

## Phase 1 — what would unblock future rules

When Soap exports the dashboard-accessible logs (or the data-access gaps are otherwise resolved), the next-iteration Phase 1 should look at:

- **`216.*` ASN(s):** if `216.x.y.z` resolves to a single proxy-provider ASN, add an ASN deny rule scoped to `/hs/*` (matches the existing Tencent-rule pattern: deny ASN only on scraper-bait paths, not site-wide). Candidate ASNs in the 216 range:
- 216.244.66.0/24 (DataCenter — known to host residential-proxy services)
- 216.86.x.x (multiple ASNs including SiteFiber, others)
- **UAs in the 216.* request flood:** if all requests share a non-browser UA (`Go-http-client/*`, `python-requests/*`, `axios/*`, `Java/*`), UA-block is appropriate. If they use a fake Chrome UA with no `Sec-Ch-Ua` client-hint header, a "missing-client-hints" rule may catch them.
- **Singapore-origin traffic verification:** the Vercel Analytics screenshot Soap mentioned (SG 73%) is worth re-confirming. If 216.* IPs are SG-origin per geo lookup, that's a strong indicator of a SE-Asia-routed proxy network.

## Phase 1 appendix — rules added this sprint

(none — see above)

---

## Phase 0 conclusions in one line

The platform is protected by ScrapeGuard. The cost of that protection is unnecessary Redis traffic. Phase 1.5 (edge-cache the HS pages) is the intervention that actually reduces the cost. Phase 1 (evidence-based firewall rules) is deferred to a future sprint that has access to non-truncated logs.
6 changes: 6 additions & 0 deletions lib/changelog-data.ts
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,12 @@ export interface ChangelogEntry {
}

export const entries: ChangelogEntry[] = [
{
isoDate: '2026-05-14', date: 'May 14', tag: 'Bug Fix',
title: 'Edge cache on /hs/code/* and /hs/heading/*',
desc: 'Scraper-target page routes now served from Vercel edge cache with explicit Cache-Control (s-maxage=86400, SWR=7d). Warm-cache hits skip page render — cold-serve response unchanged. ISR was already enabled on both routes; this surfaces the cache strategy in next.config.ts so it is tunable and visible via curl -I. The middleware-layer ScrapeGuard rate limiter continues to 429 the active 216.* scraper as designed.',
link: '/hs',
},
// ─── May 2026 backfill (FAULT 5 catch-up, sprint chore/release-hygiene 2026-05-14) ───
{
isoDate: '2026-05-13', date: 'May 13', tag: 'Data Update',
Expand Down
23 changes: 23 additions & 0 deletions next.config.ts
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,25 @@ const nextConfig: NextConfig = {
'/api/vehicles/:path*',
];

// HS code + heading pages — scraper-target page routes already running
// ISR (`export const revalidate = 86400` set on each page). Add explicit
// Cache-Control headers so Vercel's edge serves cache-hits without
// invoking the page render. Aggressive s-maxage (1 day) + week-long SWR
// covers stale-while-revalidate during the off-chance the page
// regenerates. Browser cache (max-age) kept short — legitimate users
// see fresh data. The middleware-layer ScrapeGuard still runs on every
// request, but the page render cost on cache-hits is gone.
//
// Sourced by the 2026-05-14 scraper-signature audit
// (docs/audit/scraper-signature-2026-05-14.md). The 216.* scraper hits
// /hs/code/* and /hs/heading/* at ~5-second intervals; edge cache
// makes each repeat hit cheap.
const hsPageCacheControl = 'public, max-age=300, s-maxage=86400, stale-while-revalidate=604800';
const hsPageCachedPaths = [
'/hs/code/:path*',
'/hs/heading/:path*',
];

return [
{
source: '/api/:path*',
Expand All @@ -36,6 +55,10 @@ const nextConfig: NextConfig = {
source,
headers: [{ key: 'Cache-Control', value: bulkRefCacheControl }],
})),
...hsPageCachedPaths.map((source) => ({
source,
headers: [{ key: 'Cache-Control', value: hsPageCacheControl }],
})),
];
},

Expand Down