Problem
Telemetry-2026-05-21 showed 486 residual webfetch 404 errors (down 78% from March's 2,222 — the existing failure cache is working). The remaining cases mostly escape because of two gaps in the cache:
- TTL is 5 minutes. Many sessions run 30+ minutes; the same dead URL can be revisited multiple times within a session and miss the cache after the first 5-minute window.
- Cache key is the raw URL string. LLM-generated URLs often include tracking params (
?utm_source=..., &fbclid=..., &gclid=...) or fragments — each variation gets its own cache entry, so a known-bad URL with different tracking params re-hits the network.
Fix
- Extend TTL from 5 min → 30 min. URLs that 404 rarely self-heal within a session; the longer window catches the slow-burn case.
- Normalize URLs before cache lookup. Strip universally-tracking params (
utm_*, ref, fbclid, gclid, msclkid, mc_*, _ga, _gl, igshid, mibextid, __cf_chl_*, etc.), strip URL fragments, lowercase the hostname, and sort remaining params. The conservative tracking-param list ensures we never strip a functional param (?page=2, ?id=42) by accident.
Both changes apply to all 404/410/451 cache entries.
Impact
Should drop residual webfetch 404 errors further from 486. Specifically helps:
- Long sessions revisiting the same dead doc URL multiple times (TTL extension)
- LLMs that paste the same URL with different
utm_* tags pulled from a search result (normalization)
Tests
15 new tests pin normalizeUrlForCache:
- strips utm/fbclid/gclid/msclkid/etc.
- preserves functional params (page, id, q)
- sorts params for stable cache key
- strips fragments
- lowercases hostname (case-insensitive per RFC 3986)
- preserves path case (HTTP paths ARE case-sensitive)
- collapses Cloudflare
__cf_chl_* challenge tokens
- 6 variations of the same logical URL all collapse to one key
- different functional params or paths do NOT collapse
Problem
Telemetry-2026-05-21 showed 486 residual webfetch 404 errors (down 78% from March's 2,222 — the existing failure cache is working). The remaining cases mostly escape because of two gaps in the cache:
?utm_source=...,&fbclid=...,&gclid=...) or fragments — each variation gets its own cache entry, so a known-bad URL with different tracking params re-hits the network.Fix
utm_*,ref,fbclid,gclid,msclkid,mc_*,_ga,_gl,igshid,mibextid,__cf_chl_*, etc.), strip URL fragments, lowercase the hostname, and sort remaining params. The conservative tracking-param list ensures we never strip a functional param (?page=2,?id=42) by accident.Both changes apply to all 404/410/451 cache entries.
Impact
Should drop residual webfetch 404 errors further from 486. Specifically helps:
utm_*tags pulled from a search result (normalization)Tests
15 new tests pin
normalizeUrlForCache:__cf_chl_*challenge tokens