Skip to content

webfetch 404 cache: short TTL + no URL normalization lets tracking-param variations escape #838

@anandgupta42

Description

@anandgupta42

Problem

Telemetry-2026-05-21 showed 486 residual webfetch 404 errors (down 78% from March's 2,222 — the existing failure cache is working). The remaining cases mostly escape because of two gaps in the cache:

  1. TTL is 5 minutes. Many sessions run 30+ minutes; the same dead URL can be revisited multiple times within a session and miss the cache after the first 5-minute window.
  2. Cache key is the raw URL string. LLM-generated URLs often include tracking params (?utm_source=..., &fbclid=..., &gclid=...) or fragments — each variation gets its own cache entry, so a known-bad URL with different tracking params re-hits the network.

Fix

  1. Extend TTL from 5 min → 30 min. URLs that 404 rarely self-heal within a session; the longer window catches the slow-burn case.
  2. Normalize URLs before cache lookup. Strip universally-tracking params (utm_*, ref, fbclid, gclid, msclkid, mc_*, _ga, _gl, igshid, mibextid, __cf_chl_*, etc.), strip URL fragments, lowercase the hostname, and sort remaining params. The conservative tracking-param list ensures we never strip a functional param (?page=2, ?id=42) by accident.

Both changes apply to all 404/410/451 cache entries.

Impact

Should drop residual webfetch 404 errors further from 486. Specifically helps:

  • Long sessions revisiting the same dead doc URL multiple times (TTL extension)
  • LLMs that paste the same URL with different utm_* tags pulled from a search result (normalization)

Tests

15 new tests pin normalizeUrlForCache:

  • strips utm/fbclid/gclid/msclkid/etc.
  • preserves functional params (page, id, q)
  • sorts params for stable cache key
  • strips fragments
  • lowercases hostname (case-insensitive per RFC 3986)
  • preserves path case (HTTP paths ARE case-sensitive)
  • collapses Cloudflare __cf_chl_* challenge tokens
  • 6 variations of the same logical URL all collapse to one key
  • different functional params or paths do NOT collapse

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions