Skip to content

refactor: replace EMA-driven context cap with tier-based cost-aware decisions#348

Merged
BYK merged 2 commits into
mainfrom
refactor/cost-aware-tier-system
May 15, 2026
Merged

refactor: replace EMA-driven context cap with tier-based cost-aware decisions#348
BYK merged 2 commits into
mainfrom
refactor/cost-aware-tier-system

Conversation

@BYK
Copy link
Copy Markdown
Owner

@BYK BYK commented May 15, 2026

Summary

  • Remove the adaptive context cap system (EMA bust rate tracking, dynamic cap tightening/relaxation, MIN_CONTEXT_FLOOR, maxContextTokensCeiling) that caused death spirals by ratcheting the cap below the session's actual token count
  • Replace with tier-based quality watermarks (200K/500K/model limit) and per-turn economic bust-vs-continue decisions using actual cache pricing data
  • Add rolling bust detection: after 5 consecutive cache busts, stop compressing and inject a warning message advising the user to compact or start a new conversation

Key changes

New API

  • setCachePricing(write, read) — sets per-token cache costs from models.dev
  • shouldCompress(currentTokens, compressedTokens, consecutiveBusts) — bust-vs-continue economic decision with 0.85 threshold. Heavily favors NOT compressing since cache writes are 12.5x more expensive than reads on Opus. Returns false (don't compress) when no pricing data is available.
  • getTier(tokens) — maps token count to quality tier (0: ≤200K best, 1: ≤500K acceptable, 2: >500K degraded)
  • recordCacheUsage(write, read, inputTokens, sessionID) — tracks consecutive busts using total input tokens as denominator (not just cache tokens)

Integration points

  • Tier gate in transformInner: Between the layer-0 passthrough and compression stages, shouldCompress() is called. When bust cost isn't justified, the session stays at layer 0 with full context up to the model limit.
  • Unsustainable warning in pipeline.ts: When TransformResult.unsustainable is set (5+ consecutive busts), a <system-reminder> warning is injected into the last user message advising to compact or start fresh.
  • DB persistence: consecutiveBusts is persisted via the repurposed dynamicContextCap column (no migration needed) and restored on session load.

Removed

  • adaptContextCap(), computeContextCap(), setMaxContextTokens(), getMaxContextTokens(), updateBustRate()
  • bustRateEMA, interBustIntervalEMA, lastBustAt, dynamicContextCap session state fields (DB columns retained, repurposed)
  • effectiveCap constraint in transformInner — context now uses full model window
  • targetBustCost, maxContextTokens config fields (marked deprecated, still parsed to avoid breaking existing .lore.json)

Why

The old EMA system optimized for minimizing input_tokens per turn via a static dollar-based cap. This created a death spiral: high bust rate → tighten cap → session can't fit in lower layers → forced to Layer 4 → bust rate stays high → cap ratchets down further. The tiers are quality-based (empirical model effectiveness), not pricing-based — token costs scale linearly regardless of tier.

Math verification (Opus 4.6)

  • Cache write: $6.25/MTok, Cache read: $0.50/MTok (write is 12.5x more expensive)
  • At 250K → 150K compressed: bustCost=$0.94, continueCost=$0.125 → don't compress
  • At 500K → 100K compressed: bustCost=$0.625, continueCost=$0.25 → don't compress
  • At 2M → 100K compressed: bustCost=$0.625, continueCost=$1.00 → compress

BYK added 2 commits May 15, 2026 17:08
…ecisions

Remove the adaptive context cap system (EMA bust rate tracking, dynamic cap
tightening/relaxation, MIN_CONTEXT_FLOOR) that caused death spirals by
ratcheting the cap below the session's token count.

Replace with a tier-based model using quality watermarks (200K/500K/model
limit) and per-turn economic bust-vs-continue decisions:

- setCachePricing(write, read): configures per-token cache costs from models.dev
- shouldCompress(current, compressed, busts): compares bust cost vs continue
  cost with 0.85 threshold — heavily favors NOT compressing since cache writes
  are 12.5x more expensive than reads on Opus
- getTier(tokens): maps token count to quality tier (0/1/2)
- recordCacheUsage(): tracks consecutive busts for rolling detection
- TransformResult.unsustainable: signals 5+ consecutive busts for user warning

The write-to-read cost ratio (12.5x on Opus) naturally makes shouldCompress
reject compression in most cases — only extreme ratios (e.g. 2M->100K) trigger
it. After 5 consecutive busts, compression stops entirely and the unsustainable
flag is set for warning injection.
…s, inject unsustainable warning

Address critical review findings:

C1: shouldCompress() is now called in the tier gate between layer-0
    passthrough and compression stages. When bust cost is not justified,
    the session stays at layer 0 with full context up to the model limit.

C2: unsustainable flag is consumed in pipeline.ts step 7c — injects a
    system-reminder warning into the last user message advising to compact
    or start a new conversation.

C3: consecutiveBusts is persisted to DB via the repurposed
    dynamicContextCap column in session_state (no migration needed).
    Restored on session load.

M2: recordCacheUsage now takes total inputTokens as denominator for bust
    ratio, not just cacheWrite+cacheRead — prevents inflated bust ratios
    when a large fraction of tokens is uncached.

M5: shouldCompress fallback (no pricing data) now returns false (don't
    compress) instead of true — conservative default matches the design
    principle of favoring cache preservation.

L4: resetCalibration now resets cache pricing globals for test isolation.
L5: Fixed misleading 'no artificial cap' comment.
@BYK BYK merged commit ed5b203 into main May 15, 2026
7 checks passed
@BYK BYK deleted the refactor/cost-aware-tier-system branch May 15, 2026 17:47
@BYK BYK mentioned this pull request May 15, 2026
This was referenced May 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant