refactor: replace EMA-driven context cap with tier-based cost-aware decisions#348
Merged
Conversation
…ecisions Remove the adaptive context cap system (EMA bust rate tracking, dynamic cap tightening/relaxation, MIN_CONTEXT_FLOOR) that caused death spirals by ratcheting the cap below the session's token count. Replace with a tier-based model using quality watermarks (200K/500K/model limit) and per-turn economic bust-vs-continue decisions: - setCachePricing(write, read): configures per-token cache costs from models.dev - shouldCompress(current, compressed, busts): compares bust cost vs continue cost with 0.85 threshold — heavily favors NOT compressing since cache writes are 12.5x more expensive than reads on Opus - getTier(tokens): maps token count to quality tier (0/1/2) - recordCacheUsage(): tracks consecutive busts for rolling detection - TransformResult.unsustainable: signals 5+ consecutive busts for user warning The write-to-read cost ratio (12.5x on Opus) naturally makes shouldCompress reject compression in most cases — only extreme ratios (e.g. 2M->100K) trigger it. After 5 consecutive busts, compression stops entirely and the unsustainable flag is set for warning injection.
…s, inject unsustainable warning
Address critical review findings:
C1: shouldCompress() is now called in the tier gate between layer-0
passthrough and compression stages. When bust cost is not justified,
the session stays at layer 0 with full context up to the model limit.
C2: unsustainable flag is consumed in pipeline.ts step 7c — injects a
system-reminder warning into the last user message advising to compact
or start a new conversation.
C3: consecutiveBusts is persisted to DB via the repurposed
dynamicContextCap column in session_state (no migration needed).
Restored on session load.
M2: recordCacheUsage now takes total inputTokens as denominator for bust
ratio, not just cacheWrite+cacheRead — prevents inflated bust ratios
when a large fraction of tokens is uncached.
M5: shouldCompress fallback (no pricing data) now returns false (don't
compress) instead of true — conservative default matches the design
principle of favoring cache preservation.
L4: resetCalibration now resets cache pricing globals for test isolation.
L5: Fixed misleading 'no artificial cap' comment.
Closed
This was referenced May 15, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MIN_CONTEXT_FLOOR,maxContextTokensCeiling) that caused death spirals by ratcheting the cap below the session's actual token countKey changes
New API
setCachePricing(write, read)— sets per-token cache costs from models.devshouldCompress(currentTokens, compressedTokens, consecutiveBusts)— bust-vs-continue economic decision with 0.85 threshold. Heavily favors NOT compressing since cache writes are 12.5x more expensive than reads on Opus. Returns false (don't compress) when no pricing data is available.getTier(tokens)— maps token count to quality tier (0: ≤200K best, 1: ≤500K acceptable, 2: >500K degraded)recordCacheUsage(write, read, inputTokens, sessionID)— tracks consecutive busts using total input tokens as denominator (not just cache tokens)Integration points
transformInner: Between the layer-0 passthrough and compression stages,shouldCompress()is called. When bust cost isn't justified, the session stays at layer 0 with full context up to the model limit.TransformResult.unsustainableis set (5+ consecutive busts), a<system-reminder>warning is injected into the last user message advising to compact or start fresh.consecutiveBustsis persisted via the repurposeddynamicContextCapcolumn (no migration needed) and restored on session load.Removed
adaptContextCap(),computeContextCap(),setMaxContextTokens(),getMaxContextTokens(),updateBustRate()bustRateEMA,interBustIntervalEMA,lastBustAt,dynamicContextCapsession state fields (DB columns retained, repurposed)effectiveCapconstraint intransformInner— context now uses full model windowtargetBustCost,maxContextTokensconfig fields (marked deprecated, still parsed to avoid breaking existing.lore.json)Why
The old EMA system optimized for minimizing
input_tokensper turn via a static dollar-based cap. This created a death spiral: high bust rate → tighten cap → session can't fit in lower layers → forced to Layer 4 → bust rate stays high → cap ratchets down further. The tiers are quality-based (empirical model effectiveness), not pricing-based — token costs scale linearly regardless of tier.Math verification (Opus 4.6)