Skip to content

What Changed

Latest

Choose a tag to compare

@chauncygu chauncygu released this 07 Jun 05:58
· 4 commits to main since this release
99a3b4a
  • June 5, 2026 (v3.05.82) (latest): User-controllable token / cost budgets — set a spend cap; on hit the session auto-saves and you can resume or raise it. The quota engine (quota.py: per-session + per-day token/cost counters, enforced before each model call) already existed but had no friendly surface — you had to know four config keys (session_token_budget / session_cost_budget / daily_token_budget / daily_cost_budget) and there was no way to see how close you were, no warning before the wall, and the hard stop printed a bare [Quota exceeded]. This adds the UX layer on top of the unchanged engine: a /budget command — no args shows usage vs every budget as colored bars + percentages; /budget $5 sets a session cost cap (the $ means USD), /budget 200k a session token cap (parses 200k / 1.5m / 200000), /budget daily $20 / /budget daily 2m the daily caps, and /budget clear removes all. A --budget $5 / --budget 200k startup flag sets the session cap at launch. Proximity warnings fire at the end of any turn that crosses ≥80% (yellow) / ≥95% (red) of a cap, so the wall never arrives by surprise. On hit the agent now yields a QuotaPause event (instead of a plain text line): the REPL auto-saves the session (session_latest.json + daily backup, the same path /resume reads) and prints a friendly next-steps block — raise the same cap or remove it (/budget clear) then resend, or restart later and /resume. So a long task that runs out of budget is never lost: you analyze, adjust, and continue. Tight enforcement (no surprise overshoot): the check projects the next request's input (compaction.estimate_tokens) and stops before the call if it would cross the cap, and clamps that call's max_tokens to the remaining headroom (quota.output_room) — so a single tool-heavy turn can't blow 40k→49k past the budget the way a pure "already-spent ≥ limit" check let it. One budget per scope: setting a cap replaces the other unit for that scope (/budget $5 after /budget 200k switches the session cap to cost rather than stacking), so a leftover token cap can't silently keep blocking after you switch to a $ cap. Unit-matched hint: QuotaExceeded / QuotaPause carry which cap broke (key/scope/unit/limit), so the "raise it" suggestion is in the right unit — a token cap shows /budget 40k, a daily cost cap shows /budget daily $40 — instead of a generic $ amount that wouldn't lift a token cap. New helpers quota.parse_budget / fmt_amount / usage_vs_limits / warnings / output_room; command in commands/core.py:cmd_budget; QuotaPause in agent.py; REPL handling + --budget in cheetahclaws.py; 42-case tests/test_budget.py (isolated quota dir, incl. a regression that the hint matches the breached unit and that switching units clears the stale cap). The daemon's conservative serve-mode defaults (200k tok / $2 per session, 2M / $20 per day) are unchanged — interactive stays unlimited by default, the server stays guard-railed. See docs/guides/features.md · docs/guides/reference.md.
  • June 5, 2026 (v3.05.82): Adaptive Markdown streaming — live output that stays correct on every device. In-place Rich Live redraw is great on capable terminals but breaks elsewhere: it was disabled wholesale over SSH (so SSH users got raw tokens with no formatting), and where it did run it could leave duplicate or stale frames — on macOS Terminal (which can't erase above the scroll boundary), over laggy network PTYs, or with wide CJK / emoji text whose display width a naive line-count gets wrong. The renderer now selects a streaming tier per device in ui.render.auto_stream_mode(config): live — full in-place redraw, only on terminals known to handle cursor-up (local TTYs, and modern emulators even over SSH: iTerm2, WezTerm, Windows Terminal, VSCode, kitty, Alacritty, Ghostty, detected via TERM_PROGRAM / TERM / WT_SESSION / KITTY_WINDOW_ID / ALACRITTY_WINDOW_ID / WEZTERM_PANE); commitappend-only progressive Markdown, the safe default for unknown-SSH / Apple Terminal / pipes / non-TTY, where each completed block (split on blank lines, respecting open code fences so a fenced block renders atomically) is rendered and printed permanently and the cursor is never moved, making a duplicate frame structurally impossible regardless of terminal, latency, or character width; plain — raw tokens, only when rich is unavailable. The append-only floor is provably duplication-free; live is progressive enhancement on top. Override with /config stream_mode=live|commit|plain (legacy boolean /config rich_live=true|false still works → live/commit). Implemented in ui/render.py (set_stream_mode / auto_stream_mode / _safe_commit_point / _commit_stream / _commit_flush), wired in at REPL start in cheetahclaws.py, with a 26-case test suite in tests/test_stream_modes.py (device routing, code-fence-aware block boundaries, append-only commit, and a regression asserting commit mode emits zero cursor sequences even on a TTY with CJK text). Two related UX items shipped alongside: /context is now a visual grid — a Claude-Code-style 20×10 cell grid of context-window usage, colored and broken down by category (system prompt / system tools / memory files / skills / messages / free space) with per-category token counts and percentages, adapting to the model's real context window and falling back to #/. on non-UTF-8 terminals (commands/core.py:cmd_context); and deepseek-v4-flash is registered at its 1M context window in providers._MODEL_CONTEXT_LIMITS (overriding the 128K deepseek provider default, which still applies to deepseek-chat / deepseek-v4-pro), so the prompt %, /context, and the compaction trigger all reflect the true 1M window. See docs/guides/features.md · docs/guides/reference.md.