v0.9.0 stewardship integration#2762
Draft
Hmbown wants to merge 190 commits into
Draft
Conversation
…metadata (#2682) - Add model_visible() hook to ToolSpec trait (default true) - Override model_visible() -> false on todo_write, todo_add, todo_update, todo_list - Checklist variants remain model-visible as the canonical surface - Legacy todo_* calls still work for saved transcript replay - Return _deprecation metadata with use_instead and removed_in=0.9.0 - Update prompts to recommend checklist_* only - Update TOOL_SURFACE.md with v0.9.0 deprecation notes - Add tests for hidden catalog, compat alias behavior, and metadata Verification: cargo test -p codewhale-tui -- todo, cargo clippy -D warnings
The token estimator walks the full session.messages and the active system
prompt. Five call sites per turn in the engine (capacity pre/post tool
checkpoints, error escalation, the seam manager, the trim budget check)
plus four TUI/command consumers (footer, /status, /debug, context
inspector) all re-walked the same data independently. On a 200-message
history with 5 KB of tool results that is roughly 2 ms per call, or
~20 ms of pure waste on a single turn.
Introduce a process-local TokenEstimateCache keyed on
(session.messages_revision, system_prompt_fingerprint). Repeated calls
with the same inputs return the cached value without re-walking the
message list. The cache invalidates as soon as either input changes:
* session.messages_revision is a monotonic counter bumped in
Session::add_message, Session::replace_messages, the new
Session::bump_messages_revision helper, and at every direct
session.messages mutation site in core/engine.rs and
core/engine/capacity_flow.rs.
* system_prompt_fingerprint is a stable 64-bit hash of the
SystemPrompt::Text or SystemPrompt::Blocks payload.
Also restructures layered_context_checkpoint to compute the estimated
token count before taking a long-lived &SeamManager borrow, and
re-routes the capacity pre/post tool checkpoints to compute the
observation into a local before calling
capacity_controller.observe_*. Both refactors are required to satisfy
the borrow checker once estimated_input_tokens requires &mut self.
Tests: 10 new unit tests cover the miss/hit path, revision bumps,
system-prompt changes, audit-ring capacity, and downward-revision
no-ops. The full 157-test engine suite still passes.
PrefixFingerprint::compute is called once per turn by the turn loop prefix-stability check. The tool-side work serializes every tool to the chat-API JSON shape, sorts the resulting strings, joins with newlines, and SHA-256s the result. For a 60-tool catalog that is ~25-40 KB of allocation plus a sort, all of which produces a byte-identical output once the tool set is stable across turns (the common case after the first turn of a session). Introduce a process-local ToolCatalogCache that stores the joined+sorted catalog under a content-derived u64 identity (length + per-tool name + description + serialized input_schema). On a hit, the per-tool JSON serialization, sort, and join are skipped entirely — the pre-computed SHA-256 hex digest is returned directly. The cache lives on PrefixStabilityManager (per-session ownership) and backs a new PrefixFingerprint::compute_with_tool_cache entry point. check_and_update, PrefixStabilityManager::new, and pin() all use the cached path. The original compute() is kept as a fallback for callers that do not have a cache in hand (e.g. CLI tools that build a one-shot fingerprint). The cache is bounded (default capacity = 8) and uses insertion-order eviction, matching the eviction strategy already in transcript_cache.rs. invalidate() is exposed for tool-registry hot-reload and MCP attach paths. Tests: 8 new unit tests cover the miss/hit path (pointer-equal Arc on hit), identity collisions, schema change detection, capacity eviction, invalidate, empty slice, and the equivalence between cached and uncached fingerprints. The full 30-test prefix_cache suite passes; the wider prefix-cache contract tests in settings, prompts, and core::engine::tests continue to pass.
…with PrefixFingerprint::compute
Three follow-ups to the previous perf commit:
1. Correctness: tool.strict participates in the wire format emitted by
tool_to_api_json, so it MUST participate in the cache identity. Two
catalogs that differ only in strict would otherwise collide and serve
a stale SHA-256, silently busting prefix-cache stability on the wire.
2. Allocation: replace the per-tool serde_json::to_string in
tool_set_identity with a hash_json_value helper that walks the JSON
tree directly. For a 60-tool catalog this drops ~25-40 KB of
transient allocation per cache miss.
3. Dead code: the previous patch introduced PrefixFingerprint::compute,
CachedCatalog::joined, ToolCatalogCache::{invalidate,is_empty}, and a
thread-local cache helper that were not used outside tests. With
-D warnings in CI all four triggered dead-code errors. The compute
helper is now only built in cfg(test); the rest are marked
#[allow(dead_code)] with comments explaining their observability and
test-only use.
… pass build_canonical_state previously did two independent reverse walks of session.messages — one to extract the most recent user goal, and one to collect up to four confirmed-fact snippets. apply_verify_and_replan then added a third and fourth reverse scan to locate the latest user message and the latest [verification replay] user message for the re-plan path. All four reverse scans collect disjoint facts about the same most- recent-first view of the conversation. This PR folds them into a single helper, scan_canonical_inputs, that walks messages once in reverse, fills a CanonicalStateScan, and short-circuits as soon as every collector is satisfied. The helper exposes the latest-message indices so apply_verify_and_replan can clone the full Message values after the scan (eliminating the two independent find().cloned() walks). The output CanonicalState is byte-identical to the prior implementation: same goal, same confirmed facts (newest first, errors filtered), same fallback string when no user text exists. The re-plan path's keep-messages set is identical: latest user + latest verified. Tests: 6 new unit tests cover the goal lookup, fact cap, error-result filter, verified-marker scan, empty input, and the early-exit condition. The full engine test suite (153 tests) still passes.
…-user lookup The build_canonical_state path never reads CanonicalStateScan::latest_verified_user_idx, but the previous patch required is_complete() to find a verified user message before it would short-circuit. On a long history with no verification replay — the common case — the scan walked the entire message list looking for a match that could not exist. Add a find_verified: bool parameter to scan_canonical_inputs and CanonicalStateScan::is_complete. build_canonical_state now passes false, so the loop stops as soon as the goal and CANONICAL_SCAN_MAX_FACTS facts are found. The replan path (apply_verify_and_replan) keeps the existing true behavior so it still locates the latest verified user message. Test calls are updated to match; no behavior change for any test.
output_rows (in tui::history) walks the raw tool output, ANSI-strips each line, classifies path/URL-like rows, and wraps the rest to the current viewport width. selected_output_indices then computes the head/tail/importance subset that the compact Live view shows. Both functions are pure, but they are called on every render frame for every visible tool cell. For a 4 KB tool output on a 120 FPS render loop that is 2-6 redundant walks per frame, per cell, and the function is called from a non-trivial number of cells across exec, tool, command, and review history. Add tui::output_rows_cache, a thread-local, content-addressed cache keyed on (content_hash, width) for the rows and (content_hash, width, line_limit) for the indices. The cache stores the wrapped Vec<OutputRow> plus a per-line-limit map of selected indices on a single entry, so a single key lookup satisfies both render steps. render_preserved_output_mode now consults the cache for both the rows and the indices; on a hit, neither the per-line ANSI strip nor the importance-ranking pass runs. The cache is bounded (default capacity 256) with insertion-order eviction. The OutputRow struct gains PartialEq + Eq + pub fields so the cache module can store and hash it without exposing private internals. Tests: 6 new unit tests cover the hit/miss path, width invalidation, content invalidation, indices per-line_limit caching, capacity eviction, and hash stability. The wider tui::history test suite (68 tests) still passes.
Three follow-ups to the previous perf commit: 1. Drop the rows_hash field on CacheEntry. The field was computed and stored but never read on the hot path; tests exercised it only to assert the cache returned a stable hash. After this change get_or_compute_rows returns just Vec<OutputRow>, halving the tuple-return ABI and removing one DefaultHasher::write pass on every cache miss. 2. Replace DefaultHasher (SipHash) with a hand-rolled FNV-1a 64-bit hash. SipHash is per-process-keyed and ~5-10x slower than FNV on the small-to-medium tool output strings we see at 120 FPS. FNV-1a has no per-process key, fits in 20 lines of pure-Rust, and a 64-bit collision space is more than wide enough for the per-process LRU's expected <= a few hundred entries. The cache is a correctness optimization, not a security boundary; collisions only cause a false miss, never wrong data. 3. Caller in tui::history::render_preserved_output_mode updated to the new Vec<OutputRow>-only signature. Two new tests cover the FNV-1a properties (length-suffix sensitivity, empty-input stability).
- Add scroll state field to PlanPromptView with PgUp/PgDn, Ctrl+U/D/F/B, Home/End, gg/G vim-style keybindings - Show scroll indicator footer when content overflows the popup - Add confirming_exit state: Esc while scrolled asks for confirmation before discarding, preventing accidental exits on long plans - Clamp scroll in render() so overscroll doesn't hide bottom options - Use wrapped_line_count() with UnicodeWidthStr for accurate overflow detection with CJK characters - Add 11 unit tests covering scroll, keybindings, and exit confirmation
- Use word-wrapping-aware line count to prevent underestimating scroll range (gemini-code-assist / greptile-apps) - Merge PLAN_OPTIONS, PLAN_SHORTCUTS, PLAN_SHORT_LABELS into PlanOption struct (gemini-code-assist) - Remove dead Esc code in handle_key (greptile-apps) - Guard gg/G with modifier checks (gemini-code-assist) - Increase PgUp/PgDn scroll amount from 6 to 12 (greptile-apps) - Use u16::try_from for scroll value to avoid silent truncation (greptile-apps) - Update related unit tests for new scroll values
… key Use clamped (effective) scroll instead of raw `self.scroll` in the Esc handler so a short plan that fits entirely (max_scroll == 0) never triggers the "exit without implementing?" dialog when the user pressed a scroll key (PgDn/Ctrl-D/G/End) beforehand.
…e wrap width - wrapped_line_count: compute leading-space width via UnicodeWidthStr instead of byte length, so non-ASCII leading whitespace is measured correctly. - render: hoist popup_area / content_width computation above plan rendering so wrap_text can share the same content_width derived from the actual popup geometry instead of a magic 68.
…rome - Clear pending_g when Esc triggers the exit-confirmation prompt so a stray 'g' press does not leak into and survive the confirmation dialog. - Move render_modal_chrome into the else branch so only one call fires per render pass, eliminating a shadow artifact when confirming_exit is active.
…nder work - wrap_text: replace chars().count() with UnicodeWidthStr::width() so CJK text is wrapped by display columns, consistent with wrapped_line_count and ratatui's Paragraph::wrap. Also fix the hard-split loop to use exclusive byte ranges (..end) instead of inclusive (..=i) so multi-byte UTF-8 prefixes are always valid. - render: hoist the confirming_exit branch to an early return so the plan-content construction (lines, scroll bounds, footer) is skipped entirely when the confirmation dialog is visible.
…breaks at script boundaries
Wrap plan steps via wrap_text() before rendering, breaking only on display-width overflow, not on Latin/CJK Unicode word boundaries. Switch main render path from Wrap { trim: true } to Wrap { trim: false } since all content is pre-wrapped. Replace wrapped_line_count() with lines.len() for accurate scroll bounds. Keep confirm-exit dialog on Wrap { trim: true } (English-only, no risk).
#2683) Subagent aliases: - Legacy names (agent_spawn, agent_result, agent_cancel, resume_agent, agent_list, agent_send_input, agent_assign, agent_wait, delegate_to_agent) are already NOT registered — they exist as dead code with #[allow(dead_code)] since v0.8.33 - Add test verifying model catalog only advertises canonical subagent tools: agent_open, agent_eval, agent_close, tool_agent Shell aliases: - Hide exec_wait from model catalog (legacy alias for exec_shell_wait) - Hide exec_interact from model catalog (legacy alias for exec_shell_interact) - Both remain callable for saved transcript replay - Add test verifying shell aliases are hidden but callable Verification: cargo test -p codewhale-tui --locked (4040 passed), cargo clippy -D warnings
Root cause: AgentComplete unconditionally calls resume_terminal() even when the terminal was never paused, causing a secondary EnterAlternateScreen on Windows that creates a new buffer whose width may differ from the window width. Additionally, ColorCompatBackend had no terminal_size cache, so size() fell through to crossterm::terminal::size() which on Windows returns the WinAPI buffer width rather than the window width. Changes: - AgentComplete: add event_broker.is_paused() guard - resume_terminal(): cache real terminal size before reset_viewport - Resize handler: also set terminal_size alongside forced_size - subagent_routing: 3x mark_history_updated -> bump_history_cell(idx) - color_compat: add terminal_size field, set_terminal_size(), fix size() fallback priority (forced_size > terminal_size) - tests: 3 unit tests for size() fallback chain Review feedback addressed: - forced_size now takes priority over terminal_size (gemini-code-assist) - Redundant map lookups removed in subagent_routing (both bots) - set_terminal_size moved before reset_terminal_viewport (greptile-apps) (cherry picked from commit 4463c46)
Harvests provider-scoped TLS skip-verify from #1893 by @wavezhang. Disabled by default, active-provider-only, doctor-reported, and keeps SSL_CERT_FILE as the preferred custom CA path.
fix(tui): count workspace MCP servers in status surfaces
docs(runtime): document read-only VS Code Agent View APIs
Classify stream/body decode failures such as the #2847 report as recoverable network interruptions and add focused taxonomy coverage.\n\nVerification:\n- cargo test -p codewhale-tui error_taxonomy::tests --locked\n- git diff --check\n- cmp -s CHANGELOG.md crates/tui/CHANGELOG.md\n- ./scripts/release/check-versions.sh\n- ./scripts/release/check-ohos-deps.sh
Add a pure HarnessProfile resolver for provider/model routes while keeping runtime provider/model routing, prompts, tools, auth, context, and persisted config unchanged.\n\nVerification:\n- cargo test -p codewhale-config harness_profile --locked\n- cargo fmt --all --check\n- git diff --check\n- cmp -s CHANGELOG.md crates/tui/CHANGELOG.md\n- ./scripts/release/check-versions.sh\n- ./scripts/release/check-ohos-deps.sh
Add recorded mock-trace replay coverage for workflows/rlm_cache_change.star and prove missing dogfood records produce ReplayDiverged instead of live fallback.\n\nVerification:\n- cargo test -p codewhale-whaleflow rlm_cache_change --locked\n- cargo fmt --all --check\n- git diff --check\n- cmp -s CHANGELOG.md crates/tui/CHANGELOG.md\n- ./scripts/release/check-versions.sh\n- ./scripts/release/check-ohos-deps.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Current verification
Maintainer notes