Release What Changed · SafeRL-Lab/cheetahclaws

May 10, 2026 (latest, v3.05.79): Web Chat UI session organization + headless-bridges slash handler + stale-session reaper crash fix. Three threads of work merged into a single release. Bridges / headless deploys (#84 follow-up): Telegram / Slack / WeChat /help, /monitor, /model, /status produced zero response in Docker / --web deploys because _start_headless_bridges() only wired run_query and agent_state on the shared session_ctx — never handle_slash. The bridge poll loops gate on if slash_cb: and fell through to continue before the 📩 Telegram: log line, so the failure was invisible in docker compose logs -f. Fix: extracted the slash handler (originally inlined in repl()) into a module-level factory _make_bridge_slash_handler(state, config, run_query); both REPL and headless paths now use it (single source of truth, no future drift between modes). Stale-session reaper crash: web/api.py:reap_stale_chat_sessions() called remove_chat_session(sid) without the user_id the function now requires for ownership-check parity — every reaper tick raised TypeError, killing the daemon thread, so stale ChatSession objects accumulated forever in the in-memory cache. Fix: capture (sid, user_id) pairs from the cached ChatSession objects under _chat_lock, then apply outside the lock. Web UI session organization: five-feature bundle layered on top — folders + drag-drop + Move-to context menu, ChatGPT-style active-folder context (click a folder name → + New and direct-typing both drop new sessions into that folder, with a Chat · in <Folder> topbar breadcrumb), batch select with Select-all-respecting-search-filter, batch delete + combined-Markdown export (chats-N-sessions.md), and a 4-px draggable sidebar divider with localStorage persistence. Backend adds a folders table, chat_sessions.folder_id nullable FK, in-place PRAGMA table_info + ALTER TABLE migration in init_db(), and 5 new HTTP endpoints (GET/POST /api/folders, PATCH/DELETE /api/folders/{id}, PATCH /api/sessions/{id}/folder). Also rolled in: issue #111 (handle_slash_sync / handle_slash_stream no longer double-broadcast to WS) and --web --model X persistence. Tests: +16 new across test_web_api.py (folder CRUD, batch ops, reaper regression) and the new test_bridge_slash_handler.py (5 cases pinning the headless handler contract). Full suite: 2154 / 2154 passing, zero regressions. User-side guide: docs/guides/web-ui.md.
May 10, 2026: Web Chat UI fixes — slash commands no longer reply twice; --web --model X actually applies the model. Two related issues that surfaced when wiring a self-hosted vLLM endpoint into the Chat UI. (1) Issue #111 — slash commands duplicated in Chat UI but not in terminal. web/api.py:handle_slash_sync was both returning events inline in the HTTP response and broadcasting the same events to the WS subscribers of the same client; chat.js then iterated data.events AND fired _handleEvent from ws.onmessage, rendering every reply twice. Same bug in handle_slash_stream for SSE-streamed long commands (/brainstorm, /worker, /agent, /plan). Both helpers now deliver events through a single channel — HTTP/SSE only — so _handleEvent runs exactly once per event. Background-thread events (sentinel flows, agent runs) are unaffected: by the time the worker thread emits, _broadcast is already restored to the live WS broadcaster in finally. (2) --web --model X was silently ignored. The CLI override branch only ran in the interactive-REPL path; the if args.web: branch loaded config straight from disk and started the server, so python cheetahclaws.py --web --model custom/qwen2.5-72b would happily boot but every request handler reloaded ~/.cheetahclaws/config.json with the previous model name (e.g. gemma-4-31B-it), producing a confusing 404: model does not exist against the new endpoint. Fix: cheetahclaws.py now persists args.model to config before calling start_web_server, matching the documented behavior; provider:model → provider/model normalization is identical to the REPL path. User-side guide: docs/guides/web-ui.md (Troubleshooting + Architecture notes updated).
May 10, 2026: Small-context local models survive large workloads — 4-part fix: ctx cap, auto-fanout, stagnation-stop, output paths under ~/.cheetahclaws/. Repro that motivated the work: running /agent → 1 (Research Assistant) on a 6.6 MB PDF (AutoRedTeamer.pdf — ~70k tokens of extracted text) with custom/qwen2.5-72b (32k ctx). Old behavior: 400 BadRequest "context length 32768"; the agent_runner kept polling the template every 2 s; the model produced 1500+ identical "task complete" summaries before anything stopped it. New behavior, four cooperating layers: (1) Per-model context-window registry + dynamic max_tokens cap (providers._MODEL_CONTEXT_LIMITS + get_model_context_window + dynamic_cap_max_tokens) — covers Qwen 2.5/3, Llama 3.x, Mistral/Mixtral, Phi, Gemma, DeepSeek local variants; _fetch_custom_model_limit now backfills PROVIDERS["custom"]["context_limit"] so compaction sees the live /v1/models value; per-call shrink based on actual prompt size keeps input + output + 1024 safety ≤ ctx. compaction.get_context_limit gains an optional config arg so custom-endpoint detection works on the very first turn. (2) Auto-fanout for oversize tool outputs (multi_agent/fanout.py) — when a single tool result (Read on a huge PDF, Grep over a giant tree, WebFetch of a long article) exceeds 0.4 × ctx_window, split into chunks at paragraph boundaries with token-overlap, dispatch parallel sub-LLM map calls (one per chunk, default cap 5 subagents), merge with a single reduce call; substitutes the merged summary in conversation history instead of letting the next API call overflow. Hooked at the tool-result append site in agent.py; transparent UX prints [Auto-fanout: <Tool> returned ~N chars (>threshold) → dispatching K parallel sub-summaries]. Configurable: auto_fanout_enabled / _threshold / _max_subagents / _chunk_overlap_tokens. (3) Stagnation-stop in agent_runner.py — when the model emits the same summary N iterations in a row (default 3, whitespace/case-normalized), stop the loop with a clear notification instead of burning thousands of API calls; configurable via auto_agent_dup_summary_limit (0 disables). (4) Agent output paths under ~/.cheetahclaws/ — /agent wizard now resolves relative output filenames (e.g. research_notes.md) to absolute paths under ~/.cheetahclaws/agents/<name>/output/ instead of CWD; AgentRunner exposes runner.output_dir, eagerly mkdir'd; Summary block + post-start info show the resolved path in green; absolute paths pass through unchanged. Tests: +47 new (fanout 23, ctx cap 18, dup-stop 13, output paths 8). Full suite: 2139 passing, zero regressions. User-side guide: docs/guides/extensions.md.
May 9, 2026: Read tool auto-redirects on overflow — defense-in-depth for the case where model ignores the template instruction. Re-running the same /agent + autodan.pdf failure showed two real-world problems with the prior fix: (1) The user was running the pip-installed binary (/home/shangdinggu/anaconda3/bin/cheetahclaws), not the source tree. New tools / templates added to source had no effect. (2) Even if the user reinstalled, qwen2.5-72b would likely still call Read instead of SummarizeLargeFile — models default to familiar tools no matter what the template says. The fix moves the routing decision into the Read tool itself. (a) New _maybe_redirect_to_summarize helper (tools/files.py). When Read or ReadPDF would return content too large to safely fit in the next API call, it instead returns a short redirect message like [ReadTooLarge: file is too large — call SummarizeLargeFile with file_path='X' instead] PREVIEW: …. The model sees the redirect, calls SummarizeLargeFile, gets a chunked-and-merged summary back. The raw content never enters the API call. (b) CJK-aware token estimation. CJK content tokenizes at ~1 token per character (vs ~2.8 chars/token for English). New _is_cjk_heavy() heuristic: ≥20% CJK characters → use 1:1 char-to-token estimate. A 24K-char Chinese file is 24K tokens, not 8.6K, and now triggers redirect on a 32K-context model. (c) Conservative ceiling for unreliable provider declarations. custom/<model> provider declares 128K context by default but the underlying model is often 32K (qwen2.5-72b, llama 3 8B, etc.). New safe_ctx = min(declared_ctx, 30000) caps the threshold at 30K tokens regardless of provider claims — the redirect now fires on the user's exact ~25K-token PDF case (would NOT have fired with the unconditional 128K ceiling, which is exactly the bug). (d) Wrapped Read registration (tools/__init__.py). New _read_with_overflow_check lambda calls _maybe_redirect_to_summarize after _read returns; for results <8KB it skips (not worth the check). ReadPDF gets the same treatment inline in _read_pdf. Why this works even on the old install: as soon as the user updates tools/files.py and tools/__init__.py, the redirect fires regardless of whether SummarizeLargeFile / template changes are present. The redirect's prose tells the model exactly which tool to call and with what args. Tests: 14 new pytest cases (tests/test_read_overflow_redirect.py) — CJK detection (English / Chinese / Japanese / mixed-minority / empty), threshold logic (small file → no redirect; user's exact failure case → redirect with right pointer; CJK at lower char count triggers vs same chars in English; conservative ceiling protects against overconfident provider; preview included for context). Plus 2 integration tests via execute_tool("Read", ...) confirming the wrapper applies the redirect end-to-end. 2077 targeted regression tests pass (2063 prior + 14 new), zero regressions across the whole repo.
May 9, 2026: Multi-agent map-reduce SummarizeLargeFile tool — solves the "file too big for model context" problem at the source. Re-running the same /agent + autodan.pdf failure case showed the SAFETY_BUFFER bumps were still band-aids — even with 2500-token buffer the prompt re-tokenization sometimes ate ~1K, leaving no margin. The real fix: when a file is too big for the model's context, chunk it and run multiple sub-LLM agents in parallel then merge. This makes file size irrelevant. (a) New SummarizeLargeFile(file_path, focus="") tool (tools/files.py). Reads any-size file (PDF / txt / md / code), estimates tokens, and: if it fits in (model_ctx - 8.5K_reserved) tokens → single-shot summary; otherwise → splits into N chunks (number adaptive to file size: 200KB on 32K-context model → ~4 chunks; 200KB on 200K-context → 2 chunks), summarizes each chunk in parallel via ThreadPoolExecutor (up to 8 workers), then a reduce step merges all chunk summaries into one unified output. Per-chunk failures are logged inline as [chunk N: error] markers so one flaky source doesn't sink the whole job. Returns the final summary as the tool result. Registered with read_only=True, concurrent_safe=True. (b) /summarize <path> [focus] slash command (commands/advanced.py:cmd_summarize). Thin wrapper around the same helper for direct user invocation — handy for quickly summarizing a paper or large code file without spinning up a full /agent flow. (c) research_assistant.md template updated. Step 2 of "each iteration" now tells the agent to prefer SummarizeLargeFile over Read for academic papers (handles chunking + never overflows context regardless of length). Falls back to Read for tiny (< 5KB) files. (d) Quick band-aid: SAFETY_BUFFER 1000 → 2500 in _try_reduce_output_cap_from_error. Even with the new tool, output-cap auto-reduction is still useful for the rare case where Read is called on a moderately big file. The 2500-token (~7.6% of 32K) buffer now absorbs the +1K vLLM decoder-priming variance we observed in the wild. Tests: 18 new pytest cases (tests/test_summarize_large_file.py) — token estimator parametrized cases, chunk planner adaptiveness (small file → 1 chunk; size scales monotonically; larger context → fewer chunks; chunks have overlap; chunks cover all content), file reader dispatch (text / missing / directory rejected), full pipeline (small → single-shot, big → map-reduce with N≥3 map calls + 1 reduce), tool registration + schema check. 2063 targeted regression tests pass (2045 prior + 18 new), zero regressions. Golden prompt fixture regenerated for the new /summarize command in the help index.
May 9, 2026: Two follow-up fixes after re-running the same /agent failure case. The previous patch wasn't enough — running the user's exact scenario again still showed: 1st call prompt 24577 + cap 8192 = 32769 fail → my auto-reduction fired → 2nd call prompt 24778 + cap 7991 = 32769 fail again. The prompt grew by 201 tokens between attempts (provider re-tokenized differently on retry), exactly eating the 200-token safety buffer. AND the agent_runner's consecutive-failure detector kept resetting because agent.py alternates between [Failed ...] and [Circuit breaker ...] markers, so signature-matched counter went 1 → 1 → 1 → 1 forever. (a) Bumped SAFETY_BUFFER 200 → 1000 in _try_reduce_output_cap_from_error. ~3% headroom on a 32K window absorbs provider-side tokenization variance. User's case: new safe cap = 32768 - 24577 - 1000 = 7191, which actually fits even after the prompt grows. (b) agent_runner now counts ANY failure, not just signature-matched. New parallel counter consecutive_any_failures increments on ANY [Failed] / [Circuit breaker] marker regardless of signature; trips at 4 consecutive iterations. The [Failed → Circuit breaker → Failed → ...] alternation now stops the agent at iteration 4 instead of looping forever. Updated stop-message clarifies whether the trip was "same identical failures" or "consecutive mixed failures". 8 existing tests updated for new buffer + 2045 targeted regression tests pass.
May 9, 2026: Three fixes for the context-overflow + circuit-breaker doom loop. User report: /ssj 15 → Research Assistant pointed at a large PDF, model qwen2.5-72b (32K context), output cap 8192, prompt 24577 input tokens → total 32769 → 1 token over the limit. Every API call returned the same BadRequestError. The retry loop hit the same error 5 times in 60s → circuit breaker opened (120s cooldown). After cooldown the agent runner retried with the SAME config → re-opened the breaker → cycle continued forever, generating hundreds of circuit_open_skip log lines. Three coordinated fixes break the loop. (a) agent.py auto-reduces output cap on context overflow. New _try_reduce_output_cap_from_error parses the explicit token counts from the error message (max=32768, requested=8192, prompt=24577) and computes a safe new cap = model_max - prompt_tokens - 200_buffer. In the user's case: 32768 - 24577 - 200 = 7991, which fits. The retry uses the new cap WITHOUT consuming the attempt budget; bounded to ONE auto-reduction per turn so a true overflow (prompt itself too big to fit any reasonable output) eventually surfaces. Tolerant regex matches both OpenAI-style and Anthropic-style overflow messages. Falls through to existing _force_compact path if numbers can't be parsed or the safe cap < 256. (b) agent_runner.py stops after N consecutive identical failures. Track each iteration's failure signature (the [Failed ...] or [Circuit breaker ...] marker text from agent.py's output, capped at 80 chars). When 3 in a row match, stop the agent with a clear notify message naming the underlying error. Prevents the doom loop where a fundamentally broken request (context too big for compaction to fix, missing API key, unauthorized model) keeps re-running every 2s for hours. (c) agent_runner.py honors circuit-breaker cooldown. When iteration text contains [Circuit breaker OPEN ... Cooldown: Xs], parse Xs and wait that long (capped at 5 min) instead of the configured 2s interval before next iteration. Avoids 60+ wasted iterations per single 120s cooldown. Tests: 8 new pytest cases (tests/test_context_overflow_recovery.py) — parser reproduces user's exact failure → 7991 cap, no-op when current cap already fits, give-up when safe cap < 256, OpenAI vs Anthropic phrasing tolerance, regex match for circuit-breaker cooldown extraction, regex match for [Failed / [Circuit breaker markers in real outputs. 2045 targeted regression tests pass (2037 prior + 8 new), zero regressions.
May 9, 2026: /brainstorm v2: programmatic backstops + ranked synthesis + --bg background mode. Three coordinated additions that make brainstorm output usable even when the lead model is weak (qwen2.5 etc.) and let users keep working while the debate runs. (a) Programmatic action-plan filter (commands/advanced.py). Two new helpers _extract_ban_keywords(opening) and _filter_action_plan(synthesis_md, ban_keywords). After _lead_synthesis returns, the action plan is regex-scanned (case-insensitive substring) against a built-in default ban list — consult an advisor, diversify your portfolio, monitor regularly, 考虑, 咨询, 定期监控, 多元化, 咨询财务顾问, 分散投资, 关注市场动态 and dozens more, English + Chinese — PLUS topic-specific bans extracted from quoted strings ("..." / 「...」) in the lead's own opening. Matched items are dropped with a _(programmatic self-check removed N action(s))_ note appended. Deterministic — runs regardless of whether the lead model actually executed its prompt-side SELF-CHECK instruction. The user-reported failure case where qwen2.5 banned "consult an advisor" in the opening but still wrote "明天与财务顾问讨论" as Action Plan item #10 is now caught at the code level. (b) Ranked synthesis enforcement. The _lead_synthesis prompt's ## Consensus section is renamed to ## Ranked Consensus with a mandatory **Ranked by: <metric>** header (metric extracted from the user's topic — "highest expected return" / "best refactor impact" / etc.) and items must be numbered with a → Why this rank: <one sentence> line. Programmatic backstop _consensus_is_ranked regex-checks for ≥2 numbered items in the section; if missing, ONE fallback LLM call asks the lead to rank. If the fallback also fails to produce a ranking, the original ships unchanged (no crash). (c) Background mode --bg (or --background). New flag spawns a daemon thread, returns the REPL immediately. Stage progress (Lead opening, Round 2/3 (cross-examination), Synthesis) prints from the thread and interleaves with the user's typing — acceptable trade-off for a freed REPL. New /brainstorm status subcommand shows all in-flight bg brainstorms with their current stage + elapsed time + output path. Implementation uses recursion: when --bg is set, the thread re-enters cmd_brainstorm with _bg_recursion=True markers in config that bypass the interactive prompts (which would block on stdin) and suppress the TODO-generation sentinel (no REPL is listening for it). Module-level _BG_BRAINSTORMS dict is mutex-locked so /brainstorm status reads a clean snapshot. Finished brainstorms older than 1h are pruned from status to keep the list useful; running ones never prune regardless of age. Tests: 27 new pytest cases (tests/test_brainstorm_v2_advanced.py) — ban-keyword extraction (defaults + opening-quoted), action-plan filter (English + Chinese + no-section + all-clean), ranking detector (proper / unranked-bullets / no-section / single-item), _ensure_consensus_is_ranked (no-op when ranked + LLM call when not + keep-original on LLM failure), --bg flag parsing (7 cases including --background alias + flag-position-tolerance + --bgmode not matching), bg registry (register/set_stage/complete/snapshot + sort + 1h-prune-finished + keep-running-regardless). 2037 targeted regression tests pass (2010 prior + 27 new), zero regressions across the whole repo. Doc: docs/guides/brainstorm.md adds --bg row to the flag table + new "Programmatic backstops on the synthesis" section + tip "use --bg for long debates so you can keep working".
May 9, 2026: /brainstorm --ground: pre-fetch real /research data so personas debate against facts. Closes the biggest remaining gap in the brainstorm pipeline. Until now /brainstorm was pure-reasoning (no_tools=True on every persona) — fine for design / refactor / strategy questions, but useless for data-hungry topics like stocks / current events / recent news where personas would confidently invent prices and tickers from training memory. New --ground (or --ground=N for top-N cap, clamped to [3, 50], default 15) runs research.aggregator.research() on the topic BEFORE the debate starts, formats the top results as a compact ### GROUNDING DATA markdown block, and inlines that into the snapshot every persona / lead opening / lead synthesis sees. Persona round-1 instructions gain "you MUST cite specific results by [N] when your claim relates to one — do not invent figures the data doesn't show." Lead opening detects the grounding block and anchors the agenda to it ("forbid any claim that contradicts the grounding data without citing it"). Lead synthesis takes a new grounding= kwarg and the prompt requires every consensus claim to trace to either a [N] result OR a specific persona claim — un-traceable claims must be DROPPED. Failure-tolerant: any exception from the research aggregator (network, missing API keys, all sources 429) is caught silently — _fetch_grounding returns "" and the brainstorm continues un-grounded with a logged warning. Cost: 10-30s for the fetch, but cached for 24h via the existing /research SQLite cache so back-to-back runs on the same topic are basically free. Composes cleanly with --rounds, --lead, --models. SSJ interactive flow gains a new Ground in /research data first? [y/N] prompt right after Rounds; default N so existing usage is unchanged. Tests: 18 new pytest cases (tests/test_brainstorm_grounding.py) — 8 flag-parse cases including bound-clamping + four-flag composition, brief-formatting shape + sort + char-budget + empty-results, three fetch-graceful-degradation paths (raises / empty brief / happy path), backward-compat for _lead_synthesis(grounding=). 2010 targeted regression tests pass (1992 prior + 18 new), zero regressions across the whole repo. Doc: docs/guides/brainstorm.md "Data-hungry topics" section rewritten with examples + tip "always pass --ground for any topic touching the real world".
May 9, 2026: /brainstorm output-quality guards — fix 5 real bugs surfaced from a live transcript. Reviewing brainstorm_outputs/brainstorm_20260509_000935.md exposed five concrete failures the structural changes alone didn't catch. (a) All persona letters were P — letter, name = get_identity(persona_name[0].upper()) and persona dict keys are p1/p2/…, so every Agent ended up labeled P ("Agent P quoting Agent P attacking Agent P"). Letters now come from a stable persona_identity map keyed by index → A, B, C, D, E… (capped at Z). (b) Same persona's NAME re-rolled every round because get_identity was called fresh and Faker is random — round 1's "Riley Torres" became round 2's "Alex Lopez". persona_identity is sealed once before the rounds loop. (c) Round 2+ challenges were verbatim copy-paste — qwen2.5 saw the first persona's CHALLENGE block in history and cloned it (8 of 10 round-2/3 challenges in the failing transcript were >95% identical). New _extract_challenge_blocks + _jaccard_similarity + _is_redundant_challenge (threshold 0.7) guards: when a round-2+ persona's CHALLENGE is too similar to a prior one, the lead force-regenerates ONCE with explicit "pick a different target / different angle" nudge; if still redundant, the contribution is kept but tagged _[lead note: contribution flagged as redundant]_ so the synthesizer can ignore it. (d) Lead synthesis self-contradicted itself — listed "consult an advisor" in What Was Filler then included "明天与财务顾问讨论" as Action Plan item #10. _lead_synthesis now takes the lead's own opening text as context and the prompt explicitly forces a SELF-CHECK before writing the action plan: "if any action matches a banned escape hatch, REWRITE or DELETE." (e) Weak lead models silently produce flat output — qwen2.5-72b leading qwen2.5-72b is the same model on both sides with no real moderation. New _is_weak_lead_model family check (qwen / qwq / gemma / phi-3 / mistral-7b / llama-3.2 / kimi-7b / minimax-text / abab / etc.); when triggered, prints a one-line warning suggesting --lead claude-opus-4-7 or the free --lead nim/deepseek-ai/deepseek-r1. Never silently overrides — just informs. Plus a new docs/guides/brainstorm.md "When NOT to use /brainstorm" section: the panel runs with no_tools=True so it can't pull live data — bad fit for stocks / current events / repo-specific code; good fit for architecture decisions / refactor strategy / risk assessment / API design. Tests: 28 new pytest cases (extraction + Jaccard + redundancy + weak-lead + synthesis-with-opening). 297 targeted regression tests pass.
May 9, 2026: /brainstorm round 2+ becomes adversarial cross-examination. Previous round-2+ prompt asked personas to "engage with what others said" but that was too soft — weak models defaulted to "agree-and-extend" or just continued their own line, producing N rounds of polite parallel monologues instead of a real debate. Three coordinated changes flip round 2+ into mandatory adversarial mode. (a) Persona round-2+ prompt rewrite (commands/advanced.py:call_persona). Each persona MUST: quote a specific claim from another agent verbatim (by letter), attack a specific weakness (data wrong / mechanism doesn't produce outcome / confounder ignored / claim un-falsifiable / contradicts stronger claim), AND propose a falsifiable counter-claim with a specific number/date/named entity. Structured format ### [CHALLENGE → Agent X] so weak models can follow. Politeness ("great point", "I agree, and would add", restating without attacking) is explicitly FORBIDDEN. Synthesis is the lead's job, not the persona's. (b) Round-aware lead probe (_lead_probe). Round 1 keeps the existing concrete-vs-vague check. Round 2+ uses a different probe that fires on DODGES — a polite agreement, a synthesis, or a defense-only reply that doesn't quote and attack another agent earns a probe demanding "Agent X said '...'. Attack it or accept it — your call, but commit. Quote and refute, don't dodge." (c) Lead opening warns about cross-examination upfront. Opening prompt now ends with explicit rule: "in any round after the first, each expert MUST quote a specific claim from another expert and either attack with a counter-claim OR explicitly accept it. Polite agreement counts as a dodge." UI label changes too — ── Round 2/3 (adversarial cross-examination — agents must attack each other's claims) ──. Tests: 3 new round-aware probe cases (round-2 polite-agreement gets probed; round-2 real challenge passes; round-1 still uses old vague check — captured so a future round-2 change can't regress round 1). 269 targeted regression tests pass.
May 8, 2026: /ssj brainstorm: interactive Rounds prompt. Tiny UX follow-up to the multi-round /brainstorm landing — /ssj → 1 (Brainstorm) now asks Rounds [1=monologues, 2=critique (default), 3-6=more debate] > right after the existing "How many agents?" prompt, so SSJ users can dial in debate depth without remembering the --rounds N CLI flag. Behaviour: when the user invokes /brainstorm --rounds 3 … directly via the slash-command line, the explicit value wins and the prompt is skipped (no double-asking). Telegram / web bridge sessions still skip the prompt entirely (no interactive input channel) and use the documented default of 2 rounds.
May 8, 2026: /brainstorm: real multi-round debate + tighter post-Write contract. Two follow-up fixes after the lead-moderator landing. (a) Multi-round debate (commands/advanced.py). Previous flow ran every persona exactly once — even with the lead moderator, that's three monologues stapled together, not a debate. New --rounds N flag (default 2, capped to [1, 6]) wraps the persona iteration in an outer rounds loop. Round 1 is initial positions (existing prompt). Round 2+ uses a different system prompt that explicitly forbids repeating: "Read the prior debate. Pick 1-2 specific claims from OTHER agents that you disagree with, can sharpen, or that change your view. Quote and engage. Do NOT re-list your round-1 ideas." Lead probes still fire after each persona in each non-final round. The synthesis prompt's transcript is rebuilt from brainstorm_history directly so adding new header rows can't mis-slice it again. Composes with --lead <model> and --models a,b,c: /brainstorm --rounds 3 --lead claude-opus-4-7 --models gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth. (b) Tighter TODO prompt (cheetahclaws.py). The previous "do not echo / do not Read" prompt didn't stop qwen2.5 from Write → echo content as text → Bash ls to verify (with truncated path due to vLLM streaming) → echo content again. New prompt is numbered STRICT RULES: call Write EXACTLY ONCE; do NOT call Read; do NOT call Bash to verify; do NOT echo file content after Write; after Write succeeds, your turn ENDS. Both REPL and Telegram handlers updated. Tests: 9 new pytest cases (--rounds parser including bound-clamping + non-numeric rejection + three-flag composition). 266 targeted regression tests pass. The Bash-args truncation symptom (ls /srv/.../cheetahcla cut mid-path) is a vLLM hermes-parser streaming bug at the model server, not fixable on the client side; the tighter prompt avoids the Bash call entirely.
May 8, 2026: Three fixes for /monitor + /research stability — multi-word topics + aggregator deadlock + REPL Ctrl+C. Two distinct bugs reported on a /ssj → 17 (Trend Track) flow with the topic "Agent OS Benchmark". (a) Topic truncated to first word (commands/monitor_cmd.py:_parse_subscribe_args). The previous parser did args.split() and treated the FIRST whitespace token as the topic, dropping the rest. So /subscribe research:7d:Agent OS Benchmark daily became topic=research:7d:Agent + the rest was either silently dropped or mis-classified as flags. The new rule: walk left-to-right, peel off --flag tokens into channels, then if the LAST remaining token is in _VALID_SCHEDULES it's the schedule — everything before joined by single spaces is the topic. Correctly handles ai_research, ai_research weekly, custom:quantum computing weekly, research:7d:Agent OS Benchmark daily, research:7d:Agent OS Benchmark (default schedule), and edge cases. 12 new pytest cases (tests/test_subscribe_parser.py). (b) Aggregator deadlocked on slow source then killed REPL on Ctrl+C (research/aggregator.py:190). The with concurrent.futures.ThreadPoolExecutor(...) context manager calls shutdown(wait=True) on __exit__, which BLOCKS waiting for any in-flight worker to finish. When as_completed(timeout=...) fires its TimeoutError because one source is hung on a stuck socket, control unwinds into the __exit__ and joins the hung thread. Then the user Ctrl+Cs to escape, the KeyboardInterrupt fires during the join, and Python's atexit hook _python_exit ALSO joins the same threads — double-blocking, then atexit kills the process and the user is dumped to bash. Fix: switch to manual try/finally with shutdown(wait=False, cancel_futures=True) (Python 3.9+) so partial results return immediately; the hung worker keeps running as a daemon thread and dies silently with the process. Both _cf.TimeoutError and KeyboardInterrupt paths now mark unfinished sources with a status entry ("timeout (aggregator deadline exceeded)" or "interrupted by user") instead of dropping them silently. (c) REPL: Ctrl+C during a slow slash command killed the process (cheetahclaws.py:1368). The REPL did result = handle_slash(user_input, state, config) with NO try/except, so a KeyboardInterrupt during /monitor run, /research, /trading backtest, etc. unwound the call stack all the way to main() → sys.exit() → atexit. Fix: wrap the REPL slash dispatch in try / except KeyboardInterrupt → print '(command interrupted)' → continue so Ctrl+C cancels the command and returns to the prompt. Also wrapped the SSJ inner re-dispatches at lines 1420/1430 (__ssj_passthrough__ and __ssj_cmd__) so Ctrl+C from inside a slow SSJ-launched command bounces back to the SSJ menu instead of killing the REPL. 257 targeted regression tests pass.
May 8, 2026: /brainstorm gets a real lead moderator + read-only tool dedup. Two coordinated changes that turn /brainstorm from "round-robin echo chamber that produces filler advice" into "moderated debate with a structured master plan", and stop weak models from re-Reading the same file twice. (a) Lead moderator (commands/advanced.py). Three new in-process stages (no main-agent invocation, no tool calls — the whole pipeline lives inside cmd_brainstorm): (i) Opening — lead frames the agenda, names the concrete artifact this debate must produce (e.g. "specific tickers with thesis, not 'consider semiconductors'"), and lists 2-3 cheap escape hatches that will be REJECTED ("consult an advisor", "diversify", "monitor regularly"). The opening becomes the persona system-prompt's "DEBATE ANCHOR" so every persona writes against the same bar. (ii) Probe — after each persona speaks, lead reads their contribution and either replies NO_PROBE (concrete enough) or asks one ≤25-word follow-up that demands a specific commitment; the persona then gets one more swing answering the probe. (iii) Synthesis — lead produces the final master plan with four named sections (Consensus / Dissents / Concrete Action Plan / What Was Filler), with the consensus matrix tagging each claim with the agent letters that backed it. New --lead <model> flag lets you point lead at a stronger model than the default (/brainstorm --lead claude-opus-4-7 --models gpt-5,deepseek-r1 redesign auth). Composes cleanly with the existing --models a,b,c flag. (b) Eliminates the duplicate-Read bug. The previous flow returned a sentinel that asked the main agent to Read the brainstorm file and synthesize — qwen2.5 + vLLM cheerfully Read it twice and echoed the entire 4 KB master plan as text twice (also writing a different much shorter content via Write — a separate tool-call truncation issue). The new sentinel inlines the lead's master plan directly in the TODO-generation prompt, so the main agent only writes the TODO file. No Read, no rewrite. The old _save_synthesis step is now a no-op (everything is written inside cmd_brainstorm). (c) Read-only tool dedup (agent.py). Defense-in-depth even outside brainstorm: when the model fires Read/Glob/Grep/WebFetch/WebSearch with identical args twice within a single run(), the 2nd call is short-circuited — execute_tool is skipped (saves time), ToolStart/ToolEnd UI yields are suppressed (no ⚙ Read(...) printed twice), a brief [deduped Read: already in context] text marker is yielded so the user still knows what happened, and a synthetic [deduped] reminder is appended as the tool_result so the model sees "you already called this; use the content already in your context" — both nudging the model AND keeping the OpenAI/Anthropic tool_calls ↔ tool_response pairing valid. Write/Edit/Bash are explicitly NOT deduped (those can be intentional rewrites). Tests: 19 new pytest cases (8 lead helpers + 4 dedup integration via fake provider stream + 7 flag-parse). 245 targeted regression tests pass.
May 8, 2026: /ssj brainstorm hot-fixes — absolute path in synthesis prompt + tool dispatch hardened against empty args. Two bugs surfaced when a user ran /ssj → 1 (Brainstorm) on custom/qwen2.5-72b. (a) commands/advanced.py:244 — synthesis prompt leaked a relative path. The brainstorm synthesizer was injecting out_file (a Path resolved relative to cwd) into the model's prompt as brainstorm_outputs/brainstorm_<ts>.md. The model — obeying the system prompt's "always use absolute paths" rule — invented an absolute prefix and guessed wrong (in this case …/PR/cheetahclaws/brainstorm_outputs/…, a stale sibling source tree it had never been told existed). Read failed, the synthesis ran on no actual evidence. Fix: out_file.resolve() before formatting + an explicit "use this path verbatim, do NOT prepend any directory" line. (b) tools/init.py:459-471 — permission-prompt description used inputs['file_path'] not inputs.get(...). When a weak model fired a tool_call with empty arguments (qwen2.5 + vLLM hermes-parser is a documented offender — see "Be agentic on every model" entry above), the wrapper raised KeyError: 'file_path' before the registered ToolDef's friendly "Error: missing required parameter 'file_path'" lambda ever ran. The user saw Error executing Write: KeyError: 'file_path' and the model couldn't self-correct. Fix: .get(..., '<missing path>') for Write/Edit/NotebookEdit description, .get('command', '') or '' for Bash, so the inner ToolDef's friendly error always reaches the model. Bash's _is_safe_bash already tolerates empty input. Tests: 9 new pytest cases (tests/test_tool_dispatch_robustness.py) — empty args on Write/Edit/Read/Bash/NotebookEdit must return a friendly string and never leak KeyError to the agent loop. 226 targeted regression tests pass.
May 8, 2026: NVIDIA NIM free-tier provider + 429 cascade fallback + multi-model /brainstorm. Three small, focused additions — borrowed selectively from sibling forks (Falcon for NIM, Dulus for the multi-model debate idea) — that lower the barrier to entry for users without paid API keys and tighten epistemic diversity in brainstorming. (a) NIM provider (providers.py). New nim entry registered against https://integrate.api.nvidia.com/v1 (build.nvidia.com — free signup, no payment info), curated 10-model chain (deepseek-r1, deepseek-v3.1, llama-3.3-70b, llama-3.1-405b, nemotron-70b, mixtral-8x22b, qwen2.5-72b, qwen2.5-coder-32b, phi-3-medium, gemma-2-27b). All listed in COSTS as $0 so the UI doesn't show "unknown" for free-tier usage. Invocation: cheetahclaws --model nim/<vendor>/<model> — the double-prefix preserves NIM's upstream <vendor>/<name> form through detect_provider + bare_model. (b) 429 cascade fallback (agent.py). When a NIM model returns rate-limit (ErrorCategory.RATE_LIMIT), the agent loop calls nim_next_model() to pick the next model in the curated chain and retries — without consuming a regular retry slot. Capped at _NIM_FALLBACK_LIMIT = 3 swaps per turn so a fully-throttled tier can't busy-loop; after the cap, falls through to the standard exponential-backoff retry path. Disabled by setting nim_auto_fallback=False in config. Other providers (anthropic / openai / etc.) are not affected — the swap is gated by detect_provider() == "nim". (c) Multi-model /brainstorm (commands/advanced.py). New --models a,b,c flag distributes models round-robin across personas (/brainstorm --models claude-opus-4-7,gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth) so a 5-persona session alternates 1, 2, 3, 1, 2 instead of running every persona on the same model. Single-model brainstorm is an echo chamber — different model families have different training data and blind spots, so multi-model debate buys real epistemic diversity. Each persona's section in the output Markdown is tagged with the model that produced it (## 🏗️ Architect _(via gpt-5)_) so the synthesizer can weight by source. Borrowed in spirit from Dulus's RoundtableAgent; the existing /brainstorm flow is unchanged when --models is omitted. Tests: 21 new pytest cases (tests/test_nim_provider.py 12 + tests/test_brainstorm_models_flag.py 9) covering provider registration, chain cycling (cycle-through + wraparound + unknown-model head fallback), 429 swap-then-succeed, fallback-cap-then-fallthrough, fallback-disabled honor, non-NIM no-leak, flag parsing across --models a,b,c / --models=a,b,c / flag-at-end / provider-prefixed IDs / single model. 217 targeted regression tests pass, zero regressions. Skipped by design: ia-web-parser's WebToolParser — Cheetahclaws' existing _extract_native_tool_calls already covers 4 marker formats (Gemma official + asymmetric, Hermes, Mistral) plus channel-tagged form and args recovery, so the streaming-vs-buffered UX delta wasn't worth the duplication.
May 8, 2026 (earlier): "Be agentic on every model" pass — explore-first prompt + qwen overlay + runtime auto-nudge. A user reported cheetahclaws --model custom/qwen2.5-72b replying "please tell me which file you mean" when handed a directory path, instead of just ls-ing it. Three coordinated defenses, layered so any one of them is enough to fix the failure mode on any model: (a) prompts/base/default.md — new "Investigate Before Asking" section + softened Stop Conditions. Every model now gets explicit "default to action over conversation" framing: a directory is not "missing information", it's an invitation to enumerate; AskUserQuestion is reserved for genuine post-exploration ambiguity (intent that no ls/Glob/Read could disambiguate), never as a substitute for a tool call. (b) prompts/overlays/qwen.md — new family overlay (10 lines, cites the Qwen function-calling guide). Qwen / QwQ chat-tuned models hedge by default ("could you specify…"); the overlay overrides that with "treat every concrete noun the user names — path, filename, URL, function, command, error string — as an instruction to investigate it with a tool, not echo it back as a question." Registered in _OVERLAY_RULES for all qwen / qwq model IDs regardless of runtime (DashScope / Ollama / vLLM / OpenRouter all match). (c) agent.py runtime auto-nudge — model-agnostic safety net. New _looks_like_investigation() heuristic detects absolute-path tokens in the user message (URL-stripped to avoid false positives on https://host/path); if the heuristic fires AND the model's first reply is text-only with zero tool calls, the loop injects a one-shot [system reminder] use your tools, don't ask for what was given message into history and continues. Bounded to one nudge per run() invocation so it can never cause a loop — second text-only reply always falls through to break. The nudge fires on conversion to the OpenAI/Anthropic format as a normal user-role message and is invisible in the rendered UI (yielded events drive the display, not state.messages). Tests: 13 new pytest cases (tests/test_agent_nudge.py) — heuristic positives/negatives across English + Chinese + URL-only + relative-path + bare greeting; loop integration via fake provider stream verifying nudge fires, doesn't fire without path, fires at most once. 89 prompt + 196 targeted regression tests pass, zero regressions. Docs updated: prompts/README.md overlay table + Known Gaps, docs/architecture.md overlays tree + agent-loop step (h), docs/contributor_guide.md overlay enumeration. The three layers compose: strong models (Claude/Gemini) read the new default rule but already behaved this way; mid-tier models (GPT/DeepSeek/Kimi) get a clearer prompt-level instruction; weak models (qwen2.5/QwQ) get prompt + overlay + runtime nudge stacked. Even on a model that ignores the prompt entirely, the runtime nudge gives one free retry before the user has to intervene.
May 8, 2026 (earlier): Agent-OS layer (cc_kernel/) reaches v1.0 — 27 RFCs shipped, 1771 tests passing, zero regressions on the legacy REPL/bridges path. What started as a daemon foundation (RFC 0001/0002) is now a single-node agent operating system: AgentProcess + EventLog (0003), Capability model (0005), per-agent ResourceLedger with first_breach signal (0006), priority Scheduler with admission filter (0007), RLIMIT + bubblewrap Sandbox (0008), Mailbox + topic pub/sub (0009), AgentRegistry (0010), AgentFS unified VFS (0011), Observability + Prometheus exposition (0012), and a frozen 58-method JSON-RPC contract with CI drift guard (0013). On top of that substrate: F-4 Subprocess agent runner (0016), WorkerLoop scheduler↔supervisor glue (0017), Bridge mirror that wires Telegram/WeChat/Slack into kernel.mbox without touching bridges/ (0018), LLM runner MVP (0019), DialogueOrchestrator for multi-turn (0020), Tool Dispatch + Permission Routing (0021), LLM Tool Calling Integration (0022), defense-in-depth tools — Exec (argv-only, RLIMITed, env scrubbed; 0023), Glob+List (0024), Fetch (SSRF + DNS-rebind + redirect-leak defended; 0025) — three streaming layers (IPC chunks 0026, LLM token streaming 0027, Exec line streaming 0028, Fetch body streaming 0029), and three new built-in inspectors (Diff 0030, AST 0031, Git 0032). All kernel code lives in cc_kernel/ and is gated behind --enable-kernel — default CheetahClaws CLI / REPL / bridges / web UI are byte-for-byte unchanged. Operators introspect via cheetahclaws kernel summary | info | agents | proc <pid> | events | queue | registry | methods | prometheus. Kernel SQLite schema is forward-only (v1 → v7). RFC 0014 multi-tenant + RFC 0015 cluster remain explicitly parked. Full overview: docs/agent-os.md. Each design note in docs/RFC/.
May 8, 2026: F-2/F-3 follow-ups + CI unblock (feature/fix-f2). Two-commit branch on top of #101's daemon foundation (F-2 SQLite persistence + F-3 monitor in daemon). (a) CI unblock (fix(ci)). Main has been red since 9c01237d (the trading-agent #99 merge) — tests/test_packaging.py::test_required_module_imports[modular.trading.ml] (the regression test added for issue #97) caught that modular/trading/ml/features.py and modular/trading/portfolio.py import numpy at module top while numpy is in the [trading] extra, not core deps. So pip install . (no extras) shipped a wheel where import modular.trading.ml blew up. PR #100 and #101 both inherited the red. Fix: dead import numpy as np removed from features.py; stacker.py defers numpy to inside train() and predict_proba() past the early-return paths so the diagnostic-only callers (train(too_few_rows), predict_proba(missing_model)) still work without the heavy stack; portfolio.py gates the numpy import behind try/except so module import succeeds and runtime callers raise on first use as before. test_trading_advanced.py and test_trading_discovery.py get pytest.mark.skipif markers on tests that genuinely need numpy / scipy / sklearn / pandas at runtime — skip cleanly on lean CI installs, run as before on full installs. Verified in a clean venv with only [web,autosuggest] (the exact CI install): 1075 passed, 11 skipped; with [all] extras: 1086 passed, no regressions. (b) F-2/F-3 follow-ups (fix(daemon)). Five issues found during the #101 review that the merged code didn't address: (i) cc_daemon/cli.py:cmd_serve started monitor.scheduler.start(...) before the listener bound — order matters because if a due subscription fires before the daemon is reachable, an LLM/network error in fetch/summarize/deliver surfaces in the log before the user sees the listening line, and external clients can't yet act on the resulting monitor_report SSE event; moved past the bind + discovery write. (ii) monitor/scheduler.py had no defense against the daemon coming up after REPL /monitor start fired — both schedulers would race on last_run_at and double-fire subscriptions; added _foreign_daemon_running() step-aside check at every loop tick (REPL-side instances bow out when a daemon registers ownership), with owned_by_daemon=True flag the daemon passes to opt out of the check on its own scheduler. (iii) EventBus.publish was synchronous=FULL (SQLite default) → every event was an fsync per commit, ~305 μs each; for streaming agent output (text_chunk events at dozens/sec) that's a real disk-IO concern. cc_daemon/schema.py now sets PRAGMA synchronous=NORMAL on init + every thread-local connection — safe under WAL (only the most recent transactions can be lost on hard kernel crash, which for a 24h-pruned event log is fine), microbenchmark drops to 39 μs/publish (~8×). (iv) The PR description said the JSON files were "kept readable for one release as fallback", but no fallback read path actually exists — jobs.py and monitor/store.py migration is fundamentally one-way once the schema_meta marker is set. Updated docstrings + docs/architecture.md to make the one-way semantics explicit and tell users how to redo a migration if needed. (v) docs/RFC/0002-daemon-foundation-roadmap.md F-2/F-3 marked OPEN → MERGED #101 + follow-ups (#fix-f2), with a new "Follow-ups" subsection under each. Branch: feature/fix-f2.
May 8, 2026: Two production fixes — Gemma 4 native tool-call interceptor + issue #97 (pip install . shipping a broken wheel). Two unrelated bugs that both blocked end users on the v3.1 release. (a) Gemma 4 native tool-call interceptor (providers.py). When users run cheetahclaws against gemma-4-31B-it via vLLM, the model emits its native <|tool_call>call:NAME{json}<tool_call|> format instead of the Hermes/JSON envelope vLLM's --tool-call-parser hermes expects. vLLM doesn't recognise the format → leaves it in delta.content → cheetahclaws yields it as TextChunk → terminal shows raw <|tool_call>call:Research{topic:<\|"\|>...<\|"\|>}<tool_call\|> garbage instead of a coherent answer. The interceptor in stream_openai_compat now watches the streamed text for any of four native tool-call openers (Gemma official <|tool_call|>, Gemma 4 asymmetric <|tool_call>, Hermes <tool_call>, Mistral [TOOL_CALLS]); on detection it (i) yields the pre-marker text as a clean TextChunk, (ii) stops yielding text and switches into buffer mode, (iii) at end-of-stream tries three parser branches against the buffer (Gemma's call:NAME{json}, JSON envelope with name/arguments, Mistral's array form) and adds successful matches to tool_calls. Also normalises Gemma's <|"|> → " quote escaping. If no parser matches, falls back to yielding the buffered raw text so users see something rather than a silent stall. Tests: 16 new pytest cases (tests/test_native_tool_intercept.py) covering marker detection (4 variants), 3 parser branches, robustness (empty buffer / unparseable garbage / multi-call buffer), and end-to-end streaming via mocked OpenAI client (verifies pre-marker text yielded as TextChunk + <|tool_call> tokens NOT in any TextChunk + tool_call appears in AssistantTurn). (b) Issue #97 — pip install . produces a broken wheel (pyproject.toml, deleted memory.py, tests/test_packaging.py). Reported by @albertcheng on Windows + Python 3.13: cheetahclaws.exe crashed at startup with ModuleNotFoundError: No module named 'prompts'. Root cause: a name collision in pyproject.toml — memory was listed in BOTH py-modules (referring to a 11-line backward-compat shim memory.py that re-exports from the memory/ package) AND packages (the real memory/ directory). Python's import system always prefers the package directory over a same-named .py file, so the shim was dead code; setuptools ≥ 75 on Windows treats this dual-registration as a hard error and silently drops unrelated packages from the wheel build — which is how prompts/ went missing. Fix: deleted the dead memory.py shim, removed memory from py-modules, and replaced the manual packages = [...] list with [tool.setuptools.packages.find] + wildcard include patterns so future sub-packages auto-discover. This also caught a separate latent bug — the four sub-packages added in the v3.1 trading discovery layer (modular.trading.alt_data, modular.trading.broker, modular.trading.discover, modular.trading.ml) were missing from the manual packages = [...] list and would have been excluded from production wheels even after a successful build. Tests: 29 new pytest cases (tests/test_packaging.py) — config sanity (no module/package name collision allowed; memory.py shim must not be re-introduced; pyproject.toml must use find not manual list), discovery walk (every top-level dir with __init__.py is reachable from find's include patterns or explicitly excluded), and the exact issue #97 failure reproduction (parametrised import test for 24 modules including prompts, prompts.select, all four new modular.trading.* sub-packages, and the cheetahclaws entry point — fails the build if any can't be imported). Verified locally: rebuilt wheel after fix contains all 31 packages including prompts/ and the four new sub-packages. 1005 passing (976 baseline + 16 native-tool-intercept + 29 packaging = 1005), zero regressions. CONTRIBUTING.md updated with explicit packaging discipline notes: never put a name in both py-modules and packages, sub-packages auto-discover via find, only top-level packages need a new include pattern.
May 8, 2026 (later): /trading v3.1 — automatic candidate discovery + composite ranking + anomaly detector + market monitor with bridge alerts. Closes the biggest gap in v3: previously you had to feed the agent symbols (/trading analyze NVDA); now it actively scans a universe and finds candidates for you. Four orthogonal discovery scanners ship: (a) insider_cluster — SEC EDGAR Form 4 cluster detector, flags tickers with ≥3 officer / 10%-holder filings in 30 days, surfaces SEC URLs so user can verify direction; (b) earnings_beat — yfinance earnings_dates surprise extractor, requires ≥10% beat AND post-print continuation (filters out the pop-and-fade pattern); (c) momentum_quality — factor intersection over the new factors.py (momentum = 6m return + 50d>200d trend confirmation; quality = ROE − 0.3·D/E + 2·op-margin; both min-max normalised + composite-scored); (d) sector_rotation — ranks SPDR Select sector ETFs by 1m+3m return, surfaces top holdings of the leaders. The orchestrator (discover/orchestrator.py) merges per-symbol hits across all four sources with weighted aggregation (insider 1.0, earnings 0.9, mom-qual 0.7, sector 0.5) AND a +0.5 confluence bonus when ≥2 sources flag the same ticker. New CLI: /trading discover [insider|earnings|momentum-quality|sector|all] [--universe sp100|sectors] [--add-watchlist N] — the --add-watchlist flag auto-promotes the top N hits to your watchlist for downstream /trading scan / /trading analyze. New /trading rank composite-ranks candidates by 0.5×factor + 0.3×discovery + ±0.1 calibration-tilt; output is a triage table for "which names deserve a real /trading analyze". New /trading factors [SYMS] shows raw momentum/quality/low-vol scores with a 24h disk cache at ~/.cheetahclaws/trading/factors_cache.json (S&P 100 takes ~1-2 min to scan, parallel ThreadPoolExecutor with 4 workers). New /trading anomaly [SYMS] runs three independent checks per ticker: volume spike (today vs 90d median ratio ≥ 2×), price gap (open vs prior close ≥ 3%), volatility regime z-score (5d realised vol vs 90d distribution ≥ 2σ). New /trading monitor scan runs one full monitoring cycle — anomaly detection + stop-loss/take-profit hits on open paper trades + earnings within 3 days for any held position + new SEC Form 4 filings since last scan (delta detection persisted in ~/.cheetahclaws/trading/monitor_state.db); --notify [telegram] [slack] [wechat] dispatches structured alerts (severity-tagged: critical/warning/info) through cheetahclaws's existing bridge layer. Honest framing on "real-time" in the docs: yfinance is 15-20min delayed for free tier, so polling more often than every 5-10 min is wasted effort; three scheduling options documented (manual, external cron, /monitor integration). New universe.py ships hardcoded S&P 100 (~7-8% drift/year, refresh quarterly) + 11 SPDR Select sector ETFs + curated top-10 holdings per sector ETF for sector_rotation. The discovery layer also fixes a real gap in the system prompt: the LLM didn't know what /trading discover etc. existed, so when users asked "can you find me good stocks" it confabulated; the dynamic _render_commands_block from earlier session now picks up the new subcommands automatically. Tests: 21 new pytest cases in test_trading_discovery.py covering universe resolution, factor scan + score with stubbed yfinance, insider cluster threshold logic, momentum-quality intersection, sector rotation top-sector picking, orchestrator multi-source merge + bonus, anomaly triple-check (volume/gap/vol-regime), ranker factor+discovery combination, monitor alert rendering + dispatch + end-to-end scan with stubbed market data. 960 passing (939 baseline + 21 new), zero regressions; golden system-prompt fixture regenerated. Honest disclaimer in PLUGIN.md and trading.md: discovery reduces search cost, not generates alpha — the named factors (momentum, value, quality) are well-known and largely priced in by quant funds; what users get is a 100-ticker → 15-ticker triage list to spend tokens on, plus structured discipline (anomaly detection, stop monitoring, earnings calendar) that's hard to do by hand. Form 4 transaction direction is NOT yet parsed from XML (we count filings, not buys vs sales); URLs included so user verifies in 5 seconds. Insider direction parsing is on the roadmap but requires reliable XML scraping of SEC archives across version drift.
May 8, 2026: /trading v3 — paper-trade tracker, calibration, managed $X portfolios, alt-data, MV optimizer, ML stacker, walk-forward, broker abstraction. A two-stage upgrade that turns the trading module from "ask LLM about a stock" into a measurable research substrate. Stage 1 (the discipline layer): every /trading analyze recommendation is auto-recorded as a paper trade (~/.cheetahclaws/trading/paper_trades.db) — long and short signals account correctly. /trading calibration aggregates closed trades by confidence + signal and reports hit rate + mean return + a t-stat vs zero baseline; if 30+ closed trades show HIGH conviction not outperforming LOW, the agent's confidence label is noise and the diagnosis fires. /trading verify enforces hard risk rules (single-name 5% / sector 25% / total exposure 80% / stop 1-10% / earnings blackout 3 days → cap 2.5%) reading the live paper book — fixes the "LLM forgets its own rules" problem. The analyze prompt now auto-injects macro context (SPY/QQQ trend + VIX regime + 10y headwind, 30-min cached), earnings calendar warnings (🚨 if reporting within 7 days), and the current paper-book exposure so the LLM doesn't double-down on a sector already at 30%. /trading walkforward runs rolling out-of-sample chunks with a STABLE/MIXED/FRAGILE/INCONCLUSIVE verdict, replacing the dishonest aggregate backtest. /trading scan does a coarse heuristic sweep (RSI / 50d / 200d) over the watchlist before spending tokens on a real analyze. Stage 2 (the autonomous + alpha-research layer): /trading review runs a multi-agent debate on existing positions and emits structured ACTION ID=… DECISION=HOLD|ADD|TRIM|EXIT … rows for each. /trading manage start hundred 100 creates a virtual $100 portfolio backed by a SQLite-cleanly-namespaced PaperBroker; /trading manage step hundred runs one mean-variance rebalance cycle (scipy SLSQP, long-only, single-name + sector caps), /trading manage report hundred prints a markdown PnL report with equity curve — this is the canonical "I give the agent $100, check in a week" workflow. /trading optimize exposes the same MV solver standalone. The alt-data layer auto-injects three sources LLM analysis can actually add value on: SEC EDGAR Form 4 insider transactions (urllib, no API key, free), LLM-scored yfinance news headlines via the auxiliary cheap model (-10..+10 per headline aggregated to BULLISH/MIXED/BEARISH), and Google Trends search interest (soft-fails if pytrends not installed). The broker layer has a tiny BrokerBackend protocol with two backends — PaperBroker works out of the box, IBKRBroker is a stub with full setup docs (pip install ib_insync + IB Gateway config + connect()); the abstraction means switching from paper to live is one line when the user is ready. /trading ml train builds a LightGBM (or sklearn GradientBoostingClassifier fallback) classifier on closed paper trades — features are LLM signal one-hot + confidence ordinal + position size + stop / take profit + sector one-hot, label is "did this trade beat zero"; reports cross-validated AUC and feature importance, persists to ~/.cheetahclaws/trading/ml/stacker.pkl. The _CMD_META registry is also auto-populated from modular/-loaded commands now (closed a pre-existing bug where /trading, /video, /voice, /tts were callable but invisible to /help, tab-completion, and the system-prompt slash-command index — the LLM literally couldn't see its own subcommands). Tests: 46 new pytest cases across test_trading_pipeline.py and test_trading_advanced.py covering paper-trader CRUD, long/short PnL math, Phase-5 parser permissiveness, calibration aggregation, verifier 8-branch enforcement, macro/earnings/insider/sentiment/trends soft-fail behavior, MV optimizer constraints, broker buy/sell/avg-cost round-trip, IBKR stub setup-required diagnostic, end-to-end $100→step→status→report lifecycle with mocked quotes, ML feature engineering + train + predict, and the position-review prompt format. 939 passing (893 baseline + 46 new), zero regressions; golden system-prompt fixture regenerated. Also fixed a banner-rendering bug where the welcome box's right border was missing on every middle line (cheetahclaws.py now computes inner width from plain-text length and pads each row to close with │ regardless of model-name length). Honest disclaimer in the docs and PLUGIN.md: this is a research and discipline tool, not a money printer — public-data + LLM analysis does not have predictive edge over quant funds; the value is information aggregation, programmatic risk discipline, and empirical accountability. Run paper for ≥3 months with green calibration + walk-forward before considering an IBKR live account; small accounts (<$1k) have unfavorable fixed-cost economics in real life regardless of strategy.
May 7, 2026: /theme slash command — 15 console color presets + post-merge UX fixes (PRs #92, follow-up). Adds a curated palette system to ui/render.py and a new /theme command:
- /theme lists all 15 presets (default, dracula, nord, gruvbox, solarized, tokyo-night, catppuccin, matrix, synthwave, midnight, ocean, monokai, cheetah, mono, none); each row renders an info / ok / warn / err swatch in the row's own theme colors so the listing is a real palette preview, not 15 identical lines in the current theme.
- /theme <name> mutates the shared C ANSI dict in-place so every existing clr() / info() / ok() / warn() / err() call site (~25 files) switches palette without touching any call site, and persists the choice via save_config() so the next launch re-applies it (early in cheetahclaws.py:main(), before the first output).
- Per-theme color roles. Each palette declares 4 semantic colors — accent (info / cyan / blue), ok (success / green / diff +), warn (yellow / magenta), err (red / diff -) — plus a Rich code style. Picking 4 hexes per theme means info() and ok() are always visually distinguishable, and render_diff keeps semantic colors (green = add, red = remove) under every theme. The original PR collapsed cyan/green/blue to a single accent color, making info() and ok() indistinguishable and turning diff additions into the accent color (purple under dracula, yellow under gruvbox, magenta under synthwave) — the follow-up split them apart.
- CODE_THEME is now actually consumed. _make_renderable() in ui/render.py passes code_theme=CODE_THEME to rich.markdown.Markdown, so Rich code-block syntax highlighting tracks the active theme (the original PR set CODE_THEME but never plumbed it through — it was dead code).
- none theme is genuinely uncolored (clears every key in C, including reset, to "" so clr() returns plain text). mono is genuinely grayscale (4 distinct gray levels for accent/ok/warn/err — the original PR hardcoded C["red"] = "\033[38;5;196m" regardless of theme, breaking both).
- Tests: 9 new pytest cases (tests/test_theme.py) covering schema validation, unknown-theme rejection, info/ok distinguishability across all themes, diff-color distinguishability, none-as-plain-text, CODE_THEME tracking, apply_theme idempotency across state, and the Rich Markdown code_theme round-trip. 893 passing, zero regressions on the 884 pre-existing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

What Changed

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Contributors

Uh oh!