What Changed
-
May 10, 2026 (latest, v3.05.79): Web Chat UI session organization + headless-bridges slash handler + stale-session reaper crash fix. Three threads of work merged into a single release. Bridges / headless deploys (#84 follow-up): Telegram / Slack / WeChat
/help,/monitor,/model,/statusproduced zero response in Docker /--webdeploys because_start_headless_bridges()only wiredrun_queryandagent_stateon the sharedsession_ctx— neverhandle_slash. The bridge poll loops gate onif slash_cb:and fell through tocontinuebefore the📩 Telegram:log line, so the failure was invisible indocker compose logs -f. Fix: extracted the slash handler (originally inlined inrepl()) into a module-level factory_make_bridge_slash_handler(state, config, run_query); both REPL and headless paths now use it (single source of truth, no future drift between modes). Stale-session reaper crash:web/api.py:reap_stale_chat_sessions()calledremove_chat_session(sid)without theuser_idthe function now requires for ownership-check parity — every reaper tick raisedTypeError, killing the daemon thread, so staleChatSessionobjects accumulated forever in the in-memory cache. Fix: capture(sid, user_id)pairs from the cachedChatSessionobjects under_chat_lock, then apply outside the lock. Web UI session organization: five-feature bundle layered on top — folders + drag-drop + Move-to context menu, ChatGPT-style active-folder context (click a folder name →+ Newand direct-typing both drop new sessions into that folder, with aChat · in <Folder>topbar breadcrumb), batch select with Select-all-respecting-search-filter, batch delete + combined-Markdown export (chats-N-sessions.md), and a 4-px draggable sidebar divider with localStorage persistence. Backend adds afolderstable,chat_sessions.folder_idnullable FK, in-placePRAGMA table_info+ALTER TABLEmigration ininit_db(), and 5 new HTTP endpoints (GET/POST /api/folders,PATCH/DELETE /api/folders/{id},PATCH /api/sessions/{id}/folder). Also rolled in: issue #111 (handle_slash_sync/handle_slash_streamno longer double-broadcast to WS) and--web --model Xpersistence. Tests: +16 new acrosstest_web_api.py(folder CRUD, batch ops, reaper regression) and the newtest_bridge_slash_handler.py(5 cases pinning the headless handler contract). Full suite: 2154 / 2154 passing, zero regressions. User-side guide:docs/guides/web-ui.md. -
May 10, 2026: Web Chat UI fixes — slash commands no longer reply twice;
--web --model Xactually applies the model. Two related issues that surfaced when wiring a self-hosted vLLM endpoint into the Chat UI. (1) Issue #111 — slash commands duplicated in Chat UI but not in terminal.web/api.py:handle_slash_syncwas both returning events inline in the HTTP response and broadcasting the same events to the WS subscribers of the same client;chat.jsthen iterateddata.eventsAND fired_handleEventfromws.onmessage, rendering every reply twice. Same bug inhandle_slash_streamfor SSE-streamed long commands (/brainstorm,/worker,/agent,/plan). Both helpers now deliver events through a single channel — HTTP/SSE only — so_handleEventruns exactly once per event. Background-thread events (sentinel flows, agent runs) are unaffected: by the time the worker thread emits,_broadcastis already restored to the live WS broadcaster infinally. (2)--web --model Xwas silently ignored. The CLI override branch only ran in the interactive-REPL path; theif args.web:branch loaded config straight from disk and started the server, sopython cheetahclaws.py --web --model custom/qwen2.5-72bwould happily boot but every request handler reloaded~/.cheetahclaws/config.jsonwith the previous model name (e.g.gemma-4-31B-it), producing a confusing404: model does not existagainst the new endpoint. Fix:cheetahclaws.pynow persistsargs.modelto config before callingstart_web_server, matching the documented behavior;provider:model→provider/modelnormalization is identical to the REPL path. User-side guide:docs/guides/web-ui.md(Troubleshooting + Architecture notes updated). -
May 10, 2026: Small-context local models survive large workloads — 4-part fix: ctx cap, auto-fanout, stagnation-stop, output paths under
~/.cheetahclaws/. Repro that motivated the work: running/agent → 1 (Research Assistant)on a 6.6 MB PDF (AutoRedTeamer.pdf— ~70k tokens of extracted text) withcustom/qwen2.5-72b(32k ctx). Old behavior: 400 BadRequest "context length 32768"; the agent_runner kept polling the template every 2 s; the model produced 1500+ identical "task complete" summaries before anything stopped it. New behavior, four cooperating layers: (1) Per-model context-window registry + dynamic max_tokens cap (providers._MODEL_CONTEXT_LIMITS+get_model_context_window+dynamic_cap_max_tokens) — covers Qwen 2.5/3, Llama 3.x, Mistral/Mixtral, Phi, Gemma, DeepSeek local variants;_fetch_custom_model_limitnow backfillsPROVIDERS["custom"]["context_limit"]so compaction sees the live/v1/modelsvalue; per-call shrink based on actual prompt size keepsinput + output + 1024 safety ≤ ctx.compaction.get_context_limitgains an optionalconfigarg so custom-endpoint detection works on the very first turn. (2) Auto-fanout for oversize tool outputs (multi_agent/fanout.py) — when a single tool result (Read on a huge PDF, Grep over a giant tree, WebFetch of a long article) exceeds 0.4 × ctx_window, split into chunks at paragraph boundaries with token-overlap, dispatch parallel sub-LLM map calls (one per chunk, default cap 5 subagents), merge with a single reduce call; substitutes the merged summary in conversation history instead of letting the next API call overflow. Hooked at the tool-result append site inagent.py; transparent UX prints[Auto-fanout: <Tool> returned ~N chars (>threshold) → dispatching K parallel sub-summaries]. Configurable:auto_fanout_enabled/_threshold/_max_subagents/_chunk_overlap_tokens. (3) Stagnation-stop inagent_runner.py— when the model emits the same summary N iterations in a row (default 3, whitespace/case-normalized), stop the loop with a clear notification instead of burning thousands of API calls; configurable viaauto_agent_dup_summary_limit(0 disables). (4) Agent output paths under~/.cheetahclaws/—/agentwizard now resolves relative output filenames (e.g.research_notes.md) to absolute paths under~/.cheetahclaws/agents/<name>/output/instead of CWD;AgentRunnerexposesrunner.output_dir, eagerly mkdir'd; Summary block + post-start info show the resolved path in green; absolute paths pass through unchanged. Tests: +47 new (fanout 23, ctx cap 18, dup-stop 13, output paths 8). Full suite: 2139 passing, zero regressions. User-side guide:docs/guides/extensions.md. -
May 9, 2026: Read tool auto-redirects on overflow — defense-in-depth for the case where model ignores the template instruction. Re-running the same
/agent + autodan.pdffailure showed two real-world problems with the prior fix: (1) The user was running the pip-installed binary (/home/shangdinggu/anaconda3/bin/cheetahclaws), not the source tree. New tools / templates added to source had no effect. (2) Even if the user reinstalled, qwen2.5-72b would likely still callReadinstead ofSummarizeLargeFile— models default to familiar tools no matter what the template says. The fix moves the routing decision into the Read tool itself. (a) New_maybe_redirect_to_summarizehelper (tools/files.py). WhenReadorReadPDFwould return content too large to safely fit in the next API call, it instead returns a short redirect message like[ReadTooLarge: file is too large — call SummarizeLargeFile with file_path='X' instead] PREVIEW: …. The model sees the redirect, callsSummarizeLargeFile, gets a chunked-and-merged summary back. The raw content never enters the API call. (b) CJK-aware token estimation. CJK content tokenizes at ~1 token per character (vs ~2.8 chars/token for English). New_is_cjk_heavy()heuristic: ≥20% CJK characters → use 1:1 char-to-token estimate. A 24K-char Chinese file is 24K tokens, not 8.6K, and now triggers redirect on a 32K-context model. (c) Conservative ceiling for unreliable provider declarations.custom/<model>provider declares 128K context by default but the underlying model is often 32K (qwen2.5-72b, llama 3 8B, etc.). Newsafe_ctx = min(declared_ctx, 30000)caps the threshold at 30K tokens regardless of provider claims — the redirect now fires on the user's exact ~25K-token PDF case (would NOT have fired with the unconditional 128K ceiling, which is exactly the bug). (d) Wrapped Read registration (tools/__init__.py). New_read_with_overflow_checklambda calls_maybe_redirect_to_summarizeafter_readreturns; for results <8KB it skips (not worth the check). ReadPDF gets the same treatment inline in_read_pdf. Why this works even on the old install: as soon as the user updatestools/files.pyandtools/__init__.py, the redirect fires regardless of whether SummarizeLargeFile / template changes are present. The redirect's prose tells the model exactly which tool to call and with what args. Tests: 14 new pytest cases (tests/test_read_overflow_redirect.py) — CJK detection (English / Chinese / Japanese / mixed-minority / empty), threshold logic (small file → no redirect; user's exact failure case → redirect with right pointer; CJK at lower char count triggers vs same chars in English; conservative ceiling protects against overconfident provider; preview included for context). Plus 2 integration tests viaexecute_tool("Read", ...)confirming the wrapper applies the redirect end-to-end. 2077 targeted regression tests pass (2063 prior + 14 new), zero regressions across the whole repo. -
May 9, 2026: Multi-agent map-reduce
SummarizeLargeFiletool — solves the "file too big for model context" problem at the source. Re-running the same/agent + autodan.pdffailure case showed the SAFETY_BUFFER bumps were still band-aids — even with 2500-token buffer the prompt re-tokenization sometimes ate ~1K, leaving no margin. The real fix: when a file is too big for the model's context, chunk it and run multiple sub-LLM agents in parallel then merge. This makes file size irrelevant. (a) NewSummarizeLargeFile(file_path, focus="")tool (tools/files.py). Reads any-size file (PDF / txt / md / code), estimates tokens, and: if it fits in(model_ctx - 8.5K_reserved)tokens → single-shot summary; otherwise → splits into N chunks (number adaptive to file size: 200KB on 32K-context model → ~4 chunks; 200KB on 200K-context → 2 chunks), summarizes each chunk in parallel viaThreadPoolExecutor(up to 8 workers), then a reduce step merges all chunk summaries into one unified output. Per-chunk failures are logged inline as[chunk N: error]markers so one flaky source doesn't sink the whole job. Returns the final summary as the tool result. Registered withread_only=True, concurrent_safe=True. (b)/summarize <path> [focus]slash command (commands/advanced.py:cmd_summarize). Thin wrapper around the same helper for direct user invocation — handy for quickly summarizing a paper or large code file without spinning up a full/agentflow. (c)research_assistant.mdtemplate updated. Step 2 of "each iteration" now tells the agent to preferSummarizeLargeFileoverReadfor academic papers (handles chunking + never overflows context regardless of length). Falls back toReadfor tiny (< 5KB) files. (d) Quick band-aid:SAFETY_BUFFER1000 → 2500 in_try_reduce_output_cap_from_error. Even with the new tool, output-cap auto-reduction is still useful for the rare case whereReadis called on a moderately big file. The 2500-token (~7.6% of 32K) buffer now absorbs the +1K vLLM decoder-priming variance we observed in the wild. Tests: 18 new pytest cases (tests/test_summarize_large_file.py) — token estimator parametrized cases, chunk planner adaptiveness (small file → 1 chunk; size scales monotonically; larger context → fewer chunks; chunks have overlap; chunks cover all content), file reader dispatch (text / missing / directory rejected), full pipeline (small → single-shot, big → map-reduce with N≥3 map calls + 1 reduce), tool registration + schema check. 2063 targeted regression tests pass (2045 prior + 18 new), zero regressions. Golden prompt fixture regenerated for the new/summarizecommand in the help index. -
May 9, 2026: Two follow-up fixes after re-running the same
/agentfailure case. The previous patch wasn't enough — running the user's exact scenario again still showed: 1st callprompt 24577 + cap 8192 = 32769 fail→ my auto-reduction fired → 2nd callprompt 24778 + cap 7991 = 32769 fail again. The prompt grew by 201 tokens between attempts (provider re-tokenized differently on retry), exactly eating the 200-token safety buffer. AND the agent_runner's consecutive-failure detector kept resetting because agent.py alternates between[Failed ...]and[Circuit breaker ...]markers, so signature-matched counter went 1 → 1 → 1 → 1 forever. (a) BumpedSAFETY_BUFFER200 → 1000 in_try_reduce_output_cap_from_error. ~3% headroom on a 32K window absorbs provider-side tokenization variance. User's case: new safe cap =32768 - 24577 - 1000 = 7191, which actually fits even after the prompt grows. (b) agent_runner now counts ANY failure, not just signature-matched. New parallel counterconsecutive_any_failuresincrements on ANY[Failed]/[Circuit breaker]marker regardless of signature; trips at 4 consecutive iterations. The[Failed → Circuit breaker → Failed → ...]alternation now stops the agent at iteration 4 instead of looping forever. Updated stop-message clarifies whether the trip was "same identical failures" or "consecutive mixed failures". 8 existing tests updated for new buffer + 2045 targeted regression tests pass. -
May 9, 2026: Three fixes for the context-overflow + circuit-breaker doom loop. User report:
/ssj 15 → Research Assistantpointed at a large PDF, modelqwen2.5-72b(32K context), output cap 8192, prompt 24577 input tokens → total 32769 → 1 token over the limit. Every API call returned the same BadRequestError. The retry loop hit the same error 5 times in 60s → circuit breaker opened (120s cooldown). After cooldown the agent runner retried with the SAME config → re-opened the breaker → cycle continued forever, generating hundreds ofcircuit_open_skiplog lines. Three coordinated fixes break the loop. (a)agent.pyauto-reduces output cap on context overflow. New_try_reduce_output_cap_from_errorparses the explicit token counts from the error message (max=32768, requested=8192, prompt=24577) and computes a safe new cap =model_max - prompt_tokens - 200_buffer. In the user's case:32768 - 24577 - 200 = 7991, which fits. The retry uses the new cap WITHOUT consuming the attempt budget; bounded to ONE auto-reduction per turn so a true overflow (prompt itself too big to fit any reasonable output) eventually surfaces. Tolerant regex matches both OpenAI-style and Anthropic-style overflow messages. Falls through to existing_force_compactpath if numbers can't be parsed or the safe cap < 256. (b)agent_runner.pystops after N consecutive identical failures. Track each iteration's failure signature (the[Failed ...]or[Circuit breaker ...]marker text from agent.py's output, capped at 80 chars). When 3 in a row match, stop the agent with a clear notify message naming the underlying error. Prevents the doom loop where a fundamentally broken request (context too big for compaction to fix, missing API key, unauthorized model) keeps re-running every 2s for hours. (c)agent_runner.pyhonors circuit-breaker cooldown. When iteration text contains[Circuit breaker OPEN ... Cooldown: Xs], parse Xs and wait that long (capped at 5 min) instead of the configured 2s interval before next iteration. Avoids 60+ wasted iterations per single 120s cooldown. Tests: 8 new pytest cases (tests/test_context_overflow_recovery.py) — parser reproduces user's exact failure → 7991 cap, no-op when current cap already fits, give-up when safe cap < 256, OpenAI vs Anthropic phrasing tolerance, regex match for circuit-breaker cooldown extraction, regex match for [Failed / [Circuit breaker markers in real outputs. 2045 targeted regression tests pass (2037 prior + 8 new), zero regressions. -
May 9, 2026: /brainstorm v2: programmatic backstops + ranked synthesis +
--bgbackground mode. Three coordinated additions that make brainstorm output usable even when the lead model is weak (qwen2.5 etc.) and let users keep working while the debate runs. (a) Programmatic action-plan filter (commands/advanced.py). Two new helpers_extract_ban_keywords(opening)and_filter_action_plan(synthesis_md, ban_keywords). After_lead_synthesisreturns, the action plan is regex-scanned (case-insensitive substring) against a built-in default ban list —consult an advisor,diversify your portfolio,monitor regularly,考虑,咨询,定期监控,多元化,咨询财务顾问,分散投资,关注市场动态and dozens more, English + Chinese — PLUS topic-specific bans extracted from quoted strings ("..."/「...」) in the lead's own opening. Matched items are dropped with a_(programmatic self-check removed N action(s))_note appended. Deterministic — runs regardless of whether the lead model actually executed its prompt-side SELF-CHECK instruction. The user-reported failure case where qwen2.5 banned "consult an advisor" in the opening but still wrote "明天与财务顾问讨论" as Action Plan item #10 is now caught at the code level. (b) Ranked synthesis enforcement. The_lead_synthesisprompt's## Consensussection is renamed to## Ranked Consensuswith a mandatory**Ranked by: <metric>**header (metric extracted from the user's topic — "highest expected return" / "best refactor impact" / etc.) and items must be numbered with a→ Why this rank: <one sentence>line. Programmatic backstop_consensus_is_rankedregex-checks for ≥2 numbered items in the section; if missing, ONE fallback LLM call asks the lead to rank. If the fallback also fails to produce a ranking, the original ships unchanged (no crash). (c) Background mode--bg(or--background). New flag spawns a daemon thread, returns the REPL immediately. Stage progress (Lead opening,Round 2/3 (cross-examination),Synthesis) prints from the thread and interleaves with the user's typing — acceptable trade-off for a freed REPL. New/brainstorm statussubcommand shows all in-flight bg brainstorms with their current stage + elapsed time + output path. Implementation uses recursion: when--bgis set, the thread re-enterscmd_brainstormwith_bg_recursion=Truemarkers in config that bypass the interactive prompts (which would block on stdin) and suppress the TODO-generation sentinel (no REPL is listening for it). Module-level_BG_BRAINSTORMSdict is mutex-locked so/brainstorm statusreads a clean snapshot. Finished brainstorms older than 1h are pruned fromstatusto keep the list useful; running ones never prune regardless of age. Tests: 27 new pytest cases (tests/test_brainstorm_v2_advanced.py) — ban-keyword extraction (defaults + opening-quoted), action-plan filter (English + Chinese + no-section + all-clean), ranking detector (proper / unranked-bullets / no-section / single-item),_ensure_consensus_is_ranked(no-op when ranked + LLM call when not + keep-original on LLM failure),--bgflag parsing (7 cases including--backgroundalias + flag-position-tolerance +--bgmodenot matching), bg registry (register/set_stage/complete/snapshot + sort + 1h-prune-finished + keep-running-regardless). 2037 targeted regression tests pass (2010 prior + 27 new), zero regressions across the whole repo. Doc:docs/guides/brainstorm.mdadds--bgrow to the flag table + new "Programmatic backstops on the synthesis" section + tip "use --bg for long debates so you can keep working". -
May 9, 2026: /brainstorm
--ground: pre-fetch real /research data so personas debate against facts. Closes the biggest remaining gap in the brainstorm pipeline. Until now/brainstormwas pure-reasoning (no_tools=Trueon every persona) — fine for design / refactor / strategy questions, but useless for data-hungry topics like stocks / current events / recent news where personas would confidently invent prices and tickers from training memory. New--ground(or--ground=Nfor top-N cap, clamped to[3, 50], default 15) runsresearch.aggregator.research()on the topic BEFORE the debate starts, formats the top results as a compact### GROUNDING DATAmarkdown block, and inlines that into the snapshot every persona / lead opening / lead synthesis sees. Persona round-1 instructions gain "you MUST cite specific results by[N]when your claim relates to one — do not invent figures the data doesn't show." Lead opening detects the grounding block and anchors the agenda to it ("forbid any claim that contradicts the grounding data without citing it"). Lead synthesis takes a newgrounding=kwarg and the prompt requires every consensus claim to trace to either a[N]result OR a specific persona claim — un-traceable claims must be DROPPED. Failure-tolerant: any exception from the research aggregator (network, missing API keys, all sources 429) is caught silently —_fetch_groundingreturns""and the brainstorm continues un-grounded with a logged warning. Cost: 10-30s for the fetch, but cached for 24h via the existing/researchSQLite cache so back-to-back runs on the same topic are basically free. Composes cleanly with--rounds,--lead,--models. SSJ interactive flow gains a newGround in /research data first? [y/N]prompt right after Rounds; defaultNso existing usage is unchanged. Tests: 18 new pytest cases (tests/test_brainstorm_grounding.py) — 8 flag-parse cases including bound-clamping + four-flag composition, brief-formatting shape + sort + char-budget + empty-results, three fetch-graceful-degradation paths (raises / empty brief / happy path), backward-compat for_lead_synthesis(grounding=). 2010 targeted regression tests pass (1992 prior + 18 new), zero regressions across the whole repo. Doc:docs/guides/brainstorm.md"Data-hungry topics" section rewritten with examples + tip "always pass --ground for any topic touching the real world". -
May 9, 2026: /brainstorm output-quality guards — fix 5 real bugs surfaced from a live transcript. Reviewing
brainstorm_outputs/brainstorm_20260509_000935.mdexposed five concrete failures the structural changes alone didn't catch. (a) All persona letters wereP—letter, name = get_identity(persona_name[0].upper())and persona dict keys arep1/p2/…, so every Agent ended up labeledP("Agent P quoting Agent P attacking Agent P"). Letters now come from a stablepersona_identitymap keyed by index →A, B, C, D, E…(capped at Z). (b) Same persona's NAME re-rolled every round becauseget_identitywas called fresh and Faker is random — round 1's "Riley Torres" became round 2's "Alex Lopez".persona_identityis sealed once before the rounds loop. (c) Round 2+ challenges were verbatim copy-paste — qwen2.5 saw the first persona's CHALLENGE block in history and cloned it (8 of 10 round-2/3 challenges in the failing transcript were >95% identical). New_extract_challenge_blocks+_jaccard_similarity+_is_redundant_challenge(threshold 0.7) guards: when a round-2+ persona's CHALLENGE is too similar to a prior one, the lead force-regenerates ONCE with explicit "pick a different target / different angle" nudge; if still redundant, the contribution is kept but tagged_[lead note: contribution flagged as redundant]_so the synthesizer can ignore it. (d) Lead synthesis self-contradicted itself — listed "consult an advisor" inWhat Was Fillerthen included "明天与财务顾问讨论" as Action Plan item #10._lead_synthesisnow takes the lead's ownopeningtext as context and the prompt explicitly forces a SELF-CHECK before writing the action plan: "if any action matches a banned escape hatch, REWRITE or DELETE." (e) Weak lead models silently produce flat output — qwen2.5-72b leading qwen2.5-72b is the same model on both sides with no real moderation. New_is_weak_lead_modelfamily check (qwen / qwq / gemma / phi-3 / mistral-7b / llama-3.2 / kimi-7b / minimax-text / abab / etc.); when triggered, prints a one-line warning suggesting--lead claude-opus-4-7or the free--lead nim/deepseek-ai/deepseek-r1. Never silently overrides — just informs. Plus a newdocs/guides/brainstorm.md"When NOT to use /brainstorm" section: the panel runs withno_tools=Trueso it can't pull live data — bad fit for stocks / current events / repo-specific code; good fit for architecture decisions / refactor strategy / risk assessment / API design. Tests: 28 new pytest cases (extraction + Jaccard + redundancy + weak-lead + synthesis-with-opening). 297 targeted regression tests pass. -
May 9, 2026: /brainstorm round 2+ becomes adversarial cross-examination. Previous round-2+ prompt asked personas to "engage with what others said" but that was too soft — weak models defaulted to "agree-and-extend" or just continued their own line, producing N rounds of polite parallel monologues instead of a real debate. Three coordinated changes flip round 2+ into mandatory adversarial mode. (a) Persona round-2+ prompt rewrite (
commands/advanced.py:call_persona). Each persona MUST: quote a specific claim from another agent verbatim (by letter), attack a specific weakness (data wrong / mechanism doesn't produce outcome / confounder ignored / claim un-falsifiable / contradicts stronger claim), AND propose a falsifiable counter-claim with a specific number/date/named entity. Structured format### [CHALLENGE → Agent X]so weak models can follow. Politeness ("great point", "I agree, and would add", restating without attacking) is explicitly FORBIDDEN. Synthesis is the lead's job, not the persona's. (b) Round-aware lead probe (_lead_probe). Round 1 keeps the existing concrete-vs-vague check. Round 2+ uses a different probe that fires on DODGES — a polite agreement, a synthesis, or a defense-only reply that doesn't quote and attack another agent earns a probe demanding "Agent X said '...'. Attack it or accept it — your call, but commit. Quote and refute, don't dodge." (c) Lead opening warns about cross-examination upfront. Opening prompt now ends with explicit rule: "in any round after the first, each expert MUST quote a specific claim from another expert and either attack with a counter-claim OR explicitly accept it. Polite agreement counts as a dodge." UI label changes too —── Round 2/3 (adversarial cross-examination — agents must attack each other's claims) ──. Tests: 3 new round-aware probe cases (round-2 polite-agreement gets probed; round-2 real challenge passes; round-1 still uses old vague check — captured so a future round-2 change can't regress round 1). 269 targeted regression tests pass. -
May 8, 2026: /ssj brainstorm: interactive Rounds prompt. Tiny UX follow-up to the multi-round /brainstorm landing —
/ssj→ 1 (Brainstorm) now asksRounds [1=monologues, 2=critique (default), 3-6=more debate] >right after the existing "How many agents?" prompt, so SSJ users can dial in debate depth without remembering the--rounds NCLI flag. Behaviour: when the user invokes/brainstorm --rounds 3 …directly via the slash-command line, the explicit value wins and the prompt is skipped (no double-asking). Telegram / web bridge sessions still skip the prompt entirely (no interactive input channel) and use the documented default of 2 rounds. -
May 8, 2026: /brainstorm: real multi-round debate + tighter post-Write contract. Two follow-up fixes after the lead-moderator landing. (a) Multi-round debate (
commands/advanced.py). Previous flow ran every persona exactly once — even with the lead moderator, that's three monologues stapled together, not a debate. New--rounds Nflag (default2, capped to[1, 6]) wraps the persona iteration in an outer rounds loop. Round 1 is initial positions (existing prompt). Round 2+ uses a different system prompt that explicitly forbids repeating: "Read the prior debate. Pick 1-2 specific claims from OTHER agents that you disagree with, can sharpen, or that change your view. Quote and engage. Do NOT re-list your round-1 ideas." Lead probes still fire after each persona in each non-final round. The synthesis prompt's transcript is rebuilt frombrainstorm_historydirectly so adding new header rows can't mis-slice it again. Composes with--lead <model>and--models a,b,c:/brainstorm --rounds 3 --lead claude-opus-4-7 --models gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth. (b) Tighter TODO prompt (cheetahclaws.py). The previous "do not echo / do not Read" prompt didn't stop qwen2.5 from Write → echo content as text → Bashlsto verify (with truncated path due to vLLM streaming) → echo content again. New prompt is numbered STRICT RULES: call Write EXACTLY ONCE; do NOT call Read; do NOT call Bash to verify; do NOT echo file content after Write; after Write succeeds, your turn ENDS. Both REPL and Telegram handlers updated. Tests: 9 new pytest cases (--roundsparser including bound-clamping + non-numeric rejection + three-flag composition). 266 targeted regression tests pass. The Bash-args truncation symptom (ls /srv/.../cheetahclacut mid-path) is a vLLM hermes-parser streaming bug at the model server, not fixable on the client side; the tighter prompt avoids the Bash call entirely. -
May 8, 2026: Three fixes for /monitor + /research stability — multi-word topics + aggregator deadlock + REPL Ctrl+C. Two distinct bugs reported on a
/ssj→ 17 (Trend Track) flow with the topic "Agent OS Benchmark". (a) Topic truncated to first word (commands/monitor_cmd.py:_parse_subscribe_args). The previous parser didargs.split()and treated the FIRST whitespace token as the topic, dropping the rest. So/subscribe research:7d:Agent OS Benchmark dailybecame topic=research:7d:Agent+ the rest was either silently dropped or mis-classified as flags. The new rule: walk left-to-right, peel off--flagtokens into channels, then if the LAST remaining token is in_VALID_SCHEDULESit's the schedule — everything before joined by single spaces is the topic. Correctly handlesai_research,ai_research weekly,custom:quantum computing weekly,research:7d:Agent OS Benchmark daily,research:7d:Agent OS Benchmark(default schedule), and edge cases. 12 new pytest cases (tests/test_subscribe_parser.py). (b) Aggregator deadlocked on slow source then killed REPL on Ctrl+C (research/aggregator.py:190). Thewith concurrent.futures.ThreadPoolExecutor(...)context manager callsshutdown(wait=True)on__exit__, which BLOCKS waiting for any in-flight worker to finish. Whenas_completed(timeout=...)fires its TimeoutError because one source is hung on a stuck socket, control unwinds into the__exit__and joins the hung thread. Then the user Ctrl+Cs to escape, the KeyboardInterrupt fires during the join, and Python's atexit hook_python_exitALSO joins the same threads — double-blocking, then atexit kills the process and the user is dumped to bash. Fix: switch to manualtry/finallywithshutdown(wait=False, cancel_futures=True)(Python 3.9+) so partial results return immediately; the hung worker keeps running as a daemon thread and dies silently with the process. Both_cf.TimeoutErrorandKeyboardInterruptpaths now mark unfinished sources with a status entry ("timeout (aggregator deadline exceeded)"or"interrupted by user") instead of dropping them silently. (c) REPL: Ctrl+C during a slow slash command killed the process (cheetahclaws.py:1368). The REPL didresult = handle_slash(user_input, state, config)with NO try/except, so a KeyboardInterrupt during/monitor run,/research,/trading backtest, etc. unwound the call stack all the way tomain()→sys.exit()→ atexit. Fix: wrap the REPL slash dispatch intry / except KeyboardInterrupt → print '(command interrupted)' → continueso Ctrl+C cancels the command and returns to the prompt. Also wrapped the SSJ inner re-dispatches at lines 1420/1430 (__ssj_passthrough__and__ssj_cmd__) so Ctrl+C from inside a slow SSJ-launched command bounces back to the SSJ menu instead of killing the REPL. 257 targeted regression tests pass. -
May 8, 2026: /brainstorm gets a real lead moderator + read-only tool dedup. Two coordinated changes that turn /brainstorm from "round-robin echo chamber that produces filler advice" into "moderated debate with a structured master plan", and stop weak models from re-Reading the same file twice. (a) Lead moderator (
commands/advanced.py). Three new in-process stages (no main-agent invocation, no tool calls — the whole pipeline lives insidecmd_brainstorm): (i) Opening — lead frames the agenda, names the concrete artifact this debate must produce (e.g. "specific tickers with thesis, not 'consider semiconductors'"), and lists 2-3 cheap escape hatches that will be REJECTED ("consult an advisor", "diversify", "monitor regularly"). The opening becomes the persona system-prompt's "DEBATE ANCHOR" so every persona writes against the same bar. (ii) Probe — after each persona speaks, lead reads their contribution and either repliesNO_PROBE(concrete enough) or asks one ≤25-word follow-up that demands a specific commitment; the persona then gets one more swing answering the probe. (iii) Synthesis — lead produces the final master plan with four named sections (Consensus / Dissents / Concrete Action Plan / What Was Filler), with the consensus matrix tagging each claim with the agent letters that backed it. New--lead <model>flag lets you point lead at a stronger model than the default (/brainstorm --lead claude-opus-4-7 --models gpt-5,deepseek-r1 redesign auth). Composes cleanly with the existing--models a,b,cflag. (b) Eliminates the duplicate-Read bug. The previous flow returned a sentinel that asked the main agent to Read the brainstorm file and synthesize — qwen2.5 + vLLM cheerfully Read it twice and echoed the entire 4 KB master plan as text twice (also writing a different much shorter content via Write — a separate tool-call truncation issue). The new sentinel inlines the lead's master plan directly in the TODO-generation prompt, so the main agent only writes the TODO file. No Read, no rewrite. The old_save_synthesisstep is now a no-op (everything is written insidecmd_brainstorm). (c) Read-only tool dedup (agent.py). Defense-in-depth even outside brainstorm: when the model fires Read/Glob/Grep/WebFetch/WebSearch with identical args twice within a singlerun(), the 2nd call is short-circuited —execute_toolis skipped (saves time),ToolStart/ToolEndUI yields are suppressed (no⚙ Read(...)printed twice), a brief[deduped Read: already in context]text marker is yielded so the user still knows what happened, and a synthetic[deduped]reminder is appended as the tool_result so the model sees "you already called this; use the content already in your context" — both nudging the model AND keeping the OpenAI/Anthropic tool_calls ↔ tool_response pairing valid. Write/Edit/Bash are explicitly NOT deduped (those can be intentional rewrites). Tests: 19 new pytest cases (8 lead helpers + 4 dedup integration via fake provider stream + 7 flag-parse). 245 targeted regression tests pass. -
May 8, 2026: /ssj brainstorm hot-fixes — absolute path in synthesis prompt + tool dispatch hardened against empty args. Two bugs surfaced when a user ran
/ssj→ 1 (Brainstorm) oncustom/qwen2.5-72b. (a) commands/advanced.py:244 — synthesis prompt leaked a relative path. The brainstorm synthesizer was injectingout_file(aPathresolved relative to cwd) into the model's prompt asbrainstorm_outputs/brainstorm_<ts>.md. The model — obeying the system prompt's "always use absolute paths" rule — invented an absolute prefix and guessed wrong (in this case…/PR/cheetahclaws/brainstorm_outputs/…, a stale sibling source tree it had never been told existed). Read failed, the synthesis ran on no actual evidence. Fix:out_file.resolve()before formatting + an explicit "use this path verbatim, do NOT prepend any directory" line. (b) tools/init.py:459-471 — permission-prompt description usedinputs['file_path']notinputs.get(...). When a weak model fired a tool_call with empty arguments (qwen2.5 + vLLM hermes-parser is a documented offender — see "Be agentic on every model" entry above), the wrapper raisedKeyError: 'file_path'before the registered ToolDef's friendly"Error: missing required parameter 'file_path'"lambda ever ran. The user sawError executing Write: KeyError: 'file_path'and the model couldn't self-correct. Fix:.get(..., '<missing path>')for Write/Edit/NotebookEdit description,.get('command', '') or ''for Bash, so the inner ToolDef's friendly error always reaches the model. Bash's_is_safe_bashalready tolerates empty input. Tests: 9 new pytest cases (tests/test_tool_dispatch_robustness.py) — empty args on Write/Edit/Read/Bash/NotebookEdit must return a friendly string and never leakKeyErrorto the agent loop. 226 targeted regression tests pass. -
May 8, 2026: NVIDIA NIM free-tier provider + 429 cascade fallback + multi-model
/brainstorm. Three small, focused additions — borrowed selectively from sibling forks (Falcon for NIM, Dulus for the multi-model debate idea) — that lower the barrier to entry for users without paid API keys and tighten epistemic diversity in brainstorming. (a) NIM provider (providers.py). Newnimentry registered againsthttps://integrate.api.nvidia.com/v1(build.nvidia.com — free signup, no payment info), curated 10-model chain (deepseek-r1, deepseek-v3.1, llama-3.3-70b, llama-3.1-405b, nemotron-70b, mixtral-8x22b, qwen2.5-72b, qwen2.5-coder-32b, phi-3-medium, gemma-2-27b). All listed inCOSTSas $0 so the UI doesn't show "unknown" for free-tier usage. Invocation:cheetahclaws --model nim/<vendor>/<model>— the double-prefix preserves NIM's upstream<vendor>/<name>form throughdetect_provider+bare_model. (b) 429 cascade fallback (agent.py). When a NIM model returns rate-limit (ErrorCategory.RATE_LIMIT), the agent loop callsnim_next_model()to pick the next model in the curated chain and retries — without consuming a regular retry slot. Capped at_NIM_FALLBACK_LIMIT = 3swaps per turn so a fully-throttled tier can't busy-loop; after the cap, falls through to the standard exponential-backoff retry path. Disabled by settingnim_auto_fallback=Falsein config. Other providers (anthropic / openai / etc.) are not affected — the swap is gated bydetect_provider() == "nim". (c) Multi-model/brainstorm(commands/advanced.py). New--models a,b,cflag distributes models round-robin across personas (/brainstorm --models claude-opus-4-7,gpt-5,nim/deepseek-ai/deepseek-r1 redesign auth) so a 5-persona session alternates 1, 2, 3, 1, 2 instead of running every persona on the same model. Single-model brainstorm is an echo chamber — different model families have different training data and blind spots, so multi-model debate buys real epistemic diversity. Each persona's section in the output Markdown is tagged with the model that produced it (## 🏗️ Architect _(via gpt-5)_) so the synthesizer can weight by source. Borrowed in spirit from Dulus'sRoundtableAgent; the existing /brainstorm flow is unchanged when--modelsis omitted. Tests: 21 new pytest cases (tests/test_nim_provider.py12 +tests/test_brainstorm_models_flag.py9) covering provider registration, chain cycling (cycle-through + wraparound + unknown-model head fallback), 429 swap-then-succeed, fallback-cap-then-fallthrough, fallback-disabled honor, non-NIM no-leak, flag parsing across--models a,b,c/--models=a,b,c/ flag-at-end / provider-prefixed IDs / single model. 217 targeted regression tests pass, zero regressions. Skipped by design: ia-web-parser'sWebToolParser— Cheetahclaws' existing_extract_native_tool_callsalready covers 4 marker formats (Gemma official + asymmetric, Hermes, Mistral) plus channel-tagged form and args recovery, so the streaming-vs-buffered UX delta wasn't worth the duplication. -
May 8, 2026 (earlier): "Be agentic on every model" pass — explore-first prompt + qwen overlay + runtime auto-nudge. A user reported
cheetahclaws --model custom/qwen2.5-72breplying "please tell me which file you mean" when handed a directory path, instead of justls-ing it. Three coordinated defenses, layered so any one of them is enough to fix the failure mode on any model: (a)prompts/base/default.md— new "Investigate Before Asking" section + softened Stop Conditions. Every model now gets explicit "default to action over conversation" framing: a directory is not "missing information", it's an invitation to enumerate;AskUserQuestionis reserved for genuine post-exploration ambiguity (intent that nols/Glob/Read could disambiguate), never as a substitute for a tool call. (b)prompts/overlays/qwen.md— new family overlay (10 lines, cites the Qwen function-calling guide). Qwen / QwQ chat-tuned models hedge by default ("could you specify…"); the overlay overrides that with "treat every concrete noun the user names — path, filename, URL, function, command, error string — as an instruction to investigate it with a tool, not echo it back as a question." Registered in_OVERLAY_RULESfor allqwen/qwqmodel IDs regardless of runtime (DashScope / Ollama / vLLM / OpenRouter all match). (c)agent.pyruntime auto-nudge — model-agnostic safety net. New_looks_like_investigation()heuristic detects absolute-path tokens in the user message (URL-stripped to avoid false positives onhttps://host/path); if the heuristic fires AND the model's first reply is text-only with zero tool calls, the loop injects a one-shot[system reminder] use your tools, don't ask for what was givenmessage into history and continues. Bounded to one nudge perrun()invocation so it can never cause a loop — second text-only reply always falls through to break. The nudge fires on conversion to the OpenAI/Anthropic format as a normal user-role message and is invisible in the rendered UI (yielded events drive the display, notstate.messages). Tests: 13 new pytest cases (tests/test_agent_nudge.py) — heuristic positives/negatives across English + Chinese + URL-only + relative-path + bare greeting; loop integration via fake provider stream verifying nudge fires, doesn't fire without path, fires at most once. 89 prompt + 196 targeted regression tests pass, zero regressions. Docs updated:prompts/README.mdoverlay table + Known Gaps,docs/architecture.mdoverlays tree + agent-loop step (h),docs/contributor_guide.mdoverlay enumeration. The three layers compose: strong models (Claude/Gemini) read the new default rule but already behaved this way; mid-tier models (GPT/DeepSeek/Kimi) get a clearer prompt-level instruction; weak models (qwen2.5/QwQ) get prompt + overlay + runtime nudge stacked. Even on a model that ignores the prompt entirely, the runtime nudge gives one free retry before the user has to intervene. -
May 8, 2026 (earlier): Agent-OS layer (
cc_kernel/) reaches v1.0 — 27 RFCs shipped, 1771 tests passing, zero regressions on the legacy REPL/bridges path. What started as a daemon foundation (RFC 0001/0002) is now a single-node agent operating system: AgentProcess + EventLog (0003), Capability model (0005), per-agent ResourceLedger withfirst_breachsignal (0006), priority Scheduler with admission filter (0007), RLIMIT + bubblewrap Sandbox (0008), Mailbox + topic pub/sub (0009), AgentRegistry (0010), AgentFS unified VFS (0011), Observability + Prometheus exposition (0012), and a frozen 58-method JSON-RPC contract with CI drift guard (0013). On top of that substrate: F-4 Subprocess agent runner (0016), WorkerLoop scheduler↔supervisor glue (0017), Bridge mirror that wires Telegram/WeChat/Slack intokernel.mboxwithout touchingbridges/(0018), LLM runner MVP (0019), DialogueOrchestrator for multi-turn (0020), Tool Dispatch + Permission Routing (0021), LLM Tool Calling Integration (0022), defense-in-depth tools — Exec (argv-only, RLIMITed, env scrubbed; 0023), Glob+List (0024), Fetch (SSRF + DNS-rebind + redirect-leak defended; 0025) — three streaming layers (IPC chunks 0026, LLM token streaming 0027, Exec line streaming 0028, Fetch body streaming 0029), and three new built-in inspectors (Diff 0030, AST 0031, Git 0032). All kernel code lives incc_kernel/and is gated behind--enable-kernel— default CheetahClaws CLI / REPL / bridges / web UI are byte-for-byte unchanged. Operators introspect viacheetahclaws kernel summary | info | agents | proc <pid> | events | queue | registry | methods | prometheus. Kernel SQLite schema is forward-only (v1 → v7). RFC 0014 multi-tenant + RFC 0015 cluster remain explicitly parked. Full overview:docs/agent-os.md. Each design note indocs/RFC/. -
May 8, 2026: F-2/F-3 follow-ups + CI unblock (
feature/fix-f2). Two-commit branch on top of #101's daemon foundation (F-2 SQLite persistence + F-3 monitor in daemon). (a) CI unblock (fix(ci)). Main has been red since9c01237d(the trading-agent #99 merge) —tests/test_packaging.py::test_required_module_imports[modular.trading.ml](the regression test added for issue #97) caught thatmodular/trading/ml/features.pyandmodular/trading/portfolio.pyimport numpy at module top while numpy is in the[trading]extra, not core deps. Sopip install .(no extras) shipped a wheel whereimport modular.trading.mlblew up. PR #100 and #101 both inherited the red. Fix: deadimport numpy as npremoved fromfeatures.py;stacker.pydefers numpy to insidetrain()andpredict_proba()past the early-return paths so the diagnostic-only callers (train(too_few_rows),predict_proba(missing_model)) still work without the heavy stack;portfolio.pygates the numpy import behindtry/exceptso module import succeeds and runtime callers raise on first use as before.test_trading_advanced.pyandtest_trading_discovery.pygetpytest.mark.skipifmarkers on tests that genuinely need numpy / scipy / sklearn / pandas at runtime — skip cleanly on lean CI installs, run as before on full installs. Verified in a clean venv with only[web,autosuggest](the exact CI install): 1075 passed, 11 skipped; with[all]extras: 1086 passed, no regressions. (b) F-2/F-3 follow-ups (fix(daemon)). Five issues found during the #101 review that the merged code didn't address: (i)cc_daemon/cli.py:cmd_servestartedmonitor.scheduler.start(...)before the listener bound — order matters because if a due subscription fires before the daemon is reachable, an LLM/network error in fetch/summarize/deliver surfaces in the log before the user sees the listening line, and external clients can't yet act on the resultingmonitor_reportSSE event; moved past the bind + discovery write. (ii)monitor/scheduler.pyhad no defense against the daemon coming up after REPL/monitor startfired — both schedulers would race onlast_run_atand double-fire subscriptions; added_foreign_daemon_running()step-aside check at every loop tick (REPL-side instances bow out when a daemon registers ownership), withowned_by_daemon=Trueflag the daemon passes to opt out of the check on its own scheduler. (iii)EventBus.publishwassynchronous=FULL(SQLite default) → every event was anfsyncper commit, ~305 μs each; for streaming agent output (text_chunkevents at dozens/sec) that's a real disk-IO concern.cc_daemon/schema.pynow setsPRAGMA synchronous=NORMALon init + every thread-local connection — safe under WAL (only the most recent transactions can be lost on hard kernel crash, which for a 24h-pruned event log is fine), microbenchmark drops to 39 μs/publish (~8×). (iv) The PR description said the JSON files were "kept readable for one release as fallback", but no fallback read path actually exists —jobs.pyandmonitor/store.pymigration is fundamentally one-way once theschema_metamarker is set. Updated docstrings +docs/architecture.mdto make the one-way semantics explicit and tell users how to redo a migration if needed. (v)docs/RFC/0002-daemon-foundation-roadmap.mdF-2/F-3 marked OPEN → MERGED #101 + follow-ups (#fix-f2), with a new "Follow-ups" subsection under each. Branch:feature/fix-f2. -
May 8, 2026: Two production fixes — Gemma 4 native tool-call interceptor + issue #97 (
pip install .shipping a broken wheel). Two unrelated bugs that both blocked end users on the v3.1 release. (a) Gemma 4 native tool-call interceptor (providers.py). When users run cheetahclaws againstgemma-4-31B-itvia vLLM, the model emits its native<|tool_call>call:NAME{json}<tool_call|>format instead of the Hermes/JSON envelope vLLM's--tool-call-parser hermesexpects. vLLM doesn't recognise the format → leaves it indelta.content→ cheetahclaws yields it asTextChunk→ terminal shows raw<|tool_call>call:Research{topic:<\|"\|>...<\|"\|>}<tool_call\|>garbage instead of a coherent answer. The interceptor instream_openai_compatnow watches the streamed text for any of four native tool-call openers (Gemma official<|tool_call|>, Gemma 4 asymmetric<|tool_call>, Hermes<tool_call>, Mistral[TOOL_CALLS]); on detection it (i) yields the pre-marker text as a cleanTextChunk, (ii) stops yielding text and switches into buffer mode, (iii) at end-of-stream tries three parser branches against the buffer (Gemma'scall:NAME{json}, JSON envelope withname/arguments, Mistral's array form) and adds successful matches totool_calls. Also normalises Gemma's<|"|>→"quote escaping. If no parser matches, falls back to yielding the buffered raw text so users see something rather than a silent stall. Tests: 16 new pytest cases (tests/test_native_tool_intercept.py) covering marker detection (4 variants), 3 parser branches, robustness (empty buffer / unparseable garbage / multi-call buffer), and end-to-end streaming via mocked OpenAI client (verifies pre-marker text yielded as TextChunk +<|tool_call>tokens NOT in any TextChunk + tool_call appears in AssistantTurn). (b) Issue #97 —pip install .produces a broken wheel (pyproject.toml, deletedmemory.py,tests/test_packaging.py). Reported by @albertcheng on Windows + Python 3.13:cheetahclaws.execrashed at startup withModuleNotFoundError: No module named 'prompts'. Root cause: a name collision inpyproject.toml—memorywas listed in BOTHpy-modules(referring to a 11-line backward-compat shimmemory.pythat re-exports from thememory/package) ANDpackages(the realmemory/directory). Python's import system always prefers the package directory over a same-named .py file, so the shim was dead code; setuptools ≥ 75 on Windows treats this dual-registration as a hard error and silently drops unrelated packages from the wheel build — which is howprompts/went missing. Fix: deleted the deadmemory.pyshim, removedmemoryfrompy-modules, and replaced the manualpackages = [...]list with[tool.setuptools.packages.find]+ wildcardincludepatterns so future sub-packages auto-discover. This also caught a separate latent bug — the four sub-packages added in the v3.1 trading discovery layer (modular.trading.alt_data,modular.trading.broker,modular.trading.discover,modular.trading.ml) were missing from the manualpackages = [...]list and would have been excluded from production wheels even after a successful build. Tests: 29 new pytest cases (tests/test_packaging.py) — config sanity (no module/package name collision allowed;memory.pyshim must not be re-introduced; pyproject.toml must usefindnot manual list), discovery walk (every top-level dir with__init__.pyis reachable fromfind's include patterns or explicitly excluded), and the exact issue #97 failure reproduction (parametrised import test for 24 modules includingprompts,prompts.select, all four newmodular.trading.*sub-packages, and thecheetahclawsentry point — fails the build if any can't be imported). Verified locally: rebuilt wheel after fix contains all 31 packages includingprompts/and the four new sub-packages. 1005 passing (976 baseline + 16 native-tool-intercept + 29 packaging = 1005), zero regressions. CONTRIBUTING.md updated with explicit packaging discipline notes: never put a name in bothpy-modulesandpackages, sub-packages auto-discover viafind, only top-level packages need a newincludepattern. -
May 8, 2026 (later):
/tradingv3.1 — automatic candidate discovery + composite ranking + anomaly detector + market monitor with bridge alerts. Closes the biggest gap in v3: previously you had to feed the agent symbols (/trading analyze NVDA); now it actively scans a universe and finds candidates for you. Four orthogonal discovery scanners ship: (a)insider_cluster— SEC EDGAR Form 4 cluster detector, flags tickers with ≥3 officer / 10%-holder filings in 30 days, surfaces SEC URLs so user can verify direction; (b)earnings_beat— yfinance earnings_dates surprise extractor, requires ≥10% beat AND post-print continuation (filters out the pop-and-fade pattern); (c)momentum_quality— factor intersection over the newfactors.py(momentum = 6m return + 50d>200d trend confirmation; quality = ROE − 0.3·D/E + 2·op-margin; both min-max normalised + composite-scored); (d)sector_rotation— ranks SPDR Select sector ETFs by 1m+3m return, surfaces top holdings of the leaders. The orchestrator (discover/orchestrator.py) merges per-symbol hits across all four sources with weighted aggregation (insider 1.0, earnings 0.9, mom-qual 0.7, sector 0.5) AND a +0.5 confluence bonus when ≥2 sources flag the same ticker. New CLI:/trading discover [insider|earnings|momentum-quality|sector|all] [--universe sp100|sectors] [--add-watchlist N]— the--add-watchlistflag auto-promotes the top N hits to your watchlist for downstream/trading scan//trading analyze. New/trading rankcomposite-ranks candidates by 0.5×factor + 0.3×discovery + ±0.1 calibration-tilt; output is a triage table for "which names deserve a real/trading analyze". New/trading factors [SYMS]shows raw momentum/quality/low-vol scores with a 24h disk cache at~/.cheetahclaws/trading/factors_cache.json(S&P 100 takes ~1-2 min to scan, parallel ThreadPoolExecutor with 4 workers). New/trading anomaly [SYMS]runs three independent checks per ticker: volume spike (today vs 90d median ratio ≥ 2×), price gap (open vs prior close ≥ 3%), volatility regime z-score (5d realised vol vs 90d distribution ≥ 2σ). New/trading monitor scanruns one full monitoring cycle — anomaly detection + stop-loss/take-profit hits on open paper trades + earnings within 3 days for any held position + new SEC Form 4 filings since last scan (delta detection persisted in~/.cheetahclaws/trading/monitor_state.db);--notify [telegram] [slack] [wechat]dispatches structured alerts (severity-tagged: critical/warning/info) through cheetahclaws's existing bridge layer. Honest framing on "real-time" in the docs: yfinance is 15-20min delayed for free tier, so polling more often than every 5-10 min is wasted effort; three scheduling options documented (manual, external cron,/monitorintegration). Newuniverse.pyships hardcoded S&P 100 (~7-8% drift/year, refresh quarterly) + 11 SPDR Select sector ETFs + curated top-10 holdings per sector ETF for sector_rotation. The discovery layer also fixes a real gap in the system prompt: the LLM didn't know what/trading discoveretc. existed, so when users asked "can you find me good stocks" it confabulated; the dynamic_render_commands_blockfrom earlier session now picks up the new subcommands automatically. Tests: 21 new pytest cases intest_trading_discovery.pycovering universe resolution, factor scan + score with stubbed yfinance, insider cluster threshold logic, momentum-quality intersection, sector rotation top-sector picking, orchestrator multi-source merge + bonus, anomaly triple-check (volume/gap/vol-regime), ranker factor+discovery combination, monitor alert rendering + dispatch + end-to-end scan with stubbed market data. 960 passing (939 baseline + 21 new), zero regressions; golden system-prompt fixture regenerated. Honest disclaimer in PLUGIN.md and trading.md: discovery reduces search cost, not generates alpha — the named factors (momentum, value, quality) are well-known and largely priced in by quant funds; what users get is a 100-ticker → 15-ticker triage list to spend tokens on, plus structured discipline (anomaly detection, stop monitoring, earnings calendar) that's hard to do by hand. Form 4 transaction direction is NOT yet parsed from XML (we count filings, not buys vs sales); URLs included so user verifies in 5 seconds. Insider direction parsing is on the roadmap but requires reliable XML scraping of SEC archives across version drift. -
May 8, 2026:
/tradingv3 — paper-trade tracker, calibration, managed$Xportfolios, alt-data, MV optimizer, ML stacker, walk-forward, broker abstraction. A two-stage upgrade that turns the trading module from "ask LLM about a stock" into a measurable research substrate. Stage 1 (the discipline layer): every/trading analyzerecommendation is auto-recorded as a paper trade (~/.cheetahclaws/trading/paper_trades.db) — long and short signals account correctly./trading calibrationaggregates closed trades by confidence + signal and reports hit rate + mean return + a t-stat vs zero baseline; if 30+ closed trades show HIGH conviction not outperforming LOW, the agent's confidence label is noise and the diagnosis fires./trading verifyenforces hard risk rules (single-name 5% / sector 25% / total exposure 80% / stop 1-10% / earnings blackout 3 days → cap 2.5%) reading the live paper book — fixes the "LLM forgets its own rules" problem. The analyze prompt now auto-injects macro context (SPY/QQQ trend + VIX regime + 10y headwind, 30-min cached), earnings calendar warnings (🚨 if reporting within 7 days), and the current paper-book exposure so the LLM doesn't double-down on a sector already at 30%./trading walkforwardruns rolling out-of-sample chunks with a STABLE/MIXED/FRAGILE/INCONCLUSIVE verdict, replacing the dishonest aggregate backtest./trading scandoes a coarse heuristic sweep (RSI / 50d / 200d) over the watchlist before spending tokens on a real analyze. Stage 2 (the autonomous + alpha-research layer):/trading reviewruns a multi-agent debate on existing positions and emits structuredACTION ID=… DECISION=HOLD|ADD|TRIM|EXIT …rows for each./trading manage start hundred 100creates a virtual$100portfolio backed by a SQLite-cleanly-namespacedPaperBroker;/trading manage step hundredruns one mean-variance rebalance cycle (scipy SLSQP, long-only, single-name + sector caps),/trading manage report hundredprints a markdown PnL report with equity curve — this is the canonical "I give the agent $100, check in a week" workflow./trading optimizeexposes the same MV solver standalone. The alt-data layer auto-injects three sources LLM analysis can actually add value on: SEC EDGAR Form 4 insider transactions (urllib, no API key, free), LLM-scored yfinance news headlines via the auxiliary cheap model (-10..+10per headline aggregated to BULLISH/MIXED/BEARISH), and Google Trends search interest (soft-fails ifpytrendsnot installed). The broker layer has a tinyBrokerBackendprotocol with two backends —PaperBrokerworks out of the box,IBKRBrokeris a stub with full setup docs (pip install ib_insync+ IB Gateway config +connect()); the abstraction means switching from paper to live is one line when the user is ready./trading ml trainbuilds a LightGBM (or sklearnGradientBoostingClassifierfallback) classifier on closed paper trades — features are LLM signal one-hot + confidence ordinal + position size + stop / take profit + sector one-hot, label is "did this trade beat zero"; reports cross-validated AUC and feature importance, persists to~/.cheetahclaws/trading/ml/stacker.pkl. The_CMD_METAregistry is also auto-populated frommodular/-loaded commands now (closed a pre-existing bug where/trading,/video,/voice,/ttswere callable but invisible to/help, tab-completion, and the system-prompt slash-command index — the LLM literally couldn't see its own subcommands). Tests: 46 new pytest cases acrosstest_trading_pipeline.pyandtest_trading_advanced.pycovering paper-trader CRUD, long/short PnL math, Phase-5 parser permissiveness, calibration aggregation, verifier 8-branch enforcement, macro/earnings/insider/sentiment/trends soft-fail behavior, MV optimizer constraints, broker buy/sell/avg-cost round-trip, IBKR stub setup-required diagnostic, end-to-end$100→step→status→report lifecycle with mocked quotes, ML feature engineering + train + predict, and the position-review prompt format. 939 passing (893 baseline + 46 new), zero regressions; golden system-prompt fixture regenerated. Also fixed a banner-rendering bug where the welcome box's right border was missing on every middle line (cheetahclaws.pynow computes inner width from plain-text length and pads each row to close with│regardless of model-name length). Honest disclaimer in the docs and PLUGIN.md: this is a research and discipline tool, not a money printer — public-data + LLM analysis does not have predictive edge over quant funds; the value is information aggregation, programmatic risk discipline, and empirical accountability. Run paper for ≥3 months with green calibration + walk-forward before considering an IBKR live account; small accounts (<$1k) have unfavorable fixed-cost economics in real life regardless of strategy. -
May 7, 2026:
/themeslash command — 15 console color presets + post-merge UX fixes (PRs #92, follow-up). Adds a curated palette system toui/render.pyand a new/themecommand:/themelists all 15 presets (default,dracula,nord,gruvbox,solarized,tokyo-night,catppuccin,matrix,synthwave,midnight,ocean,monokai,cheetah,mono,none); each row renders aninfo / ok / warn / errswatch in the row's own theme colors so the listing is a real palette preview, not 15 identical lines in the current theme./theme <name>mutates the sharedCANSI dict in-place so every existingclr() / info() / ok() / warn() / err()call site (~25 files) switches palette without touching any call site, and persists the choice viasave_config()so the next launch re-applies it (early incheetahclaws.py:main(), before the first output).- Per-theme color roles. Each palette declares 4 semantic colors —
accent(info / cyan / blue),ok(success / green / diff +),warn(yellow / magenta),err(red / diff -) — plus a Richcodestyle. Picking 4 hexes per theme meansinfo()andok()are always visually distinguishable, andrender_diffkeeps semantic colors (green = add, red = remove) under every theme. The original PR collapsed cyan/green/blue to a single accent color, makinginfo()andok()indistinguishable and turning diff additions into the accent color (purple under dracula, yellow under gruvbox, magenta under synthwave) — the follow-up split them apart. CODE_THEMEis now actually consumed._make_renderable()inui/render.pypassescode_theme=CODE_THEMEtorich.markdown.Markdown, so Rich code-block syntax highlighting tracks the active theme (the original PR setCODE_THEMEbut never plumbed it through — it was dead code).nonetheme is genuinely uncolored (clears every key inC, includingreset, to""soclr()returns plain text).monois genuinely grayscale (4 distinct gray levels for accent/ok/warn/err — the original PR hardcodedC["red"] = "\033[38;5;196m"regardless of theme, breaking both).- Tests: 9 new pytest cases (
tests/test_theme.py) covering schema validation, unknown-theme rejection,info/okdistinguishability across all themes, diff-color distinguishability,none-as-plain-text,CODE_THEMEtracking,apply_themeidempotency across state, and the Rich Markdowncode_themeround-trip. 893 passing, zero regressions on the 884 pre-existing.