Skip to content

feat: streaming LLM responses#84

Open
TumCucTom wants to merge 10 commits intoMiniMax-AI:mainfrom
TumCucTom:feat/streaming-llm
Open

feat: streaming LLM responses#84
TumCucTom wants to merge 10 commits intoMiniMax-AI:mainfrom
TumCucTom:feat/streaming-llm

Conversation

@TumCucTom
Copy link
Copy Markdown

@TumCucTom TumCucTom commented Apr 5, 2026

Summary

  • Implement generate_stream() async generator method across all LLM clients (Anthropic, OpenAI)
  • Add StreamChunk schema with chunk types: thinking, content, tool_call_delta, tool_call_complete, done
  • Add _run_step_stream() in agent.py that streams thinking and content live
  • Add --no-stream CLI flag to disable streaming
  • Force unbuffered stdout (sys.stdout.reconfigure(line_buffering=False)) for real-time terminal output

Test plan

  • 12 streaming unit tests passing
  • Manual testing confirms live output appears progressively

Closes #71

TumCucTom and others added 10 commits April 5, 2026 18:14
Add generate_stream method to LLM clients for streaming responses:

- Add StreamChunk schema for partial response chunks (thinking, content,
  tool_call_delta, tool_call_complete, done)
- Add generate_stream abstract method to LLMClientBase
- Implement streaming in OpenAIClient via chat.completions.create(stream=True)
- Implement streaming in AnthropicClient via messages.stream()
- Add generate_stream to LLMClient wrapper
- Buffer partial tool calls and emit tool_call_complete when JSON is complete
- Add tests for streaming functionality

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…back

- Add `stream` parameter to Agent (default True) enabling real-time
  token output in the agent loop
- Refactor run() to dispatch to _run_step_stream() or _run_step_nonstream()
- Extract tool call execution into _execute_tool_calls() helper
- Streaming: buffers thinking/content, emits tool_call_complete when JSON
  parses, executes all tools after done event
- Add --no-stream CLI flag to disable streaming and use generate()
- All 12 streaming tests pass

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The Anthropic SDK sends:
- type="text"/"thinking" as TOP-LEVEL event types (content in event.text/event.thinking)
- type="content_block_delta" wraps content in block.delta.text / block.delta.thinking
- type="signature" events carry no streamable content

Also fix unit test mocks to match real SDK event structure.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…tput

- Add sys.stdout.reconfigure(line_buffering=False) in cli.py to force
  unbuffered output at the file descriptor level
- Use sys.stdout.write() + flush() instead of print() in the
  streaming loop in agent.py for immediate character-by-character display
- This ensures thinking and content chunks appear live rather than
  buffering until the stream completes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
MiniMax API sends BOTH top-level text/thinking events AND
content_block_delta events with identical content. Previously both
were yielded, creating duplicate chunks.

Fix: use top-level text/thinking events only (they arrive first
with complete content), skip content_block_delta for text/thinking,
use content_block_delta only for tool_use blocks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Any plans to support streaming output for LLM responses?

1 participant