feat: streaming benchmark suite — PromptKit vs LangChain vs Strands#915
Merged
feat: streaming benchmark suite — PromptKit vs LangChain vs Strands#915
Conversation
Add benchmarks/ as a new Go module in the workspace with: - go.mod declaring module github.com/AltairaLabs/PromptKit/benchmarks - gorilla/websocket v1.5.3 and gopkg.in/yaml.v3 v3.0.1 dependencies - Makefile with build-mock, build-harness, round1, round2, all, clean targets - results/.gitkeep placeholder directory - .gitignore entries for *.json, *.csv, *.md result artifacts
Implements Profile, OpenAIProfile, STTProfile, TTSProfile structs with YAML tags and time.Duration fields. Provides DefaultProfile() and LoadProfile(path) with full test coverage. Adds fast.yaml (10ms delays) and realistic.yaml (200ms first-chunk, 30ms inter-chunk) profiles. Also fixes unused "net/http" import in stt_ws_test.go.
Implements NewTTSHandler(cfg TTSProfile) http.Handler that upgrades HTTP
to WebSocket at /tts/ws, reads a Cartesia-compatible synthesis request,
waits FirstByteDelay, streams ceiling(32000/ChunkSize) full binary audio
chunks (each exactly ChunkSize bytes), then sends JSON {"type":"done"}.
Includes TestTTSWebSocket_StreamsAudio and TestTTSWebSocket_FirstByteDelay.
Implements NewSTTHandler(STTProfile) http.Handler at /v1/listen with Deepgram-compatible Results events. Read loop accepts binary audio frames and CloseStream JSON; write loop emits interim transcripts on a ticker and guarantees at least one interim before sending the final transcript. Also declares the shared wsUpgrader used by all WebSocket handlers.
wsUpgrader is already declared in tts_ws.go (added by the TTS task). Remove the redundant declaration to avoid a redeclaration build error.
Implements BenchmarkReport and TierResult types with WriteJSON (indented), RenderMarkdown (table with Framework/Concurrent/p50/p99/Throughput/RSS columns), and WriteCSV (header + one row per tier result). Round-trip and content tests included.
…ation Wraps the PromptKit SDK behind an OpenAI-compatible HTTP endpoint so the benchmark harness can measure it identically to LangChain and Pipecat.
Add docker-compose.yaml with profiled services for round1 (LLM streaming) and round2 (voice pipeline), using repo root as build context so Dockerfiles can access the full monorepo. Add Dockerfiles for mockupstream and harness using multi-stage builds. Replace stub Makefile with full orchestration: tiered concurrency loops for both rounds, help target, and local build/clean targets.
- mockupstream/stt_ws.go: handle Pipecat's KeepAlive (ignored) and Finalize (emit final transcript without closing) messages in the STT WebSocket read loop; accept any URL path/query-params so the Deepgram SDK's decorated handshake URLs connect cleanly - mockupstream/tts_ws.go: auto-detect Pipecat/Cartesia request format (transcript + voice.id) and respond with base64-encoded JSON chunk frames instead of raw binary; simple protocol unchanged - mockupstream/stt_ws_test.go: add TestSTTWebSocket_KeepAliveIgnored and TestSTTWebSocket_FinalizeTriggersImmediateFinal - mockupstream/tts_ws_test.go: add TestTTSWebSocket_PipecatProtocol - benchmarks/frameworks/promptkit/round2: new standalone Go WebSocket server that coordinates STT→LLM→TTS using stdlib + gorilla/websocket (no SDK dependency), measuring raw Go runtime performance - go.work: add round2 module
…ipeline Replace Pipecat framework-specific implementation with a raw Python asyncio equivalent that coordinates STT → LLM → TTS using the same WebSocket protocol as the PromptKit round2 server. This measures Python's async runtime overhead for voice pipeline coordination — the fairest comparison: same protocol, same pipeline logic, different language runtime.
…iver - Real Pipecat framework using FastAPIWebsocketTransport for multi-client - Protobuf frame serialization (AudioRawFrame) matching Pipecat's wire format - Generated Go protobuf bindings from Pipecat's frames.proto - Updated mock TTS to include context_id for Cartesia protocol compat - Updated mock STT/TTS to use catch-all handlers for SDK URL decoration - Added python-asyncio as separate framework (renamed from pipecat) - Harness round2-pipecat mode with 440Hz sine wave audio for VAD triggering
…entations - GenKit: Google's Go-based AI framework, OpenAI-compatible plugin pointing at mock upstream. Benchmarked Round 1 at 10-10k concurrent. - LiveKit Agents: Python voice pipeline framework with fake I/O pattern (bypasses LiveKit server). FastAPI WebSocket wrapper for harness compat. Round 2 deferred — needs mock HTTP STT/TTS endpoints (/audio/transcriptions, /audio/speech) which aren't implemented yet. - Renamed python-asyncio framework (was incorrectly in pipecat/ directory)
…sults - Mock upstream now serves /v1/audio/transcriptions (Whisper-compatible) and /v1/audio/speech (TTS) on the OpenAI port for LiveKit Agents compat - LiveKit Agents benchmarked at 1-100 concurrent voice sessions - Remove GenKit from Round 1 comparison (easily optimizable HTTP client config)
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Reproducible benchmark suite comparing PromptKit's LLM streaming performance against LangChain and Strands Agents (AWS AgentCore preferred runtime).
Local benchmark results (darwin/arm64 14 cores, fast profile)
PromptKit sustains ~920 rps from 100 to 10k concurrent. LangChain collapses at 2.5k. Strands degrades gracefully but peaks at ~193 rps (4.8x less throughput, 7.7x less memory-efficient per rps).
Test plan
benchmarks/mockupstream/+benchmarks/harness/)