Skip to content

feat: streaming benchmark suite — PromptKit vs LangChain vs Strands#915

Merged
chaholl merged 24 commits intomainfrom
feat/benchmark-suite
Apr 7, 2026
Merged

feat: streaming benchmark suite — PromptKit vs LangChain vs Strands#915
chaholl merged 24 commits intomainfrom
feat/benchmark-suite

Conversation

@chaholl
Copy link
Copy Markdown
Contributor

@chaholl chaholl commented Apr 7, 2026

Summary

Reproducible benchmark suite comparing PromptKit's LLM streaming performance against LangChain and Strands Agents (AWS AgentCore preferred runtime).

  • Mock upstream servers — OpenAI SSE, Deepgram STT WebSocket, Cartesia TTS WebSocket with configurable latency profiles
  • Client harness — concurrent load driver with per-request timing, percentiles, jitter, RSS/CPU sampling, JSON/markdown/CSV output
  • Framework implementations — minimal, idiomatic wrappers for PromptKit (Go/SDK), LangChain (Python/FastAPI), Strands Agents (Python/FastAPI), Pipecat (Python/voice pipeline)
  • Docker Compose + Makefile orchestration for one-command reproducible runs

Local benchmark results (darwin/arm64 14 cores, fast profile)

Concurrent PK rps PK p50 LC rps LC p50 SA rps SA p50
100 889 107ms 319 316ms 185 540ms
1000 921 1.06s 192 5.23s 175 5.19s
5000 922 5.32s 2.9 44.7s 122 17.6s
10000 920 10.7s 1.7 1m40s 115 14.9s

PromptKit sustains ~920 rps from 100 to 10k concurrent. LangChain collapses at 2.5k. Strands degrades gracefully but peaks at ~193 rps (4.8x less throughput, 7.7x less memory-efficient per rps).

Test plan

  • All 22 Go tests pass (benchmarks/mockupstream/ + benchmarks/harness/)
  • Mock upstream smoke tested locally (all 3 protocols)
  • Harness smoke tested against mock upstream
  • PromptKit Round 1 benchmarked at 10-25k concurrent
  • LangChain Round 1 benchmarked at 10-10k concurrent
  • Strands Agents Round 1 benchmarked at 10-10k concurrent
  • Docker Compose end-to-end (not yet validated — local runs used direct processes)
  • Round 2 voice pipeline (Pipecat) — deferred to follow-up

chaholl added 24 commits April 7, 2026 12:33
Add benchmarks/ as a new Go module in the workspace with:
- go.mod declaring module github.com/AltairaLabs/PromptKit/benchmarks
- gorilla/websocket v1.5.3 and gopkg.in/yaml.v3 v3.0.1 dependencies
- Makefile with build-mock, build-harness, round1, round2, all, clean targets
- results/.gitkeep placeholder directory
- .gitignore entries for *.json, *.csv, *.md result artifacts
Implements Profile, OpenAIProfile, STTProfile, TTSProfile structs with
YAML tags and time.Duration fields. Provides DefaultProfile() and
LoadProfile(path) with full test coverage. Adds fast.yaml (10ms delays)
and realistic.yaml (200ms first-chunk, 30ms inter-chunk) profiles.

Also fixes unused "net/http" import in stt_ws_test.go.
Implements NewTTSHandler(cfg TTSProfile) http.Handler that upgrades HTTP
to WebSocket at /tts/ws, reads a Cartesia-compatible synthesis request,
waits FirstByteDelay, streams ceiling(32000/ChunkSize) full binary audio
chunks (each exactly ChunkSize bytes), then sends JSON {"type":"done"}.

Includes TestTTSWebSocket_StreamsAudio and TestTTSWebSocket_FirstByteDelay.
Implements NewSTTHandler(STTProfile) http.Handler at /v1/listen with
Deepgram-compatible Results events. Read loop accepts binary audio frames
and CloseStream JSON; write loop emits interim transcripts on a ticker
and guarantees at least one interim before sending the final transcript.
Also declares the shared wsUpgrader used by all WebSocket handlers.
wsUpgrader is already declared in tts_ws.go (added by the TTS task).
Remove the redundant declaration to avoid a redeclaration build error.
Implements BenchmarkReport and TierResult types with WriteJSON (indented),
RenderMarkdown (table with Framework/Concurrent/p50/p99/Throughput/RSS columns),
and WriteCSV (header + one row per tier result). Round-trip and content tests included.
…ation

Wraps the PromptKit SDK behind an OpenAI-compatible HTTP endpoint so the
benchmark harness can measure it identically to LangChain and Pipecat.
Add docker-compose.yaml with profiled services for round1 (LLM
streaming) and round2 (voice pipeline), using repo root as build
context so Dockerfiles can access the full monorepo.

Add Dockerfiles for mockupstream and harness using multi-stage builds.

Replace stub Makefile with full orchestration: tiered concurrency
loops for both rounds, help target, and local build/clean targets.
- mockupstream/stt_ws.go: handle Pipecat's KeepAlive (ignored) and
  Finalize (emit final transcript without closing) messages in the
  STT WebSocket read loop; accept any URL path/query-params so the
  Deepgram SDK's decorated handshake URLs connect cleanly
- mockupstream/tts_ws.go: auto-detect Pipecat/Cartesia request format
  (transcript + voice.id) and respond with base64-encoded JSON chunk
  frames instead of raw binary; simple protocol unchanged
- mockupstream/stt_ws_test.go: add TestSTTWebSocket_KeepAliveIgnored
  and TestSTTWebSocket_FinalizeTriggersImmediateFinal
- mockupstream/tts_ws_test.go: add TestTTSWebSocket_PipecatProtocol
- benchmarks/frameworks/promptkit/round2: new standalone Go WebSocket
  server that coordinates STT→LLM→TTS using stdlib + gorilla/websocket
  (no SDK dependency), measuring raw Go runtime performance
- go.work: add round2 module
…ipeline

Replace Pipecat framework-specific implementation with a raw Python
asyncio equivalent that coordinates STT → LLM → TTS using the same
WebSocket protocol as the PromptKit round2 server. This measures
Python's async runtime overhead for voice pipeline coordination —
the fairest comparison: same protocol, same pipeline logic, different
language runtime.
…iver

- Real Pipecat framework using FastAPIWebsocketTransport for multi-client
- Protobuf frame serialization (AudioRawFrame) matching Pipecat's wire format
- Generated Go protobuf bindings from Pipecat's frames.proto
- Updated mock TTS to include context_id for Cartesia protocol compat
- Updated mock STT/TTS to use catch-all handlers for SDK URL decoration
- Added python-asyncio as separate framework (renamed from pipecat)
- Harness round2-pipecat mode with 440Hz sine wave audio for VAD triggering
…entations

- GenKit: Google's Go-based AI framework, OpenAI-compatible plugin pointing
  at mock upstream. Benchmarked Round 1 at 10-10k concurrent.
- LiveKit Agents: Python voice pipeline framework with fake I/O pattern
  (bypasses LiveKit server). FastAPI WebSocket wrapper for harness compat.
  Round 2 deferred — needs mock HTTP STT/TTS endpoints (/audio/transcriptions,
  /audio/speech) which aren't implemented yet.
- Renamed python-asyncio framework (was incorrectly in pipecat/ directory)
…sults

- Mock upstream now serves /v1/audio/transcriptions (Whisper-compatible)
  and /v1/audio/speech (TTS) on the OpenAI port for LiveKit Agents compat
- LiveKit Agents benchmarked at 1-100 concurrent voice sessions
- Remove GenKit from Round 1 comparison (easily optimizable HTTP client config)
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Apr 7, 2026

@chaholl chaholl merged commit 1a508b8 into main Apr 7, 2026
32 checks passed
@chaholl chaholl deleted the feat/benchmark-suite branch April 18, 2026 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant