release: 0.5.3 — latency pass + observability + parity#76
Merged
Conversation
Pre-release housekeeping pass across both SDKs. - Tightened public API surface and removed unused code paths - Aligned Python and TypeScript exports for full parity - Refactored logging to use the SDK's own logger namespace - Hardened Telnyx webhook signature verification and audio pipeline - Refreshed Mintlify docs, READMEs, and CHANGELOG for 0.5.3 - Rewrote pricing tables and dropped stale references - Trimmed examples to working stub redirects only - Standard dependency hygiene + version bump 0.5.2 → 0.5.3
- Add `OpenAITranscribeSTT` as a first-class STT class for OpenAI's `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models. Reuses the Whisper transcription infrastructure (same `/v1/audio/transcriptions` endpoint) but defaults to `gpt-4o-transcribe` and rejects `whisper-1` for clarity. - Type the ElevenLabs `model_id` field with a `Literal` / union so `eleven_v3` (newest, highest quality), `eleven_flash_v2_5` (current default, fastest TTFT), `eleven_turbo_v2_5`, `eleven_multilingual_v2`, and `eleven_monolingual_v1` are all surfaced via autocomplete. Custom model strings remain accepted for forward-compat. - Public exports + tests added in both Python and TypeScript SDKs.
Cerebras hardening pass — keeps OpenAI-compat layer (single auth/retry path, smaller dep tree) but closes the gaps documented in their public inference docs. - Default model bumped to `gpt-oss-120b` — production tier, no deprecation date, highest WSE-3 throughput in the catalog. - TypeScript: enable gzip request compression by default (Python already had it on). Reduces TTFT on prompts >2 KB. - TypeScript: add 1 retry with exponential backoff on 5xx and 429, honoring `x-ratelimit-reset-tokens-minute` / `x-ratelimit-reset-requests-minute` / `retry-after` headers. Terminal failures throw `PatterError` so the LLM loop can fall back rather than silently yielding nothing. - Forward `response_format` (JSON mode + structured outputs with strict schema), `parallel_tool_calls`, `tool_choice`, `seed`, `top_p`, `frequency_penalty`, `presence_penalty`, `stop` — all OpenAI-standard sampling kwargs that Cerebras supports but we previously dropped. - Send `max_completion_tokens` on the wire (the current Cerebras spec) while still accepting `max_tokens`/`maxTokens` from the user. - Add `User-Agent: getpatter/<version>` header. Same surface mirrored on Groq for parity (it shares the OpenAI-compat wire format).
Adopt two patterns from the LangChain voice-agent reference that we were missing. Pipeline hooks around the LLM call: - `before_llm(messages, ctx) -> messages | None` runs once per turn before invoking the LLM. Return `None` to keep the original messages or return a new list to replace them. Use cases: PII redaction, prompt rewriting, dynamic system-prompt injection. - `after_llm(text, ctx) -> text | None` runs once per turn after the full assistant response is assembled. Use cases: output validation, redaction, post-processing, cost capping. - Both hooks are opt-in and fail-open: a hook raising an exception logs but does not break the call. Fine-grained pipeline events on `EventBus`: - `transcript_partial` / `transcript_final` — STT chunks surfaced as events (existing `on_transcript` callback path remains). - `llm_chunk` — emitted per LLM streaming text chunk. - `tts_chunk` — emitted per outbound audio frame. - `tool_call_started` — emitted when an LLM tool call begins (so UIs can render "calling weather…" mid-utterance). These are additive — every existing callback continues to fire.
- README: new comparison paragraph between Quickstart and Features noting telephony parity (Twilio + Telnyx), both pipeline (sandwich) and speech-to-speech (Realtime/ConvAI engines) architectures from one API, production-grade barge-in/VAD/IVR primitives, OpenTelemetry tracing, the 4-line quickstart, and identical Python + TypeScript surface. - CHANGELOG: document the 0.5.3 additions (OpenAITranscribeSTT, ElevenLabs v3, Cerebras hardening, before_llm/after_llm hooks, fine-grained pipeline events) under the existing 0.5.3 section.
Latency tuning pass on the TTS, Realtime, and pipeline glue providers. - **Cartesia**: bump default model `sonic-2` → `sonic-3` (now GA, ~90 ms TTFB, voice IDs back-compat). Bump API version `2024-11-13` → `2025-04-16` to match the Cartesia STT path. - **OpenAI TTS**: drop `aiter_bytes` chunk size from 4096 to 1024 bytes (~85 ms → ~21 ms first-byte at 24 kHz). - **OpenAI Realtime**: default `silence_duration_ms` 500 → 300 (the documented sweet-spot for snappier turns; configurable via constructor). - **SentenceChunker**: short greetings like "Hi there!" used to sit in the buffer until end-of-stream because the splitter required ≥20 chars AND ≥2 sentences before emitting. Add a short-flush path that emits on a sentence terminator when the preceding text has ≥2 words and no trailing digit/uppercase ambiguity. No public API breaks; all defaults are user-overridable.
Skip the SDK-side resample + μ-law transcode hop on phone calls by asking ElevenLabs for the carrier-native output format directly. - `ElevenLabsTTS.for_twilio(api_key=...)` — sets `output_format = ulaw_8000` (Twilio Media Streams native). Saves ~30-80 ms first-byte + per-frame CPU vs the default `pcm_16000` → resample → μ-law chain. - `ElevenLabsTTS.for_telnyx(api_key=...)` — sets `output_format = pcm_16000` (Telnyx-negotiated default). - Constructor default unchanged (`pcm_16000`) so existing web / dashboard callers are unaffected. - Same surface mirrored on the TS pipeline `TTS` wrapper with parent-compatible overloads (positional and options-object).
Wrap the system prompt and the last tool block with `cache_control: ephemeral` and send the `anthropic-beta: prompt-caching-2024-07-31` header so subsequent turns skip re-encoding the system block. Saves 100-400 ms TTFT and ~90% of input-token cost on agents with long system prompts. - Default ON (`prompt_caching=True` / `promptCaching: true`); pass `False` to opt out for very short system prompts that fall under Anthropic's minimum-cacheable-block size. - Backwards-compatible: when caching is disabled, the system prompt is sent as a plain string and no beta header is added.
…n metrics Tightens the user-stop → first-TTS-audio loop and adds the observability needed to keep tightening it. STT: - **Python `speech_final` parity** — the TS Deepgram path already short- circuited end-of-utterance via the `speech_final` flag, but the Python side dropped it. Surface the flag through the `Transcript` dataclass and let `_stt_loop` dispatch the LLM on `is_final OR speech_final`. Saves ~300-700 ms per turn on Python. - **Deepgram `smart_format` defaults to False** — telephony users save ~50-150 ms TTFT per final transcript; the option remains configurable. - **WhisperSTT / OpenAITranscribeSTT** flush any non-empty buffer on `close()` so the trailing 0-250 ms of audio aren't silently dropped. Observability: - New `LatencyBreakdown` fields: `endpoint_ms`, `bargein_ms`, `tts_total_ms`, and a properly split `llm_ttft_ms` / `llm_total_ms` on the TS side (Python already had both). - New aggregate: `latency_p90` alongside the existing P50 / P95 / P99. - New OTel spans `getpatter.endpoint` and `getpatter.bargein`. The pre-existing `SPAN_LLM` constant is now actually emitted around the pipeline LLM call. TS span names normalised from a mix of `patter.*` and `getpatter.*` to `getpatter.*` everywhere. - New accumulator hooks for the matching timestamps: `record_tts_complete_ts`, `record_tts_stopped`, `record_bargein_detected`, plus a unified `_endpoint_signal_at` that takes the first-fires-wins between VAD-stop and STT-final. - Dashboard SSE feed surfaces the new fields.
Today the Python notify_dashboard does a synchronous httpx.post on the asyncio loop. If the dashboard is offline, _on_call_start and _on_call_end blocked the live call path for up to 1-3 seconds. Even on a healthy localhost it cost 5-15 ms per call. - Convert to `async def` using `httpx.AsyncClient`. - Wrap the two server.py call sites in `asyncio.create_task(...)` so the call-start / call-end paths return immediately regardless of dashboard responsiveness. - Drop the wasteful `json.loads(json.dumps(data, default=...))` round- trip; serialize dataclasses with a recursive helper that produces a shape httpx can encode directly. - All exceptions swallowed (this is fire-and-forget). TS equivalent already uses `http.request` non-blocking; left a TODO comment for future API-surface parity.
Two unrelated TS-side latency wins. call-log: every fs.*Sync (writeFileSync, fsyncSync, renameSync, appendFileSync, readFileSync) ran on the Node main thread, costing ~75 ms of cumulative blocking per call (call_start + ~12 turns + call_end). Replaced with `fs.promises.*` and made `logCallStart`, `logTurn`, `logEvent`, `logCallEnd` async. Server.ts call sites switched to fire-and-forget — call logging never blocks the WS handler. Twilio outbound: replaced the `Url:` parameter (which made Twilio do a fresh HTTPS GET back to our server to fetch TwiML, ~100-200 ms) with inline `Twiml:` carrying the `<Connect><Stream>` directly. Brings TS to parity with the Python adapter and saves one webhook round-trip per outbound call.
Cartesia accepts `sample_rate=8000` natively in the request body, so asking for 8 kHz directly skips the SDK-side 16 kHz → 8 kHz resample on every TTS chunk for Twilio paths. - `CartesiaTTS.for_twilio(api_key=...)` — sets `sample_rate=8000` (native μ-law transcode still done in TwilioAudioSender, but one resample step saved per chunk). - `CartesiaTTS.for_telnyx(api_key=...)` — sets `sample_rate=16000` to align with Telnyx's L16 default when that path is used. - Constructor default unchanged (`sample_rate=16000`) so existing callers behave identically. - Same surface mirrored on the TS pipeline `TTS` wrapper with parent- compatible overloads (positional + options-object).
Two Telnyx-side latency wins plus a ring-timeout default tweak. - Telnyx supports passing the streaming params inside the `answer` action body. We previously waited for `call.answered` to arrive and then made a second POST to `actions/streaming_start`. Folding the parameters into the original `answer` body eliminates one webhook round-trip and one HTTP POST — saves ~100-200 ms per inbound call. `call.answered` becomes a no-op debug log. - `stream_track` flipped from `both_tracks` to `inbound_track` — Telnyx no longer forwards the outbound echo we were filtering out on receive. Halves WS upstream bandwidth and removes the per-frame filter branch (kept as defense-in-depth). - Default `ring_timeout` 60 s → 25 s on both Twilio and Telnyx paths. 60 s leaves trunks tied up on phantom rings; 25 s is the production-recommended default. `ring_timeout=60` opts back in; passing `None` (Py) or `null` (TS) sends no timeout.
ElevenLabs ConvAI supports `output_audio_format="ulaw_8000"` and
`input_audio_format="ulaw_8000"` natively. When configured, the SDK
can drop both the inbound mulaw → pcm16 + 8k → 16k resample chain and
the outbound 16k → 8k + pcm16 → mulaw transcode, giving a per-turn
saving of ~5-10 ms plus CPU.
- `ElevenLabsConvAIAdapter.for_twilio(api_key, agent_id, …)` and
`for_telnyx(...)` factories that set both audio formats to
`ulaw_8000`. Constructor default unchanged.
- `ElevenLabsConvAIStreamHandler` (Python + TypeScript) detects the
ulaw configuration on `start()` and:
• bypasses the inbound transcode in `on_audio_received` /
inbound branch — forwards raw mulaw bytes to ConvAI;
• flips `audio_sender._input_is_mulaw_8k = True` so the outbound
sender skips its resample + μ-law conversion.
- Pipeline mode and other handlers untouched.
|
Preview deployment for your docs. Learn more about Mintlify Previews.
💡 Tip: Enable Workflows to automatically generate PRs for you. |
…s default switch to llama3.1-8b)
Wave 10A added 10 sampling kwargs and a User-Agent header to both CerebrasLLMProvider and GroqLLMProvider, each duplicating the entire SSE consumption loop from OpenAILLMProvider. The duplication broke when main's PR #73 introduced tests that mock the parent's stream() directly, and made cerebras / groq subclasses ~80 lines each that we would have to keep in sync forever. Move the kwargs (and the User-Agent header) up to OpenAILLMProvider so every OpenAI-compat client benefits — including Anthropic-compat flows and any future provider that subclasses it. - OpenAILLMProvider gains: response_format, parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, temperature, max_tokens (forwarded as max_completion_tokens on the wire), user_agent (default "getpatter/<version>"). - CerebrasLLMProvider.stream() is now a 17-line wrapper around super().stream() that adds the 404 model_not_found recovery hint. Cerebras-specific gzip + msgpack compression and the per-tier default model are unchanged. - GroqLLMProvider drops its stream() override entirely; the parent handles everything. - Same kwargs surfaced on the public TS OpenAILLMSamplingOptions and the OpenAILLM wrapper. TS Cerebras / Groq retain their own kwargs forwarding because they use a different transport (bare fetch + gzip + retry) and have no shared parent class. Net code change: -193 LOC across cerebras_llm.py / groq_llm.py. All Wave 10A features preserved — only their architectural layer changes.
Three tests written before the latency / measurement / pricing waves asserted on values that the production code has since updated. They passed locally because the tests target functions that were touched incrementally and the failures only surfaced once the matrix CI ran the suite end-to-end against the merged release branch. - test_llm_loop._make_llm_loop fixture used `LLMLoop.__new__(LLMLoop)` to bypass the constructor, then manually set instance attributes. The Wave 12b observability pass added `_metrics`, `_event_bus`, `_model`, and `_provider_name` to LLMLoop's runtime contract; the fixture now sets them so `loop.run()` finds the expected attrs. - test_twilio_handler.test_telnyx_webhook_stream_url_both_tracks asserted `stream_track == "both_tracks"`. Wave 14a flipped the default to `inbound_track` (halves WS upstream bandwidth, the outbound echo is filtered downstream anyway). Test renamed to test_telnyx_webhook_stream_url_inbound_track and asserts the new value with rationale comment. - test_soak.test_s2_1000_turn_conversation hard-coded the pre-Wave-12b3 Deepgram batch rate ($0.0043/min) and ElevenLabs Creator-overage rate ($0.18/1k). Both were corrected to the streaming-API rates ($0.0077/min and $0.06/1k respectively) when the cost-accuracy audit found ~45% under-reporting on Deepgram and ~3x over-reporting on ElevenLabs. Test updated to the corrected rates. Full suite: 1352 passed, 15 skipped on Python 3.12.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Pre-public-release polishing pass. Brings the 0.5.3 SDK to public-launch quality: tighter latency budget, broader provider catalogue, observability you can actually act on, full Python ↔ TypeScript parity.
Highlights — latency
End-to-end P50 (user-stop → first TTS audio byte) reduced by ~1000-2000 ms across the pipeline, distributed across many independent wins:
speech_finalparity with TypeScript (Deepgram fast endpointing). Defaultsmart_format=Falsefor telephony. Whisper / OpenAITranscribeSTT always flush on close.cache_control: ephemeralon system + last tool block). Cerebras default bumped togpt-oss-120bwith retry + structured-output + sampling-kwargs forwarding. Newbefore_llm/after_llmpipeline hooks for PII redaction, output validation, and prompt rewriting.sonic-3(~90 ms TTFB). OpenAI TTS chunk size 4096 → 1024. Sentence chunker emits short greetings immediately. New telephony factories (for_twilio()/for_telnyx()) on ElevenLabs, Cartesia, and ConvAI that negotiate carrier-native codecs (ulaw_8000/ 8 kHz PCM) and skip per-chunk SDK transcoding.silence_duration_ms500 → 300.answer+streaming_startconsolidated into a single API call (saves one webhook round-trip). TS Twilio outbound switched fromUrl:to inlineTwiml:(parity with Python adapter, saves another round-trip).stream_trackset toinbound_track(halves WS upstream bandwidth). Defaultring_timeoutlowered from 60 s to 25 s.notify_dashboardmade async + fire-and-forget (avoids 1-3 s stall when dashboard is offline). TScall-logswitched tofs.promisesto keep ~75 ms of cumulative blocking off the Node main thread.Highlights — providers
OpenAITranscribeSTTforgpt-4o-transcribe/gpt-4o-mini-transcribe.eleven_v3/eleven_flash_v2_5/eleven_turbo_v2_5/eleven_multilingual_v2/eleven_monolingual_v1.response_format(JSON mode + structured outputs),parallel_tool_calls,tool_choice,seed,top_p,frequency_penalty,presence_penalty,stop,User-Agenttelemetry,max_completion_tokens, gzip compression on by default in TypeScript (parity with Python).Highlights — observability
LatencyBreakdownextended withendpoint_ms,bargein_ms,tts_total_ms, properly splitllm_ttft_ms/llm_total_ms.latency_p90alongside P50 / P95 / P99.getpatter.endpointandgetpatter.bargein. The pre-existinggetpatter.llmspan is now actually emitted around the pipeline LLM call.EventBusevent types:transcript_partial,transcript_final,llm_chunk,tts_chunk,tool_call_started.patter.*andgetpatter.*togetpatter.*everywhere.Test plan
tsc --noEmit) clean.from getpatter import …resolves the full public surface in Python;require('getpatter')exposes equivalent TS symbols.Patter(api_key=…)raisesNotImplementedErrorwith a clear message in both SDKs.v0.5.3to trigger PyPI + npm publish viarelease.yml.pip install getpatter==0.5.3andnpm install getpatter@0.5.3in fresh environments.