release: 0.5.3 — latency pass + observability + parity by nicolotognoni · Pull Request #76 · PatterAI/Patter

nicolotognoni · 2026-04-27T17:49:21Z

Summary

Pre-public-release polishing pass. Brings the 0.5.3 SDK to public-launch quality: tighter latency budget, broader provider catalogue, observability you can actually act on, full Python ↔ TypeScript parity.

Highlights — latency

End-to-end P50 (user-stop → first TTS audio byte) reduced by ~1000-2000 ms across the pipeline, distributed across many independent wins:

STT: Python speech_final parity with TypeScript (Deepgram fast endpointing). Default smart_format=False for telephony. Whisper / OpenAITranscribeSTT always flush on close.
LLM: Anthropic prompt caching enabled by default (cache_control: ephemeral on system + last tool block). Cerebras default bumped to gpt-oss-120b with retry + structured-output + sampling-kwargs forwarding. New before_llm / after_llm pipeline hooks for PII redaction, output validation, and prompt rewriting.
TTS: Cartesia bumped to sonic-3 (~90 ms TTFB). OpenAI TTS chunk size 4096 → 1024. Sentence chunker emits short greetings immediately. New telephony factories (for_twilio() / for_telnyx()) on ElevenLabs, Cartesia, and ConvAI that negotiate carrier-native codecs (ulaw_8000 / 8 kHz PCM) and skip per-chunk SDK transcoding.
Realtime: OpenAI Realtime silence_duration_ms 500 → 300.
Telephony: Telnyx answer + streaming_start consolidated into a single API call (saves one webhook round-trip). TS Twilio outbound switched from Url: to inline Twiml: (parity with Python adapter, saves another round-trip). stream_track set to inbound_track (halves WS upstream bandwidth). Default ring_timeout lowered from 60 s to 25 s.
Infrastructure: notify_dashboard made async + fire-and-forget (avoids 1-3 s stall when dashboard is offline). TS call-log switched to fs.promises to keep ~75 ms of cumulative blocking off the Node main thread.

Highlights — providers

New first-class OpenAITranscribeSTT for gpt-4o-transcribe / gpt-4o-mini-transcribe.
Typed ElevenLabs model literal — eleven_v3 / eleven_flash_v2_5 / eleven_turbo_v2_5 / eleven_multilingual_v2 / eleven_monolingual_v1.
Cerebras: response_format (JSON mode + structured outputs), parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, User-Agent telemetry, max_completion_tokens, gzip compression on by default in TypeScript (parity with Python).

Highlights — observability

LatencyBreakdown extended with endpoint_ms, bargein_ms, tts_total_ms, properly split llm_ttft_ms / llm_total_ms.
New aggregate: latency_p90 alongside P50 / P95 / P99.
New OTel spans getpatter.endpoint and getpatter.bargein. The pre-existing getpatter.llm span is now actually emitted around the pipeline LLM call.
New EventBus event types: transcript_partial, transcript_final, llm_chunk, tts_chunk, tool_call_started.
TS span names normalised from a mix of patter.* and getpatter.* to getpatter.* everywhere.

Test plan

Python: 885 unit tests pass (10 skipped, 0 failed).
TypeScript: 1107 tests across 64 files pass.
TS typecheck (tsc --noEmit) clean.
TS build (ESM + CJS + DTS + CLI) clean.
Smoke import: from getpatter import … resolves the full public surface in Python; require('getpatter') exposes equivalent TS symbols.
Cloud-mode rejection: Patter(api_key=…) raises NotImplementedError with a clear message in both SDKs.
After merge: tag v0.5.3 to trigger PyPI + npm publish via release.yml.
After publish: smoke test pip install getpatter==0.5.3 and npm install getpatter@0.5.3 in fresh environments.

Pre-release housekeeping pass across both SDKs. - Tightened public API surface and removed unused code paths - Aligned Python and TypeScript exports for full parity - Refactored logging to use the SDK's own logger namespace - Hardened Telnyx webhook signature verification and audio pipeline - Refreshed Mintlify docs, READMEs, and CHANGELOG for 0.5.3 - Rewrote pricing tables and dropped stale references - Trimmed examples to working stub redirects only - Standard dependency hygiene + version bump 0.5.2 → 0.5.3

- Add `OpenAITranscribeSTT` as a first-class STT class for OpenAI's `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models. Reuses the Whisper transcription infrastructure (same `/v1/audio/transcriptions` endpoint) but defaults to `gpt-4o-transcribe` and rejects `whisper-1` for clarity. - Type the ElevenLabs `model_id` field with a `Literal` / union so `eleven_v3` (newest, highest quality), `eleven_flash_v2_5` (current default, fastest TTFT), `eleven_turbo_v2_5`, `eleven_multilingual_v2`, and `eleven_monolingual_v1` are all surfaced via autocomplete. Custom model strings remain accepted for forward-compat. - Public exports + tests added in both Python and TypeScript SDKs.

Cerebras hardening pass — keeps OpenAI-compat layer (single auth/retry path, smaller dep tree) but closes the gaps documented in their public inference docs. - Default model bumped to `gpt-oss-120b` — production tier, no deprecation date, highest WSE-3 throughput in the catalog. - TypeScript: enable gzip request compression by default (Python already had it on). Reduces TTFT on prompts >2 KB. - TypeScript: add 1 retry with exponential backoff on 5xx and 429, honoring `x-ratelimit-reset-tokens-minute` / `x-ratelimit-reset-requests-minute` / `retry-after` headers. Terminal failures throw `PatterError` so the LLM loop can fall back rather than silently yielding nothing. - Forward `response_format` (JSON mode + structured outputs with strict schema), `parallel_tool_calls`, `tool_choice`, `seed`, `top_p`, `frequency_penalty`, `presence_penalty`, `stop` — all OpenAI-standard sampling kwargs that Cerebras supports but we previously dropped. - Send `max_completion_tokens` on the wire (the current Cerebras spec) while still accepting `max_tokens`/`maxTokens` from the user. - Add `User-Agent: getpatter/<version>` header. Same surface mirrored on Groq for parity (it shares the OpenAI-compat wire format).

Adopt two patterns from the LangChain voice-agent reference that we were missing. Pipeline hooks around the LLM call: - `before_llm(messages, ctx) -> messages | None` runs once per turn before invoking the LLM. Return `None` to keep the original messages or return a new list to replace them. Use cases: PII redaction, prompt rewriting, dynamic system-prompt injection. - `after_llm(text, ctx) -> text | None` runs once per turn after the full assistant response is assembled. Use cases: output validation, redaction, post-processing, cost capping. - Both hooks are opt-in and fail-open: a hook raising an exception logs but does not break the call. Fine-grained pipeline events on `EventBus`: - `transcript_partial` / `transcript_final` — STT chunks surfaced as events (existing `on_transcript` callback path remains). - `llm_chunk` — emitted per LLM streaming text chunk. - `tts_chunk` — emitted per outbound audio frame. - `tool_call_started` — emitted when an LLM tool call begins (so UIs can render "calling weather…" mid-utterance). These are additive — every existing callback continues to fire.

- README: new comparison paragraph between Quickstart and Features noting telephony parity (Twilio + Telnyx), both pipeline (sandwich) and speech-to-speech (Realtime/ConvAI engines) architectures from one API, production-grade barge-in/VAD/IVR primitives, OpenTelemetry tracing, the 4-line quickstart, and identical Python + TypeScript surface. - CHANGELOG: document the 0.5.3 additions (OpenAITranscribeSTT, ElevenLabs v3, Cerebras hardening, before_llm/after_llm hooks, fine-grained pipeline events) under the existing 0.5.3 section.

Latency tuning pass on the TTS, Realtime, and pipeline glue providers. - **Cartesia**: bump default model `sonic-2` → `sonic-3` (now GA, ~90 ms TTFB, voice IDs back-compat). Bump API version `2024-11-13` → `2025-04-16` to match the Cartesia STT path. - **OpenAI TTS**: drop `aiter_bytes` chunk size from 4096 to 1024 bytes (~85 ms → ~21 ms first-byte at 24 kHz). - **OpenAI Realtime**: default `silence_duration_ms` 500 → 300 (the documented sweet-spot for snappier turns; configurable via constructor). - **SentenceChunker**: short greetings like "Hi there!" used to sit in the buffer until end-of-stream because the splitter required ≥20 chars AND ≥2 sentences before emitting. Add a short-flush path that emits on a sentence terminator when the preceding text has ≥2 words and no trailing digit/uppercase ambiguity. No public API breaks; all defaults are user-overridable.

Skip the SDK-side resample + μ-law transcode hop on phone calls by asking ElevenLabs for the carrier-native output format directly. - `ElevenLabsTTS.for_twilio(api_key=...)` — sets `output_format = ulaw_8000` (Twilio Media Streams native). Saves ~30-80 ms first-byte + per-frame CPU vs the default `pcm_16000` → resample → μ-law chain. - `ElevenLabsTTS.for_telnyx(api_key=...)` — sets `output_format = pcm_16000` (Telnyx-negotiated default). - Constructor default unchanged (`pcm_16000`) so existing web / dashboard callers are unaffected. - Same surface mirrored on the TS pipeline `TTS` wrapper with parent-compatible overloads (positional and options-object).

Wrap the system prompt and the last tool block with `cache_control: ephemeral` and send the `anthropic-beta: prompt-caching-2024-07-31` header so subsequent turns skip re-encoding the system block. Saves 100-400 ms TTFT and ~90% of input-token cost on agents with long system prompts. - Default ON (`prompt_caching=True` / `promptCaching: true`); pass `False` to opt out for very short system prompts that fall under Anthropic's minimum-cacheable-block size. - Backwards-compatible: when caching is disabled, the system prompt is sent as a plain string and no beta header is added.

…n metrics Tightens the user-stop → first-TTS-audio loop and adds the observability needed to keep tightening it. STT: - **Python `speech_final` parity** — the TS Deepgram path already short- circuited end-of-utterance via the `speech_final` flag, but the Python side dropped it. Surface the flag through the `Transcript` dataclass and let `_stt_loop` dispatch the LLM on `is_final OR speech_final`. Saves ~300-700 ms per turn on Python. - **Deepgram `smart_format` defaults to False** — telephony users save ~50-150 ms TTFT per final transcript; the option remains configurable. - **WhisperSTT / OpenAITranscribeSTT** flush any non-empty buffer on `close()` so the trailing 0-250 ms of audio aren't silently dropped. Observability: - New `LatencyBreakdown` fields: `endpoint_ms`, `bargein_ms`, `tts_total_ms`, and a properly split `llm_ttft_ms` / `llm_total_ms` on the TS side (Python already had both). - New aggregate: `latency_p90` alongside the existing P50 / P95 / P99. - New OTel spans `getpatter.endpoint` and `getpatter.bargein`. The pre-existing `SPAN_LLM` constant is now actually emitted around the pipeline LLM call. TS span names normalised from a mix of `patter.*` and `getpatter.*` to `getpatter.*` everywhere. - New accumulator hooks for the matching timestamps: `record_tts_complete_ts`, `record_tts_stopped`, `record_bargein_detected`, plus a unified `_endpoint_signal_at` that takes the first-fires-wins between VAD-stop and STT-final. - Dashboard SSE feed surfaces the new fields.

Today the Python notify_dashboard does a synchronous httpx.post on the asyncio loop. If the dashboard is offline, _on_call_start and _on_call_end blocked the live call path for up to 1-3 seconds. Even on a healthy localhost it cost 5-15 ms per call. - Convert to `async def` using `httpx.AsyncClient`. - Wrap the two server.py call sites in `asyncio.create_task(...)` so the call-start / call-end paths return immediately regardless of dashboard responsiveness. - Drop the wasteful `json.loads(json.dumps(data, default=...))` round- trip; serialize dataclasses with a recursive helper that produces a shape httpx can encode directly. - All exceptions swallowed (this is fire-and-forget). TS equivalent already uses `http.request` non-blocking; left a TODO comment for future API-surface parity.

Two unrelated TS-side latency wins. call-log: every fs.*Sync (writeFileSync, fsyncSync, renameSync, appendFileSync, readFileSync) ran on the Node main thread, costing ~75 ms of cumulative blocking per call (call_start + ~12 turns + call_end). Replaced with `fs.promises.*` and made `logCallStart`, `logTurn`, `logEvent`, `logCallEnd` async. Server.ts call sites switched to fire-and-forget — call logging never blocks the WS handler. Twilio outbound: replaced the `Url:` parameter (which made Twilio do a fresh HTTPS GET back to our server to fetch TwiML, ~100-200 ms) with inline `Twiml:` carrying the `<Connect><Stream>` directly. Brings TS to parity with the Python adapter and saves one webhook round-trip per outbound call.

Cartesia accepts `sample_rate=8000` natively in the request body, so asking for 8 kHz directly skips the SDK-side 16 kHz → 8 kHz resample on every TTS chunk for Twilio paths. - `CartesiaTTS.for_twilio(api_key=...)` — sets `sample_rate=8000` (native μ-law transcode still done in TwilioAudioSender, but one resample step saved per chunk). - `CartesiaTTS.for_telnyx(api_key=...)` — sets `sample_rate=16000` to align with Telnyx's L16 default when that path is used. - Constructor default unchanged (`sample_rate=16000`) so existing callers behave identically. - Same surface mirrored on the TS pipeline `TTS` wrapper with parent- compatible overloads (positional + options-object).

Two Telnyx-side latency wins plus a ring-timeout default tweak. - Telnyx supports passing the streaming params inside the `answer` action body. We previously waited for `call.answered` to arrive and then made a second POST to `actions/streaming_start`. Folding the parameters into the original `answer` body eliminates one webhook round-trip and one HTTP POST — saves ~100-200 ms per inbound call. `call.answered` becomes a no-op debug log. - `stream_track` flipped from `both_tracks` to `inbound_track` — Telnyx no longer forwards the outbound echo we were filtering out on receive. Halves WS upstream bandwidth and removes the per-frame filter branch (kept as defense-in-depth). - Default `ring_timeout` 60 s → 25 s on both Twilio and Telnyx paths. 60 s leaves trunks tied up on phantom rings; 25 s is the production-recommended default. `ring_timeout=60` opts back in; passing `None` (Py) or `null` (TS) sends no timeout.

ElevenLabs ConvAI supports `output_audio_format="ulaw_8000"` and `input_audio_format="ulaw_8000"` natively. When configured, the SDK can drop both the inbound mulaw → pcm16 + 8k → 16k resample chain and the outbound 16k → 8k + pcm16 → mulaw transcode, giving a per-turn saving of ~5-10 ms plus CPU. - `ElevenLabsConvAIAdapter.for_twilio(api_key, agent_id, …)` and `for_telnyx(...)` factories that set both audio formats to `ulaw_8000`. Constructor default unchanged. - `ElevenLabsConvAIStreamHandler` (Python + TypeScript) detects the ulaw configuration on `start()` and: • bypasses the inbound transcode in `on_audio_received` / inbound branch — forwards raw mulaw bytes to ConvAI; • flips `audio_sender._input_is_mulaw_8k = True` so the outbound sender skips its resample + μ-law conversion. - Pipeline mode and other handlers untouched.

mintlify · 2026-04-27T17:49:37Z

Preview deployment for your docs. Learn more about Mintlify Previews.

Project	Status	Preview	Updated (UTC)
patter-06b046ce	🟢 Ready	View Preview	Apr 27, 2026, 5:50 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

…s default switch to llama3.1-8b)

Wave 10A added 10 sampling kwargs and a User-Agent header to both CerebrasLLMProvider and GroqLLMProvider, each duplicating the entire SSE consumption loop from OpenAILLMProvider. The duplication broke when main's PR #73 introduced tests that mock the parent's stream() directly, and made cerebras / groq subclasses ~80 lines each that we would have to keep in sync forever. Move the kwargs (and the User-Agent header) up to OpenAILLMProvider so every OpenAI-compat client benefits — including Anthropic-compat flows and any future provider that subclasses it. - OpenAILLMProvider gains: response_format, parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, temperature, max_tokens (forwarded as max_completion_tokens on the wire), user_agent (default "getpatter/<version>"). - CerebrasLLMProvider.stream() is now a 17-line wrapper around super().stream() that adds the 404 model_not_found recovery hint. Cerebras-specific gzip + msgpack compression and the per-tier default model are unchanged. - GroqLLMProvider drops its stream() override entirely; the parent handles everything. - Same kwargs surfaced on the public TS OpenAILLMSamplingOptions and the OpenAILLM wrapper. TS Cerebras / Groq retain their own kwargs forwarding because they use a different transport (bare fetch + gzip + retry) and have no shared parent class. Net code change: -193 LOC across cerebras_llm.py / groq_llm.py. All Wave 10A features preserved — only their architectural layer changes.

Three tests written before the latency / measurement / pricing waves asserted on values that the production code has since updated. They passed locally because the tests target functions that were touched incrementally and the failures only surfaced once the matrix CI ran the suite end-to-end against the merged release branch. - test_llm_loop._make_llm_loop fixture used `LLMLoop.__new__(LLMLoop)` to bypass the constructor, then manually set instance attributes. The Wave 12b observability pass added `_metrics`, `_event_bus`, `_model`, and `_provider_name` to LLMLoop's runtime contract; the fixture now sets them so `loop.run()` finds the expected attrs. - test_twilio_handler.test_telnyx_webhook_stream_url_both_tracks asserted `stream_track == "both_tracks"`. Wave 14a flipped the default to `inbound_track` (halves WS upstream bandwidth, the outbound echo is filtered downstream anyway). Test renamed to test_telnyx_webhook_stream_url_inbound_track and asserts the new value with rationale comment. - test_soak.test_s2_1000_turn_conversation hard-coded the pre-Wave-12b3 Deepgram batch rate ($0.0043/min) and ElevenLabs Creator-overage rate ($0.18/1k). Both were corrected to the streaming-API rates ($0.0077/min and $0.06/1k respectively) when the cost-accuracy audit found ~45% under-reporting on Deepgram and ~3x over-reporting on ElevenLabs. Test updated to the corrected rates. Full suite: 1352 passed, 15 skipped on Python 3.12.

nicolotognoni added 14 commits April 27, 2026 10:41

mintlify Bot deployed to staging - docs April 27, 2026 17:50 View deployment

nicolotognoni added 3 commits April 27, 2026 20:10

Merge branch 'main' into release/0.5.3 (conflict resolution + Cerebra…

a6ea175

…s default switch to llama3.1-8b)

nicolotognoni merged commit 35ed245 into main Apr 27, 2026
15 checks passed

nicolotognoni deleted the release/0.5.3 branch May 8, 2026 14:56

nicolotognoni mentioned this pull request May 25, 2026

release: 0.6.1 — patter.* OTel span attributes (Python) #85

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: 0.5.3 — latency pass + observability + parity#76

release: 0.5.3 — latency pass + observability + parity#76
nicolotognoni merged 17 commits into
mainfrom
release/0.5.3

nicolotognoni commented Apr 27, 2026

Uh oh!

mintlify Bot commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nicolotognoni commented Apr 27, 2026

Summary

Highlights — latency

Highlights — providers

Highlights — observability

Test plan

Uh oh!

mintlify Bot commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mintlify Bot commented Apr 27, 2026 •

edited

Loading