Skip to content

release: 0.5.3 — latency pass + observability + parity#76

Merged
nicolotognoni merged 17 commits into
mainfrom
release/0.5.3
Apr 27, 2026
Merged

release: 0.5.3 — latency pass + observability + parity#76
nicolotognoni merged 17 commits into
mainfrom
release/0.5.3

Conversation

@nicolotognoni
Copy link
Copy Markdown
Collaborator

Summary

Pre-public-release polishing pass. Brings the 0.5.3 SDK to public-launch quality: tighter latency budget, broader provider catalogue, observability you can actually act on, full Python ↔ TypeScript parity.

Highlights — latency

End-to-end P50 (user-stop → first TTS audio byte) reduced by ~1000-2000 ms across the pipeline, distributed across many independent wins:

  • STT: Python speech_final parity with TypeScript (Deepgram fast endpointing). Default smart_format=False for telephony. Whisper / OpenAITranscribeSTT always flush on close.
  • LLM: Anthropic prompt caching enabled by default (cache_control: ephemeral on system + last tool block). Cerebras default bumped to gpt-oss-120b with retry + structured-output + sampling-kwargs forwarding. New before_llm / after_llm pipeline hooks for PII redaction, output validation, and prompt rewriting.
  • TTS: Cartesia bumped to sonic-3 (~90 ms TTFB). OpenAI TTS chunk size 4096 → 1024. Sentence chunker emits short greetings immediately. New telephony factories (for_twilio() / for_telnyx()) on ElevenLabs, Cartesia, and ConvAI that negotiate carrier-native codecs (ulaw_8000 / 8 kHz PCM) and skip per-chunk SDK transcoding.
  • Realtime: OpenAI Realtime silence_duration_ms 500 → 300.
  • Telephony: Telnyx answer + streaming_start consolidated into a single API call (saves one webhook round-trip). TS Twilio outbound switched from Url: to inline Twiml: (parity with Python adapter, saves another round-trip). stream_track set to inbound_track (halves WS upstream bandwidth). Default ring_timeout lowered from 60 s to 25 s.
  • Infrastructure: notify_dashboard made async + fire-and-forget (avoids 1-3 s stall when dashboard is offline). TS call-log switched to fs.promises to keep ~75 ms of cumulative blocking off the Node main thread.

Highlights — providers

  • New first-class OpenAITranscribeSTT for gpt-4o-transcribe / gpt-4o-mini-transcribe.
  • Typed ElevenLabs model literal — eleven_v3 / eleven_flash_v2_5 / eleven_turbo_v2_5 / eleven_multilingual_v2 / eleven_monolingual_v1.
  • Cerebras: response_format (JSON mode + structured outputs), parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, User-Agent telemetry, max_completion_tokens, gzip compression on by default in TypeScript (parity with Python).

Highlights — observability

  • LatencyBreakdown extended with endpoint_ms, bargein_ms, tts_total_ms, properly split llm_ttft_ms / llm_total_ms.
  • New aggregate: latency_p90 alongside P50 / P95 / P99.
  • New OTel spans getpatter.endpoint and getpatter.bargein. The pre-existing getpatter.llm span is now actually emitted around the pipeline LLM call.
  • New EventBus event types: transcript_partial, transcript_final, llm_chunk, tts_chunk, tool_call_started.
  • TS span names normalised from a mix of patter.* and getpatter.* to getpatter.* everywhere.

Test plan

  • Python: 885 unit tests pass (10 skipped, 0 failed).
  • TypeScript: 1107 tests across 64 files pass.
  • TS typecheck (tsc --noEmit) clean.
  • TS build (ESM + CJS + DTS + CLI) clean.
  • Smoke import: from getpatter import … resolves the full public surface in Python; require('getpatter') exposes equivalent TS symbols.
  • Cloud-mode rejection: Patter(api_key=…) raises NotImplementedError with a clear message in both SDKs.
  • After merge: tag v0.5.3 to trigger PyPI + npm publish via release.yml.
  • After publish: smoke test pip install getpatter==0.5.3 and npm install getpatter@0.5.3 in fresh environments.

Pre-release housekeeping pass across both SDKs.

- Tightened public API surface and removed unused code paths
- Aligned Python and TypeScript exports for full parity
- Refactored logging to use the SDK's own logger namespace
- Hardened Telnyx webhook signature verification and audio pipeline
- Refreshed Mintlify docs, READMEs, and CHANGELOG for 0.5.3
- Rewrote pricing tables and dropped stale references
- Trimmed examples to working stub redirects only
- Standard dependency hygiene + version bump 0.5.2 → 0.5.3
- Add `OpenAITranscribeSTT` as a first-class STT class for OpenAI's
  `gpt-4o-transcribe` and `gpt-4o-mini-transcribe` models. Reuses the
  Whisper transcription infrastructure (same `/v1/audio/transcriptions`
  endpoint) but defaults to `gpt-4o-transcribe` and rejects `whisper-1`
  for clarity.

- Type the ElevenLabs `model_id` field with a `Literal` / union so
  `eleven_v3` (newest, highest quality), `eleven_flash_v2_5` (current
  default, fastest TTFT), `eleven_turbo_v2_5`, `eleven_multilingual_v2`,
  and `eleven_monolingual_v1` are all surfaced via autocomplete.
  Custom model strings remain accepted for forward-compat.

- Public exports + tests added in both Python and TypeScript SDKs.
Cerebras hardening pass — keeps OpenAI-compat layer (single auth/retry
path, smaller dep tree) but closes the gaps documented in their public
inference docs.

- Default model bumped to `gpt-oss-120b` — production tier, no
  deprecation date, highest WSE-3 throughput in the catalog.
- TypeScript: enable gzip request compression by default (Python
  already had it on). Reduces TTFT on prompts >2 KB.
- TypeScript: add 1 retry with exponential backoff on 5xx and 429,
  honoring `x-ratelimit-reset-tokens-minute` /
  `x-ratelimit-reset-requests-minute` / `retry-after` headers. Terminal
  failures throw `PatterError` so the LLM loop can fall back rather
  than silently yielding nothing.
- Forward `response_format` (JSON mode + structured outputs with strict
  schema), `parallel_tool_calls`, `tool_choice`, `seed`, `top_p`,
  `frequency_penalty`, `presence_penalty`, `stop` — all OpenAI-standard
  sampling kwargs that Cerebras supports but we previously dropped.
- Send `max_completion_tokens` on the wire (the current Cerebras spec)
  while still accepting `max_tokens`/`maxTokens` from the user.
- Add `User-Agent: getpatter/<version>` header.

Same surface mirrored on Groq for parity (it shares the OpenAI-compat
wire format).
Adopt two patterns from the LangChain voice-agent reference that we
were missing.

Pipeline hooks around the LLM call:
- `before_llm(messages, ctx) -> messages | None` runs once per turn
  before invoking the LLM. Return `None` to keep the original messages
  or return a new list to replace them. Use cases: PII redaction,
  prompt rewriting, dynamic system-prompt injection.
- `after_llm(text, ctx) -> text | None` runs once per turn after the
  full assistant response is assembled. Use cases: output validation,
  redaction, post-processing, cost capping.
- Both hooks are opt-in and fail-open: a hook raising an exception
  logs but does not break the call.

Fine-grained pipeline events on `EventBus`:
- `transcript_partial` / `transcript_final` — STT chunks surfaced as
  events (existing `on_transcript` callback path remains).
- `llm_chunk` — emitted per LLM streaming text chunk.
- `tts_chunk` — emitted per outbound audio frame.
- `tool_call_started` — emitted when an LLM tool call begins (so UIs
  can render "calling weather…" mid-utterance).

These are additive — every existing callback continues to fire.
- README: new comparison paragraph between Quickstart and Features
  noting telephony parity (Twilio + Telnyx), both pipeline (sandwich)
  and speech-to-speech (Realtime/ConvAI engines) architectures from
  one API, production-grade barge-in/VAD/IVR primitives, OpenTelemetry
  tracing, the 4-line quickstart, and identical Python + TypeScript
  surface.

- CHANGELOG: document the 0.5.3 additions (OpenAITranscribeSTT,
  ElevenLabs v3, Cerebras hardening, before_llm/after_llm hooks,
  fine-grained pipeline events) under the existing 0.5.3 section.
Latency tuning pass on the TTS, Realtime, and pipeline glue providers.

- **Cartesia**: bump default model `sonic-2` → `sonic-3` (now GA, ~90 ms
  TTFB, voice IDs back-compat). Bump API version `2024-11-13` →
  `2025-04-16` to match the Cartesia STT path.
- **OpenAI TTS**: drop `aiter_bytes` chunk size from 4096 to 1024 bytes
  (~85 ms → ~21 ms first-byte at 24 kHz).
- **OpenAI Realtime**: default `silence_duration_ms` 500 → 300 (the
  documented sweet-spot for snappier turns; configurable via constructor).
- **SentenceChunker**: short greetings like "Hi there!" used to sit in
  the buffer until end-of-stream because the splitter required ≥20 chars
  AND ≥2 sentences before emitting. Add a short-flush path that emits
  on a sentence terminator when the preceding text has ≥2 words and no
  trailing digit/uppercase ambiguity.

No public API breaks; all defaults are user-overridable.
Skip the SDK-side resample + μ-law transcode hop on phone calls by
asking ElevenLabs for the carrier-native output format directly.

- `ElevenLabsTTS.for_twilio(api_key=...)` — sets `output_format =
  ulaw_8000` (Twilio Media Streams native). Saves ~30-80 ms first-byte
  + per-frame CPU vs the default `pcm_16000` → resample → μ-law chain.
- `ElevenLabsTTS.for_telnyx(api_key=...)` — sets `output_format =
  pcm_16000` (Telnyx-negotiated default).
- Constructor default unchanged (`pcm_16000`) so existing web /
  dashboard callers are unaffected.
- Same surface mirrored on the TS pipeline `TTS` wrapper with
  parent-compatible overloads (positional and options-object).
Wrap the system prompt and the last tool block with
`cache_control: ephemeral` and send the
`anthropic-beta: prompt-caching-2024-07-31` header so subsequent turns
skip re-encoding the system block. Saves 100-400 ms TTFT and ~90% of
input-token cost on agents with long system prompts.

- Default ON (`prompt_caching=True` / `promptCaching: true`); pass
  `False` to opt out for very short system prompts that fall under
  Anthropic's minimum-cacheable-block size.
- Backwards-compatible: when caching is disabled, the system prompt is
  sent as a plain string and no beta header is added.
…n metrics

Tightens the user-stop → first-TTS-audio loop and adds the observability
needed to keep tightening it.

STT:
- **Python `speech_final` parity** — the TS Deepgram path already short-
  circuited end-of-utterance via the `speech_final` flag, but the Python
  side dropped it. Surface the flag through the `Transcript` dataclass
  and let `_stt_loop` dispatch the LLM on `is_final OR speech_final`.
  Saves ~300-700 ms per turn on Python.
- **Deepgram `smart_format` defaults to False** — telephony users save
  ~50-150 ms TTFT per final transcript; the option remains configurable.
- **WhisperSTT / OpenAITranscribeSTT** flush any non-empty buffer on
  `close()` so the trailing 0-250 ms of audio aren't silently dropped.

Observability:
- New `LatencyBreakdown` fields: `endpoint_ms`, `bargein_ms`,
  `tts_total_ms`, and a properly split `llm_ttft_ms` / `llm_total_ms`
  on the TS side (Python already had both).
- New aggregate: `latency_p90` alongside the existing P50 / P95 / P99.
- New OTel spans `getpatter.endpoint` and `getpatter.bargein`. The
  pre-existing `SPAN_LLM` constant is now actually emitted around the
  pipeline LLM call. TS span names normalised from a mix of `patter.*`
  and `getpatter.*` to `getpatter.*` everywhere.
- New accumulator hooks for the matching timestamps:
  `record_tts_complete_ts`, `record_tts_stopped`,
  `record_bargein_detected`, plus a unified `_endpoint_signal_at` that
  takes the first-fires-wins between VAD-stop and STT-final.
- Dashboard SSE feed surfaces the new fields.
Today the Python notify_dashboard does a synchronous httpx.post on the
asyncio loop. If the dashboard is offline, _on_call_start and
_on_call_end blocked the live call path for up to 1-3 seconds. Even on
a healthy localhost it cost 5-15 ms per call.

- Convert to `async def` using `httpx.AsyncClient`.
- Wrap the two server.py call sites in `asyncio.create_task(...)` so the
  call-start / call-end paths return immediately regardless of dashboard
  responsiveness.
- Drop the wasteful `json.loads(json.dumps(data, default=...))` round-
  trip; serialize dataclasses with a recursive helper that produces a
  shape httpx can encode directly.
- All exceptions swallowed (this is fire-and-forget).

TS equivalent already uses `http.request` non-blocking; left a TODO
comment for future API-surface parity.
Two unrelated TS-side latency wins.

call-log: every fs.*Sync (writeFileSync, fsyncSync, renameSync,
appendFileSync, readFileSync) ran on the Node main thread, costing ~75
ms of cumulative blocking per call (call_start + ~12 turns + call_end).
Replaced with `fs.promises.*` and made `logCallStart`, `logTurn`,
`logEvent`, `logCallEnd` async. Server.ts call sites switched to
fire-and-forget — call logging never blocks the WS handler.

Twilio outbound: replaced the `Url:` parameter (which made Twilio do a
fresh HTTPS GET back to our server to fetch TwiML, ~100-200 ms) with
inline `Twiml:` carrying the `<Connect><Stream>` directly. Brings TS to
parity with the Python adapter and saves one webhook round-trip per
outbound call.
Cartesia accepts `sample_rate=8000` natively in the request body, so
asking for 8 kHz directly skips the SDK-side 16 kHz → 8 kHz resample on
every TTS chunk for Twilio paths.

- `CartesiaTTS.for_twilio(api_key=...)` — sets `sample_rate=8000`
  (native μ-law transcode still done in TwilioAudioSender, but one
  resample step saved per chunk).
- `CartesiaTTS.for_telnyx(api_key=...)` — sets `sample_rate=16000` to
  align with Telnyx's L16 default when that path is used.
- Constructor default unchanged (`sample_rate=16000`) so existing
  callers behave identically.
- Same surface mirrored on the TS pipeline `TTS` wrapper with parent-
  compatible overloads (positional + options-object).
Two Telnyx-side latency wins plus a ring-timeout default tweak.

- Telnyx supports passing the streaming params inside the `answer`
  action body. We previously waited for `call.answered` to arrive and
  then made a second POST to `actions/streaming_start`. Folding the
  parameters into the original `answer` body eliminates one webhook
  round-trip and one HTTP POST — saves ~100-200 ms per inbound call.
  `call.answered` becomes a no-op debug log.
- `stream_track` flipped from `both_tracks` to `inbound_track` —
  Telnyx no longer forwards the outbound echo we were filtering out
  on receive. Halves WS upstream bandwidth and removes the per-frame
  filter branch (kept as defense-in-depth).
- Default `ring_timeout` 60 s → 25 s on both Twilio and Telnyx paths.
  60 s leaves trunks tied up on phantom rings; 25 s is the
  production-recommended default. `ring_timeout=60` opts back in;
  passing `None` (Py) or `null` (TS) sends no timeout.
ElevenLabs ConvAI supports `output_audio_format="ulaw_8000"` and
`input_audio_format="ulaw_8000"` natively. When configured, the SDK
can drop both the inbound mulaw → pcm16 + 8k → 16k resample chain and
the outbound 16k → 8k + pcm16 → mulaw transcode, giving a per-turn
saving of ~5-10 ms plus CPU.

- `ElevenLabsConvAIAdapter.for_twilio(api_key, agent_id, …)` and
  `for_telnyx(...)` factories that set both audio formats to
  `ulaw_8000`. Constructor default unchanged.
- `ElevenLabsConvAIStreamHandler` (Python + TypeScript) detects the
  ulaw configuration on `start()` and:
  • bypasses the inbound transcode in `on_audio_received` /
    inbound branch — forwards raw mulaw bytes to ConvAI;
  • flips `audio_sender._input_is_mulaw_8k = True` so the outbound
    sender skips its resample + μ-law conversion.
- Pipeline mode and other handlers untouched.
@mintlify
Copy link
Copy Markdown

mintlify Bot commented Apr 27, 2026

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
patter-06b046ce 🟢 Ready View Preview Apr 27, 2026, 5:50 PM

💡 Tip: Enable Workflows to automatically generate PRs for you.

Wave 10A added 10 sampling kwargs and a User-Agent header to both
CerebrasLLMProvider and GroqLLMProvider, each duplicating the entire
SSE consumption loop from OpenAILLMProvider. The duplication broke
when main's PR #73 introduced tests that mock the parent's stream()
directly, and made cerebras / groq subclasses ~80 lines each that we
would have to keep in sync forever.

Move the kwargs (and the User-Agent header) up to OpenAILLMProvider
so every OpenAI-compat client benefits — including Anthropic-compat
flows and any future provider that subclasses it.

- OpenAILLMProvider gains: response_format, parallel_tool_calls,
  tool_choice, seed, top_p, frequency_penalty, presence_penalty,
  stop, temperature, max_tokens (forwarded as max_completion_tokens
  on the wire), user_agent (default "getpatter/<version>").
- CerebrasLLMProvider.stream() is now a 17-line wrapper around
  super().stream() that adds the 404 model_not_found recovery hint.
  Cerebras-specific gzip + msgpack compression and the per-tier
  default model are unchanged.
- GroqLLMProvider drops its stream() override entirely; the parent
  handles everything.
- Same kwargs surfaced on the public TS OpenAILLMSamplingOptions and
  the OpenAILLM wrapper. TS Cerebras / Groq retain their own kwargs
  forwarding because they use a different transport (bare fetch +
  gzip + retry) and have no shared parent class.

Net code change: -193 LOC across cerebras_llm.py / groq_llm.py.
All Wave 10A features preserved — only their architectural layer
changes.
Three tests written before the latency / measurement / pricing waves
asserted on values that the production code has since updated. They
passed locally because the tests target functions that were touched
incrementally and the failures only surfaced once the matrix CI ran
the suite end-to-end against the merged release branch.

- test_llm_loop._make_llm_loop fixture used `LLMLoop.__new__(LLMLoop)`
  to bypass the constructor, then manually set instance attributes.
  The Wave 12b observability pass added `_metrics`, `_event_bus`,
  `_model`, and `_provider_name` to LLMLoop's runtime contract; the
  fixture now sets them so `loop.run()` finds the expected attrs.
- test_twilio_handler.test_telnyx_webhook_stream_url_both_tracks
  asserted `stream_track == "both_tracks"`. Wave 14a flipped the
  default to `inbound_track` (halves WS upstream bandwidth, the
  outbound echo is filtered downstream anyway). Test renamed to
  test_telnyx_webhook_stream_url_inbound_track and asserts the new
  value with rationale comment.
- test_soak.test_s2_1000_turn_conversation hard-coded the pre-Wave-12b3
  Deepgram batch rate ($0.0043/min) and ElevenLabs Creator-overage
  rate ($0.18/1k). Both were corrected to the streaming-API rates
  ($0.0077/min and $0.06/1k respectively) when the cost-accuracy audit
  found ~45% under-reporting on Deepgram and ~3x over-reporting on
  ElevenLabs. Test updated to the corrected rates.

Full suite: 1352 passed, 15 skipped on Python 3.12.
@nicolotognoni nicolotognoni merged commit 35ed245 into main Apr 27, 2026
15 checks passed
@nicolotognoni nicolotognoni deleted the release/0.5.3 branch May 8, 2026 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant