OVOS-AUDIO-1: Audio Output Service Specification#38
Conversation
📝 WalkthroughWalkthroughThis PR introduces the OVOS-AUDIO-1 audio output service specification and integrates it with the existing pipeline schema by adding a ChangesAudio Output Specification and Pipeline Integration
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
205b5d4 to
607cf56
Compare
Async speak, sentence segmentation, two-queue model. Clarify audio output service is optional; deployment MAY have none (e.g. server deployments). Co-Authored-By: big-pickle <big-pickle@opencode.ai>
…bsection, wording cleanup - Section 1: added hardware-access non-goal (only audio service needs audio devices) - Section 4.2 (new): instant-sounds queue subsection with discipline and ovos.audio.play_sound payload table - 'audio file' replaced with 'audio data' throughout - Removed viseme, binary_data, audio_ext fields from payload tables - Section 7: 'sound file' replaced with 'sound' in bus surface table Co-Authored-By: Claude Code <claude@anthropic.com>
- §4.3: define `listen` as an extension to ovos.utterance.speak payload (was referenced but never defined); add payload table; explicit MUST to suppress the flag on all but the final per-sentence segment; state that ovos.mic.listen is NOT emitted on stop-initiated end - §5.1/5.2: ovos.audio.output.started and .ended now carry session_id; add session boundary definition (fires once, not per-speak) - §5.3: ovos.audio.is_speaking query and reply are session-scoped - §6: both stop topics share the same MAY-scope-to-session_id rule; remove asymmetry between ovos.audio.stop and ovos.stop - §8: move fallback TTS from MUST to SHOULD (matches §3.2 body); add MUST NOT emit ovos.mic.listen on stop-initiated end; add MAY scope-stop-to-session_id - §4.2: fix "preempting any idle time" → "without waiting for the scheduled queue to drain" Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The `listen` field on ovos.utterance.speak is a pipeline-level intent
("handler expects follow-up utterance"), not an audio mechanism. Define
it in PIPELINE-1 §9.6 where the message is owned; remove the
"extends" definition from AUDIO-1 §4.3, which now references PIPELINE-1.
PIPELINE-1 §9.6 also makes explicit that a get_response flow MUST set
listen: true — omitting it is non-conformant because the user input
channel is never re-opened. The obligation will be mirrored in
CONVERSE-1 §5 when that spec is written.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Fix lang source: data.lang (content language) not session.lang (preferred language); fall back to session when absent - Cut §3.4 (internal queue-item assembly); fold ordering note into §3.2 - Cut §4.1 "Typed metadata" bullet (internal queue item schema) - Cut §4 intro duplication; replace with one-sentence summary - Trim §2 async paragraph to one line (PIPELINE-1 §6.1 owns it) - Cut §5.2 "session_id matches" (obvious) - Cut §6 "targets only the audio output service" (already in §7) - Rewrite §1 hardware-access exclusion as a true exclusion - Drop "short" from "short sound effects" (no size constraint in protocol) - Cut §8 SHOULD fallback TTS and MAY load-multiple/streaming-TTS (TTS internals; spec makes no assertions about TTS beyond it exists) - Fix stale §3.4 cross-reference in §4.1 → §3.2 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Bugs: - §3.2: fix dead §3.4 cross-reference → §4.1 - §6: step 1 now clears both queues (scheduled + instant-sounds); §4.2 already said instant-sounds are clearable on stop but §6 only cleared the scheduled queue - §6 table: "Stop all audio output for the session" contradicted the MAY-scope paragraph; simplified to "Stop audio output" - §4.3 MUST vs §8 SHOULD: ovos.mic.listen emission is MUST in §4.3 body but was SHOULD in conformance; moved to MUST, also moved the suppress-on-stop rule to MUST and rephrased positively Minor: - §4.1 ovos.audio.queue listen field: "Signal listening" was opaque; now references §4.3 - §1 hardware access: drop second sentence (normative claim about other components doesn't belong in a scope-exclusion list) - RFC 2119 preamble: remove RECOMMENDED (never used in the spec) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
session_id must never appear in message.data; session identity is always read from context.session.session_id. - ovos.audio.output.started: removed session_id payload field; no-payload note added; prose updated to context.session.session_id - ovos.audio.output.ended: same - ovos.audio.is_speaking: removed session_id from request and response payloads; requester scopes via context.session instead - §6 stop integration prose: session_id → context.session.session_id - §conformance MAY bullet: same Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@audio.md`:
- Line 143: The spec mixes "synthesise/synthesizes/synthesised" and
"synthesize/synthesizes/synthesized"; pick one spelling (e.g., American
"synthesize/synthesizes/synthesized" or British
"synthesise/synthesises/synthesised") and normalize all occurrences accordingly
(including the instance at line with "The audio output service synthesises the
utterance text into audio." and the occurrences around 385-386), updating
headings, body text, and examples so the chosen variant is used consistently
throughout the document.
- Around line 115-129: Add a language identifier to the fenced flow-diagram
block that begins with "ovos.utterance.speak" so markdown tooling treats it as
plain text: change the opening triple-backtick fence to use "text" (i.e.,
```text) for the block that contains the lines "[dialog transformers] ←
OVOS-TRANSFORM-1 §3.5", "[tts transformers] ← OVOS-TRANSFORM-1 §3.6", and
"scheduled playback queue → audio output" so the diagram is correctly typed by
renderers.
In `@ovos-pipeline-1.md`:
- Around line 1150-1158: The spec has a normative conflict: the `listen: true`
MUST on messages emitted as `ovos.utterance.speak` in a `get_response` flow
(OVOS-CONVERSE-1 §5) conflicts with the later statement that handlers have “no
normative obligation” (currently §11); resolve by choosing one approach and
updating the doc accordingly — either move the `listen` requirement out of
handler text into the orchestrator/framework contract (mentioning
`get_response`, `ovos.utterance.speak`, and `listen` so handlers remain
implementation-neutral), or keep the handler-level MUST and update §11 to add
explicit handler conformance obligations requiring handlers to set `listen:
true` when emitting `ovos.utterance.speak` in `get_response` flows; apply the
same change consistently where the spec references handler obligations.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 32f3ebb8-0636-4bc4-be10-56cf26a76b09
📒 Files selected for processing (2)
audio.mdovos-pipeline-1.md
| ``` | ||
| ovos.utterance.speak | ||
| │ | ||
| ▼ | ||
| [dialog transformers] ← OVOS-TRANSFORM-1 §3.5 | ||
| │ | ||
| ▼ | ||
| TTS synthesis (text → audio data) | ||
| │ | ||
| ▼ | ||
| [tts transformers] ← OVOS-TRANSFORM-1 §3.6 | ||
| │ | ||
| ▼ | ||
| scheduled playback queue → audio output | ||
| ``` |
There was a problem hiding this comment.
Add a language identifier to the fenced block.
The flow diagram fence is untyped; please mark it as text for markdown tooling compatibility.
Proposed edit
-```
+```text
ovos.utterance.speak
│
▼
[dialog transformers] ← OVOS-TRANSFORM-1 §3.5
@@
scheduled playback queue → audio output</details>
<!-- suggestion_start -->
<details>
<summary>📝 Committable suggestion</summary>
> ‼️ **IMPORTANT**
> Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
```suggestion
🧰 Tools
🪛 markdownlint-cli2 (0.22.1)
[warning] 115-115: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio.md` around lines 115 - 129, Add a language identifier to the fenced
flow-diagram block that begins with "ovos.utterance.speak" so markdown tooling
treats it as plain text: change the opening triple-backtick fence to use "text"
(i.e., ```text) for the block that contains the lines "[dialog transformers] ←
OVOS-TRANSFORM-1 §3.5", "[tts transformers] ← OVOS-TRANSFORM-1 §3.6", and
"scheduled playback queue → audio output" so the diagram is correctly typed by
renderers.
| **The `listen` flag and follow-up flows.** When a handler emits | ||
| `ovos.utterance.speak` as the prompt in a `get_response` flow | ||
| (OVOS-CONVERSE-1 §5), it **MUST** set `listen: true` on that Message. | ||
| The flag is a protocol-level statement that the handler expects a | ||
| follow-up utterance; every output consumer — audio, chat, any other | ||
| delivery channel — reads it and re-opens the user input channel | ||
| accordingly. Omitting the flag in a `get_response` flow is | ||
| non-conformant: the user is asked a question but the input channel | ||
| is never re-opened. |
There was a problem hiding this comment.
Resolve conformance contradiction for handler obligations.
Lines 1150-1158 impose a handler MUST (listen: true in get_response prompts), but Lines 1318-1324 state handlers have no normative obligation under this spec. This is a normative conflict and makes conformance ambiguous.
Please align one of these:
- keep handler-neutral conformance and move the
listenrequirement to the orchestrator/framework contract, or - keep handler MUST and update §11 to include explicit handler conformance obligations.
Also applies to: 1318-1324
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ovos-pipeline-1.md` around lines 1150 - 1158, The spec has a normative
conflict: the `listen: true` MUST on messages emitted as `ovos.utterance.speak`
in a `get_response` flow (OVOS-CONVERSE-1 §5) conflicts with the later statement
that handlers have “no normative obligation” (currently §11); resolve by
choosing one approach and updating the doc accordingly — either move the
`listen` requirement out of handler text into the orchestrator/framework
contract (mentioning `get_response`, `ovos.utterance.speak`, and `listen` so
handlers remain implementation-neutral), or keep the handler-level MUST and
update §11 to add explicit handler conformance obligations requiring handlers to
set `listen: true` when emitting `ovos.utterance.speak` in `get_response` flows;
apply the same change consistently where the spec references handler
obligations.
…t stoppable - One scheduled queue (not two): TTS speech and queued sounds only - Instant sounds (ovos.audio.play_sound) are fire-and-forget: play immediately, MAY overlap each other, NOT affected by stop signals - §2, §4, §6, §8 updated throughout; removed "two-queue model" framing Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Actionable comments posted: 0 |
…ppendix - §3.2: remove sentence segmentation from normative body; add note pointing to appendix §4.9 - §4.3: rewrite listen flag without segment references - §4.1, §4.2: remove inline audio payload notes (implementation-specific) - §3.3: remove transformer example list (TRANSFORM-1's job) - §5.1: collapse session-continuity paragraph to one sentence; remove named-subscriber list - §6: remove per-session queue implementation detail; trim to MAY - §8: fix "clear both queues" → "clear the scheduled queue"; remove duplicate ovos.stop SHOULD - appendix/rationale.md: add §4.9 sentence segmentation explanation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ovos.utterance.speak.b64: same TTS pipeline as ovos.utterance.speak but output stage emits ovos.audio.speech (base64) instead of enqueueing for local playback. Client receives audio and handles playback itself. ovos.audio.speech: emitted by audio output service on ovos.utterance.speak.b64; carries synthesised audio as base64 + listen flag; bridge relays to satellite (BRIDGE-1 §4.2.4). §7 bus surface and §8 conformance updated. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rvice §4.1: audio output service MUST only enqueue items for sessions it serves locally. Co-located service SHOULD serve session_id "default" only; named sessions are remote participants delivered via ovos.utterance.speak.b64 / ovos.audio.speech. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- audio.md renamed to audio-out.md for clarity alongside audio-in.md - appendix/divergences.md §5.5: add ovos.utterance.speak.b64, ovos.audio.speech, ovos.audio.queue/play_sound inline-audio entries - ovos-pipeline-1.md §9.6: forward reference to speak.b64 / AUDIO-1 §3.4 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
audio-out.md (1)
137-398:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUse one spelling family for “synthesise/synthesize” across the file.
Line 137 uses
synthesiseswhile Line 398 usessynthesized. Please normalize to one variant throughout the spec.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@audio-out.md` around lines 137 - 398, The document inconsistently uses the British "synthesise/synthesises" and American "synthesize/synthesized" spellings; pick one spelling family and normalize every occurrence (e.g., replace "synthesises", "synthesise" and "synthesised" or alternatively "synthesizes", "synthesize" and "synthesized") across the file so all mentions (including headings like "TTS transformer stage", sentence bodies such as the first paragraph and the section titles/notes) use the chosen variant consistently.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@audio-out.md`:
- Around line 238-255: The subsection numbering jumps from "### 4.3 Synthesised
audio delivery — `ovos.audio.speech`" to "### 4.5 Listen flag", leaving out 4.4;
update the heading "### 4.5 Listen flag" to "### 4.4 Listen flag" (or renumber
subsequent headings accordingly) so section references are consistent, and
verify any cross-references in the document that mention 4.5/4.4 are adjusted to
the new number.
---
Outside diff comments:
In `@audio-out.md`:
- Around line 137-398: The document inconsistently uses the British
"synthesise/synthesises" and American "synthesize/synthesized" spellings; pick
one spelling family and normalize every occurrence (e.g., replace "synthesises",
"synthesise" and "synthesised" or alternatively "synthesizes", "synthesize" and
"synthesized") across the file so all mentions (including headings like "TTS
transformer stage", sentence bodies such as the first paragraph and the section
titles/notes) use the chosen variant consistently.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f3c644e8-83c2-48a0-8ba5-435824deff4a
📒 Files selected for processing (4)
appendix/divergences.mdappendix/rationale.mdaudio-out.mdovos-pipeline-1.md
✅ Files skipped from review due to trivial changes (1)
- appendix/divergences.md
🚧 Files skipped from review as they are similar to previous changes (1)
- ovos-pipeline-1.md
| ### 4.3 Synthesised audio delivery — `ovos.audio.speech` | ||
|
|
||
| `ovos.audio.speech` is emitted by the audio output service when | ||
| processing an `ovos.utterance.speak.b64` Message (§3.4). It carries | ||
| the synthesised audio as base64; the receiving client is responsible | ||
| for decoding and playing it. | ||
|
|
||
| | Field | Type | Required | Meaning | | ||
| |-------|------|----------|---------| | ||
| | `audio` | string | yes | Base64-encoded synthesised audio. | | ||
| | `listen` | bool | no | When `true`, the client SHOULD re-open its microphone after playback. | | ||
|
|
||
| The session is identified via `context.session` as usual. A bridge | ||
| (OVOS-BRIDGE-1 §4.2.4) subscribes by `session_id` or `destination` | ||
| and relays this message to the client. | ||
|
|
||
| ### 4.5 Listen flag | ||
|
|
There was a problem hiding this comment.
Fix subsection numbering gap (4.3 → 4.5).
### 4.5 Listen flag appears immediately after ### 4.3, so section 4.4 is missing. This creates broken/ambiguous spec references.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@audio-out.md` around lines 238 - 255, The subsection numbering jumps from
"### 4.3 Synthesised audio delivery — `ovos.audio.speech`" to "### 4.5 Listen
flag", leaving out 4.4; update the heading "### 4.5 Listen flag" to "### 4.4
Listen flag" (or renumber subsequent headings accordingly) so section references
are consistent, and verify any cross-references in the document that mention
4.5/4.4 are adjusted to the new number.
Companion issue: #49
Summary
Defines the audio output service — the pipeline's output-side counterpart that consumes
ovos.utterance.speakand renders natural-language responses as audio.What the spec covers
ovos.utterance.speak) and sound effectsovos.audio.queuefor scheduled playback in queue orderovos.audio.play_sound(plays without queuing)ovos.audio.output.started/ovos.audio.output.ended(session identity fromcontext.session.session_id)ovos.audio.is_speaking(session-scoped via context, not data)ovos.audio.stopand universalovos.stop; MAY scope response to sessionovos.mic.listenemitted after playback ends whenlisten: trueon the speak messageBus surface
ovos.utterance.speakovos.audio.queueovos.audio.play_soundovos.audio.stopovos.audio.is_speakingovos.audio.output.startedovos.audio.output.endedovos.mic.listenSummary by CodeRabbit
ovos.utterance.speakprotocol to include optionallistenflag, signaling when system should await follow-up user inputovos.utterance.speak.b64) for remote clients