OVOS-AUDIO-IN-1: Audio Input Service Specification#51
Conversation
Minimal spec with three normative obligations: 1. A STT mechanism MUST exist (deployer-defined — engine, API, model are all out of scope) 2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT 3. MUST emit ovos.utterance.handle with data.utterances and data.lang Everything else — audio capture method (mic, file, remote, wake word, VAD), STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope. Language resolved from session.detected_lang → session.stt_lang → session.lang in that order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Language identification: detected lang written to session.detected_lang - Denoising/normalisation: noise reduction, format conversion - Speaker recognition: speaker_id written to Message.context for downstream personalisation without audio service knowing semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
session.stt_lang records the language STT actually decoded in (SESSION-1 §3.2.4). It is a result written by the audio input service after transcription, not an input to language selection. Corrected language resolution order (inputs to STT selection): 1. session.detected_lang (audio transformer detection) 2. session.request_lang (capture mechanism hint, e.g. wake word) 3. session.lang (general session preference) Added SHOULD obligation to write session.stt_lang after transcription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stt_lang is the language the STT model was configured to assume for the audio (model input language), written before/at STT invocation. In normal transcription stt_lang == data.lang; in speech-translation they diverge — stt_lang is the audio's spoken language, data.lang is the transcript's output language. SESSION-1 §3.2.4 updated to match: "actually transcribed in" was ambiguous in the translation case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… ref - §1 non-goals: replace confusing "post-STT transformers are deployer concern" with "owned by utterance lifecycle (PIPELINE-1), run after emission". Session lifecycle non-goal now cross-references SESSION-2 and points to §5.2. - §5.2 (new): session assignment. Audio input is the originator of interactions and MUST assign a session. Local device SHOULD use session_id "default" (SESSION-2 §5); satellite session is assigned by the bridge at the hub boundary (BRIDGE-1 §4.2.1). Session MUST be in context.session, not data. - §6 conformance: add session assignment MUST and SHOULD; remove utterance-transformer MAY (it belongs to the utterance lifecycle) - See also: add SESSION-2 and BRIDGE-1; note PIPELINE-1 owns post-STT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Preamble: drop repeated capture-method list (already in §1) - §1: drop "defines" list (restated section headings); keep non-goals only; add AUDIO-1 to post-STT non-goal (dialog/TTS chains live there) - §2: merge two-sentence role description into one - §3: drop obvious "no constraint beyond..." clause - §4: trim use-case bullet tails to one line each; drop "no transformers → unchanged" (obvious) - §5: drop redundant MSG-1 §4 reference (covered by §5.2) - §5.1: drop "most specific signal" and "prior not guarantee" padding - §5.2: drop "this is the normal case" sentence; drop final "MUST be in context.session not data" (in §6 MUST) - §6 MAY: remove "acquire audio by any mechanism" (a non-goal, not a MAY) - See also: AUDIO-1 added; entries tightened Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 14 minutes and 51 seconds. Learn how PR review limits work. Your organization has run out of usage credits. Purchase more in the billing tab. ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. 📝 WalkthroughWalkthroughThis PR introduces the OVOS-AUDIO-IN-1 specification defining the audio input service responsible for audio acquisition, pre-STT audio transformation, speech-to-text conversion, and utterance emission. It updates the session specification to clarify stt_lang semantics and indexes the new service in the specification registry. ChangesAudio Input Service Specification
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@ovos-audio-in-1.md`:
- Line 84: Change the type wording for the field named "utterances" from "array
of string" to "array of strings" in the specification table so it matches
established payload wording; locate the table row that defines the utterances
field (header shows `utterances | array of string | yes | Transcription
candidates; first element is primary.`) and update the type cell to read `array
of strings`.
- Line 85: The table entry that currently marks `data.lang` as required (`yes` /
MUST) is incorrect; update the `lang` field to reflect the PIPELINE-1 §9.1 rule
(optional-by-authority) so producers only emit `data.lang` when they
authoritatively know the content language. Change the `lang` requirement from
unconditional "yes / MUST emit" to an authority-conditional form (e.g.,
"conditional / when authoritative" or "optional-by-authority") in the
`data.lang` row and make the same adjustment for the other occurrences
referenced (lines 128-129) so the spec consistently defers to PIPELINE-1 §9.1.
Ensure the text mentions `data.lang` and cites PIPELINE-1 §9.1 for clarity.
In `@ovos-session-1.md`:
- Around line 374-375: The text incorrectly advises stages to read transcript
language from "data.lang or session.lang"; remove "or session.lang" so stages
only use data.lang (i.e., keep "read `stt_lang`; stages that need the
transcript's language read `data.lang`"). Ensure any mention of `session.lang`
as a fallback for `stt_lang` is deleted and add a clarifying note that
`stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 93d82bcd-118f-43f5-9c10-af8f8558d7a4
📒 Files selected for processing (3)
README.mdovos-audio-in-1.mdovos-session-1.md
|
|
||
| | Field | Type | Required | Meaning | | ||
| |-------|------|----------|---------| | ||
| | `utterances` | array of string | yes | Transcription candidates; first element is primary. | |
There was a problem hiding this comment.
Use array of strings for the utterances type.
array of string is ambiguous in a normative table. Please switch to array of strings to match the established payload wording and reduce implementer interpretation drift.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ovos-audio-in-1.md` at line 84, Change the type wording for the field named
"utterances" from "array of string" to "array of strings" in the specification
table so it matches established payload wording; locate the table row that
defines the utterances field (header shows `utterances | array of string | yes |
Transcription candidates; first element is primary.`) and update the type cell
to read `array of strings`.
| | Field | Type | Required | Meaning | | ||
| |-------|------|----------|---------| | ||
| | `utterances` | array of string | yes | Transcription candidates; first element is primary. | | ||
| | `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. | |
There was a problem hiding this comment.
data.lang is required here but optional-by-authority in PIPELINE-1.
This creates a contract mismatch: OVOS-PIPELINE-1 §9.1 allows lang only when the producer authoritatively knows content language, but this spec currently makes it unconditional (yes / MUST emit data.lang). Please align this to avoid forcing synthesized or guessed language tags.
Also applies to: 128-129
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ovos-audio-in-1.md` at line 85, The table entry that currently marks
`data.lang` as required (`yes` / MUST) is incorrect; update the `lang` field to
reflect the PIPELINE-1 §9.1 rule (optional-by-authority) so producers only emit
`data.lang` when they authoritatively know the content language. Change the
`lang` requirement from unconditional "yes / MUST emit" to an
authority-conditional form (e.g., "conditional / when authoritative" or
"optional-by-authority") in the `data.lang` row and make the same adjustment for
the other occurrences referenced (lines 128-129) so the spec consistently defers
to PIPELINE-1 §9.1. Ensure the text mentions `data.lang` and cites PIPELINE-1
§9.1 for clarity.
| read `stt_lang`; stages that need the transcript's language read | ||
| `data.lang` or `session.lang`. Once set, `stt_lang` travels with |
There was a problem hiding this comment.
Do not recommend session.lang as transcript-language fallback.
The clause “stages that need the transcript’s language read data.lang or session.lang” conflicts with utterance-layer semantics where payload language must come from data.lang (and must not be synthesized from session preference). Please remove or session.lang here to keep contracts consistent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@ovos-session-1.md` around lines 374 - 375, The text incorrectly advises
stages to read transcript language from "data.lang or session.lang"; remove "or
session.lang" so stages only use data.lang (i.e., keep "read `stt_lang`; stages
that need the transcript's language read `data.lang`"). Ensure any mention of
`session.lang` as a fallback for `stt_lang` is deleted and add a clarifying note
that `stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.
Audio transformer is the inline voice-signal injection point per USER-ID-1 §3.1; context.voice_match is the intermediate signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Companion issue: #52
Summary
Minimal spec for the audio input service. Three normative obligations:
ovos.utterance.handlewithdata.utterancesanddata.langWhat is NOT prescribed
Audio transformer canonical use cases (§4)
session.detected_langMessage.contextLanguage fields (§5.1)
Selection order for STT input language:
session.detected_lang→session.request_lang→session.langAfter transcription, SHOULD write
session.stt_lang= the language the STT model was configured to assume. Matchesdata.langin normal transcription; diverges in speech-translation (stt_lang= audio's spoken language,data.lang= translated output language).Companion edit (same PR)
ovos-session-1.md§3.2.4 — clarifiedstt_langdefinition (was "actually transcribed in", now "configured to assume for the audio") with speech-translation distinction documented.Summary by CodeRabbit