Skip to content

OVOS-AUDIO-IN-1: Audio Input Service Specification#51

Merged
JarbasAl merged 7 commits into
devfrom
spec/audio-input
May 28, 2026
Merged

OVOS-AUDIO-IN-1: Audio Input Service Specification#51
JarbasAl merged 7 commits into
devfrom
spec/audio-input

Conversation

@JarbasAl
Copy link
Copy Markdown
Member

@JarbasAl JarbasAl commented May 28, 2026

Companion issue: #52

Summary

Minimal spec for the audio input service. Three normative obligations:

  1. A STT mechanism MUST exist (deployer-defined)
  2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT
  3. MUST emit ovos.utterance.handle with data.utterances and data.lang

What is NOT prescribed

  • How audio is acquired (mic, file, remote, wake word, VAD, push-to-talk)
  • Which STT engine, model, or API is used
  • Post-STT transformer chains (utterance §3.2, metadata §3.3) — deployer MAY run before emission
  • Session lifecycle

Audio transformer canonical use cases (§4)

  • Language identification → writes session.detected_lang
  • Denoising / normalisation → acoustic pre-processing
  • Speaker recognition → result written into Message.context

Language fields (§5.1)

Selection order for STT input language: session.detected_langsession.request_langsession.lang

After transcription, SHOULD write session.stt_lang = the language the STT model was configured to assume. Matches data.lang in normal transcription; diverges in speech-translation (stt_lang = audio's spoken language, data.lang = translated output language).

Companion edit (same PR)

ovos-session-1.md §3.2.4 — clarified stt_lang definition (was "actually transcribed in", now "configured to assume for the audio") with speech-translation distinction documented.

Summary by CodeRabbit

  • Documentation
    • Added Audio Input Service specification (OVOS-AUDIO-IN-1) covering audio acquisition, pre-STT transformation, and speech-to-text processing workflows.
    • Updated session specification documentation to clarify STT language signal configuration and behavior.

Review Change Stack

JarbasAl and others added 4 commits May 28, 2026 09:42
Minimal spec with three normative obligations:
1. A STT mechanism MUST exist (deployer-defined — engine, API, model
   are all out of scope)
2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT
3. MUST emit ovos.utterance.handle with data.utterances and data.lang

Everything else — audio capture method (mic, file, remote, wake word,
VAD), STT engine selection, post-STT transformer chains — is deployer
concern and explicitly out of scope. Language resolved from
session.detected_lang → session.stt_lang → session.lang in that order.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Language identification: detected lang written to session.detected_lang
- Denoising/normalisation: noise reduction, format conversion
- Speaker recognition: speaker_id written to Message.context for
  downstream personalisation without audio service knowing semantics

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
session.stt_lang records the language STT actually decoded in
(SESSION-1 §3.2.4). It is a result written by the audio input
service after transcription, not an input to language selection.

Corrected language resolution order (inputs to STT selection):
  1. session.detected_lang (audio transformer detection)
  2. session.request_lang (capture mechanism hint, e.g. wake word)
  3. session.lang (general session preference)

Added SHOULD obligation to write session.stt_lang after transcription.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
stt_lang is the language the STT model was configured to assume for
the audio (model input language), written before/at STT invocation.
In normal transcription stt_lang == data.lang; in speech-translation
they diverge — stt_lang is the audio's spoken language, data.lang
is the transcript's output language.

SESSION-1 §3.2.4 updated to match: "actually transcribed in" was
ambiguous in the translation case.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
JarbasAl and others added 2 commits May 28, 2026 10:03
… ref

- §1 non-goals: replace confusing "post-STT transformers are deployer
  concern" with "owned by utterance lifecycle (PIPELINE-1), run after
  emission". Session lifecycle non-goal now cross-references SESSION-2
  and points to §5.2.
- §5.2 (new): session assignment. Audio input is the originator of
  interactions and MUST assign a session. Local device SHOULD use
  session_id "default" (SESSION-2 §5); satellite session is assigned
  by the bridge at the hub boundary (BRIDGE-1 §4.2.1). Session MUST
  be in context.session, not data.
- §6 conformance: add session assignment MUST and SHOULD; remove
  utterance-transformer MAY (it belongs to the utterance lifecycle)
- See also: add SESSION-2 and BRIDGE-1; note PIPELINE-1 owns post-STT

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Preamble: drop repeated capture-method list (already in §1)
- §1: drop "defines" list (restated section headings); keep non-goals
  only; add AUDIO-1 to post-STT non-goal (dialog/TTS chains live there)
- §2: merge two-sentence role description into one
- §3: drop obvious "no constraint beyond..." clause
- §4: trim use-case bullet tails to one line each; drop "no transformers
  → unchanged" (obvious)
- §5: drop redundant MSG-1 §4 reference (covered by §5.2)
- §5.1: drop "most specific signal" and "prior not guarantee" padding
- §5.2: drop "this is the normal case" sentence; drop final
  "MUST be in context.session not data" (in §6 MUST)
- §6 MAY: remove "acquire audio by any mechanism" (a non-goal, not a MAY)
- See also: AUDIO-1 added; entries tightened

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 28, 2026

Warning

Review limit reached

@JarbasAl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed5facf8-6556-4322-8fb9-6f186cc11c41

📥 Commits

Reviewing files that changed from the base of the PR and between d074ae2 and 649542e.

📒 Files selected for processing (1)
  • ovos-audio-in-1.md
📝 Walkthrough

Walkthrough

This PR introduces the OVOS-AUDIO-IN-1 specification defining the audio input service responsible for audio acquisition, pre-STT audio transformation, speech-to-text conversion, and utterance emission. It updates the session specification to clarify stt_lang semantics and indexes the new service in the specification registry.

Changes

Audio Input Service Specification

Layer / File(s) Summary
Audio Input Service Specification
ovos-audio-in-1.md
Defines the audio input service contract: acquiring audio, executing the pre-STT audio-transformer chain (OVOS-TRANSFORM-1), performing STT with deployer-defined mechanism, and emitting utterances to ovos.utterance.handle. Specifies required emission fields (data.utterances, data.lang), language resolution precedence (session.detected_lang, session.request_lang, session.lang), session assignment requirements (local vs. satellite), and conformance checklist.
Specification Index and Session Language Semantics
README.md, ovos-session-1.md
Adds OVOS-AUDIO-IN-1 to the specifications index and clarifies the stt_lang field in the session specification to describe the language the STT stage was configured to assume (distinct from data.lang, the transcript output language), with explicit contracts for transcription and speech-translation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

  • OpenVoiceOS/architecture#52: Directly addresses the proposal for the OVOS-AUDIO-IN-1 audio input service specification with STT and session language semantics.

Possibly related PRs

  • OpenVoiceOS/architecture#11: The new OVOS-AUDIO-IN-1 spec explicitly emits results to ovos.utterance.handle, requiring conformance to OVOS-PIPELINE-1's utterance-lifecycle contract and terminal event guarantees.

Poem

🐰 In the audio streams so bright,
Input service sets things right,
STT transforms the sound,
Sessions bind the ground,
Pipeline's path now shines with light!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main change: introducing the OVOS-AUDIO-IN-1 specification document for the audio input service, which is the primary addition in this pull request.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch spec/audio-input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ovos-audio-in-1.md`:
- Line 84: Change the type wording for the field named "utterances" from "array
of string" to "array of strings" in the specification table so it matches
established payload wording; locate the table row that defines the utterances
field (header shows `utterances | array of string | yes | Transcription
candidates; first element is primary.`) and update the type cell to read `array
of strings`.
- Line 85: The table entry that currently marks `data.lang` as required (`yes` /
MUST) is incorrect; update the `lang` field to reflect the PIPELINE-1 §9.1 rule
(optional-by-authority) so producers only emit `data.lang` when they
authoritatively know the content language. Change the `lang` requirement from
unconditional "yes / MUST emit" to an authority-conditional form (e.g.,
"conditional / when authoritative" or "optional-by-authority") in the
`data.lang` row and make the same adjustment for the other occurrences
referenced (lines 128-129) so the spec consistently defers to PIPELINE-1 §9.1.
Ensure the text mentions `data.lang` and cites PIPELINE-1 §9.1 for clarity.

In `@ovos-session-1.md`:
- Around line 374-375: The text incorrectly advises stages to read transcript
language from "data.lang or session.lang"; remove "or session.lang" so stages
only use data.lang (i.e., keep "read `stt_lang`; stages that need the
transcript's language read `data.lang`"). Ensure any mention of `session.lang`
as a fallback for `stt_lang` is deleted and add a clarifying note that
`stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93d82bcd-118f-43f5-9c10-af8f8558d7a4

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd4a7b and d074ae2.

📒 Files selected for processing (3)
  • README.md
  • ovos-audio-in-1.md
  • ovos-session-1.md

Comment thread ovos-audio-in-1.md

| Field | Type | Required | Meaning |
|-------|------|----------|---------|
| `utterances` | array of string | yes | Transcription candidates; first element is primary. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use array of strings for the utterances type.

array of string is ambiguous in a normative table. Please switch to array of strings to match the established payload wording and reduce implementer interpretation drift.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-audio-in-1.md` at line 84, Change the type wording for the field named
"utterances" from "array of string" to "array of strings" in the specification
table so it matches established payload wording; locate the table row that
defines the utterances field (header shows `utterances | array of string | yes |
Transcription candidates; first element is primary.`) and update the type cell
to read `array of strings`.

Comment thread ovos-audio-in-1.md
| Field | Type | Required | Meaning |
|-------|------|----------|---------|
| `utterances` | array of string | yes | Transcription candidates; first element is primary. |
| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. |
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

data.lang is required here but optional-by-authority in PIPELINE-1.

This creates a contract mismatch: OVOS-PIPELINE-1 §9.1 allows lang only when the producer authoritatively knows content language, but this spec currently makes it unconditional (yes / MUST emit data.lang). Please align this to avoid forcing synthesized or guessed language tags.

Also applies to: 128-129

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-audio-in-1.md` at line 85, The table entry that currently marks
`data.lang` as required (`yes` / MUST) is incorrect; update the `lang` field to
reflect the PIPELINE-1 §9.1 rule (optional-by-authority) so producers only emit
`data.lang` when they authoritatively know the content language. Change the
`lang` requirement from unconditional "yes / MUST emit" to an
authority-conditional form (e.g., "conditional / when authoritative" or
"optional-by-authority") in the `data.lang` row and make the same adjustment for
the other occurrences referenced (lines 128-129) so the spec consistently defers
to PIPELINE-1 §9.1. Ensure the text mentions `data.lang` and cites PIPELINE-1
§9.1 for clarity.

Comment thread ovos-session-1.md
Comment on lines +374 to +375
read `stt_lang`; stages that need the transcript's language read
`data.lang` or `session.lang`. Once set, `stt_lang` travels with
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not recommend session.lang as transcript-language fallback.

The clause “stages that need the transcript’s language read data.lang or session.lang” conflicts with utterance-layer semantics where payload language must come from data.lang (and must not be synthesized from session preference). Please remove or session.lang here to keep contracts consistent.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@ovos-session-1.md` around lines 374 - 375, The text incorrectly advises
stages to read transcript language from "data.lang or session.lang"; remove "or
session.lang" so stages only use data.lang (i.e., keep "read `stt_lang`; stages
that need the transcript's language read `data.lang`"). Ensure any mention of
`session.lang` as a fallback for `stt_lang` is deleted and add a clarifying note
that `stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.

Audio transformer is the inline voice-signal injection point per
USER-ID-1 §3.1; context.voice_match is the intermediate signal.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@JarbasAl JarbasAl merged commit fc3df31 into dev May 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant