OVOS-AUDIO-IN-1: Audio Input Service Specification by JarbasAl · Pull Request #51 · OpenVoiceOS/architecture

JarbasAl · 2026-05-28T08:42:35Z

Companion issue: #52

Summary

Minimal spec for the audio input service. Three normative obligations:

A STT mechanism MUST exist (deployer-defined)
Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT
MUST emit ovos.utterance.handle with data.utterances and data.lang

What is NOT prescribed

How audio is acquired (mic, file, remote, wake word, VAD, push-to-talk)
Which STT engine, model, or API is used
Post-STT transformer chains (utterance §3.2, metadata §3.3) — deployer MAY run before emission
Session lifecycle

Audio transformer canonical use cases (§4)

Language identification → writes session.detected_lang
Denoising / normalisation → acoustic pre-processing
Speaker recognition → result written into Message.context

Language fields (§5.1)

Selection order for STT input language: session.detected_lang → session.request_lang → session.lang

After transcription, SHOULD write session.stt_lang = the language the STT model was configured to assume. Matches data.lang in normal transcription; diverges in speech-translation (stt_lang = audio's spoken language, data.lang = translated output language).

Companion edit (same PR)

ovos-session-1.md §3.2.4 — clarified stt_lang definition (was "actually transcribed in", now "configured to assume for the audio") with speech-translation distinction documented.

Summary by CodeRabbit

Documentation
- Added Audio Input Service specification (OVOS-AUDIO-IN-1) covering audio acquisition, pre-STT transformation, and speech-to-text processing workflows.
- Updated session specification documentation to clarify STT language signal configuration and behavior.

Minimal spec with three normative obligations: 1. A STT mechanism MUST exist (deployer-defined — engine, API, model are all out of scope) 2. Audio-transformer chain (TRANSFORM-1 §3.1) MUST run before STT 3. MUST emit ovos.utterance.handle with data.utterances and data.lang Everything else — audio capture method (mic, file, remote, wake word, VAD), STT engine selection, post-STT transformer chains — is deployer concern and explicitly out of scope. Language resolved from session.detected_lang → session.stt_lang → session.lang in that order. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Language identification: detected lang written to session.detected_lang - Denoising/normalisation: noise reduction, format conversion - Speaker recognition: speaker_id written to Message.context for downstream personalisation without audio service knowing semantics Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

session.stt_lang records the language STT actually decoded in (SESSION-1 §3.2.4). It is a result written by the audio input service after transcription, not an input to language selection. Corrected language resolution order (inputs to STT selection): 1. session.detected_lang (audio transformer detection) 2. session.request_lang (capture mechanism hint, e.g. wake word) 3. session.lang (general session preference) Added SHOULD obligation to write session.stt_lang after transcription. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

stt_lang is the language the STT model was configured to assume for the audio (model input language), written before/at STT invocation. In normal transcription stt_lang == data.lang; in speech-translation they diverge — stt_lang is the audio's spoken language, data.lang is the transcript's output language. SESSION-1 §3.2.4 updated to match: "actually transcribed in" was ambiguous in the translation case. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… ref - §1 non-goals: replace confusing "post-STT transformers are deployer concern" with "owned by utterance lifecycle (PIPELINE-1), run after emission". Session lifecycle non-goal now cross-references SESSION-2 and points to §5.2. - §5.2 (new): session assignment. Audio input is the originator of interactions and MUST assign a session. Local device SHOULD use session_id "default" (SESSION-2 §5); satellite session is assigned by the bridge at the hub boundary (BRIDGE-1 §4.2.1). Session MUST be in context.session, not data. - §6 conformance: add session assignment MUST and SHOULD; remove utterance-transformer MAY (it belongs to the utterance lifecycle) - See also: add SESSION-2 and BRIDGE-1; note PIPELINE-1 owns post-STT Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Preamble: drop repeated capture-method list (already in §1) - §1: drop "defines" list (restated section headings); keep non-goals only; add AUDIO-1 to post-STT non-goal (dialog/TTS chains live there) - §2: merge two-sentence role description into one - §3: drop obvious "no constraint beyond..." clause - §4: trim use-case bullet tails to one line each; drop "no transformers → unchanged" (obvious) - §5: drop redundant MSG-1 §4 reference (covered by §5.2) - §5.1: drop "most specific signal" and "prior not guarantee" padding - §5.2: drop "this is the normal case" sentence; drop final "MUST be in context.session not data" (in §6 MUST) - §6 MAY: remove "acquire audio by any mechanism" (a non-goal, not a MAY) - See also: AUDIO-1 added; entries tightened Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai · 2026-05-28T09:16:47Z

Warning

Review limit reached

@JarbasAl, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 14 minutes and 51 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: ed5facf8-6556-4322-8fb9-6f186cc11c41

📥 Commits

Reviewing files that changed from the base of the PR and between d074ae2 and 649542e.

📒 Files selected for processing (1)

ovos-audio-in-1.md

📝 Walkthrough

Walkthrough

This PR introduces the OVOS-AUDIO-IN-1 specification defining the audio input service responsible for audio acquisition, pre-STT audio transformation, speech-to-text conversion, and utterance emission. It updates the session specification to clarify stt_lang semantics and indexes the new service in the specification registry.

Changes

Audio Input Service Specification

Layer / File(s)	Summary
Audio Input Service Specification `ovos-audio-in-1.md`	Defines the audio input service contract: acquiring audio, executing the pre-STT audio-transformer chain (OVOS-TRANSFORM-1), performing STT with deployer-defined mechanism, and emitting utterances to `ovos.utterance.handle`. Specifies required emission fields (`data.utterances`, `data.lang`), language resolution precedence (`session.detected_lang`, `session.request_lang`, `session.lang`), session assignment requirements (local vs. satellite), and conformance checklist.
Specification Index and Session Language Semantics `README.md`, `ovos-session-1.md`	Adds OVOS-AUDIO-IN-1 to the specifications index and clarifies the `stt_lang` field in the session specification to describe the language the STT stage was configured to assume (distinct from `data.lang`, the transcript output language), with explicit contracts for transcription and speech-translation scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

OpenVoiceOS/architecture#52: Directly addresses the proposal for the OVOS-AUDIO-IN-1 audio input service specification with STT and session language semantics.

Possibly related PRs

OpenVoiceOS/architecture#11: The new OVOS-AUDIO-IN-1 spec explicitly emits results to ovos.utterance.handle, requiring conformance to OVOS-PIPELINE-1's utterance-lifecycle contract and terminal event guarantees.

Poem

🐰 In the audio streams so bright,
Input service sets things right,
STT transforms the sound,
Sessions bind the ground,
Pipeline's path now shines with light! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and concisely summarizes the main change: introducing the OVOS-AUDIO-IN-1 specification document for the audio input service, which is the primary addition in this pull request.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch spec/audio-input

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@ovos-audio-in-1.md`:
- Line 84: Change the type wording for the field named "utterances" from "array
of string" to "array of strings" in the specification table so it matches
established payload wording; locate the table row that defines the utterances
field (header shows `utterances | array of string | yes | Transcription
candidates; first element is primary.`) and update the type cell to read `array
of strings`.
- Line 85: The table entry that currently marks `data.lang` as required (`yes` /
MUST) is incorrect; update the `lang` field to reflect the PIPELINE-1 §9.1 rule
(optional-by-authority) so producers only emit `data.lang` when they
authoritatively know the content language. Change the `lang` requirement from
unconditional "yes / MUST emit" to an authority-conditional form (e.g.,
"conditional / when authoritative" or "optional-by-authority") in the
`data.lang` row and make the same adjustment for the other occurrences
referenced (lines 128-129) so the spec consistently defers to PIPELINE-1 §9.1.
Ensure the text mentions `data.lang` and cites PIPELINE-1 §9.1 for clarity.

In `@ovos-session-1.md`:
- Around line 374-375: The text incorrectly advises stages to read transcript
language from "data.lang or session.lang"; remove "or session.lang" so stages
only use data.lang (i.e., keep "read `stt_lang`; stages that need the
transcript's language read `data.lang`"). Ensure any mention of `session.lang`
as a fallback for `stt_lang` is deleted and add a clarifying note that
`stt_lang` must come from the utterance payload (`data.lang`) and not be
synthesized from `session.lang`.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 93d82bcd-118f-43f5-9c10-af8f8558d7a4

📥 Commits

Reviewing files that changed from the base of the PR and between 5dd4a7b and d074ae2.

📒 Files selected for processing (3)

README.md
ovos-audio-in-1.md
ovos-session-1.md

coderabbitai · 2026-05-28T09:19:04Z

+
+| Field | Type | Required | Meaning |
+|-------|------|----------|---------|
+| `utterances` | array of string | yes | Transcription candidates; first element is primary. |


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Use array of strings for the utterances type.

array of string is ambiguous in a normative table. Please switch to array of strings to match the established payload wording and reduce implementer interpretation drift.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ovos-audio-in-1.md` at line 84, Change the type wording for the field named "utterances" from "array of string" to "array of strings" in the specification table so it matches established payload wording; locate the table row that defines the utterances field (header shows `utterances | array of string | yes | Transcription candidates; first element is primary.`) and update the type cell to read `array of strings`.

coderabbitai · 2026-05-28T09:19:04Z

+| Field | Type | Required | Meaning |
+|-------|------|----------|---------|
+| `utterances` | array of string | yes | Transcription candidates; first element is primary. |
+| `lang` | string | yes | BCP-47 output language of the transcription. See §5.1. |


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

data.lang is required here but optional-by-authority in PIPELINE-1.

This creates a contract mismatch: OVOS-PIPELINE-1 §9.1 allows lang only when the producer authoritatively knows content language, but this spec currently makes it unconditional (yes / MUST emit data.lang). Please align this to avoid forcing synthesized or guessed language tags.

Also applies to: 128-129

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ovos-audio-in-1.md` at line 85, The table entry that currently marks `data.lang` as required (`yes` / MUST) is incorrect; update the `lang` field to reflect the PIPELINE-1 §9.1 rule (optional-by-authority) so producers only emit `data.lang` when they authoritatively know the content language. Change the `lang` requirement from unconditional "yes / MUST emit" to an authority-conditional form (e.g., "conditional / when authoritative" or "optional-by-authority") in the `data.lang` row and make the same adjustment for the other occurrences referenced (lines 128-129) so the spec consistently defers to PIPELINE-1 §9.1. Ensure the text mentions `data.lang` and cites PIPELINE-1 §9.1 for clarity.

coderabbitai · 2026-05-28T09:19:04Z

+read `stt_lang`; stages that need the transcript's language read
+`data.lang` or `session.lang`. Once set, `stt_lang` travels with


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not recommend session.lang as transcript-language fallback.

The clause “stages that need the transcript’s language read data.lang or session.lang” conflicts with utterance-layer semantics where payload language must come from data.lang (and must not be synthesized from session preference). Please remove or session.lang here to keep contracts consistent.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@ovos-session-1.md` around lines 374 - 375, The text incorrectly advises stages to read transcript language from "data.lang or session.lang"; remove "or session.lang" so stages only use data.lang (i.e., keep "read `stt_lang`; stages that need the transcript's language read `data.lang`"). Ensure any mention of `session.lang` as a fallback for `stt_lang` is deleted and add a clarifying note that `stt_lang` must come from the utterance payload (`data.lang`) and not be synthesized from `session.lang`.

Audio transformer is the inline voice-signal injection point per USER-ID-1 §3.1; context.voice_match is the intermediate signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

JarbasAl and others added 4 commits May 28, 2026 09:42

JarbasAl mentioned this pull request May 28, 2026

Spec proposal: OVOS-AUDIO-IN-1 — Audio Input Service #52

Open

JarbasAl and others added 2 commits May 28, 2026 10:03

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

This was referenced May 28, 2026

USER-ID-1: user identity resolution and authentication level specification #53

Open

USER-ID-1 v1: user identity resolution specification #54

Draft

AUDIO-IN-1: cross-reference USER-ID-1 voice signal injection

649542e

Audio transformer is the inline voice-signal injection point per USER-ID-1 §3.1; context.voice_match is the intermediate signal. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

This was referenced May 28, 2026

Adoption: move the OVOS specifications out of Draft status #5

Open

Epic: migrate OVOS repos onto ovos-spec-tools #7

Open

JarbasAl merged commit fc3df31 into dev May 28, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OVOS-AUDIO-IN-1: Audio Input Service Specification#51

OVOS-AUDIO-IN-1: Audio Input Service Specification#51
JarbasAl merged 7 commits into
devfrom
spec/audio-input

JarbasAl commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		read `stt_lang`; stages that need the transcript's language read
		`data.lang` or `session.lang`. Once set, `stt_lang` travels with

Conversation

JarbasAl commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What is NOT prescribed

Audio transformer canonical use cases (§4)

Language fields (§5.1)

Companion edit (same PR)

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review limit reached

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

JarbasAl commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading