Skip to content

Factory recipe: audio encoder + speech head (Whisper-style) #650

@joelteply

Description

@joelteply

Summary

Factory recipe to add audio understanding and/or speech output to any text-only LLM. Audio encoder for input, optional vocoder head for output.

Approach

Audio Input (hearing)

  • Whisper-style encoder or distilled variant
  • Train projection layer: audio embeddings → LLM token space
  • Same pattern as vision — frozen base, train the bridge

Audio Output (speech)

  • Train a small vocoder head on the LLM's hidden states
  • Or use adapter approach: LLM generates speech tokens → decoder produces audio
  • Could leverage existing TTS as bridge while training native capability

Factory Integration

  • New forge profile type: audio-encoder and/or speech-head
  • Recipes composable: vision + audio + personality on same base model
  • Validation: transcription accuracy, speech naturalness (MOS scores)

Why

A local model that can hear and speak natively — not through TTS/STT bridges — is qualitatively different. Lower latency, better understanding of tone/emphasis, natural conversation. The factory already handles the training infra; this is just a new recipe.

Constraints

  • Audio data is large — need efficient data loading pipeline
  • Real-time inference needs AudioWorklet integration (already architected)
  • Speech head training needs paired text-audio data

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions