Summary
Factory recipe to add audio understanding and/or speech output to any text-only LLM. Audio encoder for input, optional vocoder head for output.
Approach
Audio Input (hearing)
- Whisper-style encoder or distilled variant
- Train projection layer: audio embeddings → LLM token space
- Same pattern as vision — frozen base, train the bridge
Audio Output (speech)
- Train a small vocoder head on the LLM's hidden states
- Or use adapter approach: LLM generates speech tokens → decoder produces audio
- Could leverage existing TTS as bridge while training native capability
Factory Integration
- New forge profile type:
audio-encoder and/or speech-head
- Recipes composable: vision + audio + personality on same base model
- Validation: transcription accuracy, speech naturalness (MOS scores)
Why
A local model that can hear and speak natively — not through TTS/STT bridges — is qualitatively different. Lower latency, better understanding of tone/emphasis, natural conversation. The factory already handles the training infra; this is just a new recipe.
Constraints
- Audio data is large — need efficient data loading pipeline
- Real-time inference needs AudioWorklet integration (already architected)
- Speech head training needs paired text-audio data
Related
Summary
Factory recipe to add audio understanding and/or speech output to any text-only LLM. Audio encoder for input, optional vocoder head for output.
Approach
Audio Input (hearing)
Audio Output (speech)
Factory Integration
audio-encoderand/orspeech-headWhy
A local model that can hear and speak natively — not through TTS/STT bridges — is qualitatively different. Lower latency, better understanding of tone/emphasis, natural conversation. The factory already handles the training infra; this is just a new recipe.
Constraints
Related