fix(audio/tts): orpheus must fail-closed on no-Metal, no CPU fallback#1312
Merged
Conversation
vhsm-d1f4 (Joel via vHSM cwd) flagged this in the 2026-05-16 audit pass:
orpheus.rs:179-191 had explicit Metal→CPU fallback with a friendly
"Orpheus: Using CPU (with Accelerate BLAS)" log. The fallback evaded
tests/no_cpu_fallback_contract.rs because that test only inspects
llamacpp.rs/ort_providers.rs/llamacpp_adapter.rs — Candle-side TTS
slipped through.
Joel's audit attributes the 900% CPU pathology seen during chat to this
class of silent fallback: render loop is sacred per the README, main
thread should not be doing inference, but Orpheus CPU+Accelerate BLAS
via candle ends up doing exactly that.
What changes:
- select_device() -> Device becomes select_device() -> Result<Device, TTSError>
- On Metal failure, returns TTSError::ModelNotLoaded with explicit
"Orpheus requires Metal GPU; no CPU fallback. Device::new_metal(0)
failed: {e}"
- Caller at line 550 propagates with ?
- The "Using CPU" log line is gone; only the success-path Metal log
remains
What does NOT change:
- Behavior on Metal-capable hosts: identical
- SNAC decoder ORT path already required GPU EP (lines 196-208); this
PR brings the GGUF/candle path to the same standard
- TTS engine selection elsewhere — if Orpheus refuses to load, the
caller can register a different TTS engine or surface to operator
Why this is safe:
- cargo check + clippy clean (146 warnings, baseline 146 = no regression)
- All Mac dev hosts have Metal; production runtime contract per README
requires GPU
- Error surface is typed (TTSError::ModelNotLoaded) so callers can
fall through to alternative TTS engines if registered, or fail-loud
otherwise — no silent CPU drift
VDD note (per vhsm-d1f4 audit pass 2): this PR is defensive (prevents
the CPU pathology); tok/s measurement isn't applicable because it
removes the SOURCE of the pathology rather than tuning the hot path.
Whoever owns Phase A.8 next can measure aggregate tok/s with Orpheus
load now correctly gated.
Follow-ups (separate PRs):
- src/workers/inference-grpc/src/model.rs:275-295 same CUDA→Metal→CPU
fallback (vhsm-d1f4 finding #2)
- Widen tests/no_cpu_fallback_contract.rs to grep whole workers tree
for Device::Cpu, require allow-list justification (finding #3)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
joelteply
added a commit
that referenced
this pull request
May 16, 2026
Companion to codex's #1312 (orpheus same-shape fix). Closes the inference-grpc CPU-fallback path supervisor vhsm-d1f4 flagged in audit pass 1 finding #2 (2026-05-16). Evaded the codified no_cpu_fallback_contract.rs test (only inspects llamacpp / ort_providers / llamacpp_adapter, not workers/inference-grpc). Pre-fix select_best_device tried CUDA, tried Metal, then printed 'Using CPU (no GPU acceleration)' and returned Device::Cpu. - select_best_device now returns Result<Device, Box<dyn Error>> - caller propagates via ?, no behavior change on GPU-available hosts - Error message names what to do - cargo check clean: --features metal Co-authored-by: Test <test@test.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
vhsm-d1f4 (Joel via vHSM cwd) flagged `orpheus.rs:181-190` in audit pass 1 (2026-05-16): explicit Metal→CPU fallback with a friendly `"Orpheus: Using CPU (with Accelerate BLAS)"` log. The fallback evaded `tests/no_cpu_fallback_contract.rs` because that test only inspects llamacpp.rs/ort_providers.rs/llamacpp_adapter.rs — Candle-side TTS slipped through.
Joel attributes the 900% CPU pathology seen during chat to this class of silent fallback: render loop is sacred per README, main thread should not be doing inference, but Orpheus CPU+Accelerate BLAS via candle ends up doing exactly that.
What changes
What does NOT change
Vision Pro test (per Joel's TS-vs-Rust criterion)
The Rust crate compiles + tests standalone with this change. Caller (any host — Swift on Vision Pro, Node on Mac, whatever) sees a typed `TTSError` if Metal unavailable. No JS runtime involved.
Why this is safe
VDD note (per vhsm-d1f4 audit pass 2)
This PR is defensive — it removes the SOURCE of the CPU pathology rather than tuning the hot path. tok/s measurement isn't applicable to a deletion. Whoever owns Phase A.8 next can measure aggregate tok/s with Orpheus load now correctly gated.
Follow-ups (separate PRs)
🤖 Generated with Claude Code