fix(inference,#1280): delete dead Candle adapter chain (Phase 1, ~2500 LOC)#1288
Merged
Conversation
…0 LOC) Per the plasticity reachability audit on #1280 (#1280 (comment)), production routes local inference exclusively through `LlamaCppAdapter`. The Candle-side chain — `CandleAdapter`, `ContinuumModel`, `select_best_device`, `load_model_by_id`, `quantized.rs::load_*_quantized`, `backends::generate`, `backends::load_gguf_backend` — was reachable only through itself or orphaned `bin/*` files. Plasticity's IPC handlers (`plasticity/{analyze,compact,compress,topology,pipeline}`) work on safetensors files via plasticity's own helpers and don't touch this chain. Deleted: - `inference/candle_adapter.rs` (1486 LOC) - `inference/quantized.rs` (287 LOC) - `inference/model.rs` collapsed from 857 → 167 LOC, retaining only `rebuild_with_stacked_lora` (used by `backends/llama_safetensors.rs::CompactLlamaSafetensorsBackend`, test-only, slated for Phase 2 deletion alongside the safetensors backends once plasticity LoRA training is migrated or retired) Wire updates: - `ai/mod.rs`: drop `pub use crate::inference::CandleAdapter` re-export - `inference/mod.rs`: drop `candle_adapter`/`quantized` modules + their re-exports; keep `model::rebuild_with_stacked_lora` only - `modules/ai_provider.rs`: drop dead `CandleAdapter` import (it was imported but never instantiated by `register_adapters`) Contract relocation (the audit's flagged risk): The no-CPU-fallback `panic!("...CPU fallback is disabled")` in `select_best_device` was deleted along with the rest of the dead chain. The contract's actual production enforcement was already on llama.cpp: `LlamaCppConfig::default()` sets `n_gpu_layers: -1` (= "all layers on GPU"), and llama.cpp's loader hard-fails when no GPU is available. `tests/no_cpu_fallback_contract.rs` is updated atomically to assert the `n_gpu_layers: -1` invariant in `backends/llamacpp.rs` rather than the deleted panic site. The `ort_providers` and `LlamaCppAdapter` assertions survive unchanged. Net: 7 files changed, +92 / -2546 LOC. Verified: - cargo check --features metal: clean (52 pre-existing warnings, 0 errors) - cargo test --test no_cpu_fallback_contract: 3 passed (new contract assertion `llamacpp_default_config_requires_full_gpu_offload` green) - cargo test --lib --features metal: 2166 passed, 0 failed Phase 2 (deferred): delete safetensors backends + vendored qwen2/llama backends + `rebuild_with_stacked_lora` once plasticity's production reachability allows. Audit: #1262 (comment) Mission: Joel 2026-05-15 — "eliminate slop and slowly oxidize this project" Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Phase 1 of #1280 — delete the Candle inference chain that's been vestigial since the llama.cpp migration. Production routes local inference exclusively through `LlamaCppAdapter`; this PR removes ~2500 LOC that was reachable only from itself or orphaned bin/* files.
Net: 7 files changed, +92 / -2546 LOC.
Deleted
Wire updates
Contract relocation (audit-flagged risk)
The audit (#1280 (comment)) flagged that the no-CPU-fallback `panic!("...CPU fallback is disabled")` lived in `select_best_device` (deleted with the dead chain). The contract's actual production enforcement was already on llama.cpp: `LlamaCppConfig::default()` sets `n_gpu_layers: -1` (= "all layers on GPU"); llama.cpp's loader hard-fails when no GPU is available.
`tests/no_cpu_fallback_contract.rs` is updated atomically to assert the `n_gpu_layers: -1` invariant in `backends/llamacpp.rs` rather than the deleted panic site:
```rust
#[test]
fn llamacpp_default_config_requires_full_gpu_offload() {
assert!(LLAMACPP_BACKEND_SOURCE.contains("n_gpu_layers: -1"), "...");
}
```
The `ort_providers` and `LlamaCppAdapter::NoLocalModelLoadable` assertions survive unchanged.
Why safe
Per the plasticity reachability audit (#1280 audit comment): plasticity's IPC handlers (`plasticity/{analyze,compact,compress,topology,pipeline}`) work on safetensors files via plasticity's own `compactor::` / `pipeline::` helpers and don't import anything from the deleted Candle chain. Only plasticity's `#[cfg(test)] mod tests` block touches the safetensors backends — those tests are preserved (they go in Phase 2 alongside the safetensors backends).
Mission context
Joel 2026-05-15: "mission to eliminate slop and slowly oxidize this project". This PR is the second-largest single dead-code deletion in the #1262 audit's follow-on series (after #1279's qwen3.5 deletion at 1132 LOC).
Test plan
Phase 2 (deferred)
Delete safetensors backends (`llama_safetensors.rs`, `compact_llama_safetensors.rs`, `qwen2_safetensors.rs`, `llama_gguf.rs`) + vendored Candle modules (`compact_llama`, `quantized_llama`, `qwen2`) + `rebuild_with_stacked_lora` once plasticity's LoRA training infrastructure is migrated to llama.cpp or formally retired. Estimated additional ~2500 LOC.
🤖 Generated with Claude Code