feat(accessors): issue #2 wrapper API + 27-round safety hardening pass#8
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands the safe Rust wrapper’s public “accessor” surface and hardens several FFI-facing behaviors (language handling, duration bounds, tokenization, timings) with additional validation and typed errors, plus a comments-only safety audit module to document the review matrix.
Changes:
- Add new public accessors/introspection APIs (e.g.,
version(),lang_max_id(),lang_id_for(),Lang::full_name(),Context::{token_to_bytes, token_* accessors, tokenize, tokenize_one, model_dims},State::{n_mel_frames, print_timings, reset_timings}). - Add
State::fullpreflights for model-bound language hints andoffset_ms/duration_msaudio-range validation (new error variants). - Extend the sys shim surface (no-log tokenize + state-aware timings + lang-id shim) and enforce required patch markers in
whispercpp-sys/build.rs; add a large comments-onlysafety_auditmodule.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
whispercpp/src/state.rs |
Adds State::full preflights (language/model + duration bounds) and new State accessors for mel frames and timings. |
whispercpp/src/params.rs |
Strengthens setters (set_language, set_initial_prompt, set_tokens) with bounds/alloc-safety and adds internal accessors used by State::full. |
whispercpp/src/context.rs |
Adds token/vocab accessors, tokenization APIs, model-dims introspection, and improves constructor error reporting. |
whispercpp/src/error.rs |
Expands WhisperError with new variants and adds sentinel code fields for richer diagnostics. |
whispercpp/src/lang.rs |
Adds Lang::full_name plus lang_max_id/lang_id_for top-level APIs and tests. |
whispercpp/src/lib.rs |
Wires in safety_audit, re-exports new language helpers, and adds version() with link-guard tests. |
whispercpp/src/safety_audit.rs |
Adds a comments-only safety audit matrix documenting the API surface and required native patches. |
whispercpp-sys/whispercpp_shim.h |
Declares new shims (no-log tokenize, token-count probe, state-aware timings, lang-id shim). |
whispercpp-sys/whispercpp_shim.cpp |
Implements whispercpp_lang_id and updates shim notes. |
whispercpp-sys/build.rs |
Adds required patch markers and allowlists additional no-throw accessors/shims for bindgen. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+969
to
977
| /// | ||
| /// Returns [`WhisperError::AllocationFailed`] if the copy | ||
| /// buffer cannot be reserved (`Vec::try_reserve_exact` | ||
| /// failure). **Panic-free** — `slice::to_vec` would route | ||
| /// through the abort-on-OOM global allocator; the fallible | ||
| /// reserve makes a multi-gigabyte token vector under memory | ||
| /// pressure a typed error instead of a process abort. | ||
| pub fn set_tokens(&mut self, tokens: &[i32]) -> WhisperResult<&mut Self> { | ||
| if tokens.is_empty() { |
Comment on lines
34
to
56
| #[error("failed to load model from {path}: {reason}")] | ||
| ContextLoad { | ||
| /// Path the caller passed in. Stored so logs can pinpoint | ||
| /// which model file failed. | ||
| path: SmolStr, | ||
| /// Any extra context whisper.cpp surfaced (often empty — | ||
| /// the C API just returns NULL). | ||
| /// Any extra context whisper.cpp surfaced. Set to one of | ||
| /// a small set of `&'static str` values via | ||
| /// `SmolStr::new_static`, never `format!` — this error | ||
| /// is reached on the `bad_alloc` / `system_error` path, | ||
| /// and the `format!` macro plus | ||
| /// `SmolStr::new(formatted_string)` would each | ||
| /// heap-allocate while the global allocator is already | ||
| /// under pressure. | ||
| reason: SmolStr, | ||
| /// Constructor exception sentinel from the C++ shim. | ||
| /// `Some(WHISPERCPP_ERR_*)` when the shim caught a | ||
| /// throw; `None` when upstream returned NULL via its | ||
| /// own bool-failure path. Kept as a separate `Option<i32>` | ||
| /// (instead of formatted into `reason`) so error | ||
| /// construction allocates nothing on the | ||
| /// caught-exception branch. | ||
| code: Option<i32>, | ||
| }, |
Implements the wrapper additions tracked in issue #2 (token / vocab / language / model-dim accessors, state-aware timings, fallible setters) and hardens the FFI seam against panic, leak, double-free, OOM-abort, dangling-alias, and integer- boundary classes reachable from safe Rust through `Context::new` / `Context::create_state` / `State::full`. Rust-side surface ----------------- * New accessors: `Context::{token_to_bytes, tokenize, tokenize_one, model_dims, token_for_lang}`, special-token getters, `Lang::full_name` (owned `SmolStr`), top-level `version` / `lang_max_id` / `lang_id_for`. * State-bound timings: `State::{print_timings, reset_timings, n_mel_frames}` (state-bound only — context-shared totals dropped to avoid races and stale-baseline output). * Fallible setters with `try_reserve_exact` on every public copying input: `Params::{set_language, set_initial_prompt, set_tokens}`. * New error variants: `UnknownLanguage`, `LanguageNotSupportedByModel`, `InvalidDuration`, `AllocationFailed`, plus `ContextLoad { code: Option<i32> }` and `StateInit { code: Option<i32> }` carrying shim sentinels without OOM-path heap allocation. * Bounded error payloads (≤ 256 bytes) and static-string reasons on the OOM-recoverable path. Submodule patches (Findit-AI/whisper.cpp@rust) ---------------------------------------------- 37 patches, pinned via `REQUIRED_MARKERS` in `whispercpp-sys/build.rs`. A submodule rebase that drops any sentinel fails patch-verification at build time. * C++ exception safety: RAII entry/exit for `whisper_init_state` and `whisper_init_from_file_with_params_no_state`; per-pointer `ggml_context_ptr` / `ggml_backend_buffer_ptr` guards in `whisper_model_load`, `whisper_vad_load`, and `whisper_backend_init` (closes leak windows around `ctx_map[buft] = ctx`, `model.ctxs.emplace_back`, raw-local tensor-prep contexts, and backend-handle accumulation). * OOM-safe allocators on the create-state path: every `GGML_ASSERT(ptr != NULL)` and `GGML_ABORT` reachable from `Context::create_state` → `whisper_sched_graph_init` → `ggml_backend_sched_new` / `ggml_gallocr_new_n` / `ggml_gallocr_reserve_n_impl` / `ggml_dyn_tallocr_*` converted to return-NULL/false with cleanup. File-local non-aborting `whispercpp_hash_set_new_or_null` variants replace upstream's `GGML_MALLOC`-routed `ggml_hash_set_new` calls. * Transactional state changes on the recoverable path: `node_allocs`, `leaf_allocs`, `vbuffer` realloc (with alias-aware pre-pass), and `hash_set` + `hash_values` all use atomic-commit so an alloc failure leaves the gallocr in a coherent retryable state instead of poisoned (NULL with stale capacity, dangling pointers in aliased slots, etc.). * Validation: vocab `n_vocab` bounds + post-synthesis size check, tensor header validation, special-token bounds, hparams head divisibility, `lang_str` null guard, sched abort callback wiring, model-bound language preflight (multilingual + non-multilingual + auto-detect filter against `vocab.num_languages()`). * Integer-boundary: `whisper_tokenize`, `whisper_token_count`, and the `whispercpp_*` shim variants reject `res.size() > INT_MAX` with `INT_MIN` instead of UB. * DoS / range guards: `Params::set_duration_ms` preflight (avoids a `seek_end` slip driving 71k zero-padded decode windows); auto-detect candidate-set filter (avoids out-of-range language ids aliasing onto task tokens). * Sentinel hygiene: `whispercpp_*` shims with disjoint sentinels (`INT_MIN` for tokenize-overflow, `WHISPERCPP_ERR_*` ≤ -100 for caught exceptions); `va_copy` + truncation in `whisper_log_internal` to close va-list reuse and heap-throw UB. Audit framework --------------- `whispercpp/src/safety_audit.rs` documents the 14+ safety axes (throw-safety, sync, allocation, lifetime, linkage, sentinel collisions, log pollution, error-payload bounds, race / classification, model-bound semantics, metadata consistency, DoS amplification, default-state mirror, state vs ctx scope, integer boundary, abort-bypass of shim, invariant preservation), the per-method × per-axis ✓ matrix, and the native-side patches list. Pre-emptive out-of-scope sweep targets (VAD `(int) size()` casts, `ggml_backend_tensor_copy` slow-path malloc, the broader `GGML_MALLOC` / `GGML_CALLOC` abort-on-OOM call graph, `ggml_backend_graph_copy`'s aborting hash_set use) are flagged in the doc rather than buried in commit footers. Test plan --------- 47/47 unit tests pass; clippy `--all-targets -D warnings` clean. DTW × flash-attn rejection covered. `Params` setters fully exercised (length caps, NUL rejection, default mirroring, accessor round-trips, language overlong-payload bound). State-aware timing shim C linkage pinned. `Token::from_raw` / DTW sentinel projection / language preflight covered. Fault-injection coverage for OOM paths (LD_PRELOAD-style allocator interposition or custom `mem_buffer` threading) is logged as the next sweep target. Real-model integration tests still rely on the parity harness in `whispery`.
uqio
added a commit
that referenced
this pull request
May 9, 2026
Documents the DTW timestamp surface (#7), the issue-#2 accessor surface (#8), the safety hardening pass against panic / leak / OOM-abort / integer-boundary classes, the breaking changes since 0.1.0, and the 37 native-side patches pinned via build-time marker verification. Format follows Keep a Changelog 1.1.0; project follows SemVer 2.0.
14 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements the wrapper additions tracked in issue #2 (token / vocab / language / model-dim accessors, state-aware timings, fallible setters), then expanded — over 27 rounds of adversarial review — into a comprehensive safety pass that hardens the FFI seam against panic, leak, double-free, OOM-abort, dangling-alias, and integer-boundary classes reachable from safe Rust through
Context::new/Context::create_state/State::full.Rust-side surface
Context::{token_to_bytes, tokenize, tokenize_one, model_dims, token_for_lang}, special-token getters,Lang::full_name(ownedSmolStr), top-levelversion/lang_max_id/lang_id_for.State::{print_timings, reset_timings, n_mel_frames}(state-bound only — context-shared totals dropped to avoid races and stale-baseline output).try_reserve_exacton every public copying input:Params::{set_language, set_initial_prompt, set_tokens}.UnknownLanguage,LanguageNotSupportedByModel,InvalidDuration,AllocationFailed, plusContextLoad { code: Option<i32> }andStateInit { code: Option<i32> }carrying shim sentinels without OOM-path heap allocation.Submodule patches (
Findit-AI/whisper.cpp@rust)37 patches, pinned via
REQUIRED_MARKERSinwhispercpp-sys/build.rs. A submodule rebase that drops any sentinel fails patch-verification at build time.whisper_init_state/whisper_init_from_file_with_params_no_state; per-pointerggml_context_ptr/ggml_backend_buffer_ptrguards inwhisper_model_load,whisper_vad_load, andwhisper_backend_init(closes leak windows aroundctx_map[buft] = ctx,model.ctxs.emplace_back, raw-local tensor-prep contexts, and backend-handle accumulation).GGML_ASSERT(ptr != NULL)andGGML_ABORTreachable fromContext::create_state→whisper_sched_graph_init→ggml_backend_sched_new/ggml_gallocr_new_n/ggml_gallocr_reserve_n_impl/ggml_dyn_tallocr_*converted to return-NULL/false with cleanup. File-local non-abortingwhispercpp_hash_set_new_or_nullvariants replace upstream'sGGML_MALLOC-routedggml_hash_set_newcalls.node_allocs,leaf_allocs,vbufferrealloc (with alias-aware pre-pass), andhash_set+hash_valuesall use atomic-commit so an alloc failure leaves the gallocr in a coherent retryable state instead of poisoned (NULL with stale capacity, dangling pointers in aliased slots, etc.).n_vocabbounds + post-synthesis size check, tensor header validation, special-token bounds, hparams head divisibility,lang_strnull guard, sched abort callback wiring, model-bound language preflight (multilingual + non-multilingual + auto-detect filter againstvocab.num_languages()).whisper_tokenize,whisper_token_count, and thewhispercpp_*shim variants rejectres.size() > INT_MAXwithINT_MINinstead of UB.Params::set_duration_mspreflight (avoids aseek_endslip driving 71k zero-padded decode windows); auto-detect candidate-set filter (avoids out-of-range language ids aliasing onto task tokens).whispercpp_*shims with disjoint sentinels (INT_MINfor tokenize-overflow,WHISPERCPP_ERR_*≤ -100 for caught exceptions);va_copy+ truncation inwhisper_log_internalto close va-list reuse and heap-throw UB.Audit framework
whispercpp/src/safety_audit.rsis the canonical entry point. It documents:(int) size()casts,ggml_backend_tensor_copyslow-path malloc, the broaderGGML_MALLOC/GGML_CALLOCabort-on-OOM call graph,ggml_backend_graph_copy's aborting hash_set use).Test plan
--all-targets -D warningsclean.Paramssetters fully exercised (length caps, NUL rejection, default mirroring, accessor round-trips, language overlong-payload bound).Token::from_raw/ DTW sentinel projection / language preflight.mem_bufferthreading; logged as the next sweep target.whispery.Out-of-scope follow-ups (audit-tracked)
(int) speeches.size()casts atwhisper.cpp:6264+— not yet exposed.ggml_backend_tensor_copyslow-path uncheckedmallocatggml-backend.cpp:407— runtime, not init.ggml_backend_graph_copy'sggml_hash_set_new(ggml-backend.cpp:2069) — not on safe-Rust paths today.GGML_MALLOC/GGML_CALLOCabort-on-OOM globally — would touch hundreds of callers, deferred to a separate effort.🤖 Generated with Claude Code