Skip to content

feat(accessors): issue #2 wrapper API + 27-round safety hardening pass#8

Merged
uqio merged 1 commit into
mainfrom
feat/accessors
May 9, 2026
Merged

feat(accessors): issue #2 wrapper API + 27-round safety hardening pass#8
uqio merged 1 commit into
mainfrom
feat/accessors

Conversation

@uqio
Copy link
Copy Markdown
Collaborator

@uqio uqio commented May 8, 2026

Summary

Implements the wrapper additions tracked in issue #2 (token / vocab / language / model-dim accessors, state-aware timings, fallible setters), then expanded — over 27 rounds of adversarial review — into a comprehensive safety pass that hardens the FFI seam against panic, leak, double-free, OOM-abort, dangling-alias, and integer-boundary classes reachable from safe Rust through Context::new / Context::create_state / State::full.

Rust-side surface

  • New accessors: Context::{token_to_bytes, tokenize, tokenize_one, model_dims, token_for_lang}, special-token getters, Lang::full_name (owned SmolStr), top-level version / lang_max_id / lang_id_for.
  • State-bound timings: State::{print_timings, reset_timings, n_mel_frames} (state-bound only — context-shared totals dropped to avoid races and stale-baseline output).
  • Fallible setters with try_reserve_exact on every public copying input: Params::{set_language, set_initial_prompt, set_tokens}.
  • New error variants: UnknownLanguage, LanguageNotSupportedByModel, InvalidDuration, AllocationFailed, plus ContextLoad { code: Option<i32> } and StateInit { code: Option<i32> } carrying shim sentinels without OOM-path heap allocation.
  • Bounded error payloads (≤ 256 bytes) and static-string reasons on the OOM-recoverable path.

Submodule patches (Findit-AI/whisper.cpp@rust)

37 patches, pinned via REQUIRED_MARKERS in whispercpp-sys/build.rs. A submodule rebase that drops any sentinel fails patch-verification at build time.

  • C++ exception safety: RAII entry/exit for whisper_init_state / whisper_init_from_file_with_params_no_state; per-pointer ggml_context_ptr / ggml_backend_buffer_ptr guards in whisper_model_load, whisper_vad_load, and whisper_backend_init (closes leak windows around ctx_map[buft] = ctx, model.ctxs.emplace_back, raw-local tensor-prep contexts, and backend-handle accumulation).
  • OOM-safe allocators on the create-state path: every GGML_ASSERT(ptr != NULL) and GGML_ABORT reachable from Context::create_statewhisper_sched_graph_initggml_backend_sched_new / ggml_gallocr_new_n / ggml_gallocr_reserve_n_impl / ggml_dyn_tallocr_* converted to return-NULL/false with cleanup. File-local non-aborting whispercpp_hash_set_new_or_null variants replace upstream's GGML_MALLOC-routed ggml_hash_set_new calls.
  • Transactional state changes on the recoverable path: node_allocs, leaf_allocs, vbuffer realloc (with alias-aware pre-pass), and hash_set + hash_values all use atomic-commit so an alloc failure leaves the gallocr in a coherent retryable state instead of poisoned (NULL with stale capacity, dangling pointers in aliased slots, etc.).
  • Validation: vocab n_vocab bounds + post-synthesis size check, tensor header validation, special-token bounds, hparams head divisibility, lang_str null guard, sched abort callback wiring, model-bound language preflight (multilingual + non-multilingual + auto-detect filter against vocab.num_languages()).
  • Integer-boundary: whisper_tokenize, whisper_token_count, and the whispercpp_* shim variants reject res.size() > INT_MAX with INT_MIN instead of UB.
  • DoS / range guards: Params::set_duration_ms preflight (avoids a seek_end slip driving 71k zero-padded decode windows); auto-detect candidate-set filter (avoids out-of-range language ids aliasing onto task tokens).
  • Sentinel hygiene: whispercpp_* shims with disjoint sentinels (INT_MIN for tokenize-overflow, WHISPERCPP_ERR_* ≤ -100 for caught exceptions); va_copy + truncation in whisper_log_internal to close va-list reuse and heap-throw UB.

Audit framework

whispercpp/src/safety_audit.rs is the canonical entry point. It documents:

  • The 14+ safety axes (throw-safety, sync, allocation, lifetime, linkage, sentinel collisions, log pollution, error-payload bounds, race / classification, model-bound semantics, metadata consistency, DoS amplification, default-state mirror, state vs ctx scope, integer boundary, abort-bypass of shim, invariant preservation).
  • A per-method × per-axis ✓ matrix for every public method this PR touches.
  • A per-round narrative (1–27) explaining what each review round caught, why it slipped past prior rounds, and the structural fix or new axis it forced.
  • A pre-emptive list of out-of-scope sweep targets (VAD (int) size() casts, ggml_backend_tensor_copy slow-path malloc, the broader GGML_MALLOC / GGML_CALLOC abort-on-OOM call graph, ggml_backend_graph_copy's aborting hash_set use).

Test plan

  • 47/47 unit tests pass; clippy --all-targets -D warnings clean.
  • DTW × flash-attn rejection.
  • Params setters fully exercised (length caps, NUL rejection, default mirroring, accessor round-trips, language overlong-payload bound).
  • State-aware timing shim C linkage pinned.
  • Token::from_raw / DTW sentinel projection / language preflight.
  • Fault-injection coverage for OOM paths — requires LD_PRELOAD-style allocator interposition or a custom mem_buffer threading; logged as the next sweep target.
  • Real-model integration tests still rely on the parity harness in whispery.

Out-of-scope follow-ups (audit-tracked)

  • VAD (int) speeches.size() casts at whisper.cpp:6264+ — not yet exposed.
  • ggml_backend_tensor_copy slow-path unchecked malloc at ggml-backend.cpp:407 — runtime, not init.
  • ggml_backend_graph_copy's ggml_hash_set_new (ggml-backend.cpp:2069) — not on safe-Rust paths today.
  • GGML_MALLOC / GGML_CALLOC abort-on-OOM globally — would touch hundreds of callers, deferred to a separate effort.

🤖 Generated with Claude Code

@al8n al8n requested a review from Copilot May 8, 2026 23:54
@uqio uqio changed the title feat: accessors feat(accessors): issue #2 wrapper API + 27-round safety hardening pass May 8, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the safe Rust wrapper’s public “accessor” surface and hardens several FFI-facing behaviors (language handling, duration bounds, tokenization, timings) with additional validation and typed errors, plus a comments-only safety audit module to document the review matrix.

Changes:

  • Add new public accessors/introspection APIs (e.g., version(), lang_max_id(), lang_id_for(), Lang::full_name(), Context::{token_to_bytes, token_* accessors, tokenize, tokenize_one, model_dims}, State::{n_mel_frames, print_timings, reset_timings}).
  • Add State::full preflights for model-bound language hints and offset_ms/duration_ms audio-range validation (new error variants).
  • Extend the sys shim surface (no-log tokenize + state-aware timings + lang-id shim) and enforce required patch markers in whispercpp-sys/build.rs; add a large comments-only safety_audit module.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
whispercpp/src/state.rs Adds State::full preflights (language/model + duration bounds) and new State accessors for mel frames and timings.
whispercpp/src/params.rs Strengthens setters (set_language, set_initial_prompt, set_tokens) with bounds/alloc-safety and adds internal accessors used by State::full.
whispercpp/src/context.rs Adds token/vocab accessors, tokenization APIs, model-dims introspection, and improves constructor error reporting.
whispercpp/src/error.rs Expands WhisperError with new variants and adds sentinel code fields for richer diagnostics.
whispercpp/src/lang.rs Adds Lang::full_name plus lang_max_id/lang_id_for top-level APIs and tests.
whispercpp/src/lib.rs Wires in safety_audit, re-exports new language helpers, and adds version() with link-guard tests.
whispercpp/src/safety_audit.rs Adds a comments-only safety audit matrix documenting the API surface and required native patches.
whispercpp-sys/whispercpp_shim.h Declares new shims (no-log tokenize, token-count probe, state-aware timings, lang-id shim).
whispercpp-sys/whispercpp_shim.cpp Implements whispercpp_lang_id and updates shim notes.
whispercpp-sys/build.rs Adds required patch markers and allowlists additional no-throw accessors/shims for bindgen.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread whispercpp/src/params.rs Outdated
Comment thread whispercpp/src/params.rs Outdated
Comment thread whispercpp/src/context.rs Outdated
Comment thread whispercpp/src/context.rs Outdated
Comment thread whispercpp/src/lang.rs Outdated
Comment thread whispercpp/src/params.rs
Comment on lines +969 to 977
///
/// Returns [`WhisperError::AllocationFailed`] if the copy
/// buffer cannot be reserved (`Vec::try_reserve_exact`
/// failure). **Panic-free** — `slice::to_vec` would route
/// through the abort-on-OOM global allocator; the fallible
/// reserve makes a multi-gigabyte token vector under memory
/// pressure a typed error instead of a process abort.
pub fn set_tokens(&mut self, tokens: &[i32]) -> WhisperResult<&mut Self> {
if tokens.is_empty() {
Comment thread whispercpp/src/error.rs
Comment on lines 34 to 56
#[error("failed to load model from {path}: {reason}")]
ContextLoad {
/// Path the caller passed in. Stored so logs can pinpoint
/// which model file failed.
path: SmolStr,
/// Any extra context whisper.cpp surfaced (often empty —
/// the C API just returns NULL).
/// Any extra context whisper.cpp surfaced. Set to one of
/// a small set of `&'static str` values via
/// `SmolStr::new_static`, never `format!` — this error
/// is reached on the `bad_alloc` / `system_error` path,
/// and the `format!` macro plus
/// `SmolStr::new(formatted_string)` would each
/// heap-allocate while the global allocator is already
/// under pressure.
reason: SmolStr,
/// Constructor exception sentinel from the C++ shim.
/// `Some(WHISPERCPP_ERR_*)` when the shim caught a
/// throw; `None` when upstream returned NULL via its
/// own bool-failure path. Kept as a separate `Option<i32>`
/// (instead of formatted into `reason`) so error
/// construction allocates nothing on the
/// caught-exception branch.
code: Option<i32>,
},
Implements the wrapper additions tracked in issue #2 (token /
vocab / language / model-dim accessors, state-aware timings,
fallible setters) and hardens the FFI seam against panic,
leak, double-free, OOM-abort, dangling-alias, and integer-
boundary classes reachable from safe Rust through
`Context::new` / `Context::create_state` / `State::full`.

Rust-side surface
-----------------

* New accessors: `Context::{token_to_bytes, tokenize,
  tokenize_one, model_dims, token_for_lang}`, special-token
  getters, `Lang::full_name` (owned `SmolStr`), top-level
  `version` / `lang_max_id` / `lang_id_for`.
* State-bound timings: `State::{print_timings, reset_timings,
  n_mel_frames}` (state-bound only — context-shared totals
  dropped to avoid races and stale-baseline output).
* Fallible setters with `try_reserve_exact` on every public
  copying input: `Params::{set_language, set_initial_prompt,
  set_tokens}`.
* New error variants: `UnknownLanguage`,
  `LanguageNotSupportedByModel`, `InvalidDuration`,
  `AllocationFailed`, plus `ContextLoad { code: Option<i32> }`
  and `StateInit { code: Option<i32> }` carrying shim
  sentinels without OOM-path heap allocation.
* Bounded error payloads (≤ 256 bytes) and static-string
  reasons on the OOM-recoverable path.

Submodule patches (Findit-AI/whisper.cpp@rust)
----------------------------------------------

37 patches, pinned via `REQUIRED_MARKERS` in
`whispercpp-sys/build.rs`. A submodule rebase that drops
any sentinel fails patch-verification at build time.

* C++ exception safety: RAII entry/exit for
  `whisper_init_state` and
  `whisper_init_from_file_with_params_no_state`; per-pointer
  `ggml_context_ptr` / `ggml_backend_buffer_ptr` guards in
  `whisper_model_load`, `whisper_vad_load`, and
  `whisper_backend_init` (closes leak windows around
  `ctx_map[buft] = ctx`, `model.ctxs.emplace_back`, raw-local
  tensor-prep contexts, and backend-handle accumulation).
* OOM-safe allocators on the create-state path: every
  `GGML_ASSERT(ptr != NULL)` and `GGML_ABORT` reachable from
  `Context::create_state` → `whisper_sched_graph_init` →
  `ggml_backend_sched_new` / `ggml_gallocr_new_n` /
  `ggml_gallocr_reserve_n_impl` / `ggml_dyn_tallocr_*`
  converted to return-NULL/false with cleanup. File-local
  non-aborting `whispercpp_hash_set_new_or_null` variants
  replace upstream's `GGML_MALLOC`-routed
  `ggml_hash_set_new` calls.
* Transactional state changes on the recoverable path:
  `node_allocs`, `leaf_allocs`, `vbuffer` realloc (with
  alias-aware pre-pass), and `hash_set` + `hash_values` all
  use atomic-commit so an alloc failure leaves the gallocr
  in a coherent retryable state instead of poisoned (NULL
  with stale capacity, dangling pointers in aliased slots,
  etc.).
* Validation: vocab `n_vocab` bounds + post-synthesis size
  check, tensor header validation, special-token bounds,
  hparams head divisibility, `lang_str` null guard, sched
  abort callback wiring, model-bound language preflight
  (multilingual + non-multilingual + auto-detect filter
  against `vocab.num_languages()`).
* Integer-boundary: `whisper_tokenize`, `whisper_token_count`,
  and the `whispercpp_*` shim variants reject
  `res.size() > INT_MAX` with `INT_MIN` instead of UB.
* DoS / range guards: `Params::set_duration_ms` preflight
  (avoids a `seek_end` slip driving 71k zero-padded decode
  windows); auto-detect candidate-set filter (avoids
  out-of-range language ids aliasing onto task tokens).
* Sentinel hygiene: `whispercpp_*` shims with disjoint
  sentinels (`INT_MIN` for tokenize-overflow,
  `WHISPERCPP_ERR_*` ≤ -100 for caught exceptions);
  `va_copy` + truncation in `whisper_log_internal` to
  close va-list reuse and heap-throw UB.

Audit framework
---------------

`whispercpp/src/safety_audit.rs` documents the 14+ safety
axes (throw-safety, sync, allocation, lifetime, linkage,
sentinel collisions, log pollution, error-payload bounds,
race / classification, model-bound semantics, metadata
consistency, DoS amplification, default-state mirror, state
vs ctx scope, integer boundary, abort-bypass of shim,
invariant preservation), the per-method × per-axis ✓
matrix, and the native-side patches list. Pre-emptive
out-of-scope sweep targets (VAD `(int) size()` casts,
`ggml_backend_tensor_copy` slow-path malloc, the broader
`GGML_MALLOC` / `GGML_CALLOC` abort-on-OOM call graph,
`ggml_backend_graph_copy`'s aborting hash_set use) are
flagged in the doc rather than buried in commit footers.

Test plan
---------

47/47 unit tests pass; clippy `--all-targets -D warnings`
clean. DTW × flash-attn rejection covered. `Params`
setters fully exercised (length caps, NUL rejection,
default mirroring, accessor round-trips, language
overlong-payload bound). State-aware timing shim C
linkage pinned. `Token::from_raw` / DTW sentinel
projection / language preflight covered.

Fault-injection coverage for OOM paths (LD_PRELOAD-style
allocator interposition or custom `mem_buffer` threading)
is logged as the next sweep target. Real-model
integration tests still rely on the parity harness in
`whispery`.
@uqio uqio force-pushed the feat/accessors branch from b404767 to 39289b8 Compare May 9, 2026 00:44
@uqio uqio merged commit 5dcf84e into main May 9, 2026
15 checks passed
@uqio uqio deleted the feat/accessors branch May 9, 2026 00:52
uqio added a commit that referenced this pull request May 9, 2026
Documents the DTW timestamp surface (#7), the issue-#2
accessor surface (#8), the safety hardening pass against
panic / leak / OOM-abort / integer-boundary classes, the
breaking changes since 0.1.0, and the 37 native-side
patches pinned via build-time marker verification.

Format follows Keep a Changelog 1.1.0; project follows
SemVer 2.0.
@uqio uqio linked an issue May 9, 2026 that may be closed by this pull request
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add small accessors to the safe wrapper (tracking)

2 participants