feat(state): segments / tokens iterators (#3)#9
Merged
Merged
Conversation
Adds two iterator entry points to the safe API and the
standard adapter trait impls so they compose with
`flat_map` / `rev` / `for ... in &state`.
Surface
-------
```rust
state.segments_iter() // Segments<'_>
seg.tokens_iter() // Tokens<'_>
for seg in &state { ... } // IntoIterator for &State
for tok in seg { ... } // IntoIterator for Segment<'a>
for tok in &seg { ... } // IntoIterator for &Segment<'a>
state.segments_iter().rev() // DoubleEndedIterator
seg.tokens_iter().rev()
state.segments_iter().flat_map(|s| s.tokens_iter()) // flat token stream
```
Both iterators implement `Iterator + ExactSizeIterator
+ DoubleEndedIterator + FusedIterator`. Yielded items:
`Segments::Item = Segment<'a>` (borrows from the source
`&'a State`); `Tokens::Item = Token` (owned, value-typed
snapshot, no further borrow).
Lifetime shape
--------------
* `Segments<'a>` borrows `&'a State`. Multiple iterators
alive simultaneously is sound — the underlying
`whisper_state` is immutable for the borrow's
duration (`State::full` requires `&mut self`, ruled
out while any `&self` iterator exists).
* `Tokens<'state>` owns a `Copy` of the `Segment` (which
is itself a thin index + pointer projection,
derive-`Copy`). This makes
`state.segments_iter().flat_map(|seg| seg.tokens_iter())`
compile — the inner iterator does not borrow the
closure-local `seg` value. `'state` still ties yielded
item pointer projections to the parent `State`.
* Auto-traits: `!Send + !Sync` on both, inherited from
`&State` (State is `!Sync`) / `NonNull<whisper_state>`
(foreign C type, no impl). Matches the existing State
threading model.
Performance
-----------
`Segments::next` and `Tokens::next` inline the
construction / pointer projection rather than calling
back through `State::segment(i)` / `Segment::token(j)`.
Those accessors do their own `n_segments()` /
`n_tokens()` FFI bounds-check, but the iterator already
captured `end` from the same call at construction —
and the borrow chain pins the underlying state. One
FFI call per yielded item saved on `Tokens` (which
dominates: hundreds to low thousands of tokens per
state).
Tests
-----
14 new unit tests in `whispercpp/src/state.rs`'s
`#[cfg(test)] mod tests`. Two `#[cfg(test)] pub(crate)`
test fixtures make the iterator tests possible without
a model file:
* `Context::dangling_for_test()` — `unsafe fn` whose
contract requires the caller to `mem::forget` the
result (or any `Arc<Context>` derived from it) before
its `Drop` runs. The unsafe is at the producer; the
only call site wraps it in `unsafe { ... }` with a
SAFETY comment explaining the Arc-refcount-leak
invariant.
* `State::poisoned_for_test(ctx: Arc<Context>)` — safe
(the Drop impl already short-circuits for poisoned
states; no FFI runs).
Runtime exercise of `Tokens` requires a real model
(the poisoned fixture yields zero segments, blocking
`Segment` construction). Compile-time bound assertions
cover the type-system contract instead.
Audit
-----
`whispercpp/src/safety_audit.rs` gains a
`segments_iter` / `tokens_iter` row in the per-method
coverage matrix walking all ten safety axes. Inlined
FFI projection (`whisper_full_get_token_data_from_state`)
is documented under axis #1 (throw); the
borrow-chain-pinned bounds-check redundancy is the
soundness argument.
Verification
------------
* `cargo build -p whispercpp` clean (debug + release).
* `cargo test -p whispercpp --lib` — 62 passed (was
48; +14 new).
* `cargo clippy -p whispercpp -p whispercpp-sys
--all-targets -- -D warnings` clean.
* `cargo doc -p whispercpp --no-deps` clean.
Closes #3.
Cargo-compatible patch release. Only `whispercpp` bumps — `whispercpp-sys` stays at 0.2.0 (no native changes since 0.2.0). CHANGELOG documents the iterator additions (`Segments` / `Tokens` types, `IntoIterator` impls, `DoubleEndedIterator`), the per-yielded-item FFI savings, and the new test fixtures. No breaking changes from 0.2.0.
There was a problem hiding this comment.
Pull request overview
Adds ergonomic, adapter-friendly iteration over whisper.cpp decode outputs by introducing segment and token iterator entry points on the safe wrapper types, plus standard IntoIterator impls to enable idiomatic for ... in loops.
Changes:
- Add
State::segments_iter()andSegment::tokens_iter()returningSegments<'_>/Tokens<'_>withExactSizeIterator + DoubleEndedIterator + FusedIterator. - Implement
IntoIteratorfor&State,Segment<'a>, and&Segment<'a>to compose with standard iterator adapters. - Add test-only fixtures (
Context::dangling_for_test,State::poisoned_for_test) and unit tests validating iterator shape/composability; extend the safety audit.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
whispercpp/src/state.rs |
Introduces Segments/Tokens iterators, segments_iter/tokens_iter, IntoIterator impls, plus iterator-focused unit tests and test fixtures. |
whispercpp/src/context.rs |
Adds a #[cfg(test)] unsafe dangling_for_test() constructor used by iterator unit tests without requiring a real model. |
whispercpp/src/lib.rs |
Re-exports the new iterator types (Segments, Tokens) as part of the public surface. |
whispercpp/src/safety_audit.rs |
Documents safety/audit coverage for the new iterator methods and their IntoIterator delegations. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Miri's default leak checker correctly flags the `Arc<Context>` allocation that `poisoned_state_for_test` deliberately leaks (one ArcInner per fixture call, forgotten so `Context::drop` can never call `whisper_free` on the dangling pointer). Two reasons the existing `context.rs::tests::*` tests that use the same `mem::forget` pattern don't trip this: they construct a bare-stack `Context` and forget the value (no heap allocation), while State's `ctx: Arc<Context>` field forces the iterator fixture through `Arc::new` — and the ArcInner is what Miri reports. Tag the 12 fixture-using tests with `#[cfg_attr(miri, ignore = "...")]`. Pure compile-time tests (`into_iter_for_segment_compiles`, `tokens_iter_double_ended_compiles`) stay Miri-eligible since they only touch `PhantomData` + trait-bound functions and allocate nothing. Trade-off: Miri loses runtime coverage of the iterator counters on poisoned state (n_segments == 0; loop body never runs under Miri or otherwise). The load-bearing assertions are at compile time (`Iterator<Item = …>` bounds, lifetime threading for flat_map composition, ExactSizeIterator / DoubleEndedIterator impls), and those still get checked by the regular `cargo test` job. Doc note added to `poisoned_state_for_test` explaining the Miri trade-off so future test additions know to copy the cfg_attr pattern.
Three findings from the Copilot review:
* Replace the helper-pair fixture (`poisoned_state_for_test`
+ `forget_poisoned_state`) with a `PoisonedStateFixture`
RAII guard. The previous design required every test to
call `forget_poisoned_state(state)` for cleanup — fragile
against panic / early-return mid-test. The guard holds a
`ManuallyDrop<Arc<Context>>` clone in addition to the
`State`, so the leak invariant
("Context refcount stays ≥ 1 forever") survives test
panics. Tests now read `let fixture =
PoisonedStateFixture::new(); let state = &*fixture;` with
no explicit cleanup. Doc on the struct spells out the
Miri trade-off (still a leak by design; tests tagged
`#[cfg_attr(miri, ignore)]`).
* Fix the `Segments::next` comment that claimed
`self.state.ptr` is `pub(super)` — the field is
actually private and only accessible because
`Segments` lives in the same module. Comment now
reflects the real visibility.
* Clarify the `Context::dangling_for_test` doc rationale
for `unsafe fn` over `ManuallyDrop<Self>`. The previous
text implied `Arc<ManuallyDrop<…>>` would still drop the
inner `Context` via Arc's refcount mechanism — that's
wrong (`ManuallyDrop`'s drop is a no-op, so the UB
would in fact be prevented). The real reason is
API-shape: `State::poisoned_for_test` and the
production code expect `Arc<Context>` not
`Arc<ManuallyDrop<Context>>`. The
`PoisonedStateFixture` test guard handles the leak
invariant via composition (separate `ManuallyDrop`
field) instead.
62/62 tests pass under standard `cargo test`; clippy +
doc clean.
…ANGELOG link Two findings from the second Copilot review on PR #9. * `State::poisoned_for_test` was a `#[cfg(test)] pub(crate) fn` returning a `State` from a passed-in `Arc<Context>`. Since `Context::dangling_for_test` is `unsafe fn` but its resulting `Arc<Context>` is a regular safe value, this constructor effectively LIFTED the unsafe contract back into safe code: a caller could chain `Arc::new(unsafe { dangling_for_test() })` → `State::poisoned_for_test(arc)` → `drop(state)` and reach `Context::drop`'s `whisper_free(NonNull::dangling())` through what looked like a safe API. Deleted the constructor entirely. `PoisonedStateFixture::new()` in the test module now constructs `State { ptr: None, ctx }` directly via private-field access (the test submodule is a child of `state.rs` and inherits the parent's privacy boundary). Collapses the soundness chain into a single auditable point; no crate-visible safe API takes a potentially-dangling Arc<Context> as input. * CHANGELOG.md 0.2.1 entry had `[#9]` defined as a reference-style link but never used inline. The release is about closing issue #3 via PR #9 — both should be cited. Added inline references in the section header ("Closes [#3] (delivered via PR [#9])") and added the matching `[#3]` link def. Also refreshed the Internal bullet to mention `PoisonedStateFixture` (replaced the stale `State::poisoned_for_test` reference now that the function is gone). 62/62 tests pass; clippy + cargo doc clean.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #3.
Adds two ergonomic iterator entry points to the safe API and the standard adapter trait impls so they compose with
flat_map/rev/for ... in &state.Surface
Both iterators implement
Iterator + ExactSizeIterator + DoubleEndedIterator + FusedIterator. Yielded items:Segments::Item = Segment<'a>(borrows from the source&'a State);Tokens::Item = Token(owned, value-typed snapshot, no further borrow).Lifetime shape
Segments<'a>borrows&'a State. Multiple iterators alive simultaneously is sound — the underlyingwhisper_stateis immutable for the borrow's duration (State::fullrequires&mut self, ruled out by the borrow checker while any&selfiterator exists).Tokens<'state>owns aCopyof theSegment(which is itself a thin index + pointer projection, derive-Copy). This makessegvalue. The'statelifetime still ties yielded item pointer projections to the parentState.!Send + !Syncon both, inherited from&State(State is!Sync) /NonNull<whisper_state>(foreign C type with no impl). Matches the existing State threading model.Performance
Segments::nextandTokens::nextinline the construction / pointer projection rather than calling back throughState::segment(i)/Segment::token(j). Those accessors do their ownn_segments()/n_tokens()FFI bounds-check, but the iterator already capturedendfrom the same call at construction — and the borrow chain pins the underlying state, so the bounds-check is redundant. One FFI call per yielded item saved onTokens(which dominates: hundreds to low thousands of tokens per state).Tests
13 new unit tests in
whispercpp/src/state.rs's#[cfg(test)] mod tests:segments_iter_empty_state_yields_zero_itemssegments_iter_count_matches_n_segmentssegments_iter_exact_size_len_matches_countsegments_iter_size_hint_is_exactsegments_iter_fused_after_exhaustionmultiple_segments_iter_alive_concurrentlysegments_iter_composes_with_adaptersnested_segments_and_tokens_iter_compilesiterator_type_bounds_are_correcttokens_iter_composes_with_flat_mapinto_iter_for_state_ref_yields_segmentsinto_iter_for_segment_compilessegments_iter_double_ended_compiles_and_empty_yields_nonetokens_iter_double_ended_compilesTwo
#[cfg(test)] pub(crate)test fixtures make this possible without a model file:Context::dangling_for_test()—unsafe fn. Returns aContextwithNonNull::dangling(). The unsafe contract: the caller must ensure the value (or anyArc<Context>derived from it) iscore::mem::forget'd before itsDropruns (which would otherwise callwhisper_freeon the dangling pointer).State::poisoned_for_test(ctx: Arc<Context>)— safe. Returns aStatewithptr: None(theDropimpl already short-circuits for poisoned states; no FFI runs).The test fixture (
poisoned_state_for_test) wraps the unsafe call in a singleunsafeblock with a SAFETY comment explaining the Arc-refcount-leak invariant:Arc::new+Arc::clonebrings refcount to 2, thenmem::forgeton the original handle keeps it at ≥1 forever, soContext::dropis unreachable.The runtime exercise of
Tokensrequires a real model (the poisoned fixture yields zero segments, so we can't construct aSegment). The compile-time bound assertions (assert_token_iter,assert_dei) cover the type-system contract instead — every adapter the docs claim should compile is type-checked at the call site.Audit
whispercpp/src/safety_audit.rsgains asegments_iter/tokens_iterrow in the per-method coverage matrix, walking all ten safety axes. Highlights:whisper_full_get_token_data_from_state) is a pure C accessor; bounds verified at iterator construction.Stateis!Sync;&selfpermits multiple iterators alive simultaneously, sound because the buffers are immutable for the borrow's duration.Tokens<'state>owns a copiedSegment<'state>so adapter composition works;'stateties pointer projections to the parent State.next/endindex counters; the converged-cursor case (next == end) is rejected at the top of every direction's call.Verification
cargo build -p whispercppclean (debug + release).cargo test -p whispercpp --lib— 62 passed (was 48, +14 new across the four commits).cargo clippy -p whispercpp -p whispercpp-sys --all-targets -- -D warningsclean.cargo doc -p whispercpp --no-depsclean.🤖 Generated with Claude Code