Fix HTTP-continuation hang: externalize table-function scan state as a cursor#1
Merged
Merged
Conversation
…a cursor The structure table functions (tables/words/pages) were TableFunctionGenerator[Args] with `process(params, state: None, out)` that did `out.emit(...ALL rows...); out.finish()` in a single tick. Over the stateless HTTP transport the framework wire-serializes the per-scan state after each tick and resumes by deserializing it, emitting at most one producer batch per response — so a position-less `state: None` generator restarts from row 0 on every HTTP resume and loops forever once the output exceeds one batch. `words` (hundreds–thousands of rows/PDF) and `tables` (one row/cell) are genuinely unbounded, so this is a real hang on the http leg. subprocess/unix hide it by keeping state in-process. Convert all six functions to TableFunctionGenerator[Args, ScanState], mirroring vgi-search's ScanState pattern: - Add ROWS_PER_TICK = 64 and ScanState(ArrowSerializableDataclass) with started/offset/rows_ipc (all plainly serializable), plus result_to_ipc / ipc_to_table / _stream_slice helpers. - Add initial_state() -> ScanState(); refactor _emit_* into _build_* that return the full RecordBatch. process() materializes the full batch into rows_ipc on the first tick, then emits a bounded ROWS_PER_TICK slice from offset, advancing offset and finishing when drained. NULL/empty-source early finish paths stay; rows/schema are byte-identical to before. Validation: - tests/harness.invoke_table_function gains serialize_state=True, round-tripping the state through serialize_to_bytes/deserialize_from_bytes between every tick (1000-tick guard) — mimicking the HTTP wire. - TestScanStateRoundTrip / TestCursorSurvivesContinuation assert identical rows/order, no dupes, termination, and bounded chunks (>= 2 batches each <= ROWS_PER_TICK — fails on old emit-all code, which emits one 200-row batch). - New manywords.pdf fixture (200 words > ROWS_PER_TICK) + structure.test paging case (count = 200, ordered head, distinct = 200) — over http this only terminates if the cursor works. All three transports (subprocess/http/unix) pass locally; CLAUDE.md documents the cursor and why. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
The structure table functions (
tables,words,pages) wereTableFunctionGenerator[Args]withprocess(params, state: None, out)doingout.emit(...ALL rows...); out.finish()in a single tick.Over the stateless HTTP transport the framework wire-serializes the per-scan
state after every tick and resumes by deserializing it, emitting at most one
producer batch per response. A position-less
state: Nonegenerator thereforerestarts from row 0 on every HTTP resume and loops forever once the output
exceeds one batch.
words(hundreds–thousands of rows/PDF) andtables(onerow/cell) are genuinely unbounded, so this is a real hang on the http leg.
subprocess/unix hide it by keeping state in-process.
The fix
Convert all six functions to
TableFunctionGenerator[Args, ScanState], mirroringvgi-search's
ScanStatepattern:ROWS_PER_TICK = 64+ScanState(ArrowSerializableDataclass)withstarted/offset/rows_ipc(all plainly serializable), plusresult_to_ipc/ipc_to_table/_stream_slicehelpers.initial_state() -> ScanState();_emit_*refactored into_build_*thatreturn the full RecordBatch.
process()materializes the full batch intorows_ipcon the first tick, then emits a boundedROWS_PER_TICKslice fromoffset, advancing it and finishing when drained. NULL/empty-source earlyfinish paths stay; rows/schema are byte-identical to before.
Validation (fail-old / pass-new)
tests/harness.invoke_table_function(..., serialize_state=True)round-trips thestate through
serialize_to_bytes/deserialize_from_bytesbetween every tick(1000-tick guard) — mimics the HTTP wire.
TestScanStateRoundTrip/TestCursorSurvivesContinuationassert identicalrows/order, no dupes, termination, and bounded chunks (
>= 2batches each<= ROWS_PER_TICK) — this fails on the old emit-all code (one 200-row batch)and passes on the cursor code.
manywords.pdffixture (200 words >ROWS_PER_TICK) +structure.testpaging case (
count = 200, ordered head,distinct = 200) — over http thisonly terminates if the cursor works.
All three transports (subprocess/http/unix) pass locally.
CLAUDE.mddocumentsthe cursor and why.
🤖 Generated with Claude Code