Skip to content

Fix file descriptor leaks in wsds #63

Merged
rashishhume merged 1 commit into
mainfrom
fix/wsds-fd-leaks
Apr 10, 2026
Merged

Fix file descriptor leaks in wsds #63
rashishhume merged 1 commit into
mainfrom
fix/wsds-fd-leaks

Conversation

@tlebryk
Copy link
Copy Markdown
Contributor

@tlebryk tlebryk commented Apr 10, 2026

SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released.

  • RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held.

  • WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map().

  • _build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC.

Changes:

  • ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd release; close writer on error path too; _build_key_iter eagerly collects keys and closes the temp dataset; catch KeyError for race condition where sibling column dir exists but has no committed shards yet
  • ws_shard.py: Add close() method; keep reference to _source_file so we can explicitly close the pyarrow NativeFile (memory_map/OSFile)
  • ws_dataset.py: Add close() that closes all cached shards and linked datasets

SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only
GC does. Force gc.collect() after close and add explicit close() methods to
WSShard/WSDataset so file handles are released between fused pipeline stages.

Fused stages run 5-7 stages back-to-back in a single Modal container. Each
stage opens pyarrow memory-mapped files for reading shards and IPC writers for
output. Before this fix, none of these file handles were explicitly released:

- RecordBatchFileWriter.close() flushes metadata but keeps the fd open until
  the Python object is garbage-collected. On container reuse, volume.reload()
  fails with "there are open files preventing the operation" because the
  previous invocation's writer fds are still held.

- WSDataset caches WSShard objects in _open_shards, each holding a
  memory-mapped file via pa.memory_map(). No close() method existed, so
  these accumulated across fused stages.

- _build_key_iter created a temporary WSDataset (with open shard handles) via
  a lazy generator that was never closed, leaking handles until GC.

Changes:
- ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd
  release; close writer on error path too; _build_key_iter eagerly collects
  keys and closes the temp dataset; catch KeyError for race condition where
  sibling column dir exists but has no committed shards yet
- ws_shard.py: Add close() method; keep reference to _source_file so we can
  explicitly close the pyarrow NativeFile (memory_map/OSFile)
- ws_dataset.py: Add close() that closes all cached shards and linked datasets
@rashishhume rashishhume merged commit 8b173cb into main Apr 10, 2026
@jpc jpc deleted the fix/wsds-fd-leaks branch April 20, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants