Fix file descriptor leaks in wsds by tlebryk · Pull Request #63 · HumeAI/wsds

tlebryk · 2026-04-10T15:45:21Z

SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released.

RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held.
WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map().
_build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC.

Changes:

ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd release; close writer on error path too; _build_key_iter eagerly collects keys and closes the temp dataset; catch KeyError for race condition where sibling column dir exists but has no committed shards yet
ws_shard.py: Add close() method; keep reference to _source_file so we can explicitly close the pyarrow NativeFile (memory_map/OSFile)
ws_dataset.py: Add close() that closes all cached shards and linked datasets

SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released between fused pipeline stages. Fused stages run 5-7 stages back-to-back in a single Modal container. Each stage opens pyarrow memory-mapped files for reading shards and IPC writers for output. Before this fix, none of these file handles were explicitly released: - RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held. - WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map(). No close() method existed, so these accumulated across fused stages. - _build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC. Changes: - ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd release; close writer on error path too; _build_key_iter eagerly collects keys and closes the temp dataset; catch KeyError for race condition where sibling column dir exists but has no committed shards yet - ws_shard.py: Add close() method; keep reference to _source_file so we can explicitly close the pyarrow NativeFile (memory_map/OSFile) - ws_dataset.py: Add close() that closes all cached shards and linked datasets

rashishhume approved these changes Apr 10, 2026

View reviewed changes

shahbaz-humeai approved these changes Apr 10, 2026

View reviewed changes

rashishhume merged commit 8b173cb into main Apr 10, 2026

jpc deleted the fix/wsds-fd-leaks branch April 20, 2026 08:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix file descriptor leaks in wsds #63

Fix file descriptor leaks in wsds #63
rashishhume merged 1 commit into
mainfrom
fix/wsds-fd-leaks

tlebryk commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

tlebryk commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants