Fix file descriptor leaks in wsds #63
Merged
Merged
Conversation
SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released between fused pipeline stages. Fused stages run 5-7 stages back-to-back in a single Modal container. Each stage opens pyarrow memory-mapped files for reading shards and IPC writers for output. Before this fix, none of these file handles were explicitly released: - RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held. - WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map(). No close() method existed, so these accumulated across fused stages. - _build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC. Changes: - ws_sink.py: gc.collect() after RecordBatchFileWriter.close() to force fd release; close writer on error path too; _build_key_iter eagerly collects keys and closes the temp dataset; catch KeyError for race condition where sibling column dir exists but has no committed shards yet - ws_shard.py: Add close() method; keep reference to _source_file so we can explicitly close the pyarrow NativeFile (memory_map/OSFile) - ws_dataset.py: Add close() that closes all cached shards and linked datasets
rashishhume
approved these changes
Apr 10, 2026
shahbaz-humeai
approved these changes
Apr 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
SUMMARY: pyarrow's RecordBatchFileWriter.close() doesn't release the fd — only GC does. Force gc.collect() after close and add explicit close() methods to WSShard/WSDataset so file handles are released.
RecordBatchFileWriter.close() flushes metadata but keeps the fd open until the Python object is garbage-collected. On container reuse, volume.reload() fails with "there are open files preventing the operation" because the previous invocation's writer fds are still held.
WSDataset caches WSShard objects in _open_shards, each holding a memory-mapped file via pa.memory_map().
_build_key_iter created a temporary WSDataset (with open shard handles) via a lazy generator that was never closed, leaking handles until GC.
Changes: