Parallel access to b-tree and data via cat_ranges and threading#218
Parallel access to b-tree and data via cat_ranges and threading#218bnlawrence wants to merge 26 commits intomainfrom
Conversation
These results show the benefit of the parallelism for data reading, though they suggest one would not make the parallel b-tree read the default. Further investigation is necessary. Note that the POSIX results are not believable as they represent memory caching by the OS, as discussed here. Note that the ssh results are using `p5rem`, not `fsspec`. To what extent server side caching (for http and s3) is involved is not clear.
|
|
@bnlawrence I fixed your ruff issues so you have a clean CI and focus on the functional fails, if any. You can always fix ruff issues to the first degree/pass by running |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #218 +/- ##
==========================================
+ Coverage 77.62% 78.39% +0.77%
==========================================
Files 15 15
Lines 3128 3300 +172
Branches 499 526 +27
==========================================
+ Hits 2428 2587 +159
- Misses 573 578 +5
- Partials 127 135 +8 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Adds internal parallelism to pyfive’s chunk and B-tree access paths to reduce latency (especially for remote/object-store reads) by using fsspec cat_ranges and/or threaded os.pread.
Changes:
- Introduces a chunk-read dispatch layer in
DatasetID(bulkcat_ranges, threadedpread, serial fallback) and wires it into the chunk-selection hot path. - Adds an optional
fetch_fntoBTreeV1RawDataChunksto bulk-fetch leaf nodes (including handling variable leaf sizes) and parse nodes from in-memory buffers. - Adds test coverage for parallel B-tree reads and cat_ranges usage; adds an additional S3 caching investigation test module.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
pyfive/h5d.py |
Adds ChunkRead mixin, default parallelism configuration, and b-tree leaf fetch function plumbing. |
pyfive/btree.py |
Implements bulk leaf-node fetching + buffer-based parsing for v1 raw-data chunk B-trees. |
pyfive/high_level.py |
Tweaks remote-handle detection messaging and changes Dataset base class to ABC. |
pyfive/utilities.py |
Exposes fs/path on MetadataBufferingWrapper and delegates unknown attrs to underlying handle. |
tests/test_btree_parallel.py |
Adds tests validating b-tree parallel leaf reads via cat_ranges and pread. |
tests/test_s3_caching.py |
Adds tests/framework for investigating S3 handle reuse/caching (currently mostly non-asserting). |
doc/pyfive_class_diagram.pu |
Adds/updates UML diagram to reflect new classes and relationships. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…tribute_read_detection Agent-Logs-Url: https://github.com/NCAS-CMS/pyfive/sessions/e26762f6-6c10-4c60-9365-a21f50bb3fc3 Co-authored-by: bnlawrence <1792815+bnlawrence@users.noreply.github.com>
…meter naming, test fixes, and logging setup
|
@bnlawrence please run |

Description
It is clear that pyfive itself could benefit from internal parallelism. This idea was outlined in #154. Some detailed thinking and architecture design resulted in #216. This is the outcome of that work, and provides both parallel chunk reading and parallel reading of b-tree information. These are both turned on by default. The API to turn them off is somewhat obscure, and might be something to address in the discussion around this pull request.
This would close #209 and #216 (#154 has been already closed in anticipation).
Considerations:
The use of a mixin class for reading chunks. While concerns have been expressed, i think in the end, this is the right pattern, for now at least.
This retains a nearly complete separation of concerns between pyfive and the environment (POSIX, FSSPEC etc), but it is not perfect. Future work will need to address that, but the benefits of doing this now are so remarkable that it is worth doing it now, and foreshadowing the necessary work (an issue will be forthcoming in the next few days, and will link back here).
This replaces the previous pull request (First cut at adding some parallelism in pyfive #209).
Parallel decompression of chunks is postponed for future work.
Checklist