Parallel access to b-tree and data via cat_ranges and threading by bnlawrence · Pull Request #218 · NCAS-CMS/pyfive

bnlawrence · 2026-04-06T08:31:29Z

Description

It is clear that pyfive itself could benefit from internal parallelism. This idea was outlined in #154. Some detailed thinking and architecture design resulted in #216. This is the outcome of that work, and provides both parallel chunk reading and parallel reading of b-tree information. These are both turned on by default. The API to turn them off is somewhat obscure, and might be something to address in the discussion around this pull request.

This would close #209 and #216 (#154 has been already closed in anticipation).

Considerations:

The use of a mixin class for reading chunks. While concerns have been expressed, i think in the end, this is the right pattern, for now at least.
This retains a nearly complete separation of concerns between pyfive and the environment (POSIX, FSSPEC etc), but it is not perfect. Future work will need to address that, but the benefits of doing this now are so remarkable that it is worth doing it now, and foreshadowing the necessary work (an issue will be forthcoming in the next few days, and will link back here).
This replaces the previous pull request (First cut at adding some parallelism in pyfive #209).
Parallel decompression of chunks is postponed for future work.

Checklist

This pull request has a descriptive title and labels
This pull request has a minimal description (most was discussed in the issue, but a two-liner description is still desirable)
Unit tests have been added (if codecov test fails)
Any changed dependencies have been added or removed correctly (if need be)
If you are working on the documentation, please ensure the current build passes
All tests pass

…ubclassing

bnlawrence · 2026-04-06T11:16:27Z

These results show the benefit of the parallelism for data reading, though they suggest one would not make the parallel b-tree read the default. Further investigation is necessary. Note that the POSIX results are not believable as they represent memory caching by the OS, as discussed here. Note that the ssh results are using `p5rem`, not `fsspec`. To what extent server side caching (for http and s3) is involved is not clear.

valeriupredoi · 2026-04-09T15:48:12Z

@bnlawrence I fixed your ruff issues so you have a clean CI and focus on the functional fails, if any. You can always fix ruff issues to the first degree/pass by running pre-commit run -a 🍺

codecov · 2026-04-09T15:51:49Z

Codecov Report

❌ Patch coverage is 93.12169% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 78.39%. Comparing base (3a93a0d) to head (b134783).
⚠️ Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
pyfive/h5d.py	91.00%	3 Missing and 6 partials ⚠️
pyfive/btree.py	94.93%	2 Missing and 2 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #218      +/-   ##
==========================================
+ Coverage   77.62%   78.39%   +0.77%     
==========================================
  Files          15       15              
  Lines        3128     3300     +172     
  Branches      499      526      +27     
==========================================
+ Hits         2428     2587     +159     
- Misses        573      578       +5     
- Partials      127      135       +8

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copilot

Pull request overview

Adds internal parallelism to pyfive’s chunk and B-tree access paths to reduce latency (especially for remote/object-store reads) by using fsspec cat_ranges and/or threaded os.pread.

Changes:

Introduces a chunk-read dispatch layer in DatasetID (bulk cat_ranges, threaded pread, serial fallback) and wires it into the chunk-selection hot path.
Adds an optional fetch_fn to BTreeV1RawDataChunks to bulk-fetch leaf nodes (including handling variable leaf sizes) and parse nodes from in-memory buffers.
Adds test coverage for parallel B-tree reads and cat_ranges usage; adds an additional S3 caching investigation test module.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
`pyfive/h5d.py`	Adds `ChunkRead` mixin, default parallelism configuration, and b-tree leaf fetch function plumbing.
`pyfive/btree.py`	Implements bulk leaf-node fetching + buffer-based parsing for v1 raw-data chunk B-trees.
`pyfive/high_level.py`	Tweaks remote-handle detection messaging and changes `Dataset` base class to `ABC`.
`pyfive/utilities.py`	Exposes `fs`/`path` on `MetadataBufferingWrapper` and delegates unknown attrs to underlying handle.
`tests/test_btree_parallel.py`	Adds tests validating b-tree parallel leaf reads via `cat_ranges` and `pread`.
`tests/test_s3_caching.py`	Adds tests/framework for investigating S3 handle reuse/caching (currently mostly non-asserting).
`doc/pyfive_class_diagram.pu`	Adds/updates UML diagram to reflect new classes and relationships.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…tribute_read_detection Agent-Logs-Url: https://github.com/NCAS-CMS/pyfive/sessions/e26762f6-6c10-4c60-9365-a21f50bb3fc3 Co-authored-by: bnlawrence <1792815+bnlawrence@users.noreply.github.com>

…meter naming, test fixes, and logging setup

valeriupredoi · 2026-04-21T13:05:25Z

@bnlawrence please run pre-commit run -a in the source dir, and let pre-commit fix your linting, then commit the files that changed 🍺

…re-commit fail????

…on't

Bryan Lawrence added 14 commits November 18, 2025 08:09

One read for b-tree nodes

9a99dce

First cut at adding some parallelism in pyfive

9d9afed

controlling and logging parallelism

cca72e5

Making parallelism actually work

0fe1e80

Merge branch 'btreespeed' into parallel

e81c4f4

Optimize B-tree node reads and parallelize leaf fetches

00e2f88

Merge branch 'main' into pbtree

17076cc

pyfive

d3b816f

Implementing #issuecomment-4170115848

6327aa7

Tiny optimisation in b-tree reading

c8ed94e

Make dataset an ABC so other packages can register versions without s…

494f481

…ubclassing

Making parallelism the default, the benefits are too incredible

6b5ba3e

testing for parallel

56f3fb2

UML for pyfive in this branch

a3cc121

bnlawrence mentioned this pull request Apr 6, 2026

Datasets not truly independent from files on remote file systems. #219

Open

bnlawrence changed the title ~~Pbtree2~~ Parallel access to b-tree and data via cat_ranges and threading Apr 6, 2026

Bryan Lawrence and others added 6 commits April 7, 2026 13:41

revised class diagram

6a4a0ef

b-tree parallelism off by default

548e130

Merge branch 'main' into pbtree2

8b6890d

Fix ruff problem

c518cb8

run pre-commit

4fe2831

run pre-commit on two test files

530c66a

bnlawrence marked this pull request as ready for review April 15, 2026 08:01

bnlawrence requested review from Copilot, davidhassell and valeriupredoi April 15, 2026 08:01

Copilot started reviewing on behalf of bnlawrence April 15, 2026 08:02 View session

Copilot AI reviewed Apr 15, 2026

View reviewed changes

bnlawrence mentioned this pull request Apr 21, 2026

Address installation issues for beta NCAS-CMS/xconv2#47

Open

5 tasks

Copilot started work on behalf of bnlawrence April 21, 2026 12:34 View session

Replace meaningless assert True with pytest.skip in test_duplicate_at…

d698b1d

…tribute_read_detection Agent-Logs-Url: https://github.com/NCAS-CMS/pyfive/sessions/e26762f6-6c10-4c60-9365-a21f50bb3fc3 Co-authored-by: bnlawrence <1792815+bnlawrence@users.noreply.github.com>

Copilot finished work on behalf of bnlawrence April 21, 2026 12:40

Bryan Lawrence added 2 commits April 21, 2026 13:57

Modificaitons in response to review

c0a0b57

Merge origin/pbtree2 - keep local improvements to btree_parallel para…

0cbe504

…meter naming, test fixes, and logging setup

Bryan Lawrence added 3 commits April 21, 2026 14:07

Fixing two blank lines at the end, because that seems to be causing p…

d0ab718

…re-commit fail????

pre-commit magic that V made me do

4f7878c

I'm now confused about which files need to be in the repo and which d…

b134783

…on't

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel access to b-tree and data via cat_ranges and threading#218

Parallel access to b-tree and data via cat_ranges and threading#218
bnlawrence wants to merge 26 commits intomainfrom
pbtree2

bnlawrence commented Apr 6, 2026 •

edited

Loading

Uh oh!

bnlawrence commented Apr 6, 2026 •

edited

Loading

Uh oh!

valeriupredoi commented Apr 9, 2026

Uh oh!

codecov Bot commented Apr 9, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

bnlawrence commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Checklist

Uh oh!

bnlawrence commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

valeriupredoi commented Apr 9, 2026

Uh oh!

codecov Bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valeriupredoi commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

bnlawrence commented Apr 6, 2026 •

edited

Loading

bnlawrence commented Apr 6, 2026 •

edited

Loading

codecov Bot commented Apr 9, 2026 •

edited

Loading