Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of qml.data.load() when partially loading a dataset #4674

Merged

Conversation

brownj85
Copy link
Contributor

@brownj85 brownj85 commented Oct 13, 2023

Context:
Link to shortcut story

Loading individual attributes of datasets took much longer than loading a whole dataset. This is because the fsspec library was mapping the HDF5 reads directly to HTTP requests, which only loaded a few KB each.

Description of the Change:
open_hdf5_s3() now opens the remote dataset in read-buffered mode, which reads data in 8MB chunks into a memory-mapped cache. This results in much fewer requests and faster loading.

Benefits:
Acceptable performance for partial loading of large datasets. The download throughput for partial loading is now comparable to downloading the whole dataset (about 15-20% less mb/s).

Possible Drawbacks:
None

Related GitHub Issues:

@brownj85 brownj85 changed the title use mmap cache for fsspec Use fsspec read-buffering when partially loading dataset Oct 16, 2023
@brownj85 brownj85 requested a review from DSGuala October 16, 2023 14:08
@brownj85 brownj85 changed the title Use fsspec read-buffering when partially loading dataset Improve performance of qml.data.load() when partially loading a dataset Oct 16, 2023
@brownj85 brownj85 marked this pull request as ready for review October 16, 2023 14:11
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5f246a9) 99.64% compared to head (0d0e7bf) 99.63%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #4674      +/-   ##
==========================================
- Coverage   99.64%   99.63%   -0.01%     
==========================================
  Files         377      377              
  Lines       33999    33735     -264     
==========================================
- Hits        33878    33613     -265     
- Misses        121      122       +1     
Files Coverage Δ
pennylane/data/base/hdf5.py 100.00% <100.00%> (ø)
pennylane/data/data_manager/__init__.py 98.52% <100.00%> (-0.03%) ⬇️

... and 42 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@timmysilv
Copy link
Contributor

nice speed-up! to be clear, we used to only use (disk) caches when users requested a specific path, but now we always use one (in-memory), right? is there any loss of functionality by removing the option of specifying a cache path?

Copy link
Contributor

@anthayes92 anthayes92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @brownj85, looks good overall! Just wondering how was the speedup verified?

@brownj85
Copy link
Contributor Author

Thanks @brownj85, looks good overall! Just wondering how was the speedup verified

Just by my own testing. I'm assuming @timmysilv and @obliviateandsurrender tested it as well

@brownj85
Copy link
Contributor Author

nice speed-up! to be clear, we used to only use (disk) caches when users requested a specific path, but now we always use one (in-memory), right? is there any loss of functionality by removing the option of specifying a cache path?

Pretty much - I used blockcache because I thought it did what mmap does, but it actually just stores the response data after the fact and doesn't do any read buffering. mmap is in-memory but it can commit blocks to a temporary file as needed. No loss of functionality, the cache_dir was just there to make the block cache work

Copy link
Contributor

@timmysilv timmysilv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good. I didn't do any testing personally so I'm a bit curious as to what your own testing entailed. that said, I trust your judgement here and I'm happy with this change!

@brownj85 brownj85 enabled auto-merge (squash) October 19, 2023 18:57
Copy link
Contributor

@DSGuala DSGuala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with H2 and H2O. The performance improved significantly on downloading individual attributes 💪

@timmysilv timmysilv added the merge-ready ✔️ All tests pass and the PR is ready to be merged. label Oct 19, 2023
@brownj85 brownj85 merged commit 4e2cb17 into master Oct 19, 2023
33 checks passed
@brownj85 brownj85 deleted the sc-43737-loading-a-dataset-hamiltonian-attribute-takes branch October 19, 2023 22:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
merge-ready ✔️ All tests pass and the PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants