Improve performance of qml.data.load() when partially loading a dataset #4674

brownj85 · 2023-10-13T17:19:28Z

Loading individual attributes of datasets took much longer than loading a whole dataset. This is because the fsspec library was mapping the HDF5 reads directly to HTTP requests, which only loaded a few KB each.

Description of the Change:
open_hdf5_s3() now opens the remote dataset in read-buffered mode, which reads data in 8MB chunks into a memory-mapped cache. This results in much fewer requests and faster loading.

Benefits:
Acceptable performance for partial loading of large datasets. The download throughput for partial loading is now comparable to downloading the whole dataset (about 15-20% less mb/s).

Possible Drawbacks:
None

Related GitHub Issues:

…ribute-takes

codecov · 2023-10-16T14:56:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (5f246a9) 99.64% compared to head (0d0e7bf) 99.63%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4674      +/-   ##
==========================================
- Coverage   99.64%   99.63%   -0.01%     
==========================================
  Files         377      377              
  Lines       33999    33735     -264     
==========================================
- Hits        33878    33613     -265     
- Misses        121      122       +1

Files	Coverage Δ
pennylane/data/base/hdf5.py	`100.00% <100.00%> (ø)`
pennylane/data/data_manager/__init__.py	`98.52% <100.00%> (-0.03%)`	⬇️

... and 42 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…ribute-takes

pennylane/data/base/hdf5.py

doc/releases/changelog-dev.md

timmysilv · 2023-10-19T14:09:19Z

nice speed-up! to be clear, we used to only use (disk) caches when users requested a specific path, but now we always use one (in-memory), right? is there any loss of functionality by removing the option of specifying a cache path?

anthayes92

Thanks @brownj85, looks good overall! Just wondering how was the speedup verified?

brownj85 · 2023-10-19T16:31:50Z

Thanks @brownj85, looks good overall! Just wondering how was the speedup verified

Just by my own testing. I'm assuming @timmysilv and @obliviateandsurrender tested it as well

…ribute-takes

… of github.com:PennyLaneAI/pennylane into sc-43737-loading-a-dataset-hamiltonian-attribute-takes

brownj85 · 2023-10-19T16:43:26Z

nice speed-up! to be clear, we used to only use (disk) caches when users requested a specific path, but now we always use one (in-memory), right? is there any loss of functionality by removing the option of specifying a cache path?

Pretty much - I used blockcache because I thought it did what mmap does, but it actually just stores the response data after the fact and doesn't do any read buffering. mmap is in-memory but it can commit blocks to a temporary file as needed. No loss of functionality, the cache_dir was just there to make the block cache work

timmysilv

sounds good. I didn't do any testing personally so I'm a bit curious as to what your own testing entailed. that said, I trust your judgement here and I'm happy with this change!

…ribute-takes

DSGuala

Tested with H2 and H2O. The performance improved significantly on downloading individual attributes 💪

…ribute-takes

brownj85 added 3 commits October 13, 2023 13:19

use mmap cache for fsspec

b9a6651

fix test

d99443a

fmt

30727d4

brownj85 changed the title ~~use mmap cache for fsspec~~ Use fsspec read-buffering when partially loading dataset Oct 16, 2023

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

f9a3c8c

…ribute-takes

brownj85 requested a review from DSGuala October 16, 2023 14:08

brownj85 changed the title ~~Use fsspec read-buffering when partially loading dataset~~ Improve performance of qml.data.load() when partially loading a dataset Oct 16, 2023

brownj85 added 2 commits October 16, 2023 10:11

update changelog

c7df35f

changelog

bf30567

brownj85 marked this pull request as ready for review October 16, 2023 14:11

brownj85 requested review from timmysilv and anthayes92 October 16, 2023 14:12

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

27bbb1e

…ribute-takes

brownj85 mentioned this pull request Oct 16, 2023

Fix full dataset download after partial #4681

Merged

brownj85 requested a review from obliviateandsurrender October 16, 2023 19:35

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

a32952c

…ribute-takes

obliviateandsurrender approved these changes Oct 18, 2023

View reviewed changes

pennylane/data/base/hdf5.py Show resolved Hide resolved

trbromley reviewed Oct 19, 2023

View reviewed changes

doc/releases/changelog-dev.md Outdated Show resolved Hide resolved

anthayes92 reviewed Oct 19, 2023

View reviewed changes

make block size configurable

c008276

brownj85 added 3 commits October 19, 2023 12:31

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

bdb80b1

…ribute-takes

update changelog

6b3fa9f

Merge branch 'sc-43737-loading-a-dataset-hamiltonian-attribute-takes'…

fb2d8c4

… of github.com:PennyLaneAI/pennylane into sc-43737-loading-a-dataset-hamiltonian-attribute-takes

brownj85 requested review from trbromley and anthayes92 October 19, 2023 16:33

timmysilv approved these changes Oct 19, 2023

View reviewed changes

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

55a73d8

…ribute-takes

brownj85 enabled auto-merge (squash) October 19, 2023 18:57

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

79388cb

…ribute-takes

DSGuala approved these changes Oct 19, 2023

View reviewed changes

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

46c3a01

…ribute-takes

timmysilv added the merge-ready ✔️ All tests pass and the PR is ready to be merged. label Oct 19, 2023

brownj85 added 2 commits October 19, 2023 17:14

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

7a23b9f

…ribute-takes

Merge branch 'master' into sc-43737-loading-a-dataset-hamiltonian-att…

0d0e7bf

…ribute-takes

brownj85 merged commit 4e2cb17 into master Oct 19, 2023
33 checks passed

brownj85 deleted the sc-43737-loading-a-dataset-hamiltonian-attribute-takes branch October 19, 2023 22:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of qml.data.load() when partially loading a dataset #4674

Improve performance of qml.data.load() when partially loading a dataset #4674

brownj85 commented Oct 13, 2023 •

edited

Loading

codecov bot commented Oct 16, 2023 •

edited

Loading

timmysilv commented Oct 19, 2023

anthayes92 left a comment

brownj85 commented Oct 19, 2023

brownj85 commented Oct 19, 2023

timmysilv left a comment

DSGuala left a comment

Improve performance of qml.data.load() when partially loading a dataset #4674

Improve performance of qml.data.load() when partially loading a dataset #4674

Conversation

brownj85 commented Oct 13, 2023 • edited Loading

codecov bot commented Oct 16, 2023 • edited Loading

Codecov Report

timmysilv commented Oct 19, 2023

anthayes92 left a comment

Choose a reason for hiding this comment

brownj85 commented Oct 19, 2023

brownj85 commented Oct 19, 2023

timmysilv left a comment

Choose a reason for hiding this comment

DSGuala left a comment

Choose a reason for hiding this comment

brownj85 commented Oct 13, 2023 •

edited

Loading

codecov bot commented Oct 16, 2023 •

edited

Loading