New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context.export_dataset
method to export to another file format
#1379
Conversation
Thx for addressing this! :-) Maybe one could also use the Dask integration? That already allows loading to memory, storing into anything that allows slice assignment, HDF5, TileDB, Zarr, NPY stacks: https://docs.dask.org/en/stable/array-creation.html#store-dask-arrays. Probably it just has to be documented properly? Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF. That approach could be the way to go for best performance. Instead of NPY one could also look into ZARR -- it just has to support parallel assignment to non-overlapping slices. |
Dask integration is a good way to do this, I agree. The reason that I didn't go with that for this first implementation (even in a hidden way) is that right now we can't ensure that partitions are axis-aligned so saving via Dask could incur rechunking (or even loading all the data into memory). It is a good reason to revisit #1264 and do it properly. It's also possible that the delayed functions will be executed on different workers / machines leading IPC for the array chunks, though this is probably a minor issue.
Great, I can look at implementing this approach, though we would have to upstream RecordUDF or similar rather than importing from |
Some comments: The API: if we want to extend to a parallel version later, there will probably be some interaction with a
As far as I can see, with this we (will) have multiple ways of converting/saving:
Both 3) and 4) can be summarized as: built for the live case; "declarative" writing; limited format support and full integration into libertem-live is still TODO. Should we attempt to unify these in some way, at least in the interface? Having different interfaces for offline conversion vs. live acquisition would also be a good option IMHO.
Agreed for offline conversion of data; I'm not convinced that it will work in the general case for live acquisition. I haven't tried it yet in a multi-process execution, but having a single-threaded mmap based writer in the pipeline of the k2is receiver kills/severely limits performance (IIRC: page faults and memory pressure -> jitter and less efficient memory re-use).
Agreed. |
Good point about the K2 IS and dedicated high-performance low-overhead writers! In that sense, it would be a dataset option in an UDF run and just saving or converting would be running an empty UDF set with the dataset saving option enabled? |
Must do the necessary import magic in libertem-live!
50ffb12
to
ea7ee91
Compare
After discussion with @sk1p updated this to instead use a method on the Context: def export_dataset(
self,
dataset: DataSet,
*
path: os.PathLike,
progress: bool = False,
overwrite: bool = False,
):
"""
Export the dataset to another format on disk with an at-this-time minimal interface. Also this PR now upstreams To be discussed if this is something we want to add, but it is undeniably a very convenient way to work with some odd datasets (offline) with other tools ! |
DataSet
method to export to another file formatContext.convert_dataset
method to export to another file format
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## master #1379 +/- ##
==========================================
+ Coverage 68.47% 68.70% +0.23%
==========================================
Files 305 158 -147
Lines 17874 15541 -2333
Branches 3201 2771 -430
==========================================
- Hits 12239 10678 -1561
+ Misses 5115 4436 -679
+ Partials 520 427 -93
☔ View full report in Codecov by Sentry. |
Context.convert_dataset
method to export to another file formatContext.export_dataset
method to export to another file format
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall, I've added some comments in-line, mostly details.
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Thanks for the review again, sorry to have missed all those doc changes! |
/azp run libertem.libertem-data |
Azure Pipelines successfully started running 1 pipeline(s). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
NOTE The implementation has changed, leaving the below for the record...
Supports export to
npy
andraw
, could be later extended. In particular in response to #1366 and likely other Issues.User-facing is fairly minimal:
Works via
partition.get_macrotile()
so should be quite well-supported. Writes data serially partition-by-partition but this could of course be improved.I did see some old code for writing data notably
io.writers.base.WriteHandle
, but I couldn't quite make sense of it. Instead I wrote the writer support into the filecommon.writers
.A further quality-of-life extension would be a method to actually load the data direct to an array in memory.
I'm expecting to discuss / re-work this before merging, of course, but I'm putting up the PR for discussion!
Contributor Checklist:
Reviewer Checklist:
/azp run libertem.libertem-data
passed