Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context.export_dataset method to export to another file format #1379

Merged
merged 22 commits into from Jul 25, 2023

Conversation

matbryan52
Copy link
Member

@matbryan52 matbryan52 commented Jan 11, 2023

NOTE The implementation has changed, leaving the below for the record...

Supports export to npy and raw, could be later extended. In particular in response to #1366 and likely other Issues.

User-facing is fairly minimal:

class DataSet:
    ...

    def save_as(
        self,
        path: os.PathLike,
        format: Optional[Literal['raw', 'npy']] = None,
        progress: bool = False,
        save_dtype: Optional['nt.DTypeLike'] = None,
    ):
        ...

Works via partition.get_macrotile() so should be quite well-supported. Writes data serially partition-by-partition but this could of course be improved.

I did see some old code for writing data notably io.writers.base.WriteHandle, but I couldn't quite make sense of it. Instead I wrote the writer support into the file common.writers.

A further quality-of-life extension would be a method to actually load the data direct to an array in memory.

I'm expecting to discuss / re-work this before merging, of course, but I'm putting up the PR for discussion!

Contributor Checklist:

Reviewer Checklist:

  • /azp run libertem.libertem-data passed
  • No import of GPL code from MIT code

@matbryan52 matbryan52 linked an issue Jan 11, 2023 that may be closed by this pull request
@uellue
Copy link
Member

uellue commented Jan 11, 2023

Thx for addressing this! :-)

Maybe one could also use the Dask integration? That already allows loading to memory, storing into anything that allows slice assignment, HDF5, TileDB, Zarr, NPY stacks: https://docs.dask.org/en/stable/array-creation.html#store-dask-arrays. Probably it just has to be documented properly?

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF. That approach could be the way to go for best performance. Instead of NPY one could also look into ZARR -- it just has to support parallel assignment to non-overlapping slices.

@matbryan52
Copy link
Member Author

Maybe one could also use the Dask integration? That already allows loading to memory, storing into anything that allows slice assignment, HDF5, TileDB, Zarr, NPY stacks: https://docs.dask.org/en/stable/array-creation.html#store-dask-arrays. Probably it just has to be documented properly?

Dask integration is a good way to do this, I agree. The reason that I didn't go with that for this first implementation (even in a hidden way) is that right now we can't ensure that partitions are axis-aligned so saving via Dask could incur rechunking (or even loading all the data into memory). It is a good reason to revisit #1264 and do it properly.

It's also possible that the delayed functions will be executed on different workers / machines leading IPC for the array chunks, though this is probably a minor issue.

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF. That approach could be the way to go for best performance. Instead of NPY one could also look into ZARR -- it just has to support parallel assignment to non-overlapping slices.

Great, I can look at implementing this approach, though we would have to upstream RecordUDF or similar rather than importing from libertem-live I think.

@sk1p
Copy link
Member

sk1p commented Jan 12, 2023

Some comments:

The API: if we want to extend to a parallel version later, there will probably be some interaction with a Context - we should think about how that would look like, and if we can easily implement that in a backwards-compatible way. Also, running the export in the main process / node breaks with one "tradition" of the DataSet API, which is to only perform I/O "remotely" on an executor.

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF.

As far as I can see, with this we (will) have multiple ways of converting/saving:

  1. As a side-effect of running RecordUDF
  2. "Imperatively" using DataSet.save_as
  3. live: controlling the data acquisition software of the detector and telling it to acquire into a specified path (saved in the detector-specific format, like .mib or HDF5 with bslz4 or similar)
  4. live: specialized data acquisition / writing routines that are built into our receiving libraries (example: the upcoming k2is support in LiberTEM-live can write to disk very efficiently and can keep up with the data rates without any issues; data writing can be controlled via the Python API)

Both 3) and 4) can be summarized as: built for the live case; "declarative" writing; limited format support and full integration into libertem-live is still TODO.

Should we attempt to unify these in some way, at least in the interface? Having different interfaces for offline conversion vs. live acquisition would also be a good option IMHO.

That approach could be the way to go for best performance.

Agreed for offline conversion of data; I'm not convinced that it will work in the general case for live acquisition. I haven't tried it yet in a multi-process execution, but having a single-threaded mmap based writer in the pipeline of the k2is receiver kills/severely limits performance (IIRC: page faults and memory pressure -> jitter and less efficient memory re-use).

though we would have to upstream RecordUDF or similar rather than importing from libertem-live I think.

Agreed.

@uellue
Copy link
Member

uellue commented Jan 12, 2023

Good point about the K2 IS and dedicated high-performance low-overhead writers! In that sense, it would be a dataset option in an UDF run and just saving or converting would be running an empty UDF set with the dataset saving option enabled?

@sk1p sk1p added this to the 0.12 milestone Jun 28, 2023
@matbryan52
Copy link
Member Author

matbryan52 commented Jun 29, 2023

After discussion with @sk1p updated this to instead use a method on the Context:

    def export_dataset(
        self,
        dataset: DataSet,
        *
        path: os.PathLike,
        progress: bool = False,
        overwrite: bool = False,
    ):
        """
        Export the dataset to another format on disk

with an at-this-time minimal interface. Also this PR now upstreams RecordUDF to take advantage of its performance.

To be discussed if this is something we want to add, but it is undeniably a very convenient way to work with some odd datasets (offline) with other tools !

@matbryan52 matbryan52 changed the title DataSet method to export to another file format Context.convert_dataset method to export to another file format Jun 29, 2023
@codecov
Copy link

codecov bot commented Jun 29, 2023

Codecov Report

Patch coverage: 94.44% and project coverage change: +0.23 🎉

Comparison is base (4f9de10) 68.47% compared to head (36f3695) 68.70%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1379      +/-   ##
==========================================
+ Coverage   68.47%   68.70%   +0.23%     
==========================================
  Files         305      158     -147     
  Lines       17874    15541    -2333     
  Branches     3201     2771     -430     
==========================================
- Hits        12239    10678    -1561     
+ Misses       5115     4436     -679     
+ Partials      520      427      -93     
Impacted Files Coverage Δ
src/libertem/udf/record.py 91.30% <91.30%> (ø)
src/libertem/api.py 91.09% <100.00%> (+1.52%) ⬆️

... and 182 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/libertem/api.py Outdated Show resolved Hide resolved
src/libertem/api.py Show resolved Hide resolved
src/libertem/api.py Show resolved Hide resolved
src/libertem/api.py Outdated Show resolved Hide resolved
src/libertem/api.py Outdated Show resolved Hide resolved
matbryan52 and others added 4 commits July 6, 2023 09:35
Good spot

Co-authored-by: Alexander Clausen <alex@gc-web.de>
@matbryan52 matbryan52 changed the title Context.convert_dataset method to export to another file format Context.export_dataset method to export to another file format Jul 6, 2023
Copy link
Member

@sk1p sk1p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall, I've added some comments in-line, mostly details.

docs/source/changelog/features/export_dataset.rst Outdated Show resolved Hide resolved
docs/source/changelog/features/export_dataset.rst Outdated Show resolved Hide resolved
docs/source/formats.rst Outdated Show resolved Hide resolved
docs/source/formats.rst Outdated Show resolved Hide resolved
docs/source/formats.rst Outdated Show resolved Hide resolved
src/libertem/api.py Show resolved Hide resolved
matbryan52 and others added 5 commits July 6, 2023 16:21
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
Co-authored-by: Alexander Clausen <alex@gc-web.de>
@matbryan52
Copy link
Member Author

Thanks for the review again, sorry to have missed all those doc changes!

@matbryan52
Copy link
Member Author

/azp run libertem.libertem-data

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link
Member

@sk1p sk1p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@sk1p sk1p merged commit 9275040 into LiberTEM:master Jul 25, 2023
31 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

How to get data out and other stupid questions
3 participants