`Context.export_dataset` method to export to another file format #1379

matbryan52 · 2023-01-11T08:49:06Z

NOTE The implementation has changed, leaving the below for the record...

Supports export to npy and raw, could be later extended. In particular in response to #1366 and likely other Issues.

User-facing is fairly minimal:

class DataSet:
    ...

    def save_as(
        self,
        path: os.PathLike,
        format: Optional[Literal['raw', 'npy']] = None,
        progress: bool = False,
        save_dtype: Optional['nt.DTypeLike'] = None,
    ):
        ...

Works via partition.get_macrotile() so should be quite well-supported. Writes data serially partition-by-partition but this could of course be improved.

I did see some old code for writing data notably io.writers.base.WriteHandle, but I couldn't quite make sense of it. Instead I wrote the writer support into the file common.writers.

A further quality-of-life extension would be a method to actually load the data direct to an array in memory.

I'm expecting to discuss / re-work this before merging, of course, but I'm putting up the PR for discussion!

Contributor Checklist:

I have added or updated my entry in the creators.json file
I have added a changelog entry for my contribution
I have added/updated documentation for all user-facing changes
I have added/updated test cases
I have included the rebuilt production build of the client (only if changes were made to the GUI)

Reviewer Checklist:

/azp run libertem.libertem-data passed
No import of GPL code from MIT code

uellue · 2023-01-11T09:51:54Z

Thx for addressing this! :-)

Maybe one could also use the Dask integration? That already allows loading to memory, storing into anything that allows slice assignment, HDF5, TileDB, Zarr, NPY stacks: https://docs.dask.org/en/stable/array-creation.html#store-dask-arrays. Probably it just has to be documented properly?

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF. That approach could be the way to go for best performance. Instead of NPY one could also look into ZARR -- it just has to support parallel assignment to non-overlapping slices.

matbryan52 · 2023-01-12T13:01:02Z

Maybe one could also use the Dask integration? That already allows loading to memory, storing into anything that allows slice assignment, HDF5, TileDB, Zarr, NPY stacks: https://docs.dask.org/en/stable/array-creation.html#store-dask-arrays. Probably it just has to be documented properly?

Dask integration is a good way to do this, I agree. The reason that I didn't go with that for this first implementation (even in a hidden way) is that right now we can't ensure that partitions are axis-aligned so saving via Dask could incur rechunking (or even loading all the data into memory). It is a good reason to revisit #1264 and do it properly.

It's also possible that the delayed functions will be executed on different workers / machines leading IPC for the array chunks, though this is probably a minor issue.

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF. That approach could be the way to go for best performance. Instead of NPY one could also look into ZARR -- it just has to support parallel assignment to non-overlapping slices.

Great, I can look at implementing this approach, though we would have to upstream RecordUDF or similar rather than importing from libertem-live I think.

sk1p · 2023-01-12T15:13:17Z

Some comments:

The API: if we want to extend to a parallel version later, there will probably be some interaction with a Context - we should think about how that would look like, and if we can easily implement that in a backwards-compatible way. Also, running the export in the main process / node breaks with one "tradition" of the DataSet API, which is to only perform I/O "remotely" on an executor.

Also, in LiberTEM-live we have https://github.com/LiberTEM/LiberTEM-live/blob/master/src/libertem_live/udf/record.py that performs parallel writing to an NPY file on the workers using an UDF.

As far as I can see, with this we (will) have multiple ways of converting/saving:

As a side-effect of running RecordUDF
"Imperatively" using DataSet.save_as
live: controlling the data acquisition software of the detector and telling it to acquire into a specified path (saved in the detector-specific format, like .mib or HDF5 with bslz4 or similar)
live: specialized data acquisition / writing routines that are built into our receiving libraries (example: the upcoming k2is support in LiberTEM-live can write to disk very efficiently and can keep up with the data rates without any issues; data writing can be controlled via the Python API)

Both 3) and 4) can be summarized as: built for the live case; "declarative" writing; limited format support and full integration into libertem-live is still TODO.

Should we attempt to unify these in some way, at least in the interface? Having different interfaces for offline conversion vs. live acquisition would also be a good option IMHO.

That approach could be the way to go for best performance.

Agreed for offline conversion of data; I'm not convinced that it will work in the general case for live acquisition. I haven't tried it yet in a multi-process execution, but having a single-threaded mmap based writer in the pipeline of the k2is receiver kills/severely limits performance (IIRC: page faults and memory pressure -> jitter and less efficient memory re-use).

though we would have to upstream RecordUDF or similar rather than importing from libertem-live I think.

Agreed.

uellue · 2023-01-12T17:16:08Z

Good point about the K2 IS and dedicated high-performance low-overhead writers! In that sense, it would be a dataset option in an UDF run and just saving or converting would be running an empty UDF set with the dataset saving option enabled?

Must do the necessary import magic in libertem-live!

matbryan52 · 2023-06-29T14:49:17Z

After discussion with @sk1p updated this to instead use a method on the Context:

    def export_dataset(
        self,
        dataset: DataSet,
        *
        path: os.PathLike,
        progress: bool = False,
        overwrite: bool = False,
    ):
        """
        Export the dataset to another format on disk

with an at-this-time minimal interface. Also this PR now upstreams RecordUDF to take advantage of its performance.

To be discussed if this is something we want to add, but it is undeniably a very convenient way to work with some odd datasets (offline) with other tools !

codecov · 2023-06-29T15:01:00Z

Codecov Report

Patch coverage: 94.44% and project coverage change: +0.23 🎉

Comparison is base (4f9de10) 68.47% compared to head (36f3695) 68.70%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1379      +/-   ##
==========================================
+ Coverage   68.47%   68.70%   +0.23%     
==========================================
  Files         305      158     -147     
  Lines       17874    15541    -2333     
  Branches     3201     2771     -430     
==========================================
- Hits        12239    10678    -1561     
+ Misses       5115     4436     -679     
+ Partials      520      427      -93

Impacted Files	Coverage Δ
src/libertem/udf/record.py	`91.30% <91.30%> (ø)`
src/libertem/api.py	`91.09% <100.00%> (+1.52%)`	⬆️

... and 182 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/libertem/api.py

Good spot Co-authored-by: Alexander Clausen <alex@gc-web.de>

sk1p

Looks good overall, I've added some comments in-line, mostly details.

docs/source/changelog/features/export_dataset.rst

docs/source/formats.rst

src/libertem/api.py

Co-authored-by: Alexander Clausen <alex@gc-web.de>

matbryan52 · 2023-07-06T14:26:48Z

Thanks for the review again, sorry to have missed all those doc changes!

matbryan52 · 2023-07-06T14:49:58Z

/azp run libertem.libertem-data

azure-pipelines · 2023-07-06T14:50:09Z

Azure Pipelines successfully started running 1 pipeline(s).

sk1p

LGTM!

matbryan52 linked an issue Jan 11, 2023 that may be closed by this pull request

How to get data out and other stupid questions #1366

Closed

sk1p added this to the 0.12 milestone Jun 28, 2023

matbryan52 added 10 commits June 29, 2023 16:07

Add first version npy-exporter

775e592

add raw reader

55d9528

adjust comment

ff38b0f

add tests and bugfixes

7feeecd

revert change from testing

9fe8d7e

Remove common.writers, revert dataset.py

e1908c5

Upstream RecordUDF

a4a9b7b

Must do the necessary import magic in libertem-live!

Move convert endpoint onto Context

8176749

Add changelog entry, more docstring

75a58ad

Update documentation

ea7ee91

matbryan52 force-pushed the file_exporter branch from 50ffb12 to ea7ee91 Compare June 29, 2023 14:44

matbryan52 changed the title ~~DataSet method to export to another file format~~ Context.convert_dataset method to export to another file format Jun 29, 2023

Add doc entry for RecordUDF

a4b1448

sk1p reviewed Jul 5, 2023

View reviewed changes

src/libertem/api.py Outdated Show resolved Hide resolved

src/libertem/api.py Show resolved Hide resolved

src/libertem/api.py Show resolved Hide resolved

src/libertem/api.py Outdated Show resolved Hide resolved

sk1p reviewed Jul 5, 2023

View reviewed changes

src/libertem/api.py Outdated Show resolved Hide resolved

matbryan52 and others added 4 commits July 6, 2023 09:35

Typo

78cf0cf

Good spot Co-authored-by: Alexander Clausen <alex@gc-web.de>

Keyword-only args

5613a99

Change name

a3d4713

Add overwrite=False default arg

a3a02ab

matbryan52 changed the title ~~Context.convert_dataset method to export to another file format~~ Context.export_dataset method to export to another file format Jul 6, 2023

matbryan52 added 2 commits July 6, 2023 09:53

rename changelog entry

9ecf2bd

Update changelog entry

90d98d7

sk1p reviewed Jul 6, 2023

View reviewed changes

matbryan52 and others added 5 commits July 6, 2023 16:21

Update docs/source/formats.rst

13b1e0f

Co-authored-by: Alexander Clausen <alex@gc-web.de>

Update docs/source/changelog/features/export_dataset.rst

465d137

Co-authored-by: Alexander Clausen <alex@gc-web.de>

Update docs/source/changelog/features/export_dataset.rst

92b8995

Co-authored-by: Alexander Clausen <alex@gc-web.de>

Update docs/source/formats.rst

93260b4

Co-authored-by: Alexander Clausen <alex@gc-web.de>

Update docs/source/formats.rst

36f3695

Co-authored-by: Alexander Clausen <alex@gc-web.de>

sk1p approved these changes Jul 25, 2023

View reviewed changes

sk1p merged commit 9275040 into LiberTEM:master Jul 25, 2023
31 checks passed

This was referenced Aug 16, 2023

Focus and planning for the v0.13 release #1501

Closed

Remove RecordUDF now that it is upstreamed LiberTEM/LiberTEM-live#129

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Context.export_dataset` method to export to another file format #1379

`Context.export_dataset` method to export to another file format #1379

matbryan52 commented Jan 11, 2023 •

edited

uellue commented Jan 11, 2023

matbryan52 commented Jan 12, 2023

sk1p commented Jan 12, 2023

uellue commented Jan 12, 2023

matbryan52 commented Jun 29, 2023 •

edited

codecov bot commented Jun 29, 2023 •

edited

sk1p left a comment

matbryan52 commented Jul 6, 2023

matbryan52 commented Jul 6, 2023

azure-pipelines bot commented Jul 6, 2023

sk1p left a comment

Context.export_dataset method to export to another file format #1379

Context.export_dataset method to export to another file format #1379

Conversation

matbryan52 commented Jan 11, 2023 • edited

Contributor Checklist:

Reviewer Checklist:

uellue commented Jan 11, 2023

matbryan52 commented Jan 12, 2023

sk1p commented Jan 12, 2023

uellue commented Jan 12, 2023

matbryan52 commented Jun 29, 2023 • edited

codecov bot commented Jun 29, 2023 • edited

Codecov Report

sk1p left a comment

Choose a reason for hiding this comment

matbryan52 commented Jul 6, 2023

matbryan52 commented Jul 6, 2023

azure-pipelines bot commented Jul 6, 2023

sk1p left a comment

Choose a reason for hiding this comment

`Context.export_dataset` method to export to another file format #1379

`Context.export_dataset` method to export to another file format #1379

matbryan52 commented Jan 11, 2023 •

edited

matbryan52 commented Jun 29, 2023 •

edited

codecov bot commented Jun 29, 2023 •

edited