Support for DM3/DM4 #291

uellue · 2019-03-12T07:54:36Z

Support DM3 files in order to open this dataset: https://zenodo.org/record/2566137

See also #254
The simulated data is in npz format #222

CC:
@joverbee, thank you for publishing the data on Zenodo!

sk1p · 2019-06-17T17:39:02Z

Support for DM3 and DM4 can be added in one reader.

I've experimented with using the ncempy reader for reading stacks of DM3 files, and sadly, the performance wasn't great. This is mostly because a lot of parsing is going on, and that performance overhead is multiplied with the number of files in the stack. I think for large DM4 files with 4D stem data, ncempy should be fine (have not tested this yet).

ercius · 2019-09-26T02:22:08Z

Support for DM3 and DM4 can be added in one reader.

Very true. The two formats differ very little. ncempy implements reading either type automatically.

@sk1p @uellue Im in charge of ncempy and I just found this comment. You are correct that the performance is hindered by the amount of parsing in the header. That can not be avoided since it is impossible to figure out where the data starts in the file before you parse the header. I have two ideas which might help you in this case though:

Use the on_memory=True keyword. This is now the default in the most recent version. This reads the entire file into memory and can make the parsing much faster especially when the data is on a distributed file system.
You can use the internals of the fileDM class to figure out the offset to one of the files. It should be in the fileDM.dataOffsetArray. Other relevant things are the dataTypeArray and the dataSizeArrays. If all of the other files were acquired in an identical way then the header might be exactly the same. The data will be in the same place and you can seek to that position and read the data. Ive done this before with K2 data extracted in the hour/minute/second/file.dm4 structure with 400 files per directory.

Using large number of DM files are not an efficient storage method for large data sets. It would be best to load them all once (slow) and then write it to a more efficient data file type like HDF5 or numpy array.

Please let me know if you have any questions. I might be able to write a quick script or example if I took a look at the data.

uellue · 2019-09-26T07:32:11Z

Hi @ercius, thanks for your input!

You can use the internals of the fileDM class to figure out the offset to one of the files.

In a data stack that we got https://zenodo.org/record/2566137, the individual files had different offsets. It probably really depends on the software that created the files.

Probably your proposal to convert the data once to a more suitable format is the most pragmatic way forward for collections of many individual DM3 or DM4 files.

As a side note, LiberTEM has excellent support for K2IS raw files, so no need to convert to DM4 -- quite the opposite. :-) You could use LiberTEM for your entire computation or, alternatively, create efficient Dask arrays from any dataset type.

@sk1p, you had investigated ways to speed up the DM3 parsing? Doing it in memory sounds sensible for a start. :-) The last state of the discussion that I remember was that one would probably have to write a C/Cython/Numba implementation to deal with all the individual little data structures efficiently if one wanted to use folders of small DM3/DM4 files with decent throughput. That was the point where we had to weigh up cost and benefit.

sk1p · 2019-10-07T14:55:58Z

Just as a note, there is an initial implementation for a stack-of-dm-files reader: https://github.com/LiberTEM/LiberTEM/blob/dm-reader/src/libertem/io/dataset/dm.py

It's been a while since I worked on this, but I think I implemented some caching of data-regions, so the reader would parse the file structure once, on opening the dataset, and afterwards read directly using offset+size. Still, it didn't work well with many files, with similar behavior as #440 as far as I remember.

I also have some uncommitted prototypes somewhere, if there is interest I can dig them out and put them into the prototype folder.

sk1p · 2019-12-04T18:41:08Z

In PR #497 I have added a reader for stacks of DM3/DM4 files. It uses the method described by @ercius for reading the offset via ncempy, which is then cached. By reading the offsets in parallel, we get the initialization time down to bearable levels (~20s for 2500 files on our workstation), but depending on the operation, initialization is still comparably slow. There is a parameter for assuming that all files in the stack have the same data offset, but that still needs to be tested (cc @woozey).

Proper GUI support needs #498.

uellue added enhancement New feature or request good first issue Good for newcomers file formats and I/O labels Mar 12, 2019

sk1p changed the title ~~Support for DM3~~ Support for DM3/DM4 Jun 17, 2019

sk1p mentioned this issue Jun 17, 2019

Support for DM4 #254

Closed

sk1p removed the good first issue Good for newcomers label Oct 7, 2019

sk1p mentioned this issue Dec 4, 2019

Support for stacks of DM3/DM4 files #497

Merged

4 tasks

This was referenced Dec 5, 2019

Support for single-file DM4 4D datasets #499

Closed

DM3/DM4 reader: improve initialization performance #501

Open

sk1p closed this as completed in #497 Dec 5, 2019

sk1p mentioned this issue Mar 5, 2020

File dialog: support for file stacks #498

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for DM3/DM4 #291

Support for DM3/DM4 #291

uellue commented Mar 12, 2019

sk1p commented Jun 17, 2019

ercius commented Sep 26, 2019 •

edited

Loading

uellue commented Sep 26, 2019

sk1p commented Oct 7, 2019

sk1p commented Dec 4, 2019

Support for DM3/DM4 #291

Support for DM3/DM4 #291

Comments

uellue commented Mar 12, 2019

sk1p commented Jun 17, 2019

ercius commented Sep 26, 2019 • edited Loading

uellue commented Sep 26, 2019

sk1p commented Oct 7, 2019

sk1p commented Dec 4, 2019

ercius commented Sep 26, 2019 •

edited

Loading