Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for DM3/DM4 #291

Closed
uellue opened this issue Mar 12, 2019 · 5 comments · Fixed by #497
Closed

Support for DM3/DM4 #291

uellue opened this issue Mar 12, 2019 · 5 comments · Fixed by #497
Labels

Comments

@uellue
Copy link
Member

uellue commented Mar 12, 2019

Support DM3 files in order to open this dataset: https://zenodo.org/record/2566137

See also #254
The simulated data is in npz format #222

CC:
@joverbee, thank you for publishing the data on Zenodo!

@uellue uellue added enhancement New feature or request good first issue Good for newcomers file formats and I/O labels Mar 12, 2019
@sk1p sk1p changed the title Support for DM3 Support for DM3/DM4 Jun 17, 2019
@sk1p sk1p mentioned this issue Jun 17, 2019
@sk1p
Copy link
Member

sk1p commented Jun 17, 2019

Support for DM3 and DM4 can be added in one reader.

I've experimented with using the ncempy reader for reading stacks of DM3 files, and sadly, the performance wasn't great. This is mostly because a lot of parsing is going on, and that performance overhead is multiplied with the number of files in the stack. I think for large DM4 files with 4D stem data, ncempy should be fine (have not tested this yet).

@ercius
Copy link

ercius commented Sep 26, 2019

Support for DM3 and DM4 can be added in one reader.

Very true. The two formats differ very little. ncempy implements reading either type automatically.

@sk1p @uellue Im in charge of ncempy and I just found this comment. You are correct that the performance is hindered by the amount of parsing in the header. That can not be avoided since it is impossible to figure out where the data starts in the file before you parse the header. I have two ideas which might help you in this case though:

  1. Use the on_memory=True keyword. This is now the default in the most recent version. This reads the entire file into memory and can make the parsing much faster especially when the data is on a distributed file system.

  2. You can use the internals of the fileDM class to figure out the offset to one of the files. It should be in the fileDM.dataOffsetArray. Other relevant things are the dataTypeArray and the dataSizeArrays. If all of the other files were acquired in an identical way then the header might be exactly the same. The data will be in the same place and you can seek to that position and read the data. Ive done this before with K2 data extracted in the hour/minute/second/file.dm4 structure with 400 files per directory.

Using large number of DM files are not an efficient storage method for large data sets. It would be best to load them all once (slow) and then write it to a more efficient data file type like HDF5 or numpy array.

Please let me know if you have any questions. I might be able to write a quick script or example if I took a look at the data.

@uellue
Copy link
Member Author

uellue commented Sep 26, 2019

Hi @ercius, thanks for your input!

You can use the internals of the fileDM class to figure out the offset to one of the files.

In a data stack that we got https://zenodo.org/record/2566137, the individual files had different offsets. It probably really depends on the software that created the files.

Probably your proposal to convert the data once to a more suitable format is the most pragmatic way forward for collections of many individual DM3 or DM4 files.

As a side note, LiberTEM has excellent support for K2IS raw files, so no need to convert to DM4 -- quite the opposite. :-) You could use LiberTEM for your entire computation or, alternatively, create efficient Dask arrays from any dataset type.

@sk1p, you had investigated ways to speed up the DM3 parsing? Doing it in memory sounds sensible for a start. :-) The last state of the discussion that I remember was that one would probably have to write a C/Cython/Numba implementation to deal with all the individual little data structures efficiently if one wanted to use folders of small DM3/DM4 files with decent throughput. That was the point where we had to weigh up cost and benefit.

@sk1p
Copy link
Member

sk1p commented Oct 7, 2019

Just as a note, there is an initial implementation for a stack-of-dm-files reader: https://github.com/LiberTEM/LiberTEM/blob/dm-reader/src/libertem/io/dataset/dm.py

It's been a while since I worked on this, but I think I implemented some caching of data-regions, so the reader would parse the file structure once, on opening the dataset, and afterwards read directly using offset+size. Still, it didn't work well with many files, with similar behavior as #440 as far as I remember.

I also have some uncommitted prototypes somewhere, if there is interest I can dig them out and put them into the prototype folder.

@sk1p sk1p removed the good first issue Good for newcomers label Oct 7, 2019
@sk1p
Copy link
Member

sk1p commented Dec 4, 2019

In PR #497 I have added a reader for stacks of DM3/DM4 files. It uses the method described by @ercius for reading the offset via ncempy, which is then cached. By reading the offsets in parallel, we get the initialization time down to bearable levels (~20s for 2500 files on our workstation), but depending on the operation, initialization is still comparably slow. There is a parameter for assuming that all files in the stack have the same data offset, but that still needs to be tested (cc @woozey).

Proper GUI support needs #498.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants