-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for DM3/DM4 #291
Comments
Support for DM3 and DM4 can be added in one reader. I've experimented with using the ncempy reader for reading stacks of DM3 files, and sadly, the performance wasn't great. This is mostly because a lot of parsing is going on, and that performance overhead is multiplied with the number of files in the stack. I think for large DM4 files with 4D stem data, ncempy should be fine (have not tested this yet). |
Very true. The two formats differ very little. ncempy implements reading either type automatically. @sk1p @uellue Im in charge of ncempy and I just found this comment. You are correct that the performance is hindered by the amount of parsing in the header. That can not be avoided since it is impossible to figure out where the data starts in the file before you parse the header. I have two ideas which might help you in this case though:
Using large number of DM files are not an efficient storage method for large data sets. It would be best to load them all once (slow) and then write it to a more efficient data file type like HDF5 or numpy array. Please let me know if you have any questions. I might be able to write a quick script or example if I took a look at the data. |
Hi @ercius, thanks for your input!
In a data stack that we got https://zenodo.org/record/2566137, the individual files had different offsets. It probably really depends on the software that created the files. Probably your proposal to convert the data once to a more suitable format is the most pragmatic way forward for collections of many individual DM3 or DM4 files. As a side note, LiberTEM has excellent support for K2IS raw files, so no need to convert to DM4 -- quite the opposite. :-) You could use LiberTEM for your entire computation or, alternatively, create efficient Dask arrays from any dataset type. @sk1p, you had investigated ways to speed up the DM3 parsing? Doing it in memory sounds sensible for a start. :-) The last state of the discussion that I remember was that one would probably have to write a C/Cython/Numba implementation to deal with all the individual little data structures efficiently if one wanted to use folders of small DM3/DM4 files with decent throughput. That was the point where we had to weigh up cost and benefit. |
Just as a note, there is an initial implementation for a stack-of-dm-files reader: https://github.com/LiberTEM/LiberTEM/blob/dm-reader/src/libertem/io/dataset/dm.py It's been a while since I worked on this, but I think I implemented some caching of data-regions, so the reader would parse the file structure once, on opening the dataset, and afterwards read directly using offset+size. Still, it didn't work well with many files, with similar behavior as #440 as far as I remember. I also have some uncommitted prototypes somewhere, if there is interest I can dig them out and put them into the prototype folder. |
In PR #497 I have added a reader for stacks of DM3/DM4 files. It uses the method described by @ercius for reading the offset via ncempy, which is then cached. By reading the offsets in parallel, we get the initialization time down to bearable levels (~20s for 2500 files on our workstation), but depending on the operation, initialization is still comparably slow. There is a parameter for assuming that all files in the stack have the same data offset, but that still needs to be tested (cc @woozey). Proper GUI support needs #498. |
Support DM3 files in order to open this dataset: https://zenodo.org/record/2566137
See also #254
The simulated data is in npz format #222
CC:
@joverbee, thank you for publishing the data on Zenodo!
The text was updated successfully, but these errors were encountered: