New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Virtual overview files #69
Conversation
For reference, the overview file for a run of some 3000 trains (5 minutes) with DSSC makes an overview file of about 32 MB. The cell IDs and pulse IDs compress well, and if I write those gzipped, the file is 9.2 MB. I think this use case fits well into HDF5's chunking & compression mechanism - it's easy to make useful chunks under 1MB, at which point they fit into the HDF5 chunk cache, so sequential access can be efficient. if you're using HDF5 correctly. So I'm going to compress those by default. |
extra_data/reader.py
Outdated
@@ -1310,12 +1310,17 @@ def RunDirectory(path, include='*', file_filter=locality.lc_any): | |||
Function to subset the list of filenames to open. | |||
Meant to be used with functions in the extra_data.locality module. | |||
""" | |||
files = [f for f in os.listdir(path) if f.endswith('.h5')] | |||
files = [f for f in os.listdir(path) if f.endswith('.h5') and f != 'overview.h5'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've assumed here that if we create a virtual overview file in a run directory, it will be called overview.h5
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it should not contain the proposal and run number in the name, I expect that people might want to move them around (Maybe have them all in in the same directory for convenience - maybe we want to do that as well?) and that would be convenient if they are immediately identifiable.
something like RAW-P700000-R0001-OVERVIEW.h5
to keep something similar to the current naming scheme...
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you're probably right, but I'll see if we can get some input from ITDM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've broadened this to allow for overview
anywhere in the name. So long as we don't have a big detector called 'overview', I think this should be safe enough.
raise Exception("No HDF5 files found in {} with glob pattern {}".format(path, include)) | ||
|
||
if _use_voview and (sel_files == files): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If some files were filtered out by the locality filter, this won't use the virtual overview file, to avoid accidentally trying to access data from tape when we've asked for only data on disk.
However, if you filter files with e.g. include='*AGIPD*'
, we'll still read a virtual overview file if possible. This could be unexpected in some situations - e.g. if you use .train_from_id()
to get data from all sources, you'll see ones that weren't previously included. But the primary reason for the include=
parameter is to open a run faster, and using the virtual overview file should be a fast option if it exists.
Some more numbers: creating a virtual overview file for a run of ~12k trains with DSSC took 3m 30s from GPFS (proc data) and 10m from dCache (raw). The resulting files are each about 28 MiB. |
What version of h5py do you use? using our anaconda environment: ~/projects/EXtra-data/extra_data/voview.py in main(argv)
146 print(f"Creating {file_path} from {len(run.files)} files...")
147 vofw = VirtualOverviewFileWriter(file_path, run)
--> 148 vofw.write()
149
150 if __name__ == '__main__':
~/projects/EXtra-data/extra_data/voview.py in write(self)
34
35 def write(self):
---> 36 self.record_source_files()
37 super().write()
38
~/projects/EXtra-data/extra_data/voview.py in record_source_files(self)
28
29 grp.create_dataset(
---> 30 'names', data=names, dtype=h5py.string_dtype(encoding='ascii')
31 )
32 grp.create_dataset('mtimes', data=mtimes, dtype='f8')
AttributeError: module 'h5py' has no attribute 'string_dtype' |
Looks like I have 2.10.0 installed in my home directory. It should be easy enough to make it work with 2.9, though... |
Now there's a failure importing matplotlib on Python 3.7. I've reported an issue for matplotlib about this, but we can work around it by installing a newer numpy. |
c173db6
to
093cc21
Compare
matplotlib has uploaded new wheels which fix the problem, so I removed the commit where I was working around that. |
So, If I get it right, currently there is:
What isn't referenced:
Should we try here to add these as well or do you think this is a too big task for now?
Any idea how much space the data in |
Yup, that's about right - except that We're effectively writing the file format we call '0.5' at the moment. So far we don't use any of the extra 1.0 fields, so I don't think it's urgent to write them. The 'flag' field in 1.0 might also be tricky - what if different files have different flag values for the same train ID? I can try copying the CONTROL data for a run. My guess is that it will be significantly bigger, though obviously still much smaller than the run itself. I quite like the consistency of making local indexes but using references for all the real data. |
Somewhat to my surprise, adding control data didn't affect the size that much - one file I tried went from 8 MB to 16, another from 28 MB to 40. I'm still inclined to leave CONTROL data as virtual datasets. I don't know whether there's some scenario where control data could be much bigger than the runs I tested with. It also feels clearer to say that these files we're creating are just a kind of index, and don't contain any real data. E.g. it doesn't matter much if someone who shouldn't have access to the raw data gets access to an overview file, because it doesn't contain any data. |
I'm not too surprised, most of the data in control is a 0d data, and repeating along the run... Actually in future it might even be smaller since now the DAQ storage policy can filter device properties, so there should be much less CONTROL data than currently in files. That being said, I agree with you. |
I had a quick look again. I'm a bit afraid that when we deploy this and people will start using it, it will hide the new attributes from the original files (timestamp, flag, metadata, ...). We should maybe try make EXtra-data use these new datasets (not in this PR, but prioritize this task).
|
Yes, that's a good point, we should work on exposing the new datasets. We'll need to work out what to do with It should definitely have tests as well. I'm less sure about a new command at the moment - I think |
Ideally the But, I guess it's much simpler to exclude all data for a train if a flag is set in any file. Maybe this happens so rarely that we should not worry about it? (I'd be particularly afraid if a flag is set for all trains in a file, let's say for a faulty detector module) |
It'd be very helpful in terms of the FAIR concept to include as much from |
I'm not sure whether to bother with RUN. IIRC, it duplicates the first entry of every CONTROL dataset, but also includes everything which is marked as not recorded in Karabo. There's no way to access it in EXtra-data, and I don't think anyone has ever asked for it (see #26), which suggests that there's not much interest in it. I imagine there will be some overhead in reading and creating thousands of tiny datasets, and I'm not convinced it's worth paying that on a purely speculative basis in case someone one day wants it. Maybe we could go rogue from the EuXFEL file format a bit, and make a set of links to the RUN groups in different files. Unfortunately, I don't think there's anything like a 'virtual group' which could pool the contents of several groups. I'm also still working out what we can do about the per-run information (like the start & end timestamps now in metadata). The concept of a run isn't too important in EXtra-data: you can split a run up, or join runs together, so it's not exactly clear what happens to per-run info then. |
I can see this being useful in a (far) future, for ex. for automated analysis pipelines, this could provide useful configuration (device version, result X was processed with algo A, parameter P, etc.). But agreed that this is not important at the moment.
In practice, I assume that most of the time, the DataCollection will represent a run, or consecutive runs, so that information might still be relevant in most cases? 🤔 |
In fact, I misremembered - it's creationDate and updateDate that are in the metadata, not run start & end time. That possibly makes it simpler in one sense - they differ per file, not per run - but we also want to avoid opening 400 files when the user asks 'when was this run created?' |
I see... Probably the update date does not make much sens, but we could at least have the creation date. I often was annoyed to not be able to know approximately when a run I was taken (even which day...). For that we could only open the sequence 0 files, which isn't too expensive? |
093cc21
to
bb97c58
Compare
I'm unsure how to handle 'suspect' trains (where We could also set the flag to zero for a train in the virtual overview file where all data files containing that train mark it as bad, or where any of them do. Other than that, things to do:
|
I'd like to check that actually. flagged train IDs are often off by a large margin and it's possible that it is not a problem in the end. |
I think I found some cases of train IDs just 1 or 2 out of sequence when I looked ~1 year ago, but I can't easily point to specific runs. More data on what's going on is definitely welcome. But even if it's rare, we're handling the flag per-file, so we logically still have to combine the values somehow. 🙂 |
For now, I have implemented it such that flag is 0 (meaning suspect) in the virtual overview file only for trains where it's 0 in all source files. This should still catch the error where AGIPD data is ~60k trains out, but some other data will be erroneously included even if you use We could add extra information somewhere which records the flag value per train & per source, so it can store that train 1234 is valid for source X but not source Y. But that's an extra complexity to read the data - at the moment, you can read the virtual overview file like any other EuXFEL HDF5 file. Or we could make two such files, one including suspect trains and one excluding them. But that feels like an awkward workaround. |
Virtual overview edits
|
I think having some sort of version number is a good idea. Do you think we can get away with a single number, and it ignores files with a newer version than it recognises? Or should we have major & minor numbers, so we can record versions for backwards-compatible changes? I'd say no to a CLI entry point, at least for now. If the plan is for us to autogenerate these files as runs are taken, we can do that with |
A single number is probably sufficient
👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for adding more tests! This LGTM
extra_data/voview.py
Outdated
return False # Basic check that things make sense | ||
|
||
files_now = {f for f in os.listdir(run_dir) | ||
if f.endswith('.h5') and ('overview' not in f.lower())} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same thing here?
Co-authored-by: Thomas Michelat <32831491+tmichela@users.noreply.github.com>
This includes:
This does not automatically create virtual overview files. To make one, run
python -m extra_data.voview path/to/run_dir
. On a run of some 3000 trains with the DSSC detector, this takes ~20 seconds when the source files are cached locally. If they're not cached, it may take much longer.