Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Virtual overview files #69

Merged
merged 30 commits into from Jan 10, 2022
Merged

Virtual overview files #69

merged 30 commits into from Jan 10, 2022

Conversation

takluyver
Copy link
Member

This includes:

  • Finding virtual overview files in the same locations we use for cached run maps
  • Writing a virtual overview file, including metadata about the source files it covers
  • Checking that metadata against the files in a run directory, to see if the overview file is valid
  • RunDirectory and open_run will use a valid overview file if available, so long as no locality filter is in use

This does not automatically create virtual overview files. To make one, run python -m extra_data.voview path/to/run_dir. On a run of some 3000 trains with the DSSC detector, this takes ~20 seconds when the source files are cached locally. If they're not cached, it may take much longer.

@takluyver
Copy link
Member Author

For reference, the overview file for a run of some 3000 trains (5 minutes) with DSSC makes an overview file of about 32 MB. The cell IDs and pulse IDs compress well, and if I write those gzipped, the file is 9.2 MB.

I think this use case fits well into HDF5's chunking & compression mechanism - it's easy to make useful chunks under 1MB, at which point they fit into the HDF5 chunk cache, so sequential access can be efficient. if you're using HDF5 correctly. So I'm going to compress those by default.

@takluyver takluyver mentioned this pull request Jun 25, 2020
@@ -1310,12 +1310,17 @@ def RunDirectory(path, include='*', file_filter=locality.lc_any):
Function to subset the list of filenames to open.
Meant to be used with functions in the extra_data.locality module.
"""
files = [f for f in os.listdir(path) if f.endswith('.h5')]
files = [f for f in os.listdir(path) if f.endswith('.h5') and f != 'overview.h5']
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've assumed here that if we create a virtual overview file in a run directory, it will be called overview.h5.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it should not contain the proposal and run number in the name, I expect that people might want to move them around (Maybe have them all in in the same directory for convenience - maybe we want to do that as well?) and that would be convenient if they are immediately identifiable.

something like RAW-P700000-R0001-OVERVIEW.h5 to keep something similar to the current naming scheme...

What do you think?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're probably right, but I'll see if we can get some input from ITDM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've broadened this to allow for overview anywhere in the name. So long as we don't have a big detector called 'overview', I think this should be safe enough.

raise Exception("No HDF5 files found in {} with glob pattern {}".format(path, include))

if _use_voview and (sel_files == files):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If some files were filtered out by the locality filter, this won't use the virtual overview file, to avoid accidentally trying to access data from tape when we've asked for only data on disk.

However, if you filter files with e.g. include='*AGIPD*', we'll still read a virtual overview file if possible. This could be unexpected in some situations - e.g. if you use .train_from_id() to get data from all sources, you'll see ones that weren't previously included. But the primary reason for the include= parameter is to open a run faster, and using the virtual overview file should be a fast option if it exists.

@takluyver
Copy link
Member Author

Some more numbers: creating a virtual overview file for a run of ~12k trains with DSSC took 3m 30s from GPFS (proc data) and 10m from dCache (raw). The resulting files are each about 28 MiB.

@takluyver takluyver changed the title WIP: Virtual overview files Virtual overview files Jul 13, 2020
@takluyver takluyver added this to the 1.3 milestone Jul 15, 2020
@tmichela
Copy link
Member

What version of h5py do you use? using our anaconda environment:

~/projects/EXtra-data/extra_data/voview.py in main(argv)
    146         print(f"Creating {file_path} from {len(run.files)} files...")
    147         vofw = VirtualOverviewFileWriter(file_path, run)
--> 148         vofw.write()
    149 
    150 if __name__ == '__main__':

~/projects/EXtra-data/extra_data/voview.py in write(self)
     34 
     35     def write(self):
---> 36         self.record_source_files()
     37         super().write()
     38 

~/projects/EXtra-data/extra_data/voview.py in record_source_files(self)
     28 
     29         grp.create_dataset(
---> 30             'names', data=names, dtype=h5py.string_dtype(encoding='ascii')
     31         )
     32         grp.create_dataset('mtimes', data=mtimes, dtype='f8')

AttributeError: module 'h5py' has no attribute 'string_dtype'

@takluyver
Copy link
Member Author

Looks like I have 2.10.0 installed in my home directory. It should be easy enough to make it work with 2.9, though...

@takluyver
Copy link
Member Author

Now there's a failure importing matplotlib on Python 3.7. I've reported an issue for matplotlib about this, but we can work around it by installing a newer numpy.

@takluyver
Copy link
Member Author

matplotlib has uploaded new wheels which fix the problem, so I removed the commit where I was working around that.

@tmichela
Copy link
Member

tmichela commented Jul 20, 2020

So, If I get it right, currently there is:

  • all datasets under INDEX are copied and compressed
  • all datasets under INSTRUMENT and CONTROL are virtual datasets

What isn't referenced:

  • RUN
  • in INDEX: flag and timestamp
  • in METADATA: all except sources names

Should we try here to add these as well or do you think this is a too big task for now?

  • data in RUN should be easy as they contain the same thing in all files,
  • METADATA is a bit trickier for some of them (creation date, ...), but some are easy (run number, proposal number, ...)
  • flag and timestamp? Not sure how to treat them (any idea?).
    • for timestamp, I'd be inclined to use the latest timestamp for a trainId, as the convention in karabo is that the latest update on data for a 100ms range is to be tagged with this specific trainid.

Any idea how much space the data in CONTROL takes? I wonder if it is worth copying it as well. Most of the values never change (motor not moving, ...) so it must compress quite well.

@takluyver
Copy link
Member Author

Yup, that's about right - except that image.pulseId and image.cellId under INSTRUMENT sources are also copied - we treat these as extra indexes.

We're effectively writing the file format we call '0.5' at the moment. So far we don't use any of the extra 1.0 fields, so I don't think it's urgent to write them. The 'flag' field in 1.0 might also be tricky - what if different files have different flag values for the same train ID?

I can try copying the CONTROL data for a run. My guess is that it will be significantly bigger, though obviously still much smaller than the run itself. I quite like the consistency of making local indexes but using references for all the real data.

@takluyver
Copy link
Member Author

Somewhat to my surprise, adding control data didn't affect the size that much - one file I tried went from 8 MB to 16, another from 28 MB to 40.

I'm still inclined to leave CONTROL data as virtual datasets. I don't know whether there's some scenario where control data could be much bigger than the runs I tested with. It also feels clearer to say that these files we're creating are just a kind of index, and don't contain any real data. E.g. it doesn't matter much if someone who shouldn't have access to the raw data gets access to an overview file, because it doesn't contain any data.

@tmichela
Copy link
Member

I'm not too surprised, most of the data in control is a 0d data, and repeating along the run... Actually in future it might even be smaller since now the DAQ storage policy can filter device properties, so there should be much less CONTROL data than currently in files.

That being said, I agree with you.

@tmichela
Copy link
Member

I had a quick look again. I'm a bit afraid that when we deploy this and people will start using it, it will hide the new attributes from the original files (timestamp, flag, metadata, ...). We should maybe try make EXtra-data use these new datasets (not in this PR, but prioritize this task).

  • should we add a command line tool to generate these files?
  • maybe adding some tests for this feature

@takluyver
Copy link
Member Author

Yes, that's a good point, we should work on exposing the new datasets. We'll need to work out what to do with flag - should we ignore any data in a train where flag=0 for that train?

It should definitely have tests as well. I'm less sure about a new command at the moment - I think python -m commands are a good way to try something out without advertising it too much.

@tmichela
Copy link
Member

Ideally the flag should only invalidate the sources in the file where the flag is set.

But, I guess it's much simpler to exclude all data for a train if a flag is set in any file. Maybe this happens so rarely that we should not worry about it? (I'd be particularly afraid if a flag is set for all trains in a file, let's say for a faulty detector module)

@takluyver takluyver modified the milestones: 1.3, 1.4 Aug 3, 2020
@kirienko
Copy link

What isn't referenced:
* RUN
* in INDEX: flag and timestamp
* in METADATA: all except sources names

Should we try here to add these as well or do you think this is a too big task for now?
* data in RUN should be easy as they contain the same thing in all files,
* METADATA is a bit trickier for some of them (creation date, ...), but some are easy (run number, proposal number, ...)
* flag and timestamp? Not sure how to treat them (any idea?).
* for timestamp, I'd be inclined to use the latest timestamp for a trainId, as the convention in karabo is that the latest update on data for a 100ms range is to be tagged with this specific trainid.

It'd be very helpful in terms of the FAIR concept to include as much from METADATA as possible. Indeed it's much easier to take such things like creation date or sample name from the DAQ files than from MDC!

@takluyver
Copy link
Member Author

I'm not sure whether to bother with RUN. IIRC, it duplicates the first entry of every CONTROL dataset, but also includes everything which is marked as not recorded in Karabo. There's no way to access it in EXtra-data, and I don't think anyone has ever asked for it (see #26), which suggests that there's not much interest in it. I imagine there will be some overhead in reading and creating thousands of tiny datasets, and I'm not convinced it's worth paying that on a purely speculative basis in case someone one day wants it.

Maybe we could go rogue from the EuXFEL file format a bit, and make a set of links to the RUN groups in different files. Unfortunately, I don't think there's anything like a 'virtual group' which could pool the contents of several groups.

I'm also still working out what we can do about the per-run information (like the start & end timestamps now in metadata). The concept of a run isn't too important in EXtra-data: you can split a run up, or join runs together, so it's not exactly clear what happens to per-run info then.

@tmichela
Copy link
Member

I'm not sure whether to bother with RUN. IIRC, it duplicates the first entry of every CONTROL dataset, but also includes everything which is marked as not recorded in Karabo. There's no way to access it in EXtra-data, and I don't think anyone has ever asked for it (see #26), which suggests that there's not much interest in it. I imagine there will be some overhead in reading and creating thousands of tiny datasets, and I'm not convinced it's worth paying that on a purely speculative basis in case someone one day wants it.

I can see this being useful in a (far) future, for ex. for automated analysis pipelines, this could provide useful configuration (device version, result X was processed with algo A, parameter P, etc.).
Also the only way currently to see a string property.

But agreed that this is not important at the moment.

Maybe we could go rogue from the EuXFEL file format a bit, and make a set of links to the RUN groups in different files. Unfortunately, I don't think there's anything like a 'virtual group' which could pool the contents of several groups.

I'm also still working out what we can do about the per-run information (like the start & end timestamps now in metadata). The concept of a run isn't too important in EXtra-data: you can split a run up, or join runs together, so it's not exactly clear what happens to per-run info then.

In practice, I assume that most of the time, the DataCollection will represent a run, or consecutive runs, so that information might still be relevant in most cases? 🤔
Maybe in the DataCollection object we could take the earliest start timestamp and lastest end timestamp and make that the DataCollection object start/end?

@takluyver
Copy link
Member Author

In fact, I misremembered - it's creationDate and updateDate that are in the metadata, not run start & end time. That possibly makes it simpler in one sense - they differ per file, not per run - but we also want to avoid opening 400 files when the user asks 'when was this run created?'

@tmichela
Copy link
Member

I see... Probably the update date does not make much sens, but we could at least have the creation date. I often was annoyed to not be able to know approximately when a run I was taken (even which day...). For that we could only open the sequence 0 files, which isn't too expensive?

@takluyver
Copy link
Member Author

takluyver commented May 14, 2021

I'm unsure how to handle 'suspect' trains (where INDEX/flag is 0) for this. We can't just replicate the flag, because the same train ID may be good in one file and bad in another. So we'll have to decide whether to include or exclude these suspect trains from the virtual overview file.

We could also set the flag to zero for a train in the virtual overview file where all data files containing that train mark it as bad, or where any of them do.

Other than that, things to do:

  • INDEX/timestamp
  • New METADATA entries
  • RUN data

@tmichela
Copy link
Member

tmichela commented May 14, 2021

We can't just replicate the flag, because the same train ID may be good in one file and bad in another.

I'd like to check that actually. flagged train IDs are often off by a large margin and it's possible that it is not a problem in the end.
there's also a case where some device have a train id slowly drifting (after losing sync with the timeserver?) and here trainIds could be off by only a few 1000s, so there might be collisions here, though I'd expect such case are not common 🤔

@takluyver
Copy link
Member Author

I think I found some cases of train IDs just 1 or 2 out of sequence when I looked ~1 year ago, but I can't easily point to specific runs. More data on what's going on is definitely welcome. But even if it's rare, we're handling the flag per-file, so we logically still have to combine the values somehow. 🙂

@takluyver
Copy link
Member Author

For now, I have implemented it such that flag is 0 (meaning suspect) in the virtual overview file only for trains where it's 0 in all source files. This should still catch the error where AGIPD data is ~60k trains out, but some other data will be erroneously included even if you use inc_suspect_trains=False.

We could add extra information somewhere which records the flag value per train & per source, so it can store that train 1234 is valid for source X but not source Y. But that's an extra complexity to read the data - at the moment, you can read the virtual overview file like any other EuXFEL HDF5 file. Or we could make two such files, one including suspect trains and one excluding them. But that feels like an awkward workaround.

@takluyver takluyver mentioned this pull request Jan 7, 2022
@tmichela
Copy link
Member

tmichela commented Jan 7, 2022

  • Should we have a version number for the virtual files (I think we discussed that) in case it introduces breaking changes in future?
  • did you want to add a cli entrypoint to generate these files?

@takluyver
Copy link
Member Author

I think having some sort of version number is a good idea. Do you think we can get away with a single number, and it ignores files with a newer version than it recognises? Or should we have major & minor numbers, so we can record versions for backwards-compatible changes?

I'd say no to a CLI entry point, at least for now. If the plan is for us to autogenerate these files as runs are taken, we can do that with python -m extra_data.voview, and there shouldn't be much need for other people to generate them, so we needn't make it obvious.

@tmichela
Copy link
Member

tmichela commented Jan 7, 2022

Do you think we can get away with a single number

A single number is probably sufficient

I'd say no to a CLI entry point, at least for now.

👍

Copy link
Member

@tmichela tmichela left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding more tests! This LGTM

extra_data/reader.py Outdated Show resolved Hide resolved
return False # Basic check that things make sense

files_now = {f for f in os.listdir(run_dir)
if f.endswith('.h5') and ('overview' not in f.lower())}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same thing here?

extra_data/voview.py Outdated Show resolved Hide resolved
takluyver and others added 2 commits January 10, 2022 14:07
Co-authored-by: Thomas Michelat <32831491+tmichela@users.noreply.github.com>
@takluyver takluyver added this to the 1.10 milestone Jan 10, 2022
@takluyver takluyver merged commit 8b06c77 into master Jan 10, 2022
@takluyver takluyver deleted the virtual-overview branch January 31, 2022 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants