Virtual overview files #69

takluyver · 2020-06-15T16:30:28Z

This includes:

Finding virtual overview files in the same locations we use for cached run maps
Writing a virtual overview file, including metadata about the source files it covers
Checking that metadata against the files in a run directory, to see if the overview file is valid
RunDirectory and open_run will use a valid overview file if available, so long as no locality filter is in use

This does not automatically create virtual overview files. To make one, run python -m extra_data.voview path/to/run_dir. On a run of some 3000 trains with the DSSC detector, this takes ~20 seconds when the source files are cached locally. If they're not cached, it may take much longer.

takluyver · 2020-06-19T13:59:51Z

For reference, the overview file for a run of some 3000 trains (5 minutes) with DSSC makes an overview file of about 32 MB. The cell IDs and pulse IDs compress well, and if I write those gzipped, the file is 9.2 MB.

I think this use case fits well into HDF5's chunking & compression mechanism - it's easy to make useful chunks under 1MB, at which point they fit into the HDF5 chunk cache, so sequential access can be efficient. if you're using HDF5 correctly. So I'm going to compress those by default.

takluyver · 2020-07-03T15:27:35Z

extra_data/reader.py

@@ -1310,12 +1310,17 @@ def RunDirectory(path, include='*', file_filter=locality.lc_any):
        Function to subset the list of filenames to open.
        Meant to be used with functions in the extra_data.locality module.
    """
-    files = [f for f in os.listdir(path) if f.endswith('.h5')]
+    files = [f for f in os.listdir(path) if f.endswith('.h5') and f != 'overview.h5']


I've assumed here that if we create a virtual overview file in a run directory, it will be called overview.h5.

I wonder if it should not contain the proposal and run number in the name, I expect that people might want to move them around (Maybe have them all in in the same directory for convenience - maybe we want to do that as well?) and that would be convenient if they are immediately identifiable.

something like RAW-P700000-R0001-OVERVIEW.h5 to keep something similar to the current naming scheme...

What do you think?

I think you're probably right, but I'll see if we can get some input from ITDM.

I've broadened this to allow for overview anywhere in the name. So long as we don't have a big detector called 'overview', I think this should be safe enough.

takluyver · 2020-07-03T16:24:27Z

extra_data/reader.py

        raise Exception("No HDF5 files found in {} with glob pattern {}".format(path, include))

+    if _use_voview and (sel_files == files):


If some files were filtered out by the locality filter, this won't use the virtual overview file, to avoid accidentally trying to access data from tape when we've asked for only data on disk.

However, if you filter files with e.g. include='*AGIPD*', we'll still read a virtual overview file if possible. This could be unexpected in some situations - e.g. if you use .train_from_id() to get data from all sources, you'll see ones that weren't previously included. But the primary reason for the include= parameter is to open a run faster, and using the virtual overview file should be a fast option if it exists.

takluyver · 2020-07-03T16:30:38Z

Some more numbers: creating a virtual overview file for a run of ~12k trains with DSSC took 3m 30s from GPFS (proc data) and 10m from dCache (raw). The resulting files are each about 28 MiB.

tmichela · 2020-07-17T15:53:03Z

What version of h5py do you use? using our anaconda environment:

~/projects/EXtra-data/extra_data/voview.py in main(argv)
    146         print(f"Creating {file_path} from {len(run.files)} files...")
    147         vofw = VirtualOverviewFileWriter(file_path, run)
--> 148         vofw.write()
    149 
    150 if __name__ == '__main__':

~/projects/EXtra-data/extra_data/voview.py in write(self)
     34 
     35     def write(self):
---> 36         self.record_source_files()
     37         super().write()
     38 

~/projects/EXtra-data/extra_data/voview.py in record_source_files(self)
     28 
     29         grp.create_dataset(
---> 30             'names', data=names, dtype=h5py.string_dtype(encoding='ascii')
     31         )
     32         grp.create_dataset('mtimes', data=mtimes, dtype='f8')

AttributeError: module 'h5py' has no attribute 'string_dtype'

takluyver · 2020-07-17T15:55:49Z

Looks like I have 2.10.0 installed in my home directory. It should be easy enough to make it work with 2.9, though...

takluyver · 2020-07-17T16:52:15Z

Now there's a failure importing matplotlib on Python 3.7. I've reported an issue for matplotlib about this, but we can work around it by installing a newer numpy.

takluyver · 2020-07-19T08:48:14Z

matplotlib has uploaded new wheels which fix the problem, so I removed the commit where I was working around that.

tmichela · 2020-07-20T09:57:39Z

So, If I get it right, currently there is:

all datasets under INDEX are copied and compressed
all datasets under INSTRUMENT and CONTROL are virtual datasets

What isn't referenced:

RUN
in INDEX: flag and timestamp
in METADATA: all except sources names

Should we try here to add these as well or do you think this is a too big task for now?

data in RUN should be easy as they contain the same thing in all files,
METADATA is a bit trickier for some of them (creation date, ...), but some are easy (run number, proposal number, ...)
flag and timestamp? Not sure how to treat them (any idea?).
- for timestamp, I'd be inclined to use the latest timestamp for a trainId, as the convention in karabo is that the latest update on data for a 100ms range is to be tagged with this specific trainid.

Any idea how much space the data in CONTROL takes? I wonder if it is worth copying it as well. Most of the values never change (motor not moving, ...) so it must compress quite well.

takluyver · 2020-07-20T10:10:23Z

Yup, that's about right - except that image.pulseId and image.cellId under INSTRUMENT sources are also copied - we treat these as extra indexes.

We're effectively writing the file format we call '0.5' at the moment. So far we don't use any of the extra 1.0 fields, so I don't think it's urgent to write them. The 'flag' field in 1.0 might also be tricky - what if different files have different flag values for the same train ID?

I can try copying the CONTROL data for a run. My guess is that it will be significantly bigger, though obviously still much smaller than the run itself. I quite like the consistency of making local indexes but using references for all the real data.

takluyver · 2020-07-20T13:29:18Z

Somewhat to my surprise, adding control data didn't affect the size that much - one file I tried went from 8 MB to 16, another from 28 MB to 40.

I'm still inclined to leave CONTROL data as virtual datasets. I don't know whether there's some scenario where control data could be much bigger than the runs I tested with. It also feels clearer to say that these files we're creating are just a kind of index, and don't contain any real data. E.g. it doesn't matter much if someone who shouldn't have access to the raw data gets access to an overview file, because it doesn't contain any data.

tmichela · 2020-07-20T14:56:44Z

I'm not too surprised, most of the data in control is a 0d data, and repeating along the run... Actually in future it might even be smaller since now the DAQ storage policy can filter device properties, so there should be much less CONTROL data than currently in files.

That being said, I agree with you.

tmichela · 2020-07-29T08:22:50Z

I had a quick look again. I'm a bit afraid that when we deploy this and people will start using it, it will hide the new attributes from the original files (timestamp, flag, metadata, ...). We should maybe try make EXtra-data use these new datasets (not in this PR, but prioritize this task).

should we add a command line tool to generate these files?
maybe adding some tests for this feature

takluyver · 2020-07-29T11:39:31Z

Yes, that's a good point, we should work on exposing the new datasets. We'll need to work out what to do with flag - should we ignore any data in a train where flag=0 for that train?

It should definitely have tests as well. I'm less sure about a new command at the moment - I think python -m commands are a good way to try something out without advertising it too much.

tmichela · 2020-07-29T13:39:23Z

Ideally the flag should only invalidate the sources in the file where the flag is set.

But, I guess it's much simpler to exclude all data for a train if a flag is set in any file. Maybe this happens so rarely that we should not worry about it? (I'd be particularly afraid if a flag is set for all trains in a file, let's say for a faulty detector module)

kirienko · 2020-09-22T14:47:52Z

What isn't referenced:
* RUN
* in INDEX: flag and timestamp
* in METADATA: all except sources names

Should we try here to add these as well or do you think this is a too big task for now?
* data in RUN should be easy as they contain the same thing in all files,
* METADATA is a bit trickier for some of them (creation date, ...), but some are easy (run number, proposal number, ...)
* flag and timestamp? Not sure how to treat them (any idea?).
* for timestamp, I'd be inclined to use the latest timestamp for a trainId, as the convention in karabo is that the latest update on data for a 100ms range is to be tagged with this specific trainid.

It'd be very helpful in terms of the FAIR concept to include as much from METADATA as possible. Indeed it's much easier to take such things like creation date or sample name from the DAQ files than from MDC!

takluyver · 2021-01-21T12:25:12Z

I'm not sure whether to bother with RUN. IIRC, it duplicates the first entry of every CONTROL dataset, but also includes everything which is marked as not recorded in Karabo. There's no way to access it in EXtra-data, and I don't think anyone has ever asked for it (see #26), which suggests that there's not much interest in it. I imagine there will be some overhead in reading and creating thousands of tiny datasets, and I'm not convinced it's worth paying that on a purely speculative basis in case someone one day wants it.

Maybe we could go rogue from the EuXFEL file format a bit, and make a set of links to the RUN groups in different files. Unfortunately, I don't think there's anything like a 'virtual group' which could pool the contents of several groups.

I'm also still working out what we can do about the per-run information (like the start & end timestamps now in metadata). The concept of a run isn't too important in EXtra-data: you can split a run up, or join runs together, so it's not exactly clear what happens to per-run info then.

tmichela · 2021-01-21T12:51:18Z

I'm not sure whether to bother with RUN. IIRC, it duplicates the first entry of every CONTROL dataset, but also includes everything which is marked as not recorded in Karabo. There's no way to access it in EXtra-data, and I don't think anyone has ever asked for it (see #26), which suggests that there's not much interest in it. I imagine there will be some overhead in reading and creating thousands of tiny datasets, and I'm not convinced it's worth paying that on a purely speculative basis in case someone one day wants it.

I can see this being useful in a (far) future, for ex. for automated analysis pipelines, this could provide useful configuration (device version, result X was processed with algo A, parameter P, etc.).
Also the only way currently to see a string property.

But agreed that this is not important at the moment.

Maybe we could go rogue from the EuXFEL file format a bit, and make a set of links to the RUN groups in different files. Unfortunately, I don't think there's anything like a 'virtual group' which could pool the contents of several groups.

I'm also still working out what we can do about the per-run information (like the start & end timestamps now in metadata). The concept of a run isn't too important in EXtra-data: you can split a run up, or join runs together, so it's not exactly clear what happens to per-run info then.

In practice, I assume that most of the time, the DataCollection will represent a run, or consecutive runs, so that information might still be relevant in most cases? 🤔
Maybe in the DataCollection object we could take the earliest start timestamp and lastest end timestamp and make that the DataCollection object start/end?

takluyver · 2021-01-21T13:02:55Z

In fact, I misremembered - it's creationDate and updateDate that are in the metadata, not run start & end time. That possibly makes it simpler in one sense - they differ per file, not per run - but we also want to avoid opening 400 files when the user asks 'when was this run created?'

tmichela · 2021-01-21T13:20:15Z

I see... Probably the update date does not make much sens, but we could at least have the creation date. I often was annoyed to not be able to know approximately when a run I was taken (even which day...). For that we could only open the sequence 0 files, which isn't too expensive?

takluyver · 2021-05-14T13:33:30Z

I'm unsure how to handle 'suspect' trains (where INDEX/flag is 0) for this. We can't just replicate the flag, because the same train ID may be good in one file and bad in another. So we'll have to decide whether to include or exclude these suspect trains from the virtual overview file.

We could also set the flag to zero for a train in the virtual overview file where all data files containing that train mark it as bad, or where any of them do.

Other than that, things to do:

INDEX/timestamp
New METADATA entries
RUN data

tmichela · 2021-05-14T16:33:28Z

We can't just replicate the flag, because the same train ID may be good in one file and bad in another.

I'd like to check that actually. flagged train IDs are often off by a large margin and it's possible that it is not a problem in the end.
there's also a case where some device have a train id slowly drifting (after losing sync with the timeserver?) and here trainIds could be off by only a few 1000s, so there might be collisions here, though I'd expect such case are not common 🤔

takluyver · 2021-05-14T16:38:39Z

I think I found some cases of train IDs just 1 or 2 out of sequence when I looked ~1 year ago, but I can't easily point to specific runs. More data on what's going on is definitely welcome. But even if it's rare, we're handling the flag per-file, so we logically still have to combine the values somehow. 🙂

takluyver · 2021-06-03T16:09:29Z

For now, I have implemented it such that flag is 0 (meaning suspect) in the virtual overview file only for trains where it's 0 in all source files. This should still catch the error where AGIPD data is ~60k trains out, but some other data will be erroneously included even if you use inc_suspect_trains=False.

We could add extra information somewhere which records the flag value per train & per source, so it can store that train 1234 is valid for source X but not source Y. But that's an extra complexity to read the data - at the moment, you can read the virtual overview file like any other EuXFEL HDF5 file. Or we could make two such files, one including suspect trains and one excluding them. But that feels like an awkward workaround.

Virtual overview edits

tmichela · 2022-01-07T14:04:39Z

Should we have a version number for the virtual files (I think we discussed that) in case it introduces breaking changes in future?
did you want to add a cli entrypoint to generate these files?

takluyver · 2022-01-07T14:31:36Z

I think having some sort of version number is a good idea. Do you think we can get away with a single number, and it ignores files with a newer version than it recognises? Or should we have major & minor numbers, so we can record versions for backwards-compatible changes?

I'd say no to a CLI entry point, at least for now. If the plan is for us to autogenerate these files as runs are taken, we can do that with python -m extra_data.voview, and there shouldn't be much need for other people to generate them, so we needn't make it obvious.

tmichela · 2022-01-07T14:34:48Z

Do you think we can get away with a single number

A single number is probably sufficient

I'd say no to a CLI entry point, at least for now.

👍

tmichela

Thanks for adding more tests! This LGTM

extra_data/reader.py

tmichela · 2022-01-10T09:17:54Z

extra_data/voview.py

+        return False  # Basic check that things make sense
+
+    files_now = {f for f in os.listdir(run_dir)
+                 if f.endswith('.h5') and ('overview' not in f.lower())}


same thing here?

extra_data/voview.py

Co-authored-by: Thomas Michelat <32831491+tmichela@users.noreply.github.com>

takluyver mentioned this pull request Jun 25, 2020

Add KeyData interface #70

Merged

takluyver commented Jul 3, 2020

View reviewed changes

takluyver changed the title ~~WIP: Virtual overview files~~ Virtual overview files Jul 13, 2020

takluyver added this to the 1.3 milestone Jul 15, 2020

takluyver force-pushed the virtual-overview branch from c173db6 to 093cc21 Compare July 19, 2020 08:38

takluyver closed this Jul 19, 2020

takluyver reopened this Jul 19, 2020

takluyver modified the milestones: 1.3, 1.4 Aug 3, 2020

takluyver removed this from the 1.4 milestone Feb 5, 2021

tmichela mentioned this pull request Mar 30, 2021

Add support for opening runs serially with RunDirectory() #158

Merged

Use special_dtype for compatibility with h5py 2.9

bb97c58

takluyver force-pushed the virtual-overview branch from 093cc21 to bb97c58 Compare May 14, 2021 13:22

Write INDEX/timestamps if present in source data

80e425c

takluyver added 4 commits June 2, 2021 17:06

Add links in RUN for control sources

83565e6

Test creation of links in virtual overview file

9731481

Write extra information in METADATA group if available

f7952b3

Add INDEX/flag field for file format 1.0

b42fade

tmichela and others added 3 commits January 6, 2022 15:19

write virtual files to /usr

db4dd41

fix coverage?

762e822

Merge pull request #268 from European-XFEL/virtual-overview-1

55861da

Virtual overview edits

takluyver mentioned this pull request Jan 7, 2022

Virtual overview edits #268

Merged

Merge branch 'master' into virtual-overview

b94561c

takluyver added 4 commits January 7, 2022 14:53

Add version number to virtual overview files

ebaf81c

Add basic tests of virtual overview file creation & use

314aac7

Add test for finding virtual overview files in proposal usr

9d9facf

Simplify code a bit to improve static analysis

f053ad5

tmichela approved these changes Jan 10, 2022

View reviewed changes

takluyver commented Jan 10, 2022

View reviewed changes

extra_data/voview.py Outdated Show resolved Hide resolved

takluyver and others added 2 commits January 10, 2022 14:07

More specific check for overview file inside run directory

ccf14c4

Co-authored-by: Thomas Michelat <32831491+tmichela@users.noreply.github.com>

Fix creating virtual datasets

e20957f

takluyver added this to the 1.10 milestone Jan 10, 2022

takluyver merged commit 8b06c77 into master Jan 10, 2022

takluyver deleted the virtual-overview branch January 31, 2022 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Virtual overview files #69

Virtual overview files #69

takluyver commented Jun 15, 2020

takluyver commented Jun 19, 2020

takluyver Jul 3, 2020

tmichela Jul 6, 2020

takluyver Jul 6, 2020

takluyver Jul 15, 2020

takluyver Jul 3, 2020

takluyver commented Jul 3, 2020

tmichela commented Jul 17, 2020

takluyver commented Jul 17, 2020

takluyver commented Jul 17, 2020

takluyver commented Jul 19, 2020

tmichela commented Jul 20, 2020 •

edited

takluyver commented Jul 20, 2020

takluyver commented Jul 20, 2020

tmichela commented Jul 20, 2020

tmichela commented Jul 29, 2020

takluyver commented Jul 29, 2020

tmichela commented Jul 29, 2020

kirienko commented Sep 22, 2020

takluyver commented Jan 21, 2021

tmichela commented Jan 21, 2021

takluyver commented Jan 21, 2021

tmichela commented Jan 21, 2021

takluyver commented May 14, 2021 •

edited

tmichela commented May 14, 2021 •

edited

takluyver commented May 14, 2021

takluyver commented Jun 3, 2021

tmichela commented Jan 7, 2022

takluyver commented Jan 7, 2022

tmichela commented Jan 7, 2022

tmichela left a comment

tmichela Jan 10, 2022

		raise Exception("No HDF5 files found in {} with glob pattern {}".format(path, include))

		if _use_voview and (sel_files == files):

Virtual overview files #69

Virtual overview files #69

Conversation

takluyver commented Jun 15, 2020

takluyver commented Jun 19, 2020

takluyver Jul 3, 2020

Choose a reason for hiding this comment

tmichela Jul 6, 2020

Choose a reason for hiding this comment

takluyver Jul 6, 2020

Choose a reason for hiding this comment

takluyver Jul 15, 2020

Choose a reason for hiding this comment

takluyver Jul 3, 2020

Choose a reason for hiding this comment

takluyver commented Jul 3, 2020

tmichela commented Jul 17, 2020

takluyver commented Jul 17, 2020

takluyver commented Jul 17, 2020

takluyver commented Jul 19, 2020

tmichela commented Jul 20, 2020 • edited

takluyver commented Jul 20, 2020

takluyver commented Jul 20, 2020

tmichela commented Jul 20, 2020

tmichela commented Jul 29, 2020

takluyver commented Jul 29, 2020

tmichela commented Jul 29, 2020

kirienko commented Sep 22, 2020

takluyver commented Jan 21, 2021

tmichela commented Jan 21, 2021

takluyver commented Jan 21, 2021

tmichela commented Jan 21, 2021

takluyver commented May 14, 2021 • edited

tmichela commented May 14, 2021 • edited

takluyver commented May 14, 2021

takluyver commented Jun 3, 2021

tmichela commented Jan 7, 2022

takluyver commented Jan 7, 2022

tmichela commented Jan 7, 2022

tmichela left a comment

Choose a reason for hiding this comment

tmichela Jan 10, 2022

Choose a reason for hiding this comment

tmichela commented Jul 20, 2020 •

edited

takluyver commented May 14, 2021 •

edited

tmichela commented May 14, 2021 •

edited