fix: use ak.merge_union_of_records to generate input data format #1017

lgray · 2024-01-30T17:31:20Z

~~May not need to merge if current pre-processing scheme already functions for this task.~~

As it is right now this PR uses ak.merge_union_of_records to tag fields that are not shared across files in a dataset.
When a file is parsed, if a given key is not available in that file a buffer is generated as an array of None using an IndexedOptionArray and the expected length of the step in that form key.

The only kind of arrays that are supported by this workaround at present are flat arrays of booleans. The user may choose with ak.fill_none the default behavior for the boolean to match their needs.

If any other kind of array is encountered the file is considered not openable as nanoevents and emits errors either in preprocessing or within nanoevents itself.

lgray · 2024-01-31T00:55:38Z

@nsmith- this work around was the best I could come up with for now. This at least lets us deal with changing forms in a dataset in a way that doesn't mess up the _meta of the dask_awkward array, even if it does one mildly nasty thing to nanoevents.

I think it keeps the leak fairly contained, but if you can think of a cleaner solution I'll happily accept it.

nsmith- · 2024-01-31T20:46:28Z

When a file is parsed, if a given key is not available in that file a buffer is generated with all identities (False in this case) that is the length of the file.

Why not None?

lgray · 2024-01-31T21:06:15Z

Because I can't get it to work properly.

lgray · 2024-01-31T22:53:54Z

After letting this simmer all day in the back of my head I think I have a way to get it to do option arrays. I'll give it a try.

lgray · 2024-02-01T17:41:33Z

@nsmith- ready for a review.

src/coffea/nanoevents/mapping/uproot.py

src/coffea/dataset_tools/preprocess.py

…present computed form

lgray force-pushed the use_merge_union_of_records branch from 5f638d2 to 6e3e492 Compare January 30, 2024 19:30

lgray mentioned this pull request Jan 30, 2024

Bug in NanoEventsFactory.from_root() when reading in multiple files with different trigger paths #1014

Closed

lgray force-pushed the use_merge_union_of_records branch from aca12a2 to d2b10ef Compare January 31, 2024 00:45

lgray force-pushed the use_merge_union_of_records branch from d2b10ef to a1dbafe Compare January 31, 2024 01:04

lgray force-pushed the use_merge_union_of_records branch 3 times, most recently from 1b0b518 to 0d9fe7c Compare February 1, 2024 19:38

lgray requested a review from nsmith- February 1, 2024 22:47

lgray force-pushed the use_merge_union_of_records branch from 0eef99b to bf40012 Compare February 2, 2024 20:14

nsmith- approved these changes Feb 2, 2024

View reviewed changes

src/coffea/nanoevents/mapping/uproot.py Show resolved Hide resolved

src/coffea/dataset_tools/preprocess.py Show resolved Hide resolved

lgray force-pushed the use_merge_union_of_records branch 4 times, most recently from 974c56b to 51fbc80 Compare February 11, 2024 17:27

lgray added 11 commits February 13, 2024 11:07

fix: use ak.merge_union_of_records to generate input data format

df6ac59

typo

9c8489d

another typo

292c21f

a hacky implementation that appears to function for only bools

fcf87ea

cleaner implementation of allowing missing arrays that does not misre…

59190ae

…present computed form

now properly functions as an IndexedOptionArray on output

58a4cf4

deal with trivial NanoAOD forms and/or length-zero files

3c50888

revive tests, add useful warning when empties are encountered

db5b5e3

update typing

7c49cb1

use num_entries in _default_filter of dataset_tools.filter_files

78469b3

further safeguards against strange forms

7f82629

lgray force-pushed the use_merge_union_of_records branch from 51fbc80 to 7f82629 Compare February 13, 2024 17:07

lgray merged commit a5674f2 into master Feb 13, 2024
14 checks passed

lgray deleted the use_merge_union_of_records branch February 13, 2024 21:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: use ak.merge_union_of_records to generate input data format #1017

fix: use ak.merge_union_of_records to generate input data format #1017

lgray commented Jan 30, 2024 •

edited

lgray commented Jan 31, 2024

nsmith- commented Jan 31, 2024

lgray commented Jan 31, 2024

lgray commented Jan 31, 2024

lgray commented Feb 1, 2024

fix: use ak.merge_union_of_records to generate input data format #1017

fix: use ak.merge_union_of_records to generate input data format #1017

Conversation

lgray commented Jan 30, 2024 • edited

lgray commented Jan 31, 2024

nsmith- commented Jan 31, 2024

lgray commented Jan 31, 2024

lgray commented Jan 31, 2024

lgray commented Feb 1, 2024

lgray commented Jan 30, 2024 •

edited