Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: use ak.merge_union_of_records to generate input data format #1017

Merged
merged 11 commits into from Feb 13, 2024

Conversation

lgray
Copy link
Collaborator

@lgray lgray commented Jan 30, 2024

Fixes #1014

May not need to merge if current pre-processing scheme already functions for this task.

As it is right now this PR uses ak.merge_union_of_records to tag fields that are not shared across files in a dataset.
When a file is parsed, if a given key is not available in that file a buffer is generated as an array of None using an IndexedOptionArray and the expected length of the step in that form key.

The only kind of arrays that are supported by this workaround at present are flat arrays of booleans. The user may choose with ak.fill_none the default behavior for the boolean to match their needs.

If any other kind of array is encountered the file is considered not openable as nanoevents and emits errors either in preprocessing or within nanoevents itself.

@lgray
Copy link
Collaborator Author

lgray commented Jan 31, 2024

@nsmith- this work around was the best I could come up with for now. This at least lets us deal with changing forms in a dataset in a way that doesn't mess up the _meta of the dask_awkward array, even if it does one mildly nasty thing to nanoevents.

I think it keeps the leak fairly contained, but if you can think of a cleaner solution I'll happily accept it.

@nsmith-
Copy link
Member

nsmith- commented Jan 31, 2024

When a file is parsed, if a given key is not available in that file a buffer is generated with all identities (False in this case) that is the length of the file.

Why not None?

@lgray
Copy link
Collaborator Author

lgray commented Jan 31, 2024

Because I can't get it to work properly.

@lgray
Copy link
Collaborator Author

lgray commented Jan 31, 2024

After letting this simmer all day in the back of my head I think I have a way to get it to do option arrays. I'll give it a try.

@lgray
Copy link
Collaborator Author

lgray commented Feb 1, 2024

@nsmith- ready for a review.

@lgray lgray force-pushed the use_merge_union_of_records branch 3 times, most recently from 1b0b518 to 0d9fe7c Compare February 1, 2024 19:38
@lgray lgray requested a review from nsmith- February 1, 2024 22:47
src/coffea/nanoevents/mapping/uproot.py Show resolved Hide resolved
src/coffea/dataset_tools/preprocess.py Show resolved Hide resolved
@lgray lgray force-pushed the use_merge_union_of_records branch 4 times, most recently from 974c56b to 51fbc80 Compare February 11, 2024 17:27
@lgray lgray merged commit a5674f2 into master Feb 13, 2024
14 checks passed
@lgray lgray deleted the use_merge_union_of_records branch February 13, 2024 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug in NanoEventsFactory.from_root() when reading in multiple files with different trigger paths
2 participants