-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support multi-jagged arrays and records in UprootSourceMapping._extract_base_form
#448
Conversation
The item transform is a bit suspicious.. in theory that would only be needed to access non-split data members. Otherwise, each split member should be accessible directly, as is the case for e.g.
I think, and hopefully @jpivarski can confirm, that TTree does not support splitting a double-jagged object. If that is the case, then indeed I would also hope that filtering of branches can be performed at the schema. Is it a matter of suppressing warnings from |
That is correct. The first jagged dimension gets split, but the rest do not. That's a large part of the motivation behind RNTuple. |
Yes, the double jagged element links can not be split automatically by root, that's why we need to read them as array of structs and split afterwards. But i think those are the only ones.
Both - there are 2 cases where i would get an exception:
So one could either turn those Exceptions into warnings (will be quite a lot since essentially every collection with has one such non-data branch) or have some mechanism that i can filter them already before trying. |
Also, as it came up here: scikit-hep/uproot5#267 "cannot be Awkward" is different from "cannot be read." A TTree containing TH1F, for example (weird, but it does happen), can be read if Oh! That's the same file as in that discussion. So this is a small world, then. In general, you should allow a file to be opened and examined even if some branches are not readable, since a particular use-case might not use those branches. |
Seems to me we could wrap the interior of the |
I'm not sure where we left off on this PR but would be nice to reboot it. Certainly I've seen more cases where something is awkward-describable but not able to be in NanoEvents due to the limited scope of forms considered in |
Yes, we should reboot it, thanks for pinging on this! For the purpose of trying it out i put those try - except blocks in there. With that it now works without filtering, but of course gives a lot of warnings: from coffea.nanoevents.schemas import BaseSchema
from coffea.nanoevents.factory import NanoEventsFactory
factory = NanoEventsFactory.from_root(
"DAOD_PHYSLITE.art_split99.pool.root",
treepath="CollectionTree",
schemaclass=BaseSchema,
)
def filter_name(branch):
if branch.endswith("."):
return False
if "::" in branch:
return False
if "neutralParticleLinks" in branch:
return False
if "AuxDyn." in branch or "Aux." in branch:
return True
return False
lazy_tree = uproot.lazy("DAOD_PHYSLITE.art_split99.pool.root:CollectionTree", filter_name=filter_name)
By that, is the main goal to serialize the lazy view, together with the already cached columns? In general it would of course be nice not having to duplicate all the code from |
I'm imagining saving the plain json |
Just pinging here as it has been two weeks since any update and it seems everyone was interested in seeing this PR through. |
If you'd like me to look into something in particular let me know. Otherwise maybe it makes sense to merge this as-is and ponder long-term plans later? I guess one thing we could put together is a Schema class for the DAOD, that is the other big piece of added value w.r.t. uproot.lazy. |
Sounds good - i think the modifications i made should not interfer when reading non-double-jagged branches, so if it's fine for you you can merge it and i can try to build the DAOD schema on top of that. |
@nikoladze cool - please resolve the conflicts with the base branch and we can move forward. |
As discussed yesterday, as a first step to get
DAOD_PHYSLITE
supported withNanoEvents
we need to get all branches that we want read into a base form. That includes forPHYSLITE
a few multi-jagged arrays and records, such as the following:For experimentation i tried to put support for these two cases into
UprootSourceMapping._extract_base_form
. That involved wrapping the.layout
calls intransforms
to check if the object on the stack is already a layout, to be able to do things like!content,!content
. And i added aitem
transform for getting a record field that unfortunately had to become another special case since it needs to also receive the field name, so the form keys for these have!item'{field}'
in them.Finally, to actually pass a
PHYSLITE
file throughNanoEventsFactory
i needed to add a filter for branches because some of the branches are not meaningful and can't be read since they don't contain any data (and some have issues that need to be sorted out still). So for testing i added an argumentiteritems_options
that gets passed down to uproot'siteritems
function where i can pass e.g.filter_name
. So with these modifications the following works:Of course this might not be the best way of doing things, so i'll stop here for now. Do you have any suggestions how to move on or do things better? I'm also thinking of extending the logic to recursively go through the form to support arbitrarily nested jagged arrays and records.
Then the question where to define such a branch filtering - in principle it would seem logical to have this then in the
PHYSLITE
scheme? But at the moment it already has to happen in theUprootSourceMapping._extract_base_form
...