Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support multi-jagged arrays and records in UprootSourceMapping._extract_base_form #448

Merged
merged 10 commits into from
Apr 23, 2021

Conversation

nikoladze
Copy link
Contributor

@nikoladze nikoladze commented Feb 10, 2021

As discussed yesterday, as a first step to get DAOD_PHYSLITE supported with NanoEvents we need to get all branches that we want read into a base form. That includes for PHYSLITE a few multi-jagged arrays and records, such as the following:

def example_file(
    filename="DAOD_PHYSLITE.art_split99.pool.root",
    url="https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download"
):
    import os
    if not os.path.exists(filename):
        print(f"Downloading {url} to {filename}")
        import requests
        res = requests.get(url)
        with open(filename, "wb") as of:
            of.write(res.content)
    return filename
>>> import uproot
>>> tree = uproot.open(f"{example_file()}:CollectionTree")
>>> # double jagged of record with fields "m_persKey", "m_persIndex"
>>> tree["AnalysisJetsAuxDyn.NumTrkPt500"].interpretation.awkward_form(None)
....
>>> # double jagged + record
>>> tree["AnalysisElectronsAuxDyn.trackParticleLinks"].interpretation.awkward_form(None)
...

For experimentation i tried to put support for these two cases into UprootSourceMapping._extract_base_form. That involved wrapping the .layout calls in transforms to check if the object on the stack is already a layout, to be able to do things like !content,!content. And i added a item transform for getting a record field that unfortunately had to become another special case since it needs to also receive the field name, so the form keys for these have !item'{field}' in them.

Finally, to actually pass a PHYSLITE file through NanoEventsFactory i needed to add a filter for branches because some of the branches are not meaningful and can't be read since they don't contain any data (and some have issues that need to be sorted out still). So for testing i added an argument iteritems_options that gets passed down to uproot's iteritems function where i can pass e.g. filter_name. So with these modifications the following works:

>>> from coffea.nanoevents.schemas import BaseSchema
>>> from coffea.nanoevents.factory import NanoEventsFactory
def filter_name(branch):
    if branch.endswith("."):
        return False
    if "::" in branch:
        return False
    if "neutralParticleLinks" in branch:
        return False
    if "AuxDyn." in branch or "Aux." in branch:
        return True
    return False

factory = NanoEventsFactory.from_root(
    example_file(),
    treepath="CollectionTree",
    schemaclass=BaseSchema,
    iteritems_options=dict(filter_name=filter_name)
)
>>> factory.events()["AnalysisJetsAuxDyn.NumTrkPt500"]
<Array [[[7, 0, 1, 0, 0, ... 0, 0, 0, 0, 0]]] type='100 * var * [var * int32, pa...'>
>>> factory.events()["AnalysisElectronsAuxDyn.trackParticleLinks"]
<Array [[], [], ... m_persKey: 0}]], []] type='100 * var * [var * struct[["m_per...'>

Of course this might not be the best way of doing things, so i'll stop here for now. Do you have any suggestions how to move on or do things better? I'm also thinking of extending the logic to recursively go through the form to support arbitrarily nested jagged arrays and records.
Then the question where to define such a branch filtering - in principle it would seem logical to have this then in the PHYSLITE scheme? But at the moment it already has to happen in the UprootSourceMapping._extract_base_form ...

@nsmith-
Copy link
Member

nsmith- commented Feb 11, 2021

The item transform is a bit suspicious.. in theory that would only be needed to access non-split data members. Otherwise, each split member should be accessible directly, as is the case for e.g.

 'AnalysisJetsAuxDyn.btaggingLink/AnalysisJetsAuxDyn.btaggingLink.m_persKey',
 'AnalysisJetsAuxDyn.btaggingLink/AnalysisJetsAuxDyn.btaggingLink.m_persIndex',

I think, and hopefully @jpivarski can confirm, that TTree does not support splitting a double-jagged object. If that is the case, then indeed !item transform would be a necessary feature.

I would also hope that filtering of branches can be performed at the schema. Is it a matter of suppressing warnings from _extract_base_form or that it actually raises an exception? (I guess I will try it and find out)

@jpivarski
Copy link
Contributor

I think, and hopefully @jpivarski can confirm, that TTree does not support splitting a double-jagged object. If that is the case, then indeed !item transform would be a necessary feature.

That is correct. The first jagged dimension gets split, but the rest do not. That's a large part of the motivation behind RNTuple.

@nikoladze
Copy link
Contributor Author

Yes, the double jagged element links can not be split automatically by root, that's why we need to read them as array of structs and split afterwards. But i think those are the only ones.

I would also hope that filtering of branches can be performed at the schema. Is it a matter of suppressing warnings from _extract_base_form or that it actually raises an exception? (I guess I will try it and find out)

Both - there are 2 cases where i would get an exception:

  • Branches that uproot can't read (unknown interpretation). Some of them don't contain any data, like EventInfo/SG::AuxElement/SG::IAuxElement and for others i haven't checked yet why it i does not work or if we even need them like McEventInfo/m_subEvents/m_subEvents.m_subEventInfo.m_event_type.m_bit_mask or PrimaryVerticesAuxDyn.neutralParticleLinks
  • Branches that have an interpretation in uproot but cannot be read into an awkward array (uproot raises CannotBeAwkward) which are typically also branches without data like AnalysisHLT_mu50 (without any .)

So one could either turn those Exceptions into warnings (will be quite a lot since essentially every collection with has one such non-data branch) or have some mechanism that i can filter them already before trying.

@jpivarski
Copy link
Contributor

  • Branches that have an interpretation in uproot but cannot be read into an awkward array (uproot raises CannotBeAwkward) which are typically also branches without data like AnalysisHLT_mu50 (without any .)

So one could either turn those Exceptions into warnings (will be quite a lot since essentially every collection with has one such non-data branch) or have some mechanism that i can filter them already before trying.

Also, as it came up here: scikit-hep/uproot5#267 "cannot be Awkward" is different from "cannot be read." A TTree containing TH1F, for example (weird, but it does happen), can be read if library="np". TH1F objects can't be Awkward Arrays, but they can be Python objects, and library="np" will make a NumPy array of Python objects (so dtype="O"). It's not fast, but it's possible.

Oh! That's the same file as in that discussion. So this is a small world, then.

In general, you should allow a file to be opened and examined even if some branches are not readable, since a particular use-case might not use those branches.

@nsmith-
Copy link
Member

nsmith- commented Feb 15, 2021

Seems to me we could wrap the interior of the _extract_base_form loop with a try-except block and pass these uproot exceptions.

@nsmith-
Copy link
Member

nsmith- commented Apr 8, 2021

I'm not sure where we left off on this PR but would be nice to reboot it. Certainly I've seen more cases where something is awkward-describable but not able to be in NanoEvents due to the limited scope of forms considered in _extract_base_form. It's good to keep in mind the goal of the nanoevents mappings layer and how it is different from uproot.lazy: it is to allow lazy cacheable awkward view of a root file that can be serialized. However, now that uproot files can be pickled scikit-hep/uproot5#302 maybe if we relax the "json form + string key" requirement we can just use directly the uproot.lazy implementation instead.

@nikoladze
Copy link
Contributor Author

Yes, we should reboot it, thanks for pinging on this! For the purpose of trying it out i put those try - except blocks in there. With that it now works without filtering, but of course gives a lot of warnings:

from coffea.nanoevents.schemas import BaseSchema
from coffea.nanoevents.factory import NanoEventsFactory
factory = NanoEventsFactory.from_root(
    "DAOD_PHYSLITE.art_split99.pool.root",
    treepath="CollectionTree",
    schemaclass=BaseSchema,
)

uproot.lazy also works, but here i have to include the filter

def filter_name(branch):
    if branch.endswith("."):
        return False
    if "::" in branch:
        return False
    if "neutralParticleLinks" in branch:
        return False
    if "AuxDyn." in branch or "Aux." in branch:
        return True
    return False

lazy_tree = uproot.lazy("DAOD_PHYSLITE.art_split99.pool.root:CollectionTree", filter_name=filter_name)

It's good to keep in mind the goal of the nanoevents mappings layer and how it is different from uproot.lazy: it is to allow lazy cacheable awkward view of a root file that can be serialized.

By that, is the main goal to serialize the lazy view, together with the already cached columns? In general it would of course be nice not having to duplicate all the code from uproot.lazy.

@nsmith-
Copy link
Member

nsmith- commented Apr 9, 2021

By that, is the main goal to serialize the lazy view, together with the already cached columns?

I'm imagining saving the plain json form (which is common across all chunks in a dataset) in one system, and then the cached columns in another system, but yes essentially that's the idea.

@lgray
Copy link
Collaborator

lgray commented Apr 22, 2021

Just pinging here as it has been two weeks since any update and it seems everyone was interested in seeing this PR through.
@nikoladze

@nsmith-
Copy link
Member

nsmith- commented Apr 22, 2021

If you'd like me to look into something in particular let me know. Otherwise maybe it makes sense to merge this as-is and ponder long-term plans later? I guess one thing we could put together is a Schema class for the DAOD, that is the other big piece of added value w.r.t. uproot.lazy.

@nikoladze
Copy link
Contributor Author

Sounds good - i think the modifications i made should not interfer when reading non-double-jagged branches, so if it's fine for you you can merge it and i can try to build the DAOD schema on top of that.

@lgray
Copy link
Collaborator

lgray commented Apr 23, 2021

@nikoladze cool - please resolve the conflicts with the base branch and we can move forward.

@lgray lgray marked this pull request as ready for review April 23, 2021 21:21
@lgray lgray merged commit 517a686 into CoffeaTeam:master Apr 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants