Add support multi-jagged arrays and records in `UprootSourceMapping._extract_base_form` #448

nikoladze · 2021-02-10T17:47:28Z

As discussed yesterday, as a first step to get DAOD_PHYSLITE supported with NanoEvents we need to get all branches that we want read into a base form. That includes for PHYSLITE a few multi-jagged arrays and records, such as the following:

def example_file(
    filename="DAOD_PHYSLITE.art_split99.pool.root",
    url="https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download"
):
    import os
    if not os.path.exists(filename):
        print(f"Downloading {url} to {filename}")
        import requests
        res = requests.get(url)
        with open(filename, "wb") as of:
            of.write(res.content)
    return filename

>>> import uproot
>>> tree = uproot.open(f"{example_file()}:CollectionTree")
>>> # double jagged of record with fields "m_persKey", "m_persIndex"
>>> tree["AnalysisJetsAuxDyn.NumTrkPt500"].interpretation.awkward_form(None)
....
>>> # double jagged + record
>>> tree["AnalysisElectronsAuxDyn.trackParticleLinks"].interpretation.awkward_form(None)
...

For experimentation i tried to put support for these two cases into UprootSourceMapping._extract_base_form. That involved wrapping the .layout calls in transforms to check if the object on the stack is already a layout, to be able to do things like !content,!content. And i added a item transform for getting a record field that unfortunately had to become another special case since it needs to also receive the field name, so the form keys for these have !item'{field}' in them.

Finally, to actually pass a PHYSLITE file through NanoEventsFactory i needed to add a filter for branches because some of the branches are not meaningful and can't be read since they don't contain any data (and some have issues that need to be sorted out still). So for testing i added an argument iteritems_options that gets passed down to uproot's iteritems function where i can pass e.g. filter_name. So with these modifications the following works:

>>> from coffea.nanoevents.schemas import BaseSchema
>>> from coffea.nanoevents.factory import NanoEventsFactory

def filter_name(branch):
    if branch.endswith("."):
        return False
    if "::" in branch:
        return False
    if "neutralParticleLinks" in branch:
        return False
    if "AuxDyn." in branch or "Aux." in branch:
        return True
    return False

factory = NanoEventsFactory.from_root(
    example_file(),
    treepath="CollectionTree",
    schemaclass=BaseSchema,
    iteritems_options=dict(filter_name=filter_name)
)

>>> factory.events()["AnalysisJetsAuxDyn.NumTrkPt500"]
<Array [[[7, 0, 1, 0, 0, ... 0, 0, 0, 0, 0]]] type='100 * var * [var * int32, pa...'>
>>> factory.events()["AnalysisElectronsAuxDyn.trackParticleLinks"]
<Array [[], [], ... m_persKey: 0}]], []] type='100 * var * [var * struct[["m_per...'>

Of course this might not be the best way of doing things, so i'll stop here for now. Do you have any suggestions how to move on or do things better? I'm also thinking of extending the logic to recursively go through the form to support arbitrarily nested jagged arrays and records.
Then the question where to define such a branch filtering - in principle it would seem logical to have this then in the PHYSLITE scheme? But at the moment it already has to happen in the UprootSourceMapping._extract_base_form ...

…mapping

…tSourceMapping

nsmith- · 2021-02-11T21:07:22Z

The item transform is a bit suspicious.. in theory that would only be needed to access non-split data members. Otherwise, each split member should be accessible directly, as is the case for e.g.

 'AnalysisJetsAuxDyn.btaggingLink/AnalysisJetsAuxDyn.btaggingLink.m_persKey',
 'AnalysisJetsAuxDyn.btaggingLink/AnalysisJetsAuxDyn.btaggingLink.m_persIndex',

I think, and hopefully @jpivarski can confirm, that TTree does not support splitting a double-jagged object. If that is the case, then indeed !item transform would be a necessary feature.

I would also hope that filtering of branches can be performed at the schema. Is it a matter of suppressing warnings from _extract_base_form or that it actually raises an exception? (I guess I will try it and find out)

jpivarski · 2021-02-11T21:19:29Z

I think, and hopefully @jpivarski can confirm, that TTree does not support splitting a double-jagged object. If that is the case, then indeed !item transform would be a necessary feature.

That is correct. The first jagged dimension gets split, but the rest do not. That's a large part of the motivation behind RNTuple.

nikoladze · 2021-02-15T17:11:07Z

Yes, the double jagged element links can not be split automatically by root, that's why we need to read them as array of structs and split afterwards. But i think those are the only ones.

I would also hope that filtering of branches can be performed at the schema. Is it a matter of suppressing warnings from _extract_base_form or that it actually raises an exception? (I guess I will try it and find out)

Both - there are 2 cases where i would get an exception:

Branches that uproot can't read (unknown interpretation). Some of them don't contain any data, like EventInfo/SG::AuxElement/SG::IAuxElement and for others i haven't checked yet why it i does not work or if we even need them like McEventInfo/m_subEvents/m_subEvents.m_subEventInfo.m_event_type.m_bit_mask or PrimaryVerticesAuxDyn.neutralParticleLinks
Branches that have an interpretation in uproot but cannot be read into an awkward array (uproot raises CannotBeAwkward) which are typically also branches without data like AnalysisHLT_mu50 (without any .)

So one could either turn those Exceptions into warnings (will be quite a lot since essentially every collection with has one such non-data branch) or have some mechanism that i can filter them already before trying.

jpivarski · 2021-02-15T18:02:07Z

Branches that have an interpretation in uproot but cannot be read into an awkward array (uproot raises CannotBeAwkward) which are typically also branches without data like AnalysisHLT_mu50 (without any .)

So one could either turn those Exceptions into warnings (will be quite a lot since essentially every collection with has one such non-data branch) or have some mechanism that i can filter them already before trying.

Also, as it came up here: scikit-hep/uproot5#267 "cannot be Awkward" is different from "cannot be read." A TTree containing TH1F, for example (weird, but it does happen), can be read if library="np". TH1F objects can't be Awkward Arrays, but they can be Python objects, and library="np" will make a NumPy array of Python objects (so dtype="O"). It's not fast, but it's possible.

Oh! That's the same file as in that discussion. So this is a small world, then.

In general, you should allow a file to be opened and examined even if some branches are not readable, since a particular use-case might not use those branches.

nsmith- · 2021-02-15T19:00:16Z

Seems to me we could wrap the interior of the _extract_base_form loop with a try-except block and pass these uproot exceptions.

nsmith- · 2021-04-08T19:34:13Z

I'm not sure where we left off on this PR but would be nice to reboot it. Certainly I've seen more cases where something is awkward-describable but not able to be in NanoEvents due to the limited scope of forms considered in _extract_base_form. It's good to keep in mind the goal of the nanoevents mappings layer and how it is different from uproot.lazy: it is to allow lazy cacheable awkward view of a root file that can be serialized. However, now that uproot files can be pickled scikit-hep/uproot5#302 maybe if we relax the "json form + string key" requirement we can just use directly the uproot.lazy implementation instead.

nikoladze · 2021-04-09T18:00:19Z

Yes, we should reboot it, thanks for pinging on this! For the purpose of trying it out i put those try - except blocks in there. With that it now works without filtering, but of course gives a lot of warnings:

from coffea.nanoevents.schemas import BaseSchema
from coffea.nanoevents.factory import NanoEventsFactory
factory = NanoEventsFactory.from_root(
    "DAOD_PHYSLITE.art_split99.pool.root",
    treepath="CollectionTree",
    schemaclass=BaseSchema,
)

uproot.lazy also works, but here i have to include the filter

def filter_name(branch):
    if branch.endswith("."):
        return False
    if "::" in branch:
        return False
    if "neutralParticleLinks" in branch:
        return False
    if "AuxDyn." in branch or "Aux." in branch:
        return True
    return False

lazy_tree = uproot.lazy("DAOD_PHYSLITE.art_split99.pool.root:CollectionTree", filter_name=filter_name)

It's good to keep in mind the goal of the nanoevents mappings layer and how it is different from uproot.lazy: it is to allow lazy cacheable awkward view of a root file that can be serialized.

By that, is the main goal to serialize the lazy view, together with the already cached columns? In general it would of course be nice not having to duplicate all the code from uproot.lazy.

nsmith- · 2021-04-09T21:47:44Z

By that, is the main goal to serialize the lazy view, together with the already cached columns?

I'm imagining saving the plain json form (which is common across all chunks in a dataset) in one system, and then the cached columns in another system, but yes essentially that's the idea.

lgray · 2021-04-22T19:30:14Z

Just pinging here as it has been two weeks since any update and it seems everyone was interested in seeing this PR through.
@nikoladze

nsmith- · 2021-04-22T19:38:59Z

If you'd like me to look into something in particular let me know. Otherwise maybe it makes sense to merge this as-is and ponder long-term plans later? I guess one thing we could put together is a Schema class for the DAOD, that is the other big piece of added value w.r.t. uproot.lazy.

nikoladze · 2021-04-23T14:38:32Z

Sounds good - i think the modifications i made should not interfer when reading non-double-jagged branches, so if it's fine for you you can merge it and i can try to build the DAOD schema on top of that.

lgray · 2021-04-23T14:45:23Z

@nikoladze cool - please resolve the conflicts with the base branch and we can move forward.

nikoladze and others added 5 commits February 10, 2021 15:27

pass through iteritems kwargs and add double jagged arrays to Uproot …

0517433

…mapping

pass through iteritems kwargs already from NanoEventsFactory.from_root

9633ca2

Add support for double jagged arrays (possibly with records) in Uproo…

cc82505

…tSourceMapping

flake8

e90fbef

Merge branch 'master' into dev-physlite

3f5452f

turn exceptions into warnings

dbbc782

lgray added 4 commits April 23, 2021 15:52

Merge branch 'master' into dev-physlite

06c3ae9

black

8f8b9c4

black

78ad76e

oops

2368e6d

lgray marked this pull request as ready for review April 23, 2021 21:21

lgray mentioned this pull request Apr 23, 2021

Follow up on #448 - DAOD Schema #514

Closed

nsmith- approved these changes Apr 23, 2021

View reviewed changes

lgray merged commit 517a686 into CoffeaTeam:master Apr 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support multi-jagged arrays and records in `UprootSourceMapping._extract_base_form` #448

Add support multi-jagged arrays and records in `UprootSourceMapping._extract_base_form` #448

nikoladze commented Feb 10, 2021 •

edited

nsmith- commented Feb 11, 2021 •

edited

jpivarski commented Feb 11, 2021

nikoladze commented Feb 15, 2021

jpivarski commented Feb 15, 2021

nsmith- commented Feb 15, 2021

nsmith- commented Apr 8, 2021

nikoladze commented Apr 9, 2021

nsmith- commented Apr 9, 2021

lgray commented Apr 22, 2021 •

edited

nsmith- commented Apr 22, 2021 •

edited

nikoladze commented Apr 23, 2021

lgray commented Apr 23, 2021

Add support multi-jagged arrays and records in UprootSourceMapping._extract_base_form #448

Add support multi-jagged arrays and records in UprootSourceMapping._extract_base_form #448

Conversation

nikoladze commented Feb 10, 2021 • edited

nsmith- commented Feb 11, 2021 • edited

jpivarski commented Feb 11, 2021

nikoladze commented Feb 15, 2021

jpivarski commented Feb 15, 2021

nsmith- commented Feb 15, 2021

nsmith- commented Apr 8, 2021

nikoladze commented Apr 9, 2021

nsmith- commented Apr 9, 2021

lgray commented Apr 22, 2021 • edited

nsmith- commented Apr 22, 2021 • edited

nikoladze commented Apr 23, 2021

lgray commented Apr 23, 2021

Add support multi-jagged arrays and records in `UprootSourceMapping._extract_base_form` #448

Add support multi-jagged arrays and records in `UprootSourceMapping._extract_base_form` #448

nikoladze commented Feb 10, 2021 •

edited

nsmith- commented Feb 11, 2021 •

edited

lgray commented Apr 22, 2021 •

edited

nsmith- commented Apr 22, 2021 •

edited