PHYSLITE schema #527

nikoladze · 2021-05-12T16:01:39Z

This addresses #514. I started implementing a schema for DAOD_PHYSLITE. So far it's just zipping together the collections and assigning some basic LorentzVector behavior which i just copied over from my prototype.

So far the following works:

from coffea.nanoevents.schemas import PHYSLITESchema
from coffea.nanoevents.factory import NanoEventsFactory

# file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download
factory = NanoEventsFactory.from_root(
    "DAOD_PHYSLITE.art_split99.pool.root",
    treepath="CollectionTree",
    schemaclass=PHYSLITESchema
)
events = factory.events()

>>> events.Electrons
<xAODParticleArray [[], [], ... truthType: 17}], []] type='100 * var * xAODParti...'>
>>> events.Electrons.pt
<Array [[], [], [9.46e+04, ... [6.91e+03], []] type='100 * var * float32[paramet...'>
>>> events.Electrons.px
<Array [[], [], [-9.39e+04, ... [6.81e+03], []] type='100 * var * float32'>
>>> events.GSFTrackParticles
<xAODTrackParticleArray [[], ... z0: 0.426}]] type='100 * var * xAODTrackParticl...'>
>>> events.GSFTrackParticles.pt
<Array [[], [9.46e+03, ... 3.11e+03, 2.44e+03]] type='100 * var * float32'>

Besides refining the behavior and adding other stuff like EventInfo the next big thing will be the cross references. From looking over the code for NanoAOD it looks like everything is there to do basic indexing from one collection into another, even with nested indices (that result in a double-jagged array). I just need to figure out how to best create the global indices. Maybe i'll have to add or modify some of the transforms to get them since i don't get the local indices from counts, but need to pop them out of already existing awkward arrays that uproot created.

Example for a local index, resulting in a single jagged array:

>>> events.Muons["combinedTrackParticleLink.m_persIndex"]
<Array [[], [0], [], [], ... 0], [0], [0, 1]] type='100 * var * uint32[parameter...'>

which should link into events.CombinedMuonTrackParticles, e.g. without global indices that would be something like

>>> muon_track_link = events.Muons["combinedTrackParticleLink.m_persIndex"]
>>> muon_track_key = events.Muons["combinedTrackParticleLink.m_persKey"]
>>> events.CombinedMuonTrackParticles[muon_track_link.mask[muon_track_key != 0]]
<xAODTrackParticleArray [[], ... z0: 12.5}]] type='100 * var * ?xAODTrackParticl...'>

Example for a local (nested) index, resulting in a double jagged array:

>>> events.Electrons.trackParticleLinks.m_persIndex
<Array [[], [], [[0]], ... [], [[0, 0]], []] type='100 * var * var * uint32'>

which should link into events.GSFTrackParticles. I have some example for that, but i hope that this could also be bent somehow into the nestedindex stuff that already exists.

A bit more advanced would then be to link into multiple collections, resulting in a UnionArray. In principle, all these ElementLinks have an m_persKey field that contains hashes of the collections this index links into. In most cases (like the examples above) this is always the same, but in some cases it can be several, e.g.

>>> import numpy as np
>>> import awkward as ak
>>> np.unique(ak.to_numpy(ak.flatten(events.TruthElectrons.parentLinks.m_persKey, axis=None)))
array([        0, 375408000, 614719239, 660928181, 779635413],
      dtype=uint32)

which are the hashes for TruthTaus, TruthBoson, TruthTop, TruthBottom. I think 0 should be masked out since they probably correspond to links to particles that don't exist anymore (maybe thinned out).

I'll try out a few more things, but let me know if you have ideas or suggestions how to do these things.

lgray · 2021-05-12T17:30:35Z

This might be an interesting place to do unit conversions from MeV to GeV. ;-)

lgray · 2021-05-12T17:32:05Z

Aside from the inter-experiment snark. This is awesome, I'll have a careful look through it. :-)

coffea/nanoevents/methods/physlite.py

lgray · 2021-05-12T23:30:38Z

coffea/nanoevents/schemas/physlite.py

+class PHYSLITESchema(BaseSchema):
+
+    mixins = {
+        "Electrons": "xAODParticle",


Are there no further specializations to electrons/muons/etc. that you'd like to do? It's still a draft so I suppose you may be working on that.

Yes, there are going to be more specializations - e.g. the cross references will be different for Electrons/Muons/Jets

coffea/nanoevents/transforms.py

coffea/nanoevents/schemas/physlite.py

lgray · 2021-05-13T17:33:26Z

Do you have like a 40-event DAOD file that we can put into tests?

Then we can automate running test processors and such on it.

nikoladze · 2021-05-13T17:40:52Z

I experimented a bit more to get cross references working (added some comments inline). I think it should work in principle - probably the way i implemented it can be improved. Let me know if you have ideas.

But there is another issue now - i get the wrong integer types for my ElementLink arrays. If i ask uproot for the akward form of this branch:

# file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download
tree = uproot.open("DAOD_PHYSLITE.art_split99.pool.root:CollectionTree")
tree["AnalysisElectronsAuxDyn.trackParticleLinks"].interpretation.awkward_form(None)

it will give me uint32 for the record fields:

...
"content": {
            "class": "RecordArray",
            "contents": {
                "m_persIndex": "uint32",
                "m_persKey": "uint32"
            },
...

But when i actually retrieve that array with uproot it's suddenly int64:

tree["AnalysisElectronsAuxDyn.trackParticleLinks"].array().layout.form

{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": {
            "class": "RecordArray",
            "contents": {
                "m_persIndex": "int64",
                "m_persKey": "int64"
            },
            "parameters": {
                "__record__": "ElementLink<DataVector<xAOD::TrackParticle_v1>>"
            }
        }
    }
}

Any ideas how to circumvent this?

nikoladze · 2021-05-13T17:42:23Z

Do you have like a 40-event DAOD file that we can put into tests?

Then we can automate running test processors and such on it.

The file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download has 100 events and 2.8MB. If that's still too large i could also make a smaller one.

lgray · 2021-05-13T17:58:47Z

Do you have like a 40-event DAOD file that we can put into tests?
Then we can automate running test processors and such on it.

The file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download has 100 events and 2.8MB. If that's still too large i could also make a smaller one.

That's fine we have other ~2MB files in test/samples.

nikoladze · 2021-05-18T17:12:25Z

About the issue i mentioned before where the Link indices and keys become int64 although the awkward form has int32: reading a bit through the code of uproot.lazy i noticed there is a special treatment for forms where the arrays are fetched via ak.from_iter, so that's probably something i would need as a temporary solution here as well until these branches are read directly, maybe with Awkward Forth in the future.

For now i just hardcoded the form for the element links to have int64. I also added a small file in tests/samples with 40 events.

Now i can have cross references using global indices

import os
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema
factory = NanoEventsFactory.from_root(
    os.path.abspath("tests/samples/DAOD_PHYSLITE_21.2.108.0.art.pool.root"),
    treepath="CollectionTree",
    schemaclass=PHYSLITESchema
)
events = factory.events()

>>> events.Electrons.trackParticles
<xAODTrackParticleArray [[], [], ... vz: 0, z0: -0.0982}]]] type='40 * var * var...'>
>>> events[30:].Electrons.trackParticles.pt.tolist()
[[], [], [], [], [], [[100500.9765625]], [], [], [[36525.61328125]], [[27814.41015625, 18714.423828125]]]
>>> events[32:].Electrons.trackParticles.pt.tolist()
[[], [], [], [[100500.9765625]], [], [], [[36525.61328125]], [[27814.41015625, 18714.423828125]]]

I also experimented a bit with Links into multiple collections and made a function that creates a union array from them

>>> events.TruthBoson.children
<Array [[[], ... status: 1}]]] type='40 * var * var * ?union[?xAODTruthParticle[...'>

which creates a union array of all the collections linked into by evaluating np.unique on the events.TruthBoson.childLinks.m_persKey array. The key-to-branchname mapping is currently hardcoded, but can be read in principle from the MetaData/EventFormat tree. Also unfortunately i had to put all potential combinations of links into the form because the global indices have to be created from the form ...

So i'm not quite sure if this is the best way to do it - at the moment i see it more as a proof of concept. Maybe there is in general a better way to deal with these links ...

Also i don't quite know how to attach behavior to the union arrays.

Finally one thing i haven't quite understood yet: the way i use the _apply_global_index function it seems the virtual arrays i link to get materialized, e.g. if i do

events.Electrons.trackParticles.layout

there is no VirtualArray anymore

nikoladze · 2021-05-19T07:33:22Z

I figured out a few more things:

The behavior for union arrays works if i add __record__ on the top level, so now i can do the famous "cyclic" references in NanoEvents

>>> events.TruthBoson.children
<TruthParticleArray [[[], ... TruthParticle]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents
<TruthParticleArray [[[], ... [TruthParticle]]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents.children
<TruthParticleArray [[[], ... TruthParticle]]]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents.children.ndim
5

Note: needs the fix from scikit-hep/awkward#870

However, as you can see the overridden typestring (with _set_repr_name which i copied from the nanoaod behavior) does not work for anymore here since i had to add the __record__ on top level. But the arrays seem to be created correctly (although i haven't really tested it yet), so for now it's more an aesthetic issue.

Also i found that adding the collection_name parameters will avoid the index arrays to be materialized. I have not quite understood how that happens though ...

Note on performance: In current uproot the loading of the ElementLink branches is quite slow, so as a temporary solution for playing with larger ROOT files i wrote a small hack to hijack the extract_column function of UprootSourceMapping to use my custom Numba deserialization functions, to be used like

# pip install git+https://gitlab.cern.ch/nihartma/physlite-experiments.git
from physlite_experiments.deserialization_hacks import patch_nanoevents
patch_nanoevents()

nikoladze · 2021-05-25T08:04:11Z

While testing the element links i found scikit-hep/awkward#876 - which is already fixed. After the fix the tests should pass.

lgray · 2021-06-01T17:28:45Z

Nice a new awkward is build built. Once that is done baking we should require awkward >=1.3.0 with this branch.

lgray · 2021-06-01T23:34:27Z

@nikoladze Can you add the awkward array version requirement to setup.py in this PR? Thanks!

lgray · 2021-06-02T12:34:52Z

Are you ready to switch this from a draft PR, and have it reviewed?

nsmith-

This looks really nice! I have some proposed simplifications, but for the most part it looks good to go.

nsmith- · 2021-06-02T14:53:10Z

coffea/nanoevents/methods/physlite.py

+    def trackParticle(self):
+        trackParticles = self.trackParticles
+        return self.trackParticles[
+            tuple([slice(None) for i in range(trackParticles.ndim - 1)] + [0])


Could be self.trackParticles[..., 0]?
By the way, nice catch to guard against accessing attributes in objects that may be nested further than the default 2 dimensions.

That did not wok - awkward complains with ValueError: ellipsis (...) can't be used on a data structure of different depths. Should that work? Then i can open an issue for awkward array.

Actually i noticed it for lower number of dimensions e.g. when i do events[2].Electrons.trackParticle

I would think that as long as array.ndim is well-defined, ellipsis is as well (provided no field indexers are used in the getitem). @jpivarski thoughts?

That error message about ellipsis is talking about an issue with unions (so, unless it's not what I'm thinking, it's not a bug).

>>> array = ak.Array([[1], [2], [3]]) >>> array[..., 0] <Array [1, 2, 3] type='3 * int64'> >>> array = ak.Array([[1], [2], [3], [[4], [4], [4], [4]]]) >>> array[..., 0] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/jpivarski/miniconda3/lib/python3.8/site-packages/awkward/highlevel.py", line 993, in __getitem__ return ak._util.wrap(self.layout[where], self._behavior) ValueError: ellipsis (...) can't be used on a data structure of different depths (https://github.com/scikit-hep/awkward-1.0/blob/1.2.2/src/libawkward/Content.cpp#L1491)

Arrays always have an ndim, and this is the "purelist depth". A union like the above doesn't have an easily definable depth. I had thought that I made this return -1 in cases like that, but apparently, it gives the depth of the UnionArray object:

>>> array.ndim # high-level 0 >>> array.layout.purelist_depth # low-level 0

That's not ideal: it doesn't give more information than if it had been a RecordArray at that level. But I don't know if this addresses your question.

I looked a bit what's going on for this particular array. Seems the issue here is not that it is a union, but that one field has a higher dimension. There is definingParametersCovarianceMatrix which has one dimension more than the other fields (but is still zipped on the outermost level of offsets). So a small example for that situation would be

import awkward as ak array = ak.zip( {"a" : [[1], [2, 3]], "b" : [[[10, 10]], [[20, 20], [30, 30]]]}, depth_limit=2 )

The resulting array actually has ndim=2 - which is what i use in that code for trackParticle.

>>> array.ndim 2

I want this:

>>> array[:, 0].tolist() [{'a': 1, 'b': [10, 10]}, {'a': 2, 'b': [20, 20]}]

The ellipsis don't work, presumably because of the one dimension more in the field b

>>> array[..., 0] Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/nikolai/.local/lib/python3.9/site-packages/awkward/highlevel.py", line 993, in __getitem__ return ak._util.wrap(self.layout[where], self._behavior) ValueError: ellipsis (...) can't be used on a data structure of different depths (https://github.com/scikit-hep/awkward-1.0/blob/1.3.0/src/libawkward/Content.cpp#L1491)

coffea/nanoevents/schemas/physlite.py

coffea/nanoevents/transforms.py

nsmith- · 2021-06-02T17:16:22Z

Also i don't quite know how to attach behavior to the union arrays.

This might be a bug, perhaps related to scikit-hep/awkward#770
In principle a union array should be able to have a behavior.

Finally one thing i haven't quite understood yet: the way i use the _apply_global_index function it seems the virtual arrays i link to get materialized, e.g. if i do

events.Electrons.trackParticles.layout

there is no VirtualArray anymore

This might be related to the lack of a collection_name in the layout. I think what is happening is that in

coffea/coffea/nanoevents/methods/base.py

Lines 36 to 50 in 1963fc9

    
           def _getlistarray(self): 
        
               """Do some digging to find the initial listarray""" 
        
               def descend(layout, depth): 
        
                   islistarray = isinstance( 
        
                       layout, 
        
                       ( 
        
                           awkward.layout.ListOffsetArray32, 
        
                           awkward.layout.ListOffsetArray64, 
        
                       ), 
        
                   ) 
        
                   if islistarray and layout.content.parameter("collection_name") is not None: 
        
                       return lambda: layout 
        
               return awkward._util.recursively_apply(self.layout, descend)

if no collection_name is encountered then awkward._util.recursively_apply(self.layout, descend) just returns the original (but now materialized) array. Whereas if collection_name is encountered, it would return the flat collection RecordArray without materializing its constituent fields. The purpose of that function is to descend through possible extraneous array nodes like VirtualArray since in general VirtualArray nodes in layouts can disappear in certain circumstances.

lgray · 2021-06-03T15:32:35Z

@nikoladze if we can wrap this up today or tomorrow we can have it in the next coffea release - which would be pretty snazzy.

…ith Index64 arrays

nikoladze · 2021-06-04T14:04:23Z

@nikoladze if we can wrap this up today or tomorrow we can have it in the next coffea release - which would be pretty snazzy.

Fine for me. Of course this is still work in progress, but so is DAOD_PHYSLITE, so i think it won't hurt to have some basic functionality already included in coffea.

I tried to get the PHYSLITESchema included in the docs which seemed to work more or less. I just didn't manage yet for nanovents.methods.physlite - when i add it to docs/source/reference.rst and try to compile the documentation i get an error

Extension error (sphinx.ext.linkcode):
Handler <function doctree_read at 0x7fdd9b47e4c0> for event 'doctree-read' threw an exception (exception: could not find class definition)

So here is a TODO list for the future:

Complete the lists of possible cross-reference mappings
Complete the mixin classes, add all important methods
Consider making Records of Unions instead of Unions of Records for the cases with links into multiple collections
Write more tests

... and i'm sure more things will come up.

Today i won't be able to get more work on this done, so if you think there is still significant things to do until this can be merged it is of course also fine to just have it for the next release then. Thanks a lot!

lgray · 2021-06-04T14:35:45Z

So here is a TODO list for the future:

Complete the lists of possible cross-reference mappings
Complete the mixin classes, add all important methods
Consider making Records of Unions instead of Unions of Records for the cases with links into multiple collections
Write more tests

Please make a tracking issue for these and use checkboxes or whatever you like to mark completion.

Unless @nsmith- has something further that needs addressing I'd like to go ahead and merge this, we can figure out the docs stuff in time (or I'll fix it before minting a release).

nikoladze added 4 commits May 12, 2021 10:19

add basic PHYSLITE schema

4173011

fix for / in branch name

2742e7a

add some behavior

bd1d0ad

black

e525566

Merge branch 'master' into physlite-schema

6c83bf3

lgray reviewed May 12, 2021

View reviewed changes

coffea/nanoevents/methods/physlite.py Outdated Show resolved Hide resolved

lgray reviewed May 12, 2021

View reviewed changes

nikoladze and others added 6 commits May 13, 2021 08:15

use full package names

c09fefa

Merge branch 'master' into physlite-schema

cf56ca3

add global index forms for single and double jagged cross references

d79f9d3

cleanup

940f99f

fix itemsize and add hack for duplicate key

c59da2f

add behavior for Electrons and Muons with trackParticle cross references

fad9d22