Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PHYSLITE schema #527

Merged
merged 31 commits into from
Jun 4, 2021
Merged

PHYSLITE schema #527

merged 31 commits into from
Jun 4, 2021

Conversation

nikoladze
Copy link
Contributor

This addresses #514. I started implementing a schema for DAOD_PHYSLITE. So far it's just zipping together the collections and assigning some basic LorentzVector behavior which i just copied over from my prototype.

So far the following works:

from coffea.nanoevents.schemas import PHYSLITESchema
from coffea.nanoevents.factory import NanoEventsFactory

# file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download
factory = NanoEventsFactory.from_root(
    "DAOD_PHYSLITE.art_split99.pool.root",
    treepath="CollectionTree",
    schemaclass=PHYSLITESchema
)
events = factory.events()
>>> events.Electrons
<xAODParticleArray [[], [], ... truthType: 17}], []] type='100 * var * xAODParti...'>
>>> events.Electrons.pt
<Array [[], [], [9.46e+04, ... [6.91e+03], []] type='100 * var * float32[paramet...'>
>>> events.Electrons.px
<Array [[], [], [-9.39e+04, ... [6.81e+03], []] type='100 * var * float32'>
>>> events.GSFTrackParticles
<xAODTrackParticleArray [[], ... z0: 0.426}]] type='100 * var * xAODTrackParticl...'>
>>> events.GSFTrackParticles.pt
<Array [[], [9.46e+03, ... 3.11e+03, 2.44e+03]] type='100 * var * float32'>

Besides refining the behavior and adding other stuff like EventInfo the next big thing will be the cross references. From looking over the code for NanoAOD it looks like everything is there to do basic indexing from one collection into another, even with nested indices (that result in a double-jagged array). I just need to figure out how to best create the global indices. Maybe i'll have to add or modify some of the transforms to get them since i don't get the local indices from counts, but need to pop them out of already existing awkward arrays that uproot created.

Example for a local index, resulting in a single jagged array:

>>> events.Muons["combinedTrackParticleLink.m_persIndex"]
<Array [[], [0], [], [], ... 0], [0], [0, 1]] type='100 * var * uint32[parameter...'>

which should link into events.CombinedMuonTrackParticles, e.g. without global indices that would be something like

>>> muon_track_link = events.Muons["combinedTrackParticleLink.m_persIndex"]
>>> muon_track_key = events.Muons["combinedTrackParticleLink.m_persKey"]
>>> events.CombinedMuonTrackParticles[muon_track_link.mask[muon_track_key != 0]]
<xAODTrackParticleArray [[], ... z0: 12.5}]] type='100 * var * ?xAODTrackParticl...'>

Example for a local (nested) index, resulting in a double jagged array:

>>> events.Electrons.trackParticleLinks.m_persIndex
<Array [[], [], [[0]], ... [], [[0, 0]], []] type='100 * var * var * uint32'>

which should link into events.GSFTrackParticles. I have some example for that, but i hope that this could also be bent somehow into the nestedindex stuff that already exists.

A bit more advanced would then be to link into multiple collections, resulting in a UnionArray. In principle, all these ElementLinks have an m_persKey field that contains hashes of the collections this index links into. In most cases (like the examples above) this is always the same, but in some cases it can be several, e.g.

>>> import numpy as np
>>> import awkward as ak
>>> np.unique(ak.to_numpy(ak.flatten(events.TruthElectrons.parentLinks.m_persKey, axis=None)))
array([        0, 375408000, 614719239, 660928181, 779635413],
      dtype=uint32)

which are the hashes for TruthTaus, TruthBoson, TruthTop, TruthBottom. I think 0 should be masked out since they probably correspond to links to particles that don't exist anymore (maybe thinned out).

I'll try out a few more things, but let me know if you have ideas or suggestions how to do these things.

@lgray
Copy link
Collaborator

lgray commented May 12, 2021

This might be an interesting place to do unit conversions from MeV to GeV. ;-)

@lgray
Copy link
Collaborator

lgray commented May 12, 2021

Aside from the inter-experiment snark. This is awesome, I'll have a careful look through it. :-)

class PHYSLITESchema(BaseSchema):

mixins = {
"Electrons": "xAODParticle",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there no further specializations to electrons/muons/etc. that you'd like to do? It's still a draft so I suppose you may be working on that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there are going to be more specializations - e.g. the cross references will be different for Electrons/Muons/Jets

@lgray
Copy link
Collaborator

lgray commented May 13, 2021

Do you have like a 40-event DAOD file that we can put into tests?

Then we can automate running test processors and such on it.

@nikoladze
Copy link
Contributor Author

I experimented a bit more to get cross references working (added some comments inline). I think it should work in principle - probably the way i implemented it can be improved. Let me know if you have ideas.

But there is another issue now - i get the wrong integer types for my ElementLink arrays. If i ask uproot for the akward form of this branch:

# file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download
tree = uproot.open("DAOD_PHYSLITE.art_split99.pool.root:CollectionTree")
tree["AnalysisElectronsAuxDyn.trackParticleLinks"].interpretation.awkward_form(None)

it will give me uint32 for the record fields:

...
"content": {
            "class": "RecordArray",
            "contents": {
                "m_persIndex": "uint32",
                "m_persKey": "uint32"
            },
...

But when i actually retrieve that array with uproot it's suddenly int64:

tree["AnalysisElectronsAuxDyn.trackParticleLinks"].array().layout.form
{
    "class": "ListOffsetArray64",
    "offsets": "i64",
    "content": {
        "class": "ListOffsetArray64",
        "offsets": "i64",
        "content": {
            "class": "RecordArray",
            "contents": {
                "m_persIndex": "int64",
                "m_persKey": "int64"
            },
            "parameters": {
                "__record__": "ElementLink<DataVector<xAOD::TrackParticle_v1>>"
            }
        }
    }
}

Any ideas how to circumvent this?

@nikoladze
Copy link
Contributor Author

Do you have like a 40-event DAOD file that we can put into tests?

Then we can automate running test processors and such on it.

The file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download has 100 events and 2.8MB. If that's still too large i could also make a smaller one.

@lgray
Copy link
Collaborator

lgray commented May 13, 2021

Do you have like a 40-event DAOD file that we can put into tests?
Then we can automate running test processors and such on it.

The file from https://cernbox.cern.ch/index.php/s/xTuPIvSJsP3QmXa/download has 100 events and 2.8MB. If that's still too large i could also make a smaller one.

That's fine we have other ~2MB files in test/samples.

@nikoladze
Copy link
Contributor Author

About the issue i mentioned before where the Link indices and keys become int64 although the awkward form has int32: reading a bit through the code of uproot.lazy i noticed there is a special treatment for forms where the arrays are fetched via ak.from_iter, so that's probably something i would need as a temporary solution here as well until these branches are read directly, maybe with Awkward Forth in the future.

For now i just hardcoded the form for the element links to have int64. I also added a small file in tests/samples with 40 events.

Now i can have cross references using global indices

import os
from coffea.nanoevents import NanoEventsFactory, PHYSLITESchema
factory = NanoEventsFactory.from_root(
    os.path.abspath("tests/samples/DAOD_PHYSLITE_21.2.108.0.art.pool.root"),
    treepath="CollectionTree",
    schemaclass=PHYSLITESchema
)
events = factory.events()
>>> events.Electrons.trackParticles
<xAODTrackParticleArray [[], [], ... vz: 0, z0: -0.0982}]]] type='40 * var * var...'>
>>> events[30:].Electrons.trackParticles.pt.tolist()
[[], [], [], [], [], [[100500.9765625]], [], [], [[36525.61328125]], [[27814.41015625, 18714.423828125]]]
>>> events[32:].Electrons.trackParticles.pt.tolist()
[[], [], [], [[100500.9765625]], [], [], [[36525.61328125]], [[27814.41015625, 18714.423828125]]]

I also experimented a bit with Links into multiple collections and made a function that creates a union array from them

>>> events.TruthBoson.children
<Array [[[], ... status: 1}]]] type='40 * var * var * ?union[?xAODTruthParticle[...'>

which creates a union array of all the collections linked into by evaluating np.unique on the events.TruthBoson.childLinks.m_persKey array. The key-to-branchname mapping is currently hardcoded, but can be read in principle from the MetaData/EventFormat tree. Also unfortunately i had to put all potential combinations of links into the form because the global indices have to be created from the form ...

So i'm not quite sure if this is the best way to do it - at the moment i see it more as a proof of concept. Maybe there is in general a better way to deal with these links ...

Also i don't quite know how to attach behavior to the union arrays.

Finally one thing i haven't quite understood yet: the way i use the _apply_global_index function it seems the virtual arrays i link to get materialized, e.g. if i do

events.Electrons.trackParticles.layout

there is no VirtualArray anymore

@nikoladze
Copy link
Contributor Author

I figured out a few more things:

The behavior for union arrays works if i add __record__ on the top level, so now i can do the famous "cyclic" references in NanoEvents

>>> events.TruthBoson.children
<TruthParticleArray [[[], ... TruthParticle]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents
<TruthParticleArray [[[], ... [TruthParticle]]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents.children
<TruthParticleArray [[[], ... TruthParticle]]]]] type='40 * truthParticle'>
>>> events.TruthBoson.children.parents.children.ndim
5

Note: needs the fix from scikit-hep/awkward#870

However, as you can see the overridden typestring (with _set_repr_name which i copied from the nanoaod behavior) does not work for anymore here since i had to add the __record__ on top level. But the arrays seem to be created correctly (although i haven't really tested it yet), so for now it's more an aesthetic issue.

Also i found that adding the collection_name parameters will avoid the index arrays to be materialized. I have not quite understood how that happens though ...

Note on performance: In current uproot the loading of the ElementLink branches is quite slow, so as a temporary solution for playing with larger ROOT files i wrote a small hack to hijack the extract_column function of UprootSourceMapping to use my custom Numba deserialization functions, to be used like

# pip install git+https://gitlab.cern.ch/nihartma/physlite-experiments.git
from physlite_experiments.deserialization_hacks import patch_nanoevents
patch_nanoevents()

@nikoladze
Copy link
Contributor Author

While testing the element links i found scikit-hep/awkward#876 - which is already fixed. After the fix the tests should pass.

@lgray
Copy link
Collaborator

lgray commented Jun 1, 2021

Nice a new awkward is build built. Once that is done baking we should require awkward >=1.3.0 with this branch.

@lgray
Copy link
Collaborator

lgray commented Jun 1, 2021

@nikoladze Can you add the awkward array version requirement to setup.py in this PR? Thanks!

@lgray
Copy link
Collaborator

lgray commented Jun 2, 2021

Are you ready to switch this from a draft PR, and have it reviewed?

@nikoladze nikoladze marked this pull request as ready for review June 2, 2021 15:02
Copy link
Member

@nsmith- nsmith- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really nice! I have some proposed simplifications, but for the most part it looks good to go.

def trackParticle(self):
trackParticles = self.trackParticles
return self.trackParticles[
tuple([slice(None) for i in range(trackParticles.ndim - 1)] + [0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be self.trackParticles[..., 0]?
By the way, nice catch to guard against accessing attributes in objects that may be nested further than the default 2 dimensions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That did not wok - awkward complains with ValueError: ellipsis (...) can't be used on a data structure of different depths. Should that work? Then i can open an issue for awkward array.

Actually i noticed it for lower number of dimensions e.g. when i do events[2].Electrons.trackParticle

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that as long as array.ndim is well-defined, ellipsis is as well (provided no field indexers are used in the getitem). @jpivarski thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That error message about ellipsis is talking about an issue with unions (so, unless it's not what I'm thinking, it's not a bug).

>>> array = ak.Array([[1], [2], [3]])
>>> array[..., 0]
<Array [1, 2, 3] type='3 * int64'>
>>> array = ak.Array([[1], [2], [3], [[4], [4], [4], [4]]])
>>> array[..., 0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/miniconda3/lib/python3.8/site-packages/awkward/highlevel.py", line 993, in __getitem__
    return ak._util.wrap(self.layout[where], self._behavior)
ValueError: ellipsis (...) can't be used on a data structure of different depths

(https://github.com/scikit-hep/awkward-1.0/blob/1.2.2/src/libawkward/Content.cpp#L1491)

Arrays always have an ndim, and this is the "purelist depth". A union like the above doesn't have an easily definable depth. I had thought that I made this return -1 in cases like that, but apparently, it gives the depth of the UnionArray object:

>>> array.ndim                    # high-level
0
>>> array.layout.purelist_depth   # low-level
0

That's not ideal: it doesn't give more information than if it had been a RecordArray at that level. But I don't know if this addresses your question.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked a bit what's going on for this particular array. Seems the issue here is not that it is a union, but that one field has a higher dimension. There is definingParametersCovarianceMatrix which has one dimension more than the other fields (but is still zipped on the outermost level of offsets). So a small example for that situation would be

import awkward as ak
array = ak.zip(
    {"a" : [[1], [2, 3]], "b" : [[[10, 10]], [[20, 20], [30, 30]]]},
    depth_limit=2
)

The resulting array actually has ndim=2 - which is what i use in that code for trackParticle.

>>> array.ndim
2

I want this:

>>> array[:, 0].tolist()
[{'a': 1, 'b': [10, 10]}, {'a': 2, 'b': [20, 20]}]

The ellipsis don't work, presumably because of the one dimension more in the field b

>>> array[..., 0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/nikolai/.local/lib/python3.9/site-packages/awkward/highlevel.py", line 993, in __getitem__
    return ak._util.wrap(self.layout[where], self._behavior)
ValueError: ellipsis (...) can't be used on a data structure of different depths

(https://github.com/scikit-hep/awkward-1.0/blob/1.3.0/src/libawkward/Content.cpp#L1491)

coffea/nanoevents/schemas/physlite.py Outdated Show resolved Hide resolved
coffea/nanoevents/transforms.py Show resolved Hide resolved
coffea/nanoevents/transforms.py Outdated Show resolved Hide resolved
@nsmith-
Copy link
Member

nsmith- commented Jun 2, 2021

Also i don't quite know how to attach behavior to the union arrays.

This might be a bug, perhaps related to scikit-hep/awkward#770
In principle a union array should be able to have a behavior.

Finally one thing i haven't quite understood yet: the way i use the _apply_global_index function it seems the virtual arrays i link to get materialized, e.g. if i do

events.Electrons.trackParticles.layout

there is no VirtualArray anymore

This might be related to the lack of a collection_name in the layout. I think what is happening is that in

def _getlistarray(self):
"""Do some digging to find the initial listarray"""
def descend(layout, depth):
islistarray = isinstance(
layout,
(
awkward.layout.ListOffsetArray32,
awkward.layout.ListOffsetArray64,
),
)
if islistarray and layout.content.parameter("collection_name") is not None:
return lambda: layout
return awkward._util.recursively_apply(self.layout, descend)
if no collection_name is encountered then awkward._util.recursively_apply(self.layout, descend) just returns the original (but now materialized) array. Whereas if collection_name is encountered, it would return the flat collection RecordArray without materializing its constituent fields. The purpose of that function is to descend through possible extraneous array nodes like VirtualArray since in general VirtualArray nodes in layouts can disappear in certain circumstances.

@lgray
Copy link
Collaborator

lgray commented Jun 3, 2021

@nikoladze if we can wrap this up today or tomorrow we can have it in the next coffea release - which would be pretty snazzy.

@nikoladze
Copy link
Contributor Author

@nikoladze if we can wrap this up today or tomorrow we can have it in the next coffea release - which would be pretty snazzy.

Fine for me. Of course this is still work in progress, but so is DAOD_PHYSLITE, so i think it won't hurt to have some basic functionality already included in coffea.

I tried to get the PHYSLITESchema included in the docs which seemed to work more or less. I just didn't manage yet for nanovents.methods.physlite - when i add it to docs/source/reference.rst and try to compile the documentation i get an error

Extension error (sphinx.ext.linkcode):
Handler <function doctree_read at 0x7fdd9b47e4c0> for event 'doctree-read' threw an exception (exception: could not find class definition)

So here is a TODO list for the future:

  • Complete the lists of possible cross-reference mappings
  • Complete the mixin classes, add all important methods
  • Consider making Records of Unions instead of Unions of Records for the cases with links into multiple collections
  • Write more tests

... and i'm sure more things will come up.

Today i won't be able to get more work on this done, so if you think there is still significant things to do until this can be merged it is of course also fine to just have it for the next release then. Thanks a lot!

@lgray
Copy link
Collaborator

lgray commented Jun 4, 2021

So here is a TODO list for the future:

Complete the lists of possible cross-reference mappings
Complete the mixin classes, add all important methods
Consider making Records of Unions instead of Unions of Records for the cases with links into multiple collections
Write more tests

Please make a tracking issue for these and use checkboxes or whatever you like to mark completion.

Unless @nsmith- has something further that needs addressing I'd like to go ahead and merge this, we can figure out the docs stuff in time (or I'll fix it before minting a release).

@lgray lgray merged commit de8e792 into CoffeaTeam:master Jun 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants