-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes for two types of crashes in recent nightlies #83
Fixes for two types of crashes in recent nightlies #83
Conversation
…bration code crash with new ONNX runtime
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be more occurences of MetaDataHandle::get
that are not yet changed, which leads to the failing CI. The one that I could identify from looking at the tests log would be this one:
k4RecCalorimeter/RecFCCeeCalorimeter/src/components/CreateCaloCellPositionsFCCee.cpp
Lines 37 to 38 in 3a1b7f9
// Copy over the CellIDEncoding string from the input collection to the output collection | |
m_positionedHitsCellIDEncoding.put(m_hitsCellIDEncoding.get()); |
I am not entirely sure whether it is a problem that the CellIDEncoding that you try to copy is not present, but it could be one.
Unrelated to this PR, there are also some warnings from included ONNXRuntime headers, which might be silencable by changing this bit here:
k4RecCalorimeter/RecFCCeeCalorimeter/CMakeLists.txt
Lines 30 to 31 in 3a1b7f9
${ONNXRUNTIME_LIBRARY} | |
) |
to
- ${ONNXRUNTIME_LIBRARY}
- )
+ )
+ target_link_libraries(k4RecFCCeeCalorimeterPlugins SYSTEM ${ONNXRUNTIME_LIBRARIES})
I tried this suggestion but I get a compilation error:
The cmake file looks like:
|
Ah, in that case you are running into the fact that |
I am not sure we should change this. Is there a good reason for the CellIDEncoding to be missing for the CaloClusters cell collection? Also, this show up in a test for an old geo (ALLEGRO_o1_v01) that I never worked on, I am not even sure that this segmentation fault is still there and worth investigating and requires this test. Maybe @BrieucF can comment/have a look? In the "standard" digi+reco jobs for ALLEGRO v02/v03 I don't need to setup a dedicated cell positioning tool for the CaloClusters cells after clustering because they are already positioned when passed as input to the clustering, so this error does not appear. |
Hi @giovannimarchiori, @tmadlener, this looks to be symptomatic of the fact that the cell collection attached to the cluster does not have the CellIDEncoding https://github.com/HEP-FCC/k4RecCalorimeter/blob/main/RecCalorimeter/src/components/CreateCaloClustersSlidingWindow.cpp#L45C8-L45C23. This is indeed potentially a problem (analysis scripts later in the chain may need this information to properly set-up the decoder). On the other hand, as things are now, this |
Hi @BrieucF,
|
I agree with @giovannimarchiori, I think this PR should not deal with heterogenous subsystems. It should rather fix the issue that was uncovered / introduced with key4hep/k4FWCore#195 Regarding how to handle hits from multiple different subsystems, there are two "clean" ways to deal with it IMHO,
If you go for 1, you can also just use a subset collection (which would technically be a superset collection in this case) to merge all collections into one collection if you only want a "global calorimeter hit" collection. Since we can always get back to the collectionID from the individual hits, we can also always get back to the cell ID encodings of the different collections if necessary. |
Thinking out loud about this thing... the problem arises because we're copying cells from the various cell collections produced from digitisation into a new "cluster cells" collection that includes only the clustered cells and that may come from different subsystems (e.g. ECAL and HCAL). Maybe another possibility would be that we do not copy the cells at all, but just persist as "calo cluster cells" the links to the cells in the original collections? If we don't want to save all cells in output because they take too much space, we could then implement a "thinning" scheme that drops non-clustered cells from the ROOT output (not sure how easy would that be with our EDM?) |
I just added a couple of extra fixes for the crash observed when cellIDencoding is missing in input collection, introduced by the recent changes to metadatahandle::get(). |
I think all workflows that are currently expected to pass are passing now. The one thing about a warning vs hard error is that the former is easily ignorable ;) In this case it's hard to say what is the best choice. The warning is certainly less disrupting, if you want this to be an error at some point, it might be easier to introduce it now, rather than later when everyone has already got used to living with the warning. Coming back to the discussion about the clustering. I might be missing some information here on how things are done in the end, so take this with a grain of salt. IIUC, the algorithm that we are mainly discussing here, is taking in an already clustered shower and is then doing some energy correction / calibration on this shower? Does this also mean that the underlying calorimeter hits are updated as well, or do they stay the same? In case they stay the same, I don't really see a reason to make a copy of them. You can always make a subset collection of the clustered ones, if you need an easy handle to only the ones that are clustered. Filtering them when writing is then another topic, I think. |
I can change the warning to error, that would be my preference, but it will make one test fail in the nightly because the cells used by one alg in that test do not have cellID encoding. We can then open a separate issue, while approving and merging this PR, OK?
There are some algorithms - not specifically those affected by this PR, but upstream ones doing the clustering - that take all cells and create a collection of clustered cells, and the cluster hits point to these cloned cells, see here: k4RecCalorimeter/RecFCCeeCalorimeter/src/components/CreateCaloClustersSlidingWindowFCCee.cpp Line 303 in 3a1b7f9
which calls this:
The cells/hits seem to me to be just cloned, without any additional operation performed on them. I guess the code is like that so that one can then drop the full cell collection and keep only that of clustered cells (in case e.g. noise is on and a lot of cells with non-zero energy appear in the full cell collection but are not clustered because they are below threshold), reducing the output file size. |
Yes, we can deal with this somewhere else than in this PR (but let's open an issue) |
Changed warning to error, and created issue #84 Is the PR OK for merging? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
@kjvbrt : @BrieucF on vacation and told me to ask you to merge this PR now that it has been approved by @tmadlener Thanks, cheers, |
Hi @giovannimarchiori, yap I will merge it soon... |
Thanks! |
This PR fixes two crashes observed with recent nightlies:
a crash if the clusters used in input to AugmentedCaloClusters or CalibrateCaloClusters do not have metadata
a crash in CalibrateCaloClusters to the update version of the ONNX runtime used by the software stuck