Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for two types of crashes in recent nightlies #83

Conversation

giovannimarchiori
Copy link
Contributor

This PR fixes two crashes observed with recent nightlies:

  1. a crash if the clusters used in input to AugmentedCaloClusters or CalibrateCaloClusters do not have metadata

  2. a crash in CalibrateCaloClusters to the update version of the ONNX runtime used by the software stuck

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be more occurences of MetaDataHandle::get that are not yet changed, which leads to the failing CI. The one that I could identify from looking at the tests log would be this one:

// Copy over the CellIDEncoding string from the input collection to the output collection
m_positionedHitsCellIDEncoding.put(m_hitsCellIDEncoding.get());

I am not entirely sure whether it is a problem that the CellIDEncoding that you try to copy is not present, but it could be one.

Unrelated to this PR, there are also some warnings from included ONNXRuntime headers, which might be silencable by changing this bit here:

${ONNXRUNTIME_LIBRARY}
)

to

-		      ${ONNXRUNTIME_LIBRARY}
-                      )
+                      )
+ target_link_libraries(k4RecFCCeeCalorimeterPlugins SYSTEM ${ONNXRUNTIME_LIBRARIES})

@giovannimarchiori
Copy link
Contributor Author

giovannimarchiori commented Jun 14, 2024

Unrelated to this PR, there are also some warnings from included ONNXRuntime headers, which might be silencable by changing this bit here:

${ONNXRUNTIME_LIBRARY}
)

to

-		      ${ONNXRUNTIME_LIBRARY}
-                      )
+                      )
+ target_link_libraries(k4RecFCCeeCalorimeterPlugins SYSTEM ${ONNXRUNTIME_LIBRARIES})

I tried this suggestion but I get a compilation error:

-- Found ONNXRuntime: /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2024-05-09/x86_64-almalinux9-gcc11.4.1-opt/py-onnxruntime/1.17.1-cokoc6/include  
CMake Error at RecFCCeeCalorimeter/CMakeLists.txt:31 (target_link_libraries):
  The keyword signature for target_link_libraries has already been used with
  the target "k4RecFCCeeCalorimeterPlugins".  All uses of
  target_link_libraries with a target must be either all-keyword or
  all-plain.

  The uses of the keyword signature are here:

   * /cvmfs/sw-nightlies.hsf.org/key4hep/releases/2024-05-09/x86_64-almalinux9-gcc11.4.1-opt/gaudi/38.1-kpe2ng/lib/cmake/Gaudi/GaudiToolbox.cmake:453 (target_link_libraries)



-- Configuring incomplete, errors occurred!
make: *** No rule to make target 'install'.  Stop.


The cmake file looks like:

# ONNX DEPENDENCIES
# includes
include_directories("${ONNXRUNTIME_INCLUDE_DIRS}")
# libs
link_directories("${ONNXRUNTIME_LIBRARY_DIRS}")
# New versions of ONNXRuntime add directories to include
# through the target onnxruntime::onnxruntime
if(onnxruntime_FOUND)
  set(ONNXRUNTIME_LIBRARY onnxruntime::onnxruntime)
endif()

file(GLOB _module_sources src/components/*.cpp)
gaudi_add_module(k4RecFCCeeCalorimeterPlugins
                 SOURCES ${_module_sources}
                 LINK k4FWCore::k4FWCore
                      k4FWCore::k4Interface
                      Gaudi::GaudiAlgLib
                      Gaudi::GaudiKernel
                      DD4hep::DDCore
                      EDM4HEP::edm4hep
                      k4geo::detectorSegmentations
                      k4geo::detectorCommon
                      DD4hep::DDG4
                      ROOT::Core
                      ROOT::Hist
                      )
target_link_libraries(k4RecFCCeeCalorimeterPlugins SYSTEM ${ONNXRUNTIME_LIBRARIES})          
install(TARGETS k4RecFCCeeCalorimeterPlugins
  EXPORT k4RecCalorimeterTargets
  RUNTIME DESTINATION "${CMAKE_INSTALL_BINDIR}" COMPONENT bin
  LIBRARY DESTINATION "${CMAKE_INSTALL_LIBDIR}" COMPONENT shlib
  COMPONENT dev)

@tmadlener
Copy link
Contributor

Ah, in that case you are running into the fact that gaudi_add_module doesn't use a qualifying keyword for their call to target_link_libraries. In that case there is nothing easy to do at this point, unfortunately. (Other than living with the warnings).

@giovannimarchiori
Copy link
Contributor Author

There seems to be more occurences of MetaDataHandle::get that are not yet changed, which leads to the failing CI. The one that I could identify from looking at the tests log would be this one:

// Copy over the CellIDEncoding string from the input collection to the output collection
m_positionedHitsCellIDEncoding.put(m_hitsCellIDEncoding.get());

I am not entirely sure whether it is a problem that the CellIDEncoding that you try to copy is not present, but it could be one.

I am not sure we should change this. Is there a good reason for the CellIDEncoding to be missing for the CaloClusters cell collection? Also, this show up in a test for an old geo (ALLEGRO_o1_v01) that I never worked on, I am not even sure that this segmentation fault is still there and worth investigating and requires this test. Maybe @BrieucF can comment/have a look? In the "standard" digi+reco jobs for ALLEGRO v02/v03 I don't need to setup a dedicated cell positioning tool for the CaloClusters cells after clustering because they are already positioned when passed as input to the clustering, so this error does not appear.

@BrieucF
Copy link
Contributor

BrieucF commented Jun 14, 2024

Hi @giovannimarchiori, @tmadlener, this looks to be symptomatic of the fact that the cell collection attached to the cluster does not have the CellIDEncoding https://github.com/HEP-FCC/k4RecCalorimeter/blob/main/RecCalorimeter/src/components/CreateCaloClustersSlidingWindow.cpp#L45C8-L45C23. This is indeed potentially a problem (analysis scripts later in the chain may need this information to properly set-up the decoder).

On the other hand, as things are now, this clusterCells collection may host CalorimeterHits from heterogeneous subsystems which than have inconsistent CellIDEncodings. The only way I see to handle this would be to create a clusterCells collection per subsystem. Do you see a better way?

@giovannimarchiori
Copy link
Contributor Author

Hi @BrieucF,
a couple of comments from my side.

  1. this discussion about the missing cellID enconding and about how to handle cells from heterogeneous subsystems which than have inconsistent CellIDEncodings goes beyond this PR, which did not introduce this problem (it was there before), and fixes other issues. I would then decouple the things - if the PR OK, approve it and merge it, and open an issue to decide how to treat the cells with different encodings (I have another PR for this package ready to be submitted - photon ID with MVA within Gaudi algorithm - but I am waiting for this PR to be merged)

  2. about what to do with that issue: either we save cluster cells in separate collections (cleanest option, but multiplies the output branches/collections), or we save a map of cellType -> cellEnconding in the metadata, and use the "type" of the cells to record where they come from (currently it seems to me that this variable is unused, it's always zero, what is its purpose?) or the map is SystemID -> cellEnconding, and we know that independently of the enconding the cellID always stores the systemID in the first 4 bits (or something like that).

@tmadlener
Copy link
Contributor

tmadlener commented Jun 17, 2024

I agree with @giovannimarchiori, I think this PR should not deal with heterogenous subsystems. It should rather fix the issue that was uncovered / introduced with key4hep/k4FWCore#195

Regarding how to handle hits from multiple different subsystems, there are two "clean" ways to deal with it IMHO,

  1. different collections with different cell ID encodings for each subsystem.
  2. come up with a cell ID encoding scheme that works for all subsystems, e.g. like what has been done for tracking in ILD & CLD. Both of them have the same basic structure in the form of field names, they only differ in how many bits they assign to each field. Ideally, even that would be the same, that would make exchanging tracker sub detectors between the systems a lot easier.

If you go for 1, you can also just use a subset collection (which would technically be a superset collection in this case) to merge all collections into one collection if you only want a "global calorimeter hit" collection. Since we can always get back to the collectionID from the individual hits, we can also always get back to the cell ID encodings of the different collections if necessary.

@giovannimarchiori
Copy link
Contributor Author

Thinking out loud about this thing... the problem arises because we're copying cells from the various cell collections produced from digitisation into a new "cluster cells" collection that includes only the clustered cells and that may come from different subsystems (e.g. ECAL and HCAL). Maybe another possibility would be that we do not copy the cells at all, but just persist as "calo cluster cells" the links to the cells in the original collections? If we don't want to save all cells in output because they take too much space, we could then implement a "thinning" scheme that drops non-clustered cells from the ROOT output (not sure how easy would that be with our EDM?)

@giovannimarchiori
Copy link
Contributor Author

giovannimarchiori commented Jun 17, 2024

I just added a couple of extra fixes for the crash observed when cellIDencoding is missing in input collection, introduced by the recent changes to metadatahandle::get().
For the moment, if the cellID encoding is missing in the input, I just print a WARNING message. Let me know if we should instead print an ERROR message and return StatusCode::Failure.
I hope that's all that's needed in order to approve this PR.
Cheers!

@tmadlener
Copy link
Contributor

I think all workflows that are currently expected to pass are passing now. The one thing about a warning vs hard error is that the former is easily ignorable ;) In this case it's hard to say what is the best choice. The warning is certainly less disrupting, if you want this to be an error at some point, it might be easier to introduce it now, rather than later when everyone has already got used to living with the warning.

Coming back to the discussion about the clustering. I might be missing some information here on how things are done in the end, so take this with a grain of salt. IIUC, the algorithm that we are mainly discussing here, is taking in an already clustered shower and is then doing some energy correction / calibration on this shower? Does this also mean that the underlying calorimeter hits are updated as well, or do they stay the same? In case they stay the same, I don't really see a reason to make a copy of them. You can always make a subset collection of the clustered ones, if you need an easy handle to only the ones that are clustered.

Filtering them when writing is then another topic, I think.

@giovannimarchiori
Copy link
Contributor Author

I think all workflows that are currently expected to pass are passing now. The one thing about a warning vs hard error is that the former is easily ignorable ;) In this case it's hard to say what is the best choice. The warning is certainly less disrupting, if you want this to be an error at some point, it might be easier to introduce it now, rather than later when everyone has already got used to living with the warning.

I can change the warning to error, that would be my preference, but it will make one test fail in the nightly because the cells used by one alg in that test do not have cellID encoding. We can then open a separate issue, while approving and merging this PR, OK?

Coming back to the discussion about the clustering. I might be missing some information here on how things are done in the end, so take this with a grain of salt. IIUC, the algorithm that we are mainly discussing here, is taking in an already clustered shower and is then doing some energy correction / calibration on this shower? Does this also mean that the underlying calorimeter hits are updated as well, or do they stay the same? In case they stay the same, I don't really see a reason to make a copy of them. You can always make a subset collection of the clustered ones, if you need an easy handle to only the ones that are clustered.

There are some algorithms - not specifically those affected by this PR, but upstream ones doing the clustering - that take all cells and create a collection of clustered cells, and the cluster hits point to these cloned cells, see here:


which calls this:
void CaloTowerTool::attachCells(float eta, float phi, uint halfEtaFin, uint halfPhiFin, edm4hep::MutableCluster& aEdmCluster, edm4hep::CalorimeterHitCollection* aEdmClusterCells, bool aEllipse) {

The cells/hits seem to me to be just cloned, without any additional operation performed on them. I guess the code is like that so that one can then drop the full cell collection and keep only that of clustered cells (in case e.g. noise is on and a lot of cells with non-zero energy appear in the full cell collection but are not clustered because they are below threshold), reducing the output file size.

@BrieucF
Copy link
Contributor

BrieucF commented Jun 17, 2024

Yes, we can deal with this somewhere else than in this PR (but let's open an issue)

@giovannimarchiori
Copy link
Contributor Author

giovannimarchiori commented Jun 18, 2024

Changed warning to error, and created issue #84

Is the PR OK for merging?

Copy link
Contributor

@tmadlener tmadlener left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

@giovannimarchiori
Copy link
Contributor Author

@kjvbrt : @BrieucF on vacation and told me to ask you to merge this PR now that it has been approved by @tmadlener

Thanks, cheers,
Giovanni

@kjvbrt
Copy link
Contributor

kjvbrt commented Jun 21, 2024

Hi @giovannimarchiori, yap I will merge it soon...

@kjvbrt kjvbrt merged commit 4506532 into HEP-FCC:main Jun 21, 2024
2 of 6 checks passed
@giovannimarchiori
Copy link
Contributor Author

Thanks!

@giovannimarchiori giovannimarchiori deleted the gmarchio-main-20240612-fix-onnx-and-metadata-missing branch July 30, 2024 12:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants