[bugfix] Improved Egamma PFID model selection consistency #38356

valsdav · 2022-06-13T12:36:42Z

PR description:

This PR solves the issue #38175.
The crash happened because the model selection by "eta" requirement was different in the ElectronDNNEstimator and in the GsfElectronProducer (one was using electron.eta, the other superCluster.eta). Now the model index is directly passed from the DNNHelper evaluator to the caller code, ensuring the consistency in the number of outputs. (Following comment #38175 (comment))

Moreover the electron model selection is now performed correctly with SuperCluster.eta instead of Electron.eta.

PR Validation:

The PR has been validated with local tests.

Release notes:

This is urgently needed for the 12_4_0 release.

cmsbuild · 2022-06-13T12:43:49Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38356/30544

This PR adds an extra 44KB to repository

cmsbuild · 2022-06-13T12:44:09Z

A new Pull Request was created by @valsdav (Davide Valsecchi) for master.

It involves the following packages:

RecoEgamma/EgammaElectronProducers (reconstruction)
RecoEgamma/EgammaPhotonProducers (reconstruction)
RecoEgamma/EgammaTools (reconstruction)
RecoEgamma/ElectronIdentification (reconstruction)
RecoEgamma/PhotonIdentification (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@Sam-Harper, @jainshilpi, @rovere, @lgray, @sobhatta, @lecriste, @afiqaize, @wrtabb, @varuns23, @ram1123 this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

Dr15Jones · 2022-06-13T12:52:26Z

RecoEgamma/EgammaElectronProducers/plugins/GsfElectronProducer.cc

-          mvaOutput.dnn_e_bkgPhoton = values[4];
-        } else {
-          mvaOutput.dnn_e_sigIsolated = values[0];
+        if (iModel <= 3) {  // models 0,1,2,3 have 5 outpus in this version


minor typo in comment

Suggested change

if (iModel <= 3) { // models 0,1,2,3 have 5 outpus in this version

if (iModel <= 3) { // models 0,1,2,3 have 5 outputs in this version

Dr15Jones · 2022-06-13T12:53:57Z

RecoEgamma/EgammaElectronProducers/plugins/GsfElectronProducer.cc

-        } else {
-          mvaOutput.dnn_e_sigIsolated = values[0];
+        if (iModel <= 3) {  // models 0,1,2,3 have 5 outpus in this version
+          mvaOutput.dnn_e_sigIsolated = values.at(0);


each at call will check the size of the container. That isn't the most efficient. Instead I'd suggest adding
assert(values.size() == 5) at the beginning of the if and then just use [].

I implemented the assert and removed the .at(). We wanted a way to be sure that the code crashes if there is a model index misconfiguration, and the assert is a good choice. Thanks.

Dr15Jones · 2022-06-13T12:56:25Z

Thanks for making the change!

The model index used to evaluate the candidate is now saved in the DNNHelper output and used in the producer to select how many DNN outputs should be saved, without performing again the pt/eta binning. Moreover the eta selection is now performed with SuperCluster.eta instead of Electron.eta.

jpata · 2022-06-13T13:20:13Z

@cmsbuild please test

cmsbuild · 2022-06-13T13:20:29Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38356/30545

This PR adds an extra 44KB to repository

cmsbuild · 2022-06-13T13:20:55Z

Pull request #38356 was updated. @jpata, @clacaputo, @slava77 can you please check and sign again.

cmsbuild · 2022-06-13T18:39:34Z

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fa79a/25489/summary.html
COMMIT: acc34c7
CMSSW: CMSSW_12_5_X_2022-06-13-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38356/25489/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

The relvals timed out after 4 hours.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 12 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 3659074
DQMHistoTests: Total failures: 13
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3659038
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
Checked 208 log files, 45 edm output root files, 50 DQM output files
TriggerResults: no differences found

qliphy · 2022-06-14T06:21:45Z

urgent

qliphy · 2022-06-14T06:22:15Z

please test

valsdav · 2022-06-14T09:30:47Z

Dear @qliphy I think something broke in the tests.. shall we restart them?

qliphy · 2022-06-14T09:34:23Z

please abort

qliphy · 2022-06-14T09:36:09Z

please test

jpata · 2022-06-14T12:49:28Z

@valsdav do you expect any physics differences in the MVA? Is a larger-scale validation possible?

kdlong · 2022-06-14T13:24:41Z

You haven't actually changed the model, right? Shouldn't the validation be identical?

cmsbuild · 2022-06-14T14:04:44Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fa79a/25506/summary.html
COMMIT: acc34c7
CMSSW: CMSSW_12_5_X_2022-06-13-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38356/25506/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

No significant changes to the logs found
Reco comparison results: 14 differences found in the comparisons
DQMHistoTests: Total files compared: 50
DQMHistoTests: Total histograms compared: 3659074
DQMHistoTests: Total failures: 13
DQMHistoTests: Total nulls: 1
DQMHistoTests: Total successes: 3659038
DQMHistoTests: Total skipped: 22
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
Checked 208 log files, 45 edm output root files, 50 DQM output files
TriggerResults: no differences found

a-kapoor · 2022-06-14T14:10:12Z

@kdlong
The boundary logic in the model selector was incorrectly using ele.eta() when it should have been using ele.supercluster.eta()
This is in fact what would have led to the problem reported in #38175.
Here we are only using eta to decide whether the electron is in barrel or endcap etc, this decision will be very rarely different if we use eta() or supercluster.eta(), even if the absolute values are different.

@jpata So only at boundaries (barrel-endcap, endcap-extended-endcap) we might expect some minor differences for electrons that are in barrel according to supercluster.eta() but say, in endcap according to eta().

We thus expect no significant physics differences. Given we are at the deadline for 12_4_0, we would like to know if this can be merged without a full-scale validation. We can still parallelly start a full-scale validation, but based on our experience with crab from last time, this could take a week.

jpata · 2022-06-14T14:13:34Z

Thanks for the summary. I'm fine with this explanation. There are small differences in the MVA output due to the bugfix, and it should be validated separately, but let's proceed anyway.

BTW: this didn't show up in the previous large-scale validation, right? Did any of the jobs crash?

a-kapoor · 2022-06-14T14:34:46Z

Thanks for the summary. I'm fine with this explanation. There are small differences in the MVA output due to the bugfix, and it should be validated separately, but let's proceed anyway.

BTW: this didn't show up in the previous large-scale validation, right? Did any of the jobs crash?

@jpata No crashes were reported in the final validation. We did see some initial crashes but once crab was fixed, all went fine. So the crashes were crab specific.

To make a note of it, I want the stress that only way a heap-buffer-overflow could have occured in the our earlier MVA code would have been when an electron is in |eta|>2.65 according to ele.eta(), but it is in |eta|<2.65 according to ele->supercluster.eta(). This is because the only model that has a different number of nodes is the model in |eta|>2.65. Already the efficiency of electrons is in this region is low, and then the chance the above condition being satisfied is even lower, maybe that is why we never saw this. This PR will fix it though.

jpata · 2022-06-14T15:13:56Z

Could you please also open a backport to 12_4?

jpata · 2022-06-15T07:31:49Z

RecoEgamma/EgammaTools/interface/EgammaDNNHelper.h

@@ -49,8 +49,9 @@ namespace egammaTools {
    // which has access to all the variables.
    std::pair<uint, std::vector<float>> getScaledInputs(const std::map<std::string, float>& variables) const;

-    std::vector<std::vector<float>> evaluate(const std::vector<std::map<std::string, float>>& candidates,
-                                             const std::vector<tensorflow::Session*>& sessions) const;
+    std::vector<std::pair<uint, std::vector<float>>> evaluate(


not for this PR, but I think the same comment applies here as was suggested at the DeepSC PR:
these kind of supernested structures are easy to write down but hard to reason about later. It would be better to define classes that encapsulate the required data.

jpata · 2022-06-15T07:33:57Z

+reconstruction

bugfix for a rare EGamma PFID segfault GsfElectronProducer Reading off the end of a std::vector #38175
small changes to the MVA output are expected

cmsbuild · 2022-06-15T07:34:19Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

perrotta · 2022-06-15T08:07:20Z

+1

cmsbuild added this to the CMSSW_12_5_X milestone Jun 13, 2022

cmsbuild added code-checks-pending orp-pending pending-signatures reconstruction-pending tests-pending labels Jun 13, 2022

a-kapoor mentioned this pull request Jun 13, 2022

GsfElectronProducer Reading off the end of a std::vector #38175

Closed

cmsbuild added code-checks-approved and removed code-checks-pending labels Jun 13, 2022

Dr15Jones suggested changes Jun 13, 2022

View reviewed changes

valsdav force-pushed the egammapfid_modelselection_bugfix_12_4_X branch from 65004d6 to acc34c7 Compare June 13, 2022 13:13

cmsbuild added code-checks-pending and removed code-checks-approved labels Jun 13, 2022

cmsbuild added tests-started code-checks-approved and removed tests-pending code-checks-pending labels Jun 13, 2022

cmsbuild added tests-rejected and removed tests-started labels Jun 13, 2022

cmsbuild added the urgent label Jun 14, 2022

cmsbuild added tests-pending and removed tests-started labels Jun 14, 2022

cmsbuild added tests-started and removed tests-pending labels Jun 14, 2022

cmsbuild added tests-approved and removed tests-started labels Jun 14, 2022

valsdav mentioned this pull request Jun 14, 2022

[12_4_X] Improved Egamma PFID model selection consistency #38372

Merged

jpata reviewed Jun 15, 2022

View reviewed changes

cmsbuild added fully-signed reconstruction-approved and removed reconstruction-pending pending-signatures labels Jun 15, 2022

cmsbuild added orp-approved and removed orp-pending labels Jun 15, 2022

cmsbuild merged commit 159edb5 into cms-sw:master Jun 15, 2022

This was referenced Jun 15, 2022

[L1T] Phase-2 update phase 2 emulator sequence #38375

Closed

[DEVEL] Added alpaka build rules cms-sw/cmsdist#7936

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] Improved Egamma PFID model selection consistency #38356

[bugfix] Improved Egamma PFID model selection consistency #38356

valsdav commented Jun 13, 2022 •

edited

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

Dr15Jones Jun 13, 2022

Dr15Jones Jun 13, 2022

valsdav Jun 13, 2022 •

edited

Dr15Jones commented Jun 13, 2022

jpata commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

qliphy commented Jun 14, 2022

qliphy commented Jun 14, 2022

valsdav commented Jun 14, 2022

qliphy commented Jun 14, 2022

qliphy commented Jun 14, 2022

jpata commented Jun 14, 2022

kdlong commented Jun 14, 2022

cmsbuild commented Jun 14, 2022

a-kapoor commented Jun 14, 2022

jpata commented Jun 14, 2022

a-kapoor commented Jun 14, 2022

jpata commented Jun 14, 2022

jpata Jun 15, 2022

jpata commented Jun 15, 2022

cmsbuild commented Jun 15, 2022

perrotta commented Jun 15, 2022

	if (iModel <= 3) { // models 0,1,2,3 have 5 outpus in this version
	if (iModel <= 3) { // models 0,1,2,3 have 5 outputs in this version

[bugfix] Improved Egamma PFID model selection consistency #38356

[bugfix] Improved Egamma PFID model selection consistency #38356

Conversation

valsdav commented Jun 13, 2022 • edited

PR description:

PR Validation:

Release notes:

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

Dr15Jones Jun 13, 2022

Choose a reason for hiding this comment

Dr15Jones Jun 13, 2022

Choose a reason for hiding this comment

valsdav Jun 13, 2022 • edited

Choose a reason for hiding this comment

Dr15Jones commented Jun 13, 2022

jpata commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

cmsbuild commented Jun 13, 2022

RelVals-INPUT

Comparison Summary

qliphy commented Jun 14, 2022

qliphy commented Jun 14, 2022

valsdav commented Jun 14, 2022

qliphy commented Jun 14, 2022

qliphy commented Jun 14, 2022

jpata commented Jun 14, 2022

kdlong commented Jun 14, 2022

cmsbuild commented Jun 14, 2022

Comparison Summary

a-kapoor commented Jun 14, 2022

jpata commented Jun 14, 2022

a-kapoor commented Jun 14, 2022

jpata commented Jun 14, 2022

jpata Jun 15, 2022

Choose a reason for hiding this comment

jpata commented Jun 15, 2022

cmsbuild commented Jun 15, 2022

perrotta commented Jun 15, 2022

valsdav commented Jun 13, 2022 •

edited

valsdav Jun 13, 2022 •

edited