Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Improved Egamma PFID model selection consistency #38356

Merged

Conversation

valsdav
Copy link
Contributor

@valsdav valsdav commented Jun 13, 2022

PR description:

This PR solves the issue #38175.
The crash happened because the model selection by "eta" requirement was different in the ElectronDNNEstimator and in the GsfElectronProducer (one was using electron.eta, the other superCluster.eta). Now the model index is directly passed from the DNNHelper evaluator to the caller code, ensuring the consistency in the number of outputs. (Following comment #38175 (comment))

Moreover the electron model selection is now performed correctly with SuperCluster.eta instead of Electron.eta.

PR Validation:

The PR has been validated with local tests.

Release notes:

This is urgently needed for the 12_4_0 release.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38356/30544

  • This PR adds an extra 44KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @valsdav (Davide Valsecchi) for master.

It involves the following packages:

  • RecoEgamma/EgammaElectronProducers (reconstruction)
  • RecoEgamma/EgammaPhotonProducers (reconstruction)
  • RecoEgamma/EgammaTools (reconstruction)
  • RecoEgamma/ElectronIdentification (reconstruction)
  • RecoEgamma/PhotonIdentification (reconstruction)

@jpata, @cmsbuild, @clacaputo, @slava77 can you please review it and eventually sign? Thanks.
@Sam-Harper, @jainshilpi, @rovere, @lgray, @sobhatta, @lecriste, @afiqaize, @wrtabb, @varuns23, @ram1123 this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

mvaOutput.dnn_e_bkgPhoton = values[4];
} else {
mvaOutput.dnn_e_sigIsolated = values[0];
if (iModel <= 3) { // models 0,1,2,3 have 5 outpus in this version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo in comment

Suggested change
if (iModel <= 3) { // models 0,1,2,3 have 5 outpus in this version
if (iModel <= 3) { // models 0,1,2,3 have 5 outputs in this version

} else {
mvaOutput.dnn_e_sigIsolated = values[0];
if (iModel <= 3) { // models 0,1,2,3 have 5 outpus in this version
mvaOutput.dnn_e_sigIsolated = values.at(0);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each at call will check the size of the container. That isn't the most efficient. Instead I'd suggest adding
assert(values.size() == 5) at the beginning of the if and then just use [].

Copy link
Contributor Author

@valsdav valsdav Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the assert and removed the .at(). We wanted a way to be sure that the code crashes if there is a model index misconfiguration, and the assert is a good choice. Thanks.

@Dr15Jones
Copy link
Contributor

Thanks for making the change!

The model index used to evaluate the candidate is now saved in the DNNHelper output and used in the producer
to select how many DNN outputs should be saved, without performing again the pt/eta binning.

Moreover the eta selection is now performed with SuperCluster.eta instead of Electron.eta.
@jpata
Copy link
Contributor

jpata commented Jun 13, 2022

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-38356/30545

  • This PR adds an extra 44KB to repository

@cmsbuild
Copy link
Contributor

Pull request #38356 was updated. @jpata, @clacaputo, @slava77 can you please check and sign again.

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fa79a/25489/summary.html
COMMIT: acc34c7
CMSSW: CMSSW_12_5_X_2022-06-13-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38356/25489/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 12 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3659074
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3659038
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@qliphy
Copy link
Contributor

qliphy commented Jun 14, 2022

urgent

@qliphy
Copy link
Contributor

qliphy commented Jun 14, 2022

please test

@valsdav
Copy link
Contributor Author

valsdav commented Jun 14, 2022

Dear @qliphy I think something broke in the tests.. shall we restart them?

@qliphy
Copy link
Contributor

qliphy commented Jun 14, 2022

please abort

@qliphy
Copy link
Contributor

qliphy commented Jun 14, 2022

please test

@jpata
Copy link
Contributor

jpata commented Jun 14, 2022

@valsdav do you expect any physics differences in the MVA? Is a larger-scale validation possible?

@kdlong
Copy link
Contributor

kdlong commented Jun 14, 2022

You haven't actually changed the model, right? Shouldn't the validation be identical?

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7fa79a/25506/summary.html
COMMIT: acc34c7
CMSSW: CMSSW_12_5_X_2022-06-13-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/38356/25506/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 14 differences found in the comparisons
  • DQMHistoTests: Total files compared: 50
  • DQMHistoTests: Total histograms compared: 3659074
  • DQMHistoTests: Total failures: 13
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3659038
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: -0.004 KiB( 49 files compared)
  • DQMHistoSizes: changed ( 312.0 ): -0.004 KiB MessageLogger/Warnings
  • Checked 208 log files, 45 edm output root files, 50 DQM output files
  • TriggerResults: no differences found

@a-kapoor
Copy link
Contributor

@kdlong
The boundary logic in the model selector was incorrectly using ele.eta() when it should have been using ele.supercluster.eta()
This is in fact what would have led to the problem reported in #38175.
Here we are only using eta to decide whether the electron is in barrel or endcap etc, this decision will be very rarely different if we use eta() or supercluster.eta(), even if the absolute values are different.

@jpata So only at boundaries (barrel-endcap, endcap-extended-endcap) we might expect some minor differences for electrons that are in barrel according to supercluster.eta() but say, in endcap according to eta().

We thus expect no significant physics differences. Given we are at the deadline for 12_4_0, we would like to know if this can be merged without a full-scale validation. We can still parallelly start a full-scale validation, but based on our experience with crab from last time, this could take a week.

@jpata
Copy link
Contributor

jpata commented Jun 14, 2022

Thanks for the summary. I'm fine with this explanation. There are small differences in the MVA output due to the bugfix, and it should be validated separately, but let's proceed anyway.

BTW: this didn't show up in the previous large-scale validation, right? Did any of the jobs crash?

@a-kapoor
Copy link
Contributor

Thanks for the summary. I'm fine with this explanation. There are small differences in the MVA output due to the bugfix, and it should be validated separately, but let's proceed anyway.

BTW: this didn't show up in the previous large-scale validation, right? Did any of the jobs crash?

@jpata No crashes were reported in the final validation. We did see some initial crashes but once crab was fixed, all went fine. So the crashes were crab specific.

To make a note of it, I want the stress that only way a heap-buffer-overflow could have occured in the our earlier MVA code would have been when an electron is in |eta|>2.65 according to ele.eta(), but it is in |eta|<2.65 according to ele->supercluster.eta(). This is because the only model that has a different number of nodes is the model in |eta|>2.65. Already the efficiency of electrons is in this region is low, and then the chance the above condition being satisfied is even lower, maybe that is why we never saw this. This PR will fix it though.

@jpata
Copy link
Contributor

jpata commented Jun 14, 2022

Could you please also open a backport to 12_4?

@@ -49,8 +49,9 @@ namespace egammaTools {
// which has access to all the variables.
std::pair<uint, std::vector<float>> getScaledInputs(const std::map<std::string, float>& variables) const;

std::vector<std::vector<float>> evaluate(const std::vector<std::map<std::string, float>>& candidates,
const std::vector<tensorflow::Session*>& sessions) const;
std::vector<std::pair<uint, std::vector<float>>> evaluate(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not for this PR, but I think the same comment applies here as was suggested at the DeepSC PR:
these kind of supernested structures are easy to write down but hard to reason about later. It would be better to define classes that encapsulate the required data.

@jpata
Copy link
Contributor

jpata commented Jun 15, 2022

+reconstruction

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants