Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix features order #29799

Merged
merged 3 commits into from May 13, 2020
Merged

Fix features order #29799

merged 3 commits into from May 13, 2020

Conversation

rovere
Copy link
Contributor

@rovere rovere commented May 11, 2020

PR description:

In recent meetings at the Upgrade HLT TDR, strange distribution of PFCandidates produced from TICL have been reported: JetMET BTV.
After some debugging, this has been traced back to the way in which we pass features to ML model for PID and energy regression. This PR reshuffle them in the correct order. Local tests show no more dependency on phi.

PR validation:

Besides checking that the strange phi behaviour is gone:

runTheMatrix.py -l limited

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-29799/15284

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @rovere (Marco Rovere) for master.

It involves the following packages:

RecoHGCal/TICL

@perrotta, @cmsbuild, @kpedro88, @slava77 can you please review it and eventually sign? Thanks.
@felicepantaleo, @riga, @apsallid, @sobhatta, @lecriste, @hatakeyamak, @clelange this is something you requested to watch as well.
@silviodonato, @dpiparo you are the release manager for this.

cms-bot commands are listed here

@rovere
Copy link
Contributor Author

rovere commented May 11, 2020

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented May 11, 2020

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/6227/console Started: 2020/05/11 17:53

@cmsbuild
Copy link
Contributor

+1
Tested at: eda2df4
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd1abe/6227/summary.html
CMSSW: CMSSW_11_1_X_2020-05-11-1100
SCRAM_ARCH: slc7_amd64_gcc820

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-bd1abe/6227/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 318 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2697527
  • DQMHistoTests: Total failures: 1116
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 2696092
  • DQMHistoTests: Total skipped: 319
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 33 files compared)
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@kpedro88
Copy link
Contributor

Comparison differences are limited to Phase 2 workflows, as expected.

@rovere I notice a few potentially surprising changes. Please confirm if these are expected:

  • overall slightly fewer objects reconstructed:
    trackster eta
  • many fewer EM objects:
    tracksterEM energy
  • corresponding increase in HAD objects:
    tracksterHAD energy
  • narrower distribution of MIP regressed energy:
    tracksterMIP energy

@rovere
Copy link
Contributor Author

rovere commented May 12, 2020

Ciao @kpedro88 thanks for checking. The problem is that the comparison, in this case, does not make too much sense, since the ML inference was wrong before and, as a consequence, all the results produced in terms of PID and energy regression were unreliable.
Rephrased in another way: if POGs report mismatches and poor performances related to PID and energy regression, we will follow them up and investigate. We are also in the process of retraining the model using the most up-to-date TICL/trackster reconstruction that was not available at the time of the original training included in CMSSW.

@@ -333,9 +333,9 @@ void PatternRecognitionbyCA::energyRegressionAndID(const std::vector<reco::CaloC
float *features = &input.tensor<float, 4>()(i, j, seenClusters[j], 0);

// fill features
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "interface" is extremely error prone (as this bug just demonstrated).
Whoever approved this code for CMSSW clearly was not doing their job.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the original sin is #27917.

What are the plans to fix the use of anonymous, unordered list for passing parameters to tensorflow or other ML engines ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fwyzard this is a good point. Maybe the framework should add a data structure like edm::featureMap that enforces a schema and has interfaces to output in various ML formats.

(Though, this would only prevent one of the two problems addressed by this PR. No framework feature can specify whether or not energy should be divided by vertex multiplicity...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Though, this would only prevent one of the two problems addressed by this PR. No framework feature can specify whether or not energy should be divided by vertex multiplicity...)

:-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #29818 to follow this

@kpedro88
Copy link
Contributor

@rovere I take the point that the "old" results are just wrong. However, can you remark if the "new" results seem correct?

@rovere
Copy link
Contributor Author

rovere commented May 12, 2020

@kpedro88 I can confirm that using single-pion and single-electron guns into HGCAL the network is able to assign the proper ID to the reconstructed Tracksters. How this will scale on more complex events with PU is something we still have to verify.
The regressed energy is a little off, right now, but that's expected and due to the changes that went into the reconstruction since the last time the model has been trained. Users have been warned not to use it for the time being. A retraining of the model is needed and will come in the next weeks.

@kpedro88
Copy link
Contributor

+upgrade

@perrotta
Copy link
Contributor

+1

  • Correct order of TensorFlow input features is restored
  • Jenkins tests pass and show differences where expected
  • A retraining of the model will be needed, but this is independent from this pull request

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e1b8483 into cms-sw:master May 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants