Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing tensorflow implementation in RecoHGCal/TICL #36724

Merged
merged 12 commits into from
Jan 27, 2022
Merged

Changing tensorflow implementation in RecoHGCal/TICL #36724

merged 12 commits into from
Jan 27, 2022

Conversation

dr-stringfellow
Copy link
Contributor

PR description:

The tensorflow model used to regress energy and ID of TICL objects ("tracksters") has been moved from an ED-based implementation to an ES setup. All plugins (CA, CLUE3D, FastJet, ...) will now access the same cached model, instead of caching one model for each plugin as was the case in the EDProducer setup.

This is now in line with the RecoTracker tensorflow implementation. In this first step, we are pointing also to their tensorflow producer. Eventually, in a pull request that would follow soon, this should be harmonised and moved to a central, neutral place that both RecoTracker and RecoHGCal use.

PR validation:

It saves about 5% in memory on upgrade workflow
39034.202:

https://bmaier.web.cern.ch/bmaier/view.php?dir=figs/hgcal/ticltf/new

(The more plugins are run in the same job, the higher the savings.)

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-36724/27825

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @dr-stringfellow for master.

It involves the following packages:

  • RecoHGCal/TICL (upgrade, reconstruction)

@clacaputo, @cmsbuild, @AdrianoDee, @srimanob, @slava77, @jpata can you please review it and eventually sign? Thanks.
@felicepantaleo, @rovere, @apsallid, @sobhatta, @lecriste, @hatakeyamak, @ebrondol this is something you requested to watch as well.
@perrotta, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@rovere
Copy link
Contributor

rovere commented Jan 17, 2022

Thanks a lot, @dr-stringfellow for this nice piece of work!

@rovere
Copy link
Contributor

rovere commented Jan 17, 2022

@cmsbuild please test

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-95bd10/21763/summary.html
COMMIT: 2dd2348
CMSSW: CMSSW_12_3_X_2022-01-17-1100/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36724/21763/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3464734
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3464712
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 42 files compared)
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-95bd10/21806/summary.html
COMMIT: 2dd2348
CMSSW: CMSSW_12_3_X_2022-01-18-2300/slc7_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/36724/21806/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

  • 4.764.76_ZMuSkim2012D+ZMuSkim2012D+HLTDSKIM2+RECODR1reHLT2+HARVESTDR1reHLT/step2_ZMuSkim2012D+ZMuSkim2012D+HLTDSKIM2+RECODR1reHLT2+HARVESTDR1reHLT.log

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 2 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3464860
  • DQMHistoTests: Total failures: 5
  • DQMHistoTests: Total nulls: 1
  • DQMHistoTests: Total successes: 3464832
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.004 KiB( 42 files compared)
  • DQMHistoSizes: changed ( 312.0 ): 0.004 KiB MessageLogger/Warnings
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@jpata
Copy link
Contributor

jpata commented Jan 24, 2022

Note that the static checker complains about the const_cast (also an issue for the trackDNN), related to #32529. It's not something that can be fixed in this PR, but to be kept in mind.

4 warnings generated.
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_12_3_X_2022-01-18-2300/src/RecoHGCal/TICL/plugins/PatternRecognitionbyFastJet.cc:272:19: warning: const_cast was used, this may result in thread-unsafe code [threadsafety.ConstCast]
  tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_12_3_X_2022-01-18-2300/src/RecoHGCal/TICL/plugins/PatternRecognitionbyCA.cc:439:19: warning: const_cast was used, this may result in thread-unsafe code [threadsafety.ConstCast]
  tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_12_3_X_2022-01-18-2300/src/RecoHGCal/TICL/plugins/TrackstersMergeProducer.cc:662:19: warning: const_cast was used, this may result in thread-unsafe code [threadsafety.ConstCast]
  tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 warning generated.
/data/cmsbld/jenkins/workspace/ib-run-pr-tests/CMSSW_12_3_X_2022-01-18-2300/src/RecoHGCal/TICL/plugins/PatternRecognitionbyCLUE3D.cc:403:19: warning: const_cast was used, this may result in thread-unsafe code [threadsafety.ConstCast]
  tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
                  ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I agree that in a follow-up (not necessarily by the HGCAL team), the TfGraphDefProducer stuff should be moved out of RecoTracker to a common area.

@jpata
Copy link
Contributor

jpata commented Jan 24, 2022

For posterity, can you comment on what settings were used to make this plot (besides the worfklow 39034.202):
image

@jpata
Copy link
Contributor

jpata commented Jan 24, 2022

enable profiling

@jpata
Copy link
Contributor

jpata commented Jan 25, 2022

@cmsbuild please test

@dr-stringfellow
Copy link
Contributor Author

dr-stringfellow commented Jan 25, 2022

For posterity, can you comment on what settings were used to make this plot (besides the worfklow 39034.202):
image

The following settings were used:

cmsRun step3 -s RAW2DIGI,L1Reco,RECO,RECOSIM,PAT,VALIDATION:@phase2Validation+@miniAODValidation,DQM:@phase2+@miniAODDQM --conditions auto:phase2_realistic_T27 --datatier GEN-SIM-RECO,MINIAODSIM,DQMIO -n 30 --eventcontent FEVTDEBUGHLT,MINIAODSIM,DQM --geometry Extended2026D87 --era Phase2C11I13T27M9 --procModifiers fastJetTICL,clue3D --nThreads 1 --customise Validation/Performance/TimeMemoryInfo.py

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-95bd10/21972/summary.html
COMMIT: 2dd2348
CMSSW: CMSSW_12_3_X_2022-01-24-2300/slc7_amd64_gcc10
Additional Tests: PROFILING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/36724/21972/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 1 differences found in the comparisons
  • DQMHistoTests: Total files compared: 43
  • DQMHistoTests: Total histograms compared: 3449324
  • DQMHistoTests: Total failures: 0
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3449302
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 42 files compared)
  • Checked 181 log files, 42 edm output root files, 43 DQM output files
  • TriggerResults: no differences found

@jpata
Copy link
Contributor

jpata commented Jan 26, 2022

+reconstruction

  • technical, improves the memory handling of the TICL TF models by loading them from the EventSetup
  • no reco changes expected/observed

@AdrianoDee
Copy link
Contributor

+upgrade
Technical, improving memory for upgrade wfs including TICL.

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@@ -446,7 +436,7 @@ void PatternRecognitionbyCA<TILES>::energyRegressionAndID(const std::vector<reco
}

// run the inference (7)
tensorflow::run(eidSession_, inputList, outputNames, &outputs);
tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eidSession is already a const tensorflow::Session *: why do you need the const_cast?

Copy link
Contributor

@jpata jpata Jan 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is removing the const (OK since tensorflow::run is known to be thread-safe), to fit the the tensorflow interface.
related to #35541

@@ -410,7 +400,7 @@ void PatternRecognitionbyCLUE3D<TILES>::energyRegressionAndID(const std::vector<
}

// run the inference (7)
tensorflow::run(eidSession_, inputList, outputNames, &outputs);
tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eidSession is already a const tensorflow::Session *: why do you need the const_cast?

@@ -278,7 +269,7 @@ void PatternRecognitionbyFastJet<TILES>::energyRegressionAndID(const std::vector
}

// run the inference (7)
tensorflow::run(eidSession_, inputList, outputNames, &outputs);
tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eidSession is already a const tensorflow::Session *: why do you need the const_cast?

@@ -655,7 +659,7 @@ void TrackstersMergeProducer::energyRegressionAndID(const std::vector<reco::Calo
}

// run the inference (7)
tensorflow::run(eidSession_, inputList, outputNames, &outputs);
tensorflow::run(const_cast<tensorflow::Session *>(eidSession), inputList, outputNames, &outputs);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eidSession is already a const tensorflow::Session *: why do you need the const_cast?

@perrotta
Copy link
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants