Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HBHE: speed up + mahi move double to float #28366

Merged

Conversation

mariadalfonso
Copy link
Contributor

@mariadalfonso mariadalfonso commented Nov 8, 2019

This PR gain 45% in CPU speed up in the minimize function see igprof results below.

Validation plots provided in double and floats
http://dalfonso.web.cern.ch/dalfonso/HCAL/HCAL_PR28366.pdf
Impact on the INtime energy is negligible, small changes on the arrivalTime.

To keep in mind in the future, the maxSV size should be moved 10 -->8 once the 2018menu is removed.

-- release tag: CMSSW_11_0_X_2019-11-07-2300 --

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           81.6  .........      54.70 / 54.97        MahiFit::phase1Apply(HBHEChannelInfo const&, float&, float&, bool&, float&) const [18]
[19]       81.6      54.70       2.76 / 51.93      MahiFit::doFit(std::array<float, 3ul>&, int) const
           60.0  .........      40.23 / 40.23        MahiFit::minimize() const [20]
            9.8  .........       6.56 / 6.56         MahiFit::calculateArrivalTime() const [25]
            7.5  .........       5.05 / 5.05         MahiFit::updatePulseShape(double, Eigen::Matrix<double, -1, 1, 0, 19, 1>&, Eigen::Matrix<double, -1, 1, 0, 19, 1>&, Eigen::Matrix<double, -1, -1, 0, 19, 19>&) const [28]


- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           60.0  .........      40.23 / 54.70        MahiFit::doFit(std::array<float, 3ul>&, int) const [19]
[20]       60.0      40.23       0.31 / 39.91      MahiFit::minimize() const
           30.8  .........      20.61 / 20.61        MahiFit::nnls() const [21]
           16.0  .........      10.74 / 10.74        MahiFit::updateCov() const [23]
            7.7  .........       5.15 / 5.15         MahiFit::onePulseMinimize() const [27]
            5.0  .........       3.38 / 3.38         MahiFit::calculateChiSq() const [33]

-- this PR at commit d6a2dd1 --

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           74.1  .........      33.31 / 33.56        MahiFit::phase1Apply(HBHEChannelInfo const&, float&, float&, bool&, float&) const [18]
[19]       74.1      33.31       1.74 / 31.56      MahiFit::doFit(std::array<float, 3ul>&, int) const
           49.8  .........      22.38 / 22.38        MahiFit::minimize() const [20]
           11.0  .........       4.95 / 4.95         MahiFit::updatePulseShape(float, Eigen::Matrix<float, -1, 1, 0, 15, 1>&, Eigen::Matrix<float, -1, 1, 0, 15, 1>&, Eigen::Matrix<float, -1, -1, 0, 15, 15>&) const [25]
            9.4  .........       4.22 / 4.22         MahiFit::calculateArrivalTime(unsigned int) const [26]

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
           49.8  .........      22.38 / 33.31        MahiFit::doFit(std::array<float, 3ul>&, int) const [19]
[20]       49.8      22.38       0.32 / 22.06      MahiFit::minimize() const
           23.2  .........      10.42 / 10.42        MahiFit::nnls() const [21]
           14.7  .........       6.59 / 6.59         MahiFit::updateCov(Eigen::Matrix<float, -1, -1, 0, 10, 10> const&) const [22]
            6.5  .........       2.93 / 2.93         MahiFit::onePulseMinimize() const [29]
            4.7  .........       2.11 / 2.11         MahiFit::calculateChiSq() const [34]

Tests above done on fu-c2a02-37-03 2 CPUs:
0: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (16 cores, 32 threads)
1: Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz (16 cores, 32 threads)

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28366/12669

  • This PR adds an extra 28KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-28366/12672

  • This PR adds an extra 28KB to repository

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

A new Pull Request was created by @mariadalfonso for master.

It involves the following packages:

RecoLocalCalo/HcalRecAlgos

@perrotta, @cmsbuild, @slava77 can you please review it and eventually sign? Thanks.
@apsallid, @argiro this is something you requested to watch as well.
@davidlange6, @slava77, @fabiocos you are the release manager for this.

cms-bot commands are listed here

@perrotta
Copy link
Contributor

perrotta commented Nov 8, 2019

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/3394/console Started: 2019/11/08 14:06

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

-1

Tested at: 393aff0

You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-66522f/3394/summary.html

I found follow errors while testing this PR

Failed tests: AddOn

  • AddOn:

I found errors in the following addon tests:

cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_0_X_2019-11-07-2300/src/HLTrigger/Configuration/test/OnLine_HLT_2018.py realData=False globalTag=@ inputFiles=@ : FAILED - time: date Fri Nov 8 15:33:09 2019-date Fri Nov 8 15:17:16 2019 s - exit: 34304
cmsDriver.py RelVal -s HLT:2018,RAW2DIGI,L1Reco,RECO --mc --scenario=pp -n 10 --conditions auto:run2_mc_2018 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_2018_MC.root --fileout file:RelVal_Raw_2018_MC_HLT_RECO.root : FAILED - time: date Fri Nov 8 15:33:09 2019-date Fri Nov 8 15:17:16 2019 s - exit: 34304
cmsRun /cvmfs/cms-ib.cern.ch/week1/slc7_amd64_gcc820/cms/cmssw-patch/CMSSW_11_0_X_2019-11-07-2300/src/HLTrigger/Configuration/test/OnLine_HLT_2018.py realData=True globalTag=@ inputFiles=@ : FAILED - time: date Fri Nov 8 15:31:15 2019-date Fri Nov 8 15:17:28 2019 s - exit: 34304
cmsDriver.py RelVal -s HLT:2018,RAW2DIGI,L1Reco,RECO --data --scenario=pp -n 10 --conditions auto:run2_data_2018 --relval 9000,50 --datatier "RAW-HLT-RECO" --eventcontent FEVTDEBUGHLT --customise=HLTrigger/Configuration/CustomConfigs.L1THLT --era Run2_2018 --processName=HLTRECO --filein file:RelVal_Raw_2018_DATA.root --fileout file:RelVal_Raw_2018_DATA_HLT_RECO.root : FAILED - time: date Fri Nov 8 15:31:15 2019-date Fri Nov 8 15:17:28 2019 s - exit: 34304

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

Comparison job queued.

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 8, 2019

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-66522f/3394/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 10359 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2936360
  • DQMHistoTests: Total failures: 6779
  • DQMHistoTests: Total nulls: 3
  • DQMHistoTests: Total successes: 2929237
  • DQMHistoTests: Total skipped: 341
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.012 KiB( 33 files compared)
  • DQMHistoSizes: changed ( 136.788 ): 0.020 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 136.85 ): -0.008 KiB JetMET/SUSYDQM
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@slava77
Copy link
Contributor

slava77 commented Nov 8, 2019

the crash seems related, it's in MahiFit::calculateArrivalTime

https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-66522f/3394/addOnTests/hlt_mc_2018/cmsDriver.py_RelVal_-s_HLT:2018,RAW2DIGI,L1Reco,RECO_--mc_--scenario=pp_-n_10_--conditions_auto:run2_mc_2018_--relval_9000,50_--datatier_RAW-HLT-RECO.log

Eigen/src/Core/Block.h:122: Eigen::Block<XprType, BlockRows, BlockCols, InnerPanel>::Block(XprType&, Eigen::Index) [with XprType = Eigen::Matrix<float, -1, -1, 0, 8, 8>; int BlockRows = -1; int BlockCols = 1; bool InnerPanel = true; Eigen::Index = long int]: Assertion `(i>=0) && ( ((BlockRows==1) && (BlockCols==XprType::ColsAtCompileTime) && i<xpr.rows()) ||((BlockRows==XprType::RowsAtCompileTime) && (BlockCols==1) && i<xpr.cols()))' failed.

@cmsbuild
Copy link
Contributor

Pull request #28366 was updated. @perrotta, @cmsbuild, @slava77 can you please check and sign again.

@perrotta
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 14, 2019

The tests are being triggered in jenkins.
https://cmssdt.cern.ch/jenkins/job/ib-run-pr-tests/3486/console Started: 2019/11/14 19:12

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Comparison job queued.

@cmsbuild
Copy link
Contributor

Comparison is ready
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-66522f/3486/summary.html

Comparison Summary:

  • No significant changes to the logs found
  • Reco comparison results: 10357 differences found in the comparisons
  • DQMHistoTests: Total files compared: 34
  • DQMHistoTests: Total histograms compared: 2936360
  • DQMHistoTests: Total failures: 6777
  • DQMHistoTests: Total nulls: 3
  • DQMHistoTests: Total successes: 2929239
  • DQMHistoTests: Total skipped: 341
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.012 KiB( 33 files compared)
  • DQMHistoSizes: changed ( 136.788 ): 0.020 KiB JetMET/SUSYDQM
  • DQMHistoSizes: changed ( 136.85 ): -0.008 KiB JetMET/SUSYDQM
  • Checked 147 log files, 16 edm output root files, 34 DQM output files

@perrotta
Copy link
Contributor

@mariadalfonso I'm noticing that in the validation plot linked to the PR description there is not the migration of the mahi arrival time from 0 to non zero values that was visible in the first instance of it. Does that new plot refer to the state after the latest commit d6a2dd1 of this PR?
What was causing the cancellation of that previous migration?

@perrotta
Copy link
Contributor

The igprof performance results still refer to a quite old version of this PR (after commit 0ac75e6 , as written in the PR descritpion): will you be able to update them with the most recent version of this PR?

@mariadalfonso
Copy link
Contributor Author

The igprof performance results still refer to a quite old version of this PR (after commit 0ac75e6 , as written in the PR descritpion): will you be able to update them with the most recent version of this PR?

updated :)

@mariadalfonso I'm noticing that in the validation plot linked to the PR description there is not the migration of the mahi arrival time from 0 to non zero values that was visible in the first instance of it. Does that new plot refer to the state after the latest commit d6a2dd1 of this PR?
What was causing the cancellation of that previous migration?

the migration on the arrivalTime is visible on floats see page2 (both original PR and final commit) , not visible double see page1 (intermediate test)

@perrotta
Copy link
Contributor

the migration on the arrivalTime is visible on floats see page2 (both original PR and final commit) , not visible double see page1 (intermediate test)

Ah, ok, sorry: I did not notice the second page...

Then, I must agree with you that "ArrivalTime is pretty much a ill undefined", if by only changing the precision of the input quantities you end up with such large differences in the fitted result. Such arrival time eventually becomes the RecHit time: forgive my ignorance about it, but how is this evidently "ill defined" quantity relevant in the propagation to some higher level computed quantity? (Perhaps outside the scope of this PR) are there plans to make it more robust, if so?

@mariadalfonso
Copy link
Contributor Author

Goal of this PR is solely to speed up for Run3-HLT and I hope we can have it in the pre13.

For the Run2/Run3 there is no use of the recHit time() other than the beamhalo filter.
For the Phase2 I honestly do not know how is used: jenkin saw enough changes in the PF and jetMET quantities. I suspect is generically used for the barrel to exploit the new ECAL electronics.

HCAL long term plan redefinition of the arrivalTime

The physics case will be to target the LLP analyses.

The possibilities we considered so far:

  1. M0: charge weights of the TS of the DIGI, problem in high PU
  2. DIGI/FIT hybrid: charge weighted of the TS of the DIGI after subtraction of the OOT pulses.
  3. M2 (fit E and time) = the time is explicitely fitted but the M2-fit was also abandoned because was not perfect.
  4. MAHI: add explicitely the fit for the time. For now we this have an indirect estimation that any discrepancy of the fit is related to different arrival time of the pulse.

I never saw the correlation with a real MC signal displaced in time and this should be done before we use. In addition use the TDC from HE detector to validate the time with data.

@perrotta
Copy link
Contributor

+1

  • This PR speeds up the mahi algorithm by reducing the dimensions of the fit matrices and by lowering the precision of the internal calculations from double to float (plus some additional cleaning)
  • Test outputs show differences which are compatible with numerical fluctuations, see e.g. workflow 136.788 (RunSinglePh2017B) which with all changes in this PR but excluding the double->float conversion shows
    image
    while when adding such double->float conversion the fluctuations further amplificate
    image
  • During the review it appeared evident that the computation of the mahi arrival time is instable, and can lead to some ill defined result: whenever such an arrival time is propagated to higher level quantities this instability could lead to some "unpredictable" fluctuation in the final object.
    • However, such an instability is not originated by this PR, which is only meant to speed up the current algo
    • Apparently (at least it is so in the single example provided) , with the usage of float's instead of double's the migration of non-zero arrival times to zero seems to get reduced
    • There are (long term...) plans to fix the computation of that arrival time, see HBHE: speed up + mahi move double to float #28366 (comment)

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @davidlange6, @slava77, @smuzaffar, @fabiocos (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

Issue #28423 was opened to follow up with the progresses in fixing the ArrivalTime computation

@fabiocos
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit 775ab38 into cms-sw:master Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants