Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reorganization of kdtrees for PFBlockAlgo and optimize track-hcal links #31295

Merged

Conversation

hatakeyamak
Copy link
Contributor

@hatakeyamak hatakeyamak commented Aug 31, 2020

PR description:

This PR re-organizes kdtrees for track-hcal and track-ecal links, and optimize the track-hcal link storage. It should give much more granular PFBlocks without changing PF outputs (with some caveats [a]), adding huge PFBlocks discussed at:
J. Pata (Sep 2019): https://indico.cern.ch/event/846887/contributions/3557300/
K. Hatakeyama (Mar 2020): https://indico.cern.ch/event/897397/contributions/3784811/

The PFBlock size change:

Original:
Offline: Eight PFBlocks with >1000 elements, rms of size 23.1
HLT: One PFBlock with >200 elements, rms of size 4.6

With this PR:
Offline: One PFBlocks with >1000 elements, rms of size 12.5
HLT: One PFBlock with >30 elements, rms of size ~1

igprof output with 2023PU ttbar (matrix 12634.99)

Rank    % total       Self       Self / Children   Function
[Original]
            0.3  .........       1.31 / 1.31         PFBlockProducer::produce(edm::Event&, edm::EventSetup const&) [815]
            0.1  .........       0.42 / 0.42         PFProducer::produce(edm::Event&, edm::EventSetup const&) [1774]
—>
[with this PR]
            0.2  .........       0.86 / 0.86         PFBlockProducer::produce(edm::Event&, edm::EventSetup const&) [1137]
            0.0  .........       0.12 / 0.12         PFProducer::produce(edm::Event&, edm::EventSetup const&) [3483]

I believe this speed-up mainly comes that now we are distinguishing two preshower layers at prefilter stage of linker. (not very time-consuming to start with in offline sequence, but this is a sanity check.)
(Original)
http://hep.baylor.edu/hatake/PFval/misc/igreport_perf_12634.99_PFBlockLinks_ref.res
(with this PR)
http://hep.baylor.edu/hatake/PFval/misc/igreport_perf_12634.99_PFBlockLinks_test.res

[a] Some changes are observed in, for example, rawCaloFraction and rawHcalFraction of packed candidates, while the caloFraction and hcalFraction stay the same. These "raw" quantities follow the history of how candidates are produced in PFBlocks, so this is not surprising.
Also, some changes in btag csv values are observed. In the past, when candidate order changes, we also observed that btag related quantities change. I believe this shouldn't have systematic bias toward the physics performance (to be confirmed by relval etc).

This PR also includes some cleanup, as a follow-up from #31151.

This PR also include a long-standing request from JME to have a config parameter to control the eta range for ECAL-HCAL linkings (currently performed only for |eta|>2.5 i.e. outside the traditional tracker acceptance). @lathomas @ahinzmann

This will have some interference with ongoing: #31191.
It probably makes sense that this simpler PR goes first.

@bendavid @jpata

PR validation:

Run the matrix test 11834.0 (ttbar 2021 with PU), make sure we don't see changes in PF outputs in hlt, reco, and miniaod.
Now also ran 23234.0_TTbar_14TeV+2026D49 and made sure reco PF candidates stay the same.

The PF validation based on 2021 samples can be found at:
http://hep.baylor.edu/hatake/PFval/PFValidation_PFBlockLinks/

if this PR is a backport please specify the original PR and why you need to backport that PR:

This is not backport.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

The code-checks are being triggered in jenkins.

@hatakeyamak hatakeyamak changed the title Reorganization of kdtrees and optimize track-hcal links Reorganization of kdtrees for PFBlockAlgo and optimize track-hcal links Aug 31, 2020
@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-31295/18022

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @hatakeyamak (Kenichi Hatakeyama) for master.

It involves the following packages:

DataFormats/ParticleFlowReco
HLTrigger/Configuration
RecoParticleFlow/PFClusterProducer
RecoParticleFlow/PFClusterTools
RecoParticleFlow/PFProducer

@perrotta, @Martin-Grunewald, @slava77, @cmsbuild, @fwyzard, @jpata can you please review it and eventually sign? Thanks.
@mmarionncern, @makortel, @felicepantaleo, @riga, @rovere, @lgray, @Martin-Grunewald, @clelange, @lecriste, @cbernet, @seemasharmafnal this is something you requested to watch as well.
@silviodonato, @dpiparo, @qliphy you are the release manager for this.

cms-bot commands are listed here

@jpata
Copy link
Contributor

jpata commented Aug 31, 2020

@cmsbuild please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 31, 2020

The tests are being triggered in jenkins.

@cmsbuild
Copy link
Contributor

-1

Tested at: ba64a58

CMSSW: CMSSW_11_2_X_2020-08-30-2300
SCRAM_ARCH: slc7_amd64_gcc820
You can see the results of the tests here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-e22797/9009/summary.html

I found follow errors while testing this PR

Failed tests: RelVals

  • RelVals:

When I ran the RelVals I found an error in the following workflows:
23234.0 step3

runTheMatrix-results/23234.0_TTbar_14TeV+2026D49+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D49+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log

28234.0 step3
runTheMatrix-results/28234.0_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal/step3_TTbar_14TeV+2026D60+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal.log

23434.999 step4
runTheMatrix-results/23434.999_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU/step4_TTbar_14TeV+2026D49PU_PMXS1S2PR+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+PREMIX_PremixHLBeamSpot14PU+DigiTriggerPU+RecoGlobalPU+HARVESTGlobalPU.log

@cmsbuild
Copy link
Contributor

Comparison not run due to runTheMatrix errors (RelVals and Igprof tests were also skipped)

@hatakeyamak
Copy link
Contributor Author

hatakeyamak commented Sep 13, 2020

Do you have any more info on this one?

while this change is much smaller in the QCD noPU sample:
http://hep.baylor.edu/hatake/PFval/PFValidation_PFBlockLinks/FlatQCD_NoPU_2021_ParticleFlow/PackedCandidates/electron/electronEta.pdf
Trying to see if this is the desired change.

This has been investigated and I think actually changes are favored ones. I will try to summarize that over the next couple of days.

Since the largest changes appeared in electrons, I looked at gedGefElectrons (well, actually slimmedElectrons in MINIAODSIM) more closely. The above plot comes from QCD,so the reduction of electron yields is likely the reduction of fakes, but another question is what would be the impact on real electrons. So, I looked at ZEE PU 2021 relval samples (slimmedElectrons, no ID on top of it):
http://hep.baylor.edu/hatake/PFval/PFValidation_PFBlockLinks/SlimmedElectronEta_PFBlockLinks.png
which shows that electrons matched to gen-electrons are basically fully maintained, while the reduction appears in those which don't match to gen-electrons.

This is probably due to reduced combinatorics in PFEGammaAlgo, thanks to more granular PFBlock sizes. So, it looks like this is a favored changes as I already commented.

This might also indicates that more close review of PFEGammaAlgo and how links are handled can improve its performance a little more.

To me this PR looks good to go.

@jainshilpi (any other EGamma people should be notified?)

@Martin-Grunewald
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @silviodonato, @dpiparo, @qliphy (and backports should be raised in the release meeting by the corresponding L2)

@silviodonato
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit c24fc31 into cms-sw:master Sep 15, 2020
@Dr15Jones
Copy link
Contributor

Dr15Jones commented Sep 15, 2020

This appears to have broken today's IB. The problem is

----- Begin Fatal Exception 15-Sep-2020 12:42:37 CEST-----------------------
An exception of category 'FileReadError' occurred while
   [0] Rethrowing an exception that happened on a different thread.
   [1] Reading branch recoPFBlocks_particleFlowBlock__RECO.
   Additional Info:
      [a] Fatal Root Error: @SUB=TStreamerInfo::BuildOld
Cannot convert reco::PFBlockElement::multilinks_ from type: reco::PFMultiLinksTC to type: map<reco::PFBlockElement::Type,reco::PFMultiLinksTC>, skip element

----- End Fatal Exception -------------------------------------------------

so we can no longer read the previously written std::vector<reco::PFBlock> data.

Another item of note is std::map<> are very bad for ROOT storage. They are not handled particularly well which means they take up lots of space on disk and are slow to read/write.

In addition std::map<> has a poor in memory layout as each item in the map has its own memory allocation. This requires a large number of memory reads to move around in the map. The best use of std::map<> is where you are doing large number of inserts and removes from the data structure and need the structure to remain sorted the entire time. For storage, that is not the case, we write it once and read many times.

A better data structure for an Event data product would be a sorted std::vector<std::pair<>>. It stores well and it has a very compressed memory structure.

@hatakeyamak
Copy link
Contributor Author

hatakeyamak commented Sep 15, 2020

Another item of not is std::map<> are very bad for ROOT storage. They are not handled particularly well which means they take up lots of space on disk and are slow to read/write.

In addition std::map<> has a poor in memory layout as each item in the map has its own memory allocation. This requires a large number of memory reads to move around in the map. The best use of std::map<> is where you are doing large number of inserts and removes from the data structure and need the structure to remain sorted the entire time. For storage, that is not the case, we write it once and read many times.

A better data structure for an Event data product would be a sorted std::vector<std::pair<>>. It stores well and it has a very compressed memory structure.

I see. Thank you. I will look intostd::vector<std::pair<>>.

so we can no longer read the previously written std::vector<reco::PFBlock> data.

I see. I am not clear what we should do about it, and I will appreciate suggestions. Sorry for the trouble.

@makortel
Copy link
Contributor

so we can no longer read the previously written std::vector<reco::PFBlock> data.

I see. I am not clear what we should do about it, and I will appreciate suggestions. Sorry for the trouble.

Can the earlier "scalar" PFMultiLinksTC multilinks_ be read into one element of the std::vector<std::pair<reco::PFBlockElement::Type, PFMultiLinksTC>> aggregate? If yes, it could be dealt with an ioread rule.

@hatakeyamak
Copy link
Contributor Author

hatakeyamak commented Sep 15, 2020

I see. Maybe we can make it work so that it won't crash, but I am not sure if it will do right thing, given that how PFMultiLinksTC is used for different combinations of links is being changed.
What workflow/test reads previously-produced std::vectorreco::PFBlock?

@Dr15Jones
Copy link
Contributor

Workflows: 134.807, 134.808, 134.907, 134.908, 136.727, 136.728 and 136.8391 are the ones failing in the IB all in step 2.

@hatakeyamak
Copy link
Contributor Author

Thank you. I will have a deeper look with urgency.

@slava77
Copy link
Contributor

slava77 commented Sep 15, 2020

134.807

this is reading /DoubleEG/Run2015C-ZElectron-PromptReco-v1/RAW-RECO
Is the PFBlock actually consumed or is this an error due to simply reading the file?
For the latter, we could as well come up with some solution that just ignores this data in the input.

@hatakeyamak
Copy link
Contributor Author

I am hoping the latter is the case, but I don't know. I will attack this after my class... Of course I will be happy if somebody come up with a solution by then.

@hatakeyamak
Copy link
Contributor Author

hatakeyamak commented Sep 16, 2020

So, if I comment out:
https://cmssdt.cern.ch/lxr/source/RecoParticleFlow/Configuration/python/RecoParticleFlow_EventContent_cff.py#0066
'keep recoPFBlocks_particleFlowBlock_*_*',
then 134.807 works fine, so I think this block is not really consumed.
In this situation, I am still wondering what would be the best way forward.

@slava77
Copy link
Contributor

slava77 commented Sep 16, 2020

So, if I comment out:
https://cmssdt.cern.ch/lxr/source/RecoParticleFlow/Configuration/python/RecoParticleFlow_EventContent_cff.py#0066
'keep recoPFBlocks_particleFlowBlock_*_*',
then 134.807 works fine, so I think this block is not really consumed.
In this situation, I am still wondering what would be the best way forward.

if the input multilinks_ is not really needed, then one solution is an ioread rule (in src/*_def.xml file) to specify that the new object multilinks is initialized to an empty/default-constructed value.
I'd still suggest to try to read it properly, if the conversion is not too complicated.

@jpata
Copy link
Contributor

jpata commented Sep 16, 2020

I'm a bit concerned that this bug did not show up in the jenkins tests. How can we improve this?

@jpata
Copy link
Contributor

jpata commented Sep 16, 2020

It does not seem like the multilinks are used anywhere outside of PFBlockAlgo. As they are created within PFBlockAlgo for internal consumption, how about marking them transient in the future (or otherwise not saving them in the EDM file, also to save disk space), and just initializing them with defaults to prevent a crash in case of reading legacy files.

Just to note, in #31191 (a rebase wrt. this PR is almost done) we change multilinks to be index-based, so this can be revisited once more.

@Dr15Jones
Copy link
Contributor

@jpata

I'm a bit concerned that this bug did not show up in the jenkins tests. How can we improve this?

For the framework, we have ROOT files containing RAW data for all the different file 'format's we've had throughout the history of CMSSW. We have unit tests which run over those files to be sure we can always read the old formats. We have this since the framework guarantees that we can always read all old RAW data files.

We don't have anything similar for files containing other data formats.

@jpata
Copy link
Contributor

jpata commented Sep 16, 2020

@Dr15Jones

We don't have anything similar for files containing other data formats.

OK, understood. Maybe we can look increasing the unit test coverage for reading old files also for reco.

In order to fix the issue arising in the IB now, I tried the following: #31485, perhaps we can address that particular approach there.

@hatakeyamak
Copy link
Contributor Author

@jpata thx for #31485. I already considered possibly making multilinks transient in this PR, but didn't go ahead without seeing a real need, but I think this is good.

Maybe one of RAW-RECO workflow can be made as a part of standard jenkins test to catch this sort of things in future.

@Dr15Jones
Copy link
Contributor

Maybe one of RAW-RECO workflow can be made as a part of standard jenkins test to catch this sort of things in future.

A simpler idea would be to just have a job that only reads/writes a file (forcing it not to do a 'fast copy').

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

9 participants