Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Skip invalid channels in the GPU HCAL RecHit producer #39738

Merged
merged 2 commits into from Oct 18, 2022

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Oct 14, 2022

PR description:

This PR consistes of two parts:

  • the GPU code now marks as invalid (setting chi² = -9999.) the HCAL channels for which it cannot determine the SOI;
  • the CPU code that converts from SoA to legacy format now skips all invalid channels.

See h#39693 for a discussion on the underlying issues.

PR validation:

Tested on top of CMSSW_12_4_10 over the events that caused the HLT to crash in runs 357898, 359998, and 360295.
Without these changes the original crash can be reproduced in multiple files:

----- Begin Fatal Exception 14-Oct-2022 20:27:30 CEST-----------------------
An exception of category 'Conditions not found' occurred while
   [0] Processing  Event run: 360295 lumi: 174 event: 264575349 stream: 7
   [1] Running path 'Path'
   [2] Calling method for module CaloTowersCreator/'hltTowerMakerForAll'
Exception Message:
Unavailable Conditions of type HcalChannelQuality for cell (0x0) 
----- End Fatal Exception -------------------------------------------------

With these changes the HLT jobs run to completion on almost all those files (barring one that is affected by a different, ECAL-related crash).

If this PR is a backport please specify the original PR and why you need to backport that PR. If this PR will be backported please specify to which release cycle the backport is meant for:

To be backported to CMSSW 12.4.x and 12.5.x for data taking.

@cmsbuild cmsbuild added this to the CMSSW_12_6_X milestone Oct 14, 2022
@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

type bugfix

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

assign hcal-dpg

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

enable gpu

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

urgent

Relatively urgent, as the 12.4.x backport should be deployed online ASAP.

@cmsbuild
Copy link
Contributor

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39738/32582

  • This PR adds an extra 24KB to repository

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

please test

During the conversion from SoA to legacy format, skip bad channels,
identified by a negative chi².
@fwyzard fwyzard force-pushed the fix_HCAL_GPU_rechit_producer branch from 9419cbe to ceec45f Compare October 14, 2022 18:40
@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 14, 2022

please test

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 15, 2022

11634.523 should include the GPU vs CPU comparison plots

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 15, 2022

in fact, if it works we should probably replace 11634.522 with 11634.523 in the GPU tests, etc.

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-683156/28285/summary.html
COMMIT: ceec45f
CMSSW: CMSSW_12_6_X_2022-10-15-1100/el8_amd64_gcc10
Additional Tests: GPU
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39738/28285/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-683156/2500.601_mc126X+TTBarMINIAOD12.6+NANO_mc12.6+HRV_NANO_mc

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3391158
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3391130
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 201 log files, 48 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

GPU Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-683156/11634.523_TTbar_14TeV+2021_Patatrack_HCALOnlyGPU_Validation+TTbar_14TeV_TuneCP5_GenSim+Digi+RecoNano+HARVESTNano

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 17 differences found in the comparisons
  • DQMHistoTests: Total files compared: 5
  • DQMHistoTests: Total histograms compared: 21147
  • DQMHistoTests: Total failures: 251
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 20896
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 4 files compared)
  • Checked 16 log files, 12 edm output root files, 5 DQM output files
  • TriggerResults: no differences found

@missirol
Copy link
Contributor

@cms-sw/hcal-dpg-l2 , what is your take on this PR?

It would be good to converge soon, to fix the HLT crashes reported in #39693.

@clacaputo
Copy link
Contributor

Are there plots for the GPU vs CPU comparison of the energy, chi², etc. ?

The GPU vs CPU energy comparisons show some differences in HE at low energy [1], but it could be just an effect of the different number of entries [2]. From [2], there should be also differences in HB, but they are not visible in [1]

[1] https://tinyurl.com/2ky3szdr
[2] https://tinyurl.com/2gwd4hcg

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 17, 2022

@cms-sw/hcal-dpg-l2, are there any plots that show the rechits reconstructed only on CPU or only on GPU ?

Also, did the 11634.523 workflow produce all the plots that came with 11634.522 ?

@mariadalfonso
Copy link
Contributor

@cms-sw/hcal-dpg-l2, are there any plots that show the rechits reconstructed only on CPU or only on GPU ?

Also, did the 11634.523 workflow produce all the plots that came with 11634.522 ?

Investigating now.
There are two separate issues:

  1. addressing the crash due to wrong SOI and NumTS (scope of this PR)
  2. resolve the CPU-GPU discrepancy (seen especially on HE, less in HB) seen also in some standard DQM

@mariadalfonso
Copy link
Contributor

mariadalfonso commented Oct 18, 2022

@fwyzard @clacaputo @missirol
We can merge these PRs.

I did a (standalone) comparison of the CPU-GPU reco in the MC-ttbar events wf 11634.0 (starting from the same digi).
I confirm that I get equality. There is only about a 7% reduction of entries, due to non copying those that fails the channel quality.

Screen Shot 2022-10-17 at 23 10 03

Screen Shot 2022-10-17 at 23 08 28

So in the good cases where the soi and TS are well defined (always true in MC), this PR doesn't introduce any change.
From the sw point of view this just add some protection that was found useful based on Run1-Run2 experience.

From hw post of view, wrong soi and wrong #TS should not happen and the OPS team at P5 is looking into it.
I will look separately the data, as in addition to the crash was also seen some outlier.

359699_HBHE

359998_HB

359998_HE

@clacaputo
Copy link
Contributor

+reconstruction

@missirol
Copy link
Contributor

missirol commented Oct 18, 2022

@cms-sw/hcal-dpg-l2 , just a reminder that your signature is required in this PR (I guess ORP is waiting for that).

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 18, 2022

@mariadalfonso can I ask you a couple of things? I'm sorry but I don't know where to look myself :-(

@fwyzard
Copy link
Contributor Author

fwyzard commented Oct 18, 2022

does the 11634.523 workflow produce those plots ?

maybe answering myself: I see that the JR comparison does not run on the .523 workflow, but I guess if we replace .522 with .523 in the matrix, that could be updated accordingly ?

@rappoccio
Copy link
Contributor

@mariadalfonso can you explicitly say "+1" so the bot picks up your approval?

@missirol
Copy link
Contributor

(.. I think the signature needs to come from a member of https://github.com/orgs/cms-sw/teams/hcal-dpg-l2/members .)

@wang-hui
Copy link
Contributor

+1

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

  • Once IBs finalize with this, will merge the back ports and cut new 12_4 and 12_5 releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants