Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TsosGaussianStateConversions] Require GSF states to be positive definite [12_6_X] #39873

Merged
merged 1 commit into from Oct 31, 2022

Conversation

swagata87
Copy link
Contributor

PR description:

This PR is to solve a crash in prompt-reco that led to a paused job as reported in https://cms-talk.web.cern.ch/t/logic-error-in-reco-job-for-run-360888-dataset-parkingdoublemuonlowmass2/16641, and discussed in #39570.

It was checked, from 12_4_X, that this patch cures the crash in low-pT electron reconstruction.

PR validation:

runTheMatrix.py -l 12434.0 ran fine.

From 12_4_X, it was checked that [Base] and [Base+thisPR] leads to same number of electrons, photons and low-pT electrons to be reconstructed, with same pT spectra. This check was made by running on 200 raw events on this file: /eos/cms/tier0/store/data/Run2022F/EGamma/RAW/v1/000/361/197/00000/76bd97fa-4ad6-4d85-b941-014e3ed27f9c.root

Tagging @francescobrivio as this week's ORM.

Backport of this PR might be necessary.

@cmsbuild
Copy link
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39873/32784

  • This PR adds an extra 12KB to repository

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @swagata87 (Swagata Mukherjee) for master.

It involves the following packages:

  • TrackingTools/GsfTracking (reconstruction)

@cmsbuild, @mandrenguyen, @clacaputo can you please review it and eventually sign? Thanks.
@VourMa, @bellan, @felicepantaleo, @GiacomoSguazzoni, @JanFSchulte, @rovere, @VinInn, @missirol, @ebrondol, @lecriste, @gpetruc, @mmusich, @mtosi, @dgulhan this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@francescobrivio
Copy link
Contributor

assign tracking-pog

  • Since this touches tracking code

@cmsbuild
Copy link
Contributor

New categories assigned: tracking-pog

@slava77,@mmusich you have been requested to review this Pull request/Issue and eventually sign? Thanks

@mmusich
Copy link
Contributor

mmusich commented Oct 27, 2022

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals-INPUT
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-94fd56/28561/summary.html
COMMIT: 57a8f0a
CMSSW: CMSSW_12_6_X_2022-10-27-1100/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39873/28561/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals-INPUT

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 5 differences found in the comparisons
  • Reco comparison had 6 failed jobs
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3384029
  • DQMHistoTests: Total failures: 6
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3384001
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 201 log files, 48 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Oct 28, 2022

please test

  • let's try again

@cmsbuild
Copy link
Contributor

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-94fd56/28589/summary.html
COMMIT: 57a8f0a
CMSSW: CMSSW_12_6_X_2022-10-27-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/39873/28589/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 48
  • DQMHistoTests: Total histograms compared: 3384029
  • DQMHistoTests: Total failures: 9
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3383998
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 47 files compared)
  • Checked 201 log files, 48 edm output root files, 48 DQM output files
  • TriggerResults: no differences found

@mmusich
Copy link
Contributor

mmusich commented Oct 28, 2022

+1

Footnotes

  1. using the recipe at https://github.com/cms-sw/cmssw/issues/35929#issuecomment-1288800329

@francescobrivio
Copy link
Contributor

urgent

  • @cms-sw/reconstruction-l2 please give an high priority to the review of this PR which is then needed to fix crashes in Prompt reconstruction in Tier0

@mmusich
Copy link
Contributor

mmusich commented Oct 28, 2022

type egamma, tracking

@swagata87
Copy link
Contributor Author

This fix is useful for HLT also, see #39570 (comment).
I expect the HLT Gsf crashes to go away or happen less frequently after this fix is integrated.

@clacaputo
Copy link
Contributor

clacaputo commented Oct 31, 2022

+reconstruction

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

  • Urgent fix for datataking. Once this clears IBs we will merge the backport and cut a new 12_4.

@cmsbuild cmsbuild merged commit d624d70 into cms-sw:master Oct 31, 2022
@rappoccio
Copy link
Contributor

Backports are in #39903 and #39904

@mmusich
Copy link
Contributor

mmusich commented Nov 4, 2022

For the record, when the release containing the backport of this PR (#39903) has been tested at tier0, a job that previously was giving an exception, is now having a segfault https://cms-talk.web.cern.ch/t/replay-testing-for-cmssw-12-4-11/17062/4
FYI: @swagata87

@swagata87
Copy link
Contributor Author

swagata87 commented Nov 4, 2022

Hello @mmusich

I copied the folder as indicated in the cmsTalk link you shared, /afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay12_4_11/job_1175

From what is reported there, Run: 361239 Event: 453401426 crashed, right?
So I tried to run on that event, by doing this

import FWCore.ParameterSet.Config as cms
from PSet import process
process.source.eventsToProcess = cms.untracked.VEventRange('361239:453401426-361239:453401426')

But it did not crash.
I am in CMSSW_12_4_11 and slc7_amd64_gcc10.

Am I missing something?
Can someone else try to see if the crash is reproducible in some other machine?

@mmusich
Copy link
Contributor

mmusich commented Nov 4, 2022

Hi Swagata @swagata87,
I am trying to isolate the event that crashes the process as well. Since the process is run multi-threaded, it's not necessarily the last event printed out in the log that caused the crash. One should look which stream underwent the crash.

@slava77
Copy link
Contributor

slava77 commented Nov 4, 2022

The t0 logs say

%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 4 is invalid. skipping.
%MSG

for 0-5. How many are tried? Is it a case of everything failing and the code not expecting that?

@swagata87
Copy link
Contributor Author

swagata87 commented Nov 4, 2022

From the file from T0, job_1175/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log; all of 0-5 failed [1].
So it could be that we need another protection somewhere for a scenario where nothing pass. But we could not locate the event yet.. if we can find the exact event that crashed, we would look for a fix. From the log below it looks like Event 453474850 , but this event runs fine from my area.

[1]

Begin processing the 5984th record. Run 361239, Event 453474850, LumiSection 247 on stream 0 at 04-Nov-2022 08:44:47.720 CET
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 0 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 1 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 2 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 4 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 5 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 0 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 1 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 2 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 3 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 4 is invalid. skipping.
%MSG
%MSG-e GsfMultiStateUpdator:  GsfTrackProducer:lowPtGsfEleGsfTracks  04-Nov-2022 08:44:48 CET Run: 361239 Event: 453401426
KF updated state 5 is invalid. skipping.
%MSG

A fatal system signal has occurred: segmentation violation
The following is the call stack containing the origin of the signal.

@mmusich
Copy link
Contributor

mmusich commented Nov 4, 2022

follow-up issue at #39987 is perhaps a better place to continue the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants