Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix, start Services in runOrQueueEventSetupForInstanceAsync #39644

Closed
wants to merge 1 commit into from

Conversation

wddgit
Copy link
Contributor

@wddgit wddgit commented Oct 5, 2022

PR description:

Bug fix related to #38801. The problem was noticed in the IBs after that pull request was merged. It didn't show up in earlier tests. This should fix the problem with the message "no ServiceRegistry has been set for this thread".

Note that a second problem with the symptom of a seg fault also showed up rarely after that PR was merged. This probably will not fix that. We are still working on a fix for that one.

PR validation:

This is a non-reproducible problem that does not show up every time so I am not 100% sure this will fix it. We should merge it and see what happens in the IBs.

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 5, 2022

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-39644/32445

@wddgit
Copy link
Contributor Author

wddgit commented Oct 5, 2022

please test

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 5, 2022

A new Pull Request was created by @wddgit (W. David Dagenhart) for master.

It involves the following packages:

  • FWCore/Framework (core)

@makortel, @smuzaffar, @Dr15Jones can you please review it and eventually sign? Thanks.
@makortel, @missirol this is something you requested to watch as well.
@perrotta, @dpiparo, @rappoccio you are the release manager for this.

cms-bot commands are listed here

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

type bug-fix

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

urgent

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

+1

(assuming tests succeed)

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2022

This pull request is fully signed and it will be integrated in one of the next master IBs after it passes the integration tests. This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

ping bot
(test did not start yet)

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

@smuzaffar @aandvalenzuela I see that tests are stuck in quite several PRs
I've tried to abort some of them, but still the situation does not improve for the other ones
It would be important to have at least this PR tested, because it blocks the building of pre3.
It would be possible to unblock, hopefully all tests, if not at least this one and #39642?
Thank you so much for all what you can do!

@aandvalenzuela
Copy link
Contributor

Hi @perrotta,

We realized this morning about the PR test delays. @iarspider is on shift and he is trying to solve this issue with IT. PRs are stuck because of a failure in cvmfs.
I will retry the two PRs you are mentioning giving them priority, but I will probably need to abort other testings.

Sorry for the inconvenience,
Andrea.

@aandvalenzuela
Copy link
Contributor

please test

@wddgit
Copy link
Contributor Author

wddgit commented Oct 6, 2022

Somehow this is triggering one Framework unit test to fail that was not failing before. I do not think it has anything direct to do with this bug fix, which I still think is correct. The unit test is TestFWCoreFrameworkTransitions

At first I thought it was just the other seg fault we already saw, but possibly it is not. I am looking at that now also. I suspect this might just be a problem with the unit test.

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

@wddgit @makortel we have to close 12_6_0_pre3.
Would you suggest to merge this one as soon as test finishes (there are no unit tests failing here, apparently...) and keep investigating, with tomorrow morning (CET) as a deadline; or merge the revert PR #39642 instead, and keep investigating with calm before in the two and half week before pre4?

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

@perrotta I'd merge the revert PR #39642 for 12_6_0_pre3, and we continue to investigate the problem(s) with calm. We still don't have a clue on the EventProcessor::readEvent() segfault.

Would it be feasible to revert the #39642 (i.e. add the concurrent run PR again) after the pre3 has been built? The problems seem rare-enough to me to not cause problems for PR testing.

Apologies for the inconvenience.

@wddgit
Copy link
Contributor Author

wddgit commented Oct 6, 2022

Reverting is probably a good idea if the deadline is tomorrow morning. The odds this will be resolved by then are small. One possibility is putting this in the DEVEL release first. Are we testing anything in DEVEL now? Does it run full IBs with all the same statistics and tests?

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

Thank you Matti!
We'll merge #39642 then, and build pre3 afterwards
As soon as pre3 is out, we can make the "revert of the revert" PR, and merge it, so that you have time to investigate in the IBs

@cmsbuild
Copy link
Contributor

cmsbuild commented Oct 6, 2022

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b4faef/28060/summary.html
COMMIT: 5684438
CMSSW: CMSSW_12_6_X_2022-10-05-2300/el8_amd64_gcc10
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/39644/28060/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

@slava77 comparisons for the following workflows were not done due to missing matrix map:

  • /data/cmsbld/jenkins/workspace/compare-root-files-short-matrix/data/PR-b4faef/41834.0_TTbar_14TeV+2026D94+TTbar_14TeV_TuneCP5_GenSimHLBeamSpot14+DigiTrigger+RecoGlobal+HARVESTGlobal

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3391103
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3391078
  • DQMHistoTests: Total skipped: 22
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 204 log files, 49 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

Tests look good

@wddgit
Copy link
Contributor Author

wddgit commented Oct 6, 2022

Curious that they all passed. In my local working area, I get a consistent seg fault with the following test (and only that test):

cmsRun FWCore/Framework/test/transition_test_cfg.py 6

It passes if I add the Tracer. Probably I should not discuss this on this PR anymore. I think it is unrelated to this PR and bugfix. I'm not sure if it is related to the other unrelated segfault. Nothing looks obviously wrong with the unit test itself.

@rappoccio
Copy link
Contributor

@wddgit and @makortel would you then like to merge this one for pre3 since these are now passing?

@makortel
Copy link
Contributor

makortel commented Oct 6, 2022

@rappoccio This fix is needed anyway (#38801 just made the problem apparent). But it is not sufficient fix all the problems #38801 showed.

If you indeed revert #38801 (i.e. merge #39642) for pre3, this PR is not strictly necessary (seems we either don't this code path without #38801, or it is extremely unlikely), but it shouldn't hurt either (famous last words, in a sense it could be safer to build pre3 without this PR if you proceed with #39642).

(if you would consider keeping #38801 for pre3, then this PR should be merged as well)

@wddgit
Copy link
Contributor Author

wddgit commented Oct 6, 2022

This bug fix PR will not work without #38801. If you want me to fix the bug without #38801 then I would need to revise it. I agree the bug existed previously, but the list of circumstances which have to exist to hit the bug is long and hard to create. I think it literally has never happened before concurrent runs. I think you would need IOVs that change between beginRun and the following beginLumi in an ESSource that does not support concurrent IOVs. That never happens. I would not bother to merge this fix without #38801.

@rappoccio
Copy link
Contributor

OK let's just merge the other then, and not do this one, see comment here:

#39642 (comment)

@perrotta
Copy link
Contributor

perrotta commented Oct 6, 2022

@wddgit maybe tomorrow you could combine this PR with the single commit of #38801, so that you have all in one.
Otherwise we will revert #38801 and you rebase this after that merge

@wddgit
Copy link
Contributor Author

wddgit commented Oct 6, 2022

@perrotta I will do as you suggested and also combine in the second bug fix as soon as it is ready. I'll submit a new PR containing all that.

@wddgit wddgit closed this Oct 6, 2022
@wddgit wddgit deleted the fixNoServiceRegistrySet branch January 13, 2023 17:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants