Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Instabilities in 11634.911 (DD4Hep) workflow comparisons #35109

Open
makortel opened this issue Sep 1, 2021 · 57 comments
Open

Instabilities in 11634.911 (DD4Hep) workflow comparisons #35109

makortel opened this issue Sep 1, 2021 · 57 comments

Comments

@makortel
Copy link
Contributor

makortel commented Sep 1, 2021

We've observed differences in the DD4Hep workflow 11634.911 comparisons in tests of a few PRs that should not affect results of the DD4Hep workflow. This issue is to collect pointers to those comparisons.

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2021

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor Author

makortel commented Sep 1, 2021

assign geometry

@cmsbuild
Copy link
Contributor

cmsbuild commented Sep 1, 2021

New categories assigned: geometry

@Dr15Jones,@cvuosalo,@civanch,@ianna,@mdhildreth,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor Author

makortel commented Sep 1, 2021

Observed in #35068 (comment) and #34995 (comment)

@makortel
Copy link
Contributor Author

makortel commented Sep 9, 2021

@civanch
Copy link
Contributor

civanch commented Sep 13, 2021

@cvuosalo , is the problem back or it is another one?

@cvuosalo
Copy link
Contributor

The instability appears to be random and rare. It is strange that wf 11634.912 does not show it. The difference between the two workflows is that 11634.911 runs the algorithms and calculates the reco geometry, while 11634.912 reads the already calculated algorithm results and reco geometry out of the DB.

@cvuosalo
Copy link
Contributor

I ran workflow 11634.911 thirty times in CMSSW_12_1_X_2021-09-20-1100 with identical results each time. It appears the instability has gone away.

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@makortel
Copy link
Contributor Author

On the other hand the comparison differences have appeared rather rarely.

@makortel
Copy link
Contributor Author

Here is another instance #36222 (comment).

Could we re-open the issue (and keep it open for longer time)?

@civanch
Copy link
Contributor

civanch commented Nov 24, 2021

@makortel , I cannot, may be you can reopen?

@makortel
Copy link
Contributor Author

I don't have the power. I'm not sure @qliphy / @perrotta have, or if we need @smuzaffar.

@perrotta perrotta reopened this Nov 24, 2021
@perrotta
Copy link
Contributor

Wow: I have the power!

@cms-sw cms-sw deleted a comment from cvuosalo Nov 26, 2021
@makortel
Copy link
Contributor Author

makortel commented Apr 10, 2023

Let's record here that the tests in #41273 (comment) showed 5932 differences in the DQM comparisons of 11634.911 (and that being the only phase Run-{1,2,3} workflow showing differences). Running the tests for second time did not show any differences. The differences seemed to be across the board (i.e. not localized to a few subsystems)

@makortel
Copy link
Contributor Author

makortel commented May 4, 2023

Let's record here that the tests in #41522 (comment) showed 4822 differences in the DQM comparisons of 23634.911 across the board.

@missirol
Copy link
Contributor

missirol commented May 4, 2023

For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf 23634.911.

@makortel
Copy link
Contributor Author

makortel commented May 4, 2023

@cms-sw/geometry-l2 Should we open a new issue to record these instabilities or reopen this one?

@missirol
Copy link
Contributor

missirol commented May 4, 2023

For the record, something similar happened in #41533: 47459 differences in the DQM comparisons of wf 23634.911.

Strange to me that #41541 (comment) reports exactly the same: 47459 differences in the DQM comparisons of wf 23634.911. I haven't seen this kind of differences often before, but twice today.

@makortel
Copy link
Contributor Author

makortel commented May 4, 2023

And another one in #41532 (comment), 4822 differences in workflow 23634.911.

@perrotta
Copy link
Contributor

perrotta commented May 5, 2023

One more in #41504 (comment)

@makortel
Copy link
Contributor Author

makortel commented Jun 2, 2023

Another one in #41852 (comment), 5582 differences in workflow 11634.911

@makortel
Copy link
Contributor Author

makortel commented Jun 2, 2023

(reopening the issue)

@makortel
Copy link
Contributor Author

Another one in #43041 (comment), 6123 differences in workflow 11634.911. The CPU model was the same (Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz) for both the reference and the PR test.

@makortel
Copy link
Contributor Author

To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

@AdrianoDee
Copy link
Contributor

AdrianoDee commented Dec 11, 2023

To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests.

@civanch
Copy link
Contributor

civanch commented Dec 11, 2023

From my point of view, keeping this issues does not help much even if likely we have a problem with 11634.911, which is taken out of everyday testing.

@makortel
Copy link
Contributor Author

To note here that #43439 is removing 11634.911 from the short matrix, after which we would not see these instabilities anymore in PR tests.

Let me know if you think it is preferable to keep it just to have this "constant reminder" of the issue or if it is something that we can leave to IB tests.

Good question. PR tests (including the short matrix) should be about ensuring the PRs behave as expected, and therefore I think using PR tests to stress-test reproducibility is likely not the best way.

If there is no other use for 11634.911 in short matrix (@cms-sw/geometry-l2 could you comment?), I'd be in favor of dropping 11634.911 from the short matrix. Unfortunately IBs themselves don't provide any facilities for inspecting workflow results. @smuzaffar Maybe we should think about something here, at least for select workflows? (not really optimal, but maybe better than (mis)using PR tests?)

@makortel
Copy link
Contributor Author

Just to note that in the end #43439 kept 11634.911

@srimanob
Copy link
Contributor

Hi @makortel
I think this issue is solved, should we close it? Thx.

@makortel
Copy link
Contributor Author

Do we know how the issue got resolved? Or is it just not occurring anymore?

@srimanob
Copy link
Contributor

The workflow in topic is Run-3, right? As DD4hep is run by default in Run-3 workflow (.911 = .0 for Run-3), I think we don't see any instabilities any more. Do I miss some points that we should keep investigating Run-3 DD4hep workflow?

@makortel
Copy link
Contributor Author

From the history the frequency seems to have been one occurrence every 1-4 months (although I suspect not all L2s report those).

Earlier comments suggest that .911 and .0 are different, by .911 reading the geometry from XML and .0 from the DB.

@srimanob
Copy link
Contributor

From the history the frequency seems to have been one occurrence every 1-4 months (although I suspect not all L2s report those).

Earlier comments suggest that .911 and .0 are different, by .911 reading the geometry from XML and .0 from the DB.

Ah, you are right. .911 is XML version, and .912 (which is .0 default now) is DB. Do we need to monitor XML when we use DB? I mean we don't do Run-1, Run-2 XML (DDD) anymore. So, we never know if there is an issue there or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants