Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflow 11634.911 appearing broken in several PR tests #32963

Closed
qliphy opened this issue Feb 22, 2021 · 31 comments
Closed

Workflow 11634.911 appearing broken in several PR tests #32963

qliphy opened this issue Feb 22, 2021 · 31 comments

Comments

@qliphy
Copy link
Contributor

qliphy commented Feb 22, 2021

see e.g. #32900 #32932 and #32956 where many differences appear.

Although #32956 is expected to change DD4hep indeed and thus can lead to difference, but it should not affect tracks... (perhaps it is due to the change of the Geant4 history?)

@cmsbuild
Copy link
Contributor

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@qliphy
Copy link
Contributor Author

qliphy commented Feb 22, 2021

11634.911 was added to the short matrix by #32857 and cms-sw/cms-bot#1490

assign @cms-sw/pdmv-l2 @cms-sw/geometry-l2 @cms-sw/simulation-l2

@makortel
Copy link
Contributor

makortel commented Feb 22, 2021

assign pdmv, geometry, simulation

@cmsbuild
Copy link
Contributor

New categories assigned: geometry,pdmv,simulation

@Dr15Jones,@cvuosalo,@mdhildreth,@mdhildreth,@chayanit,@wajidalikhan,@makortel,@jordan-martins,@ianna,@civanch,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cvuosalo
Copy link
Contributor

cvuosalo commented Feb 22, 2021

@qliphy I don't think there is any big problem. In #32900, the matrix map was missing, but it has since been added. In #32932, there XML files were changed, which may have triggered a new random seed and caused statistical fluctuations in the 11634.911 workflow. In #32956, the intent of the PR was to change the results of the workflow since there was a bug. Again, the XML files changed, which triggered a new random seed, so there are statistical fluctuations along with the desired changes.

@cvuosalo
Copy link
Contributor

I am seeing other PRs where 11634.911 is showing differences even though the PR has no affect on any workflows.
@smuzaffar Could there be a configuration problem with 11634.911 so that it incorrectly gets a new random seed for every PR test?

@cvuosalo
Copy link
Contributor

@smuzaffar The initialSeed values in the TTbar_14TeV_TuneCP5_cfi_GEN_SIM.py config are the same between IB and PR tests. However, the IB test is using the ROOT unit convention, and the PR test is using the Geant4 units convention. We have just gone through the update of the DD4hep version. Could that trigger a new random seed?

@smuzaffar
Copy link
Contributor

smuzaffar commented Feb 22, 2021

I can not tell if that will trigger a new random seed or not but for sure mixing units is not good.
By the way, we now have newer dd4hep in IBs ( https://github.com/cms-sw/cmsdist/pull/6612/files ). So IB should also use G4 units, I would suggest to trigger a PR tests to see if 11634.911 still shows comparison differences?

@qliphy
Copy link
Contributor Author

qliphy commented Feb 23, 2021

I can not tell if that will trigger a new random seed or not but for sure mixing units is not good.
By the way, we now have newer dd4hep in IBs ( https://github.com/cms-sw/cmsdist/pull/6612/files ). So IB should also use G4 units, I would suggest to trigger a PR tests to see if 11634.911 still shows comparison differences?

I am re-triggering #32966 to see whether there is still difference from 11634.911.

@qliphy
Copy link
Contributor Author

qliphy commented Feb 23, 2021

Re-triggering #32966 PR tests shows no difference from 11634.911.
So it seems this issue can be closed, to be confirmed at today's ORP.

@qliphy
Copy link
Contributor Author

qliphy commented Feb 23, 2021

Closed as discussed at today's ORP.

@qliphy qliphy closed this as completed Feb 23, 2021
@cvuosalo
Copy link
Contributor

+1

@makortel
Copy link
Contributor

Test in #30949 (comment) shows many differences in 11634.911, and the PR should have no impact on simulation.

@cvuosalo
Copy link
Contributor

@makortel @smuzaffar Workflow 11634.911 performs GEN-SIM every time. If there is a change in the random seed, the simulation will be slightly different due to statistical fluctuations. I am not sure what triggers a new random seed.
For the PR tests, what is the source of the baseline for comparison? Is it re-generated for each PR test? If it is old an baseline, that could be also a cause of the differences.

@makortel
Copy link
Contributor

makortel commented Feb 26, 2021

The baseline comes from the IB the PR is tested against (so changes every 12 hours or so). The PR test setup is such that the random number seeds are (or should be) the same every time. All the other MC workflows redo the GEN-SIM every time as well, so an infrastructure problem should be visible in other workflows as well.

@cvuosalo
Copy link
Contributor

@makortel With the baseline coming from the IB, is the PR tested with IB + current PR, or is it tested with IB + latest merged PRs + current PR?

@makortel
Copy link
Contributor

My recollection is that there may be other, recently merged, PRs included in the tests, but those are shown in the test result message and/or page

@cvuosalo
Copy link
Contributor

Workflow 11634.911 is different from other MC workflows in that it uses XML files for the geometry, rather than getting geometry from the DB like other workflows. This fact makes it sensitive to any changes in the geometry XML files or in DD4hep. If there are such changes between the PR and the IB, then the simulation geometry will be different (slightly), and the comparison results will show statistical fluctuations. It seems we see these differences even more often than there are have been actual changes in geometry, and I don't know why that is so.

@makortel
Copy link
Contributor

makortel commented Mar 3, 2021

Here is another one #32804 (comment). No changes that should impact random number generation sequence, no other PRs being tested, and 11634.911 showing differences.

@cvuosalo
Copy link
Contributor

cvuosalo commented Mar 3, 2021

@makortel #32804 changed tbb usage, which may have affected DD4hep. If the geometry is re-calculated and is no longer bit-wise identical to the previous geometry, that could change the simulation history and introduce statistical fluctuations. None of these changes would be significant.
I am thinking this geometry re-calculation may be the reason workflow 11634.911 is so sensitive.

@Dr15Jones
Copy link
Contributor

If DD4hep uses TBB then that implies it does stuff concurrently. Maybe there is an 'order of operation' problem which changes the geometry depending on which parts of the calculation finish first? If so, that would be very bad.

@makortel
Copy link
Contributor

makortel commented Mar 3, 2021

By quick git grep I found these uses of TBB concurrency in DD4hep
https://github.com/AIDASoft/DD4hep/blob/57bdfda84c7d6447ba070bcaebf0bb6f278ec3c4/DDDigi/src/DigiKernel.cpp#L274-L277
https://github.com/AIDASoft/DD4hep/blob/57bdfda84c7d6447ba070bcaebf0bb6f278ec3c4/DDDigi/src/DigiKernel.cpp#L349-L352
Do we use DigiKernel in some way? (I'm confused by all the "event processing" printouts in that class)

@cvuosalo
Copy link
Contributor

cvuosalo commented Mar 3, 2021

There's very little concurrent operation, if any, in DD4hep. In the past we had issues integrating DD4hep that were related to tbb libraries, so that is why I suggest there may be some dependency on tbb. I should ask the DD4hep team exactly what level of concurrency DD4hep supports.
My thought is that somewhere in the process of re-compiling, re-optimizing, re-linking with possibly different libraries and then re-calculating the thousands upon thousands of floating-point numbers in the geometry could slightly change numerical values, causing the simulation history to change.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 3, 2021 via email

@Dr15Jones
Copy link
Contributor

My thought is that somewhere in the process of re-compiling, re-optimizing, re-linking with possibly different libraries and then re-calculating the thousands upon thousands of floating-point numbers in the geometry could slightly change numerical values, causing the simulation history to change.

The compilers and linkers generate extremely reproducible results. Plus, what you describe is happening to all the other code in CMSSW and that gives consistent results. One thing that could cause such issues would be uninitialized values which would totally be effected by what was either last on the stack or in that heap location and those would change when threads are used.

@qliphy qliphy reopened this Mar 4, 2021
@davidlange6
Copy link
Contributor

Running valgrind might be of use to find memory issues. As @Dr15Jones suggests that is often the reason for things like re-compilations to change results.

@cvuosalo
Copy link
Contributor

cvuosalo commented Mar 5, 2021

I ran valgrind on DD4hep simulation. There are a lot of complaints about TStorage::UpdateIsOnHeap() in ROOT, but nothing I could find in CMSSW itself. I'll keep investigating.

@davidlange6
Copy link
Contributor

davidlange6 commented Mar 8, 2021 via email

@cvuosalo
Copy link
Contributor

cvuosalo commented Mar 9, 2021

During simulation, one can get a debug print-out of the random seeds. From multiple executions of DD4hep simulation, I found that each execution showed one of two sequences of random seeds, with the alternation between the two sequences seemingly random from execution to execution. These tests show that DD4hep simulation does not have completely different results each time, but rather alternates between two possible simulation histories. Also, the alternation does not depend upon recompilation -- the same executable can produce alternating results. I tested several IBs, and I found some IBs seemed to produce more stable results than others for unknown reasons.
These results imply to me an uninitialized variable that can randomly take on one of two values each time simulation is run. However, I could find no report of uninitialized variables from the static analyzer or from valgrind, so I'm not sure how else to look for it.

@qliphy
Copy link
Contributor Author

qliphy commented Mar 11, 2021

closed by #33141

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants