New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow 11634.911 appearing broken in several PR tests #32963
Comments
A new Issue was created by @qliphy Qiang Li. @Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
11634.911 was added to the short matrix by #32857 and cms-sw/cms-bot#1490 assign @cms-sw/pdmv-l2 @cms-sw/geometry-l2 @cms-sw/simulation-l2 |
assign pdmv, geometry, simulation |
New categories assigned: geometry,pdmv,simulation @Dr15Jones,@cvuosalo,@mdhildreth,@mdhildreth,@chayanit,@wajidalikhan,@makortel,@jordan-martins,@ianna,@civanch,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks |
@qliphy I don't think there is any big problem. In #32900, the matrix map was missing, but it has since been added. In #32932, there XML files were changed, which may have triggered a new random seed and caused statistical fluctuations in the 11634.911 workflow. In #32956, the intent of the PR was to change the results of the workflow since there was a bug. Again, the XML files changed, which triggered a new random seed, so there are statistical fluctuations along with the desired changes. |
I am seeing other PRs where 11634.911 is showing differences even though the PR has no affect on any workflows. |
@cvuosalo , I have no idea . Can you check the RelVal configuration/output for the following to see if you can find any differences?
|
@smuzaffar The |
I can not tell if that will trigger a new random seed or not but for sure mixing units is not good. |
I am re-triggering #32966 to see whether there is still difference from 11634.911. |
Re-triggering #32966 PR tests shows no difference from 11634.911. |
Closed as discussed at today's ORP. |
+1 |
Test in #30949 (comment) shows many differences in 11634.911, and the PR should have no impact on simulation. |
@makortel @smuzaffar Workflow 11634.911 performs GEN-SIM every time. If there is a change in the random seed, the simulation will be slightly different due to statistical fluctuations. I am not sure what triggers a new random seed. |
The baseline comes from the IB the PR is tested against (so changes every 12 hours or so). The PR test setup is such that the random number seeds are (or should be) the same every time. All the other MC workflows redo the GEN-SIM every time as well, so an infrastructure problem should be visible in other workflows as well. |
@makortel With the baseline coming from the IB, is the PR tested with IB + current PR, or is it tested with IB + latest merged PRs + current PR? |
My recollection is that there may be other, recently merged, PRs included in the tests, but those are shown in the test result message and/or page |
Workflow 11634.911 is different from other MC workflows in that it uses XML files for the geometry, rather than getting geometry from the DB like other workflows. This fact makes it sensitive to any changes in the geometry XML files or in DD4hep. If there are such changes between the PR and the IB, then the simulation geometry will be different (slightly), and the comparison results will show statistical fluctuations. It seems we see these differences even more often than there are have been actual changes in geometry, and I don't know why that is so. |
Here is another one #32804 (comment). No changes that should impact random number generation sequence, no other PRs being tested, and 11634.911 showing differences. |
@makortel #32804 changed tbb usage, which may have affected DD4hep. If the geometry is re-calculated and is no longer bit-wise identical to the previous geometry, that could change the simulation history and introduce statistical fluctuations. None of these changes would be significant. |
If DD4hep uses TBB then that implies it does stuff concurrently. Maybe there is an 'order of operation' problem which changes the geometry depending on which parts of the calculation finish first? If so, that would be very bad. |
By quick |
There's very little concurrent operation, if any, in DD4hep. In the past we had issues integrating DD4hep that were related to tbb libraries, so that is why I suggest there may be some dependency on tbb. I should ask the DD4hep team exactly what level of concurrency DD4hep supports. |
I agree - any influence of TBB on the geometry is a sign of bugs in the geometry code.
… On Mar 3, 2021, at 9:32 PM, Chris Jones ***@***.***> wrote:
If DD4hep uses TBB then that implies it does stuff concurrently. Maybe there is an 'order of operation' problem which changes the geometry depending on which parts of the calculation finish first? If so, that would be very bad.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
The compilers and linkers generate extremely reproducible results. Plus, what you describe is happening to all the other code in CMSSW and that gives consistent results. One thing that could cause such issues would be uninitialized values which would totally be effected by what was either last on the stack or in that heap location and those would change when threads are used. |
Running valgrind might be of use to find memory issues. As @Dr15Jones suggests that is often the reason for things like re-compilations to change results. |
I ran |
Since we see this again today - I’m wondering if the issue is rather something that changes when dd4hep itself is rebuilt? [I don’t see anything obvious in the cmake files]
… On Mar 5, 2021, at 10:39 PM, Carl Vuosalo ***@***.***> wrote:
I ran valgrind on DD4hep simulation. There are a lot of complaints about TStorage::UpdateIsOnHeap() in ROOT, but nothing I could find in CMSSW itself. I'll keep investigating.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
During simulation, one can get a debug print-out of the random seeds. From multiple executions of DD4hep simulation, I found that each execution showed one of two sequences of random seeds, with the alternation between the two sequences seemingly random from execution to execution. These tests show that DD4hep simulation does not have completely different results each time, but rather alternates between two possible simulation histories. Also, the alternation does not depend upon recompilation -- the same executable can produce alternating results. I tested several IBs, and I found some IBs seemed to produce more stable results than others for unknown reasons. |
closed by #33141 |
see e.g. #32900 #32932 and #32956 where many differences appear.
Although #32956 is expected to change DD4hep indeed and thus can lead to difference, but it should not affect tracks... (perhaps it is due to the change of the Geant4 history?)
The text was updated successfully, but these errors were encountered: