Workflow 11634.911 appearing broken in several PR tests #32963

qliphy · 2021-02-22T13:59:46Z

see e.g. #32900 #32932 and #32956 where many differences appear.

Although #32956 is expected to change DD4hep indeed and thus can lead to difference, but it should not affect tracks... (perhaps it is due to the change of the Geant4 history?)

cmsbuild · 2021-02-22T14:00:04Z

A new Issue was created by @qliphy Qiang Li.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

qliphy · 2021-02-22T14:15:09Z

11634.911 was added to the short matrix by #32857 and cms-sw/cms-bot#1490

assign @cms-sw/pdmv-l2 @cms-sw/geometry-l2 @cms-sw/simulation-l2

makortel · 2021-02-22T14:18:53Z

assign pdmv, geometry, simulation

cmsbuild · 2021-02-22T14:19:12Z

New categories assigned: geometry,pdmv,simulation

@Dr15Jones,@cvuosalo,@mdhildreth,@mdhildreth,@chayanit,@wajidalikhan,@makortel,@jordan-martins,@ianna,@civanch,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

cvuosalo · 2021-02-22T15:44:15Z

@qliphy I don't think there is any big problem. In #32900, the matrix map was missing, but it has since been added. In #32932, there XML files were changed, which may have triggered a new random seed and caused statistical fluctuations in the 11634.911 workflow. In #32956, the intent of the PR was to change the results of the workflow since there was a bug. Again, the XML files changed, which triggered a new random seed, so there are statistical fluctuations along with the desired changes.

cvuosalo · 2021-02-22T17:36:11Z

I am seeing other PRs where 11634.911 is showing differences even though the PR has no affect on any workflows.
@smuzaffar Could there be a configuration problem with 11634.911 so that it incorrectly gets a new random seed for every PR test?

smuzaffar · 2021-02-22T17:46:49Z

@cvuosalo , I have no idea . Can you check the RelVal configuration/output for the following to see if you can find any differences?

cvuosalo · 2021-02-22T18:50:27Z

@smuzaffar The initialSeed values in the TTbar_14TeV_TuneCP5_cfi_GEN_SIM.py config are the same between IB and PR tests. However, the IB test is using the ROOT unit convention, and the PR test is using the Geant4 units convention. We have just gone through the update of the DD4hep version. Could that trigger a new random seed?

smuzaffar · 2021-02-22T20:14:10Z

I can not tell if that will trigger a new random seed or not but for sure mixing units is not good.
By the way, we now have newer dd4hep in IBs ( https://github.com/cms-sw/cmsdist/pull/6612/files ). So IB should also use G4 units, I would suggest to trigger a PR tests to see if 11634.911 still shows comparison differences?

qliphy · 2021-02-23T01:47:39Z

I can not tell if that will trigger a new random seed or not but for sure mixing units is not good.
By the way, we now have newer dd4hep in IBs ( https://github.com/cms-sw/cmsdist/pull/6612/files ). So IB should also use G4 units, I would suggest to trigger a PR tests to see if 11634.911 still shows comparison differences?

I am re-triggering #32966 to see whether there is still difference from 11634.911.

qliphy · 2021-02-23T04:28:59Z

Re-triggering #32966 PR tests shows no difference from 11634.911.
So it seems this issue can be closed, to be confirmed at today's ORP.

qliphy · 2021-02-23T16:32:38Z

Closed as discussed at today's ORP.

cvuosalo · 2021-02-23T16:33:35Z

+1

makortel · 2021-02-26T14:49:54Z

Test in #30949 (comment) shows many differences in 11634.911, and the PR should have no impact on simulation.

cvuosalo · 2021-02-26T17:27:02Z

@makortel @smuzaffar Workflow 11634.911 performs GEN-SIM every time. If there is a change in the random seed, the simulation will be slightly different due to statistical fluctuations. I am not sure what triggers a new random seed.
For the PR tests, what is the source of the baseline for comparison? Is it re-generated for each PR test? If it is old an baseline, that could be also a cause of the differences.

makortel · 2021-02-26T17:40:31Z

The baseline comes from the IB the PR is tested against (so changes every 12 hours or so). The PR test setup is such that the random number seeds are (or should be) the same every time. All the other MC workflows redo the GEN-SIM every time as well, so an infrastructure problem should be visible in other workflows as well.

cvuosalo · 2021-02-26T17:58:20Z

@makortel With the baseline coming from the IB, is the PR tested with IB + current PR, or is it tested with IB + latest merged PRs + current PR?

makortel · 2021-02-26T20:09:23Z

My recollection is that there may be other, recently merged, PRs included in the tests, but those are shown in the test result message and/or page

cvuosalo · 2021-02-26T20:25:32Z

Workflow 11634.911 is different from other MC workflows in that it uses XML files for the geometry, rather than getting geometry from the DB like other workflows. This fact makes it sensitive to any changes in the geometry XML files or in DD4hep. If there are such changes between the PR and the IB, then the simulation geometry will be different (slightly), and the comparison results will show statistical fluctuations. It seems we see these differences even more often than there are have been actual changes in geometry, and I don't know why that is so.

makortel · 2021-03-03T14:14:17Z

Here is another one #32804 (comment). No changes that should impact random number generation sequence, no other PRs being tested, and 11634.911 showing differences.

cvuosalo · 2021-03-03T20:28:56Z

@makortel #32804 changed tbb usage, which may have affected DD4hep. If the geometry is re-calculated and is no longer bit-wise identical to the previous geometry, that could change the simulation history and introduce statistical fluctuations. None of these changes would be significant.
I am thinking this geometry re-calculation may be the reason workflow 11634.911 is so sensitive.

Dr15Jones · 2021-03-03T20:32:35Z

If DD4hep uses TBB then that implies it does stuff concurrently. Maybe there is an 'order of operation' problem which changes the geometry depending on which parts of the calculation finish first? If so, that would be very bad.

makortel · 2021-03-03T20:41:33Z

By quick git grep I found these uses of TBB concurrency in DD4hep
https://github.com/AIDASoft/DD4hep/blob/57bdfda84c7d6447ba070bcaebf0bb6f278ec3c4/DDDigi/src/DigiKernel.cpp#L274-L277
https://github.com/AIDASoft/DD4hep/blob/57bdfda84c7d6447ba070bcaebf0bb6f278ec3c4/DDDigi/src/DigiKernel.cpp#L349-L352
Do we use DigiKernel in some way? (I'm confused by all the "event processing" printouts in that class)

cvuosalo · 2021-03-03T20:48:16Z

There's very little concurrent operation, if any, in DD4hep. In the past we had issues integrating DD4hep that were related to tbb libraries, so that is why I suggest there may be some dependency on tbb. I should ask the DD4hep team exactly what level of concurrency DD4hep supports.
My thought is that somewhere in the process of re-compiling, re-optimizing, re-linking with possibly different libraries and then re-calculating the thousands upon thousands of floating-point numbers in the geometry could slightly change numerical values, causing the simulation history to change.

davidlange6 · 2021-03-03T21:22:29Z

I agree - any influence of TBB on the geometry is a sign of bugs in the geometry code.

…

On Mar 3, 2021, at 9:32 PM, Chris Jones ***@***.***> wrote: If DD4hep uses TBB then that implies it does stuff concurrently. Maybe there is an 'order of operation' problem which changes the geometry depending on which parts of the calculation finish first? If so, that would be very bad. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or unsubscribe.

Dr15Jones · 2021-03-03T21:41:03Z

My thought is that somewhere in the process of re-compiling, re-optimizing, re-linking with possibly different libraries and then re-calculating the thousands upon thousands of floating-point numbers in the geometry could slightly change numerical values, causing the simulation history to change.

The compilers and linkers generate extremely reproducible results. Plus, what you describe is happening to all the other code in CMSSW and that gives consistent results. One thing that could cause such issues would be uninitialized values which would totally be effected by what was either last on the stack or in that heap location and those would change when threads are used.

davidlange6 · 2021-03-04T18:42:22Z

Running valgrind might be of use to find memory issues. As @Dr15Jones suggests that is often the reason for things like re-compilations to change results.

cvuosalo · 2021-03-05T21:39:11Z

I ran valgrind on DD4hep simulation. There are a lot of complaints about TStorage::UpdateIsOnHeap() in ROOT, but nothing I could find in CMSSW itself. I'll keep investigating.

davidlange6 · 2021-03-08T08:43:20Z

Since we see this again today - I’m wondering if the issue is rather something that changes when dd4hep itself is rebuilt? [I don’t see anything obvious in the cmake files]

…

On Mar 5, 2021, at 10:39 PM, Carl Vuosalo ***@***.***> wrote: I ran valgrind on DD4hep simulation. There are a lot of complaints about TStorage::UpdateIsOnHeap() in ROOT, but nothing I could find in CMSSW itself. I'll keep investigating. — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

cvuosalo · 2021-03-09T21:28:01Z

During simulation, one can get a debug print-out of the random seeds. From multiple executions of DD4hep simulation, I found that each execution showed one of two sequences of random seeds, with the alternation between the two sequences seemingly random from execution to execution. These tests show that DD4hep simulation does not have completely different results each time, but rather alternates between two possible simulation histories. Also, the alternation does not depend upon recompilation -- the same executable can produce alternating results. I tested several IBs, and I found some IBs seemed to produce more stable results than others for unknown reasons.
These results imply to me an uninitialized variable that can randomly take on one of two values each time simulation is run. However, I could find no report of uninitialized variables from the static analyzer or from valgrind, so I'm not sure how else to look for it.

qliphy · 2021-03-11T00:20:43Z

closed by #33141

cmsbuild added the pending-assignment label Feb 22, 2021

cmsbuild added geometry-pending pdmv-pending pending-signatures simulation-pending and removed pending-assignment labels Feb 22, 2021

qliphy closed this as completed Feb 23, 2021

cmsbuild added geometry-approved and removed geometry-pending labels Feb 23, 2021

qliphy mentioned this issue Feb 26, 2021

[PPS] Fix of validation script #32998

Merged

3 tasks

silviodonato mentioned this issue Mar 2, 2021

Boost - update to 1.75.0 cms-sw/cmsdist#6655

Merged

qliphy reopened this Mar 4, 2021

abdoulline mentioned this issue Mar 5, 2021

Hcal - packing HB TDC #33050

Merged

This was referenced Mar 10, 2021

improving PixelVertexCollectionTrimmer #33091

Merged

Addressing ECAL local reconstruction on GPU issues #33116

Merged

silviodonato mentioned this issue Mar 10, 2021

Remove 11634.911 (DD4HEP) from the limited matrix #33141

Merged

qliphy closed this as completed Mar 11, 2021

cvuosalo mentioned this issue Apr 28, 2021

Random PR test comparison failures in DD4Hep workflows #33552

Closed

missirol mentioned this issue Oct 9, 2021

Undo update of Run-3 GTs in autoCondHLT #35600

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflow 11634.911 appearing broken in several PR tests #32963

Workflow 11634.911 appearing broken in several PR tests #32963

qliphy commented Feb 22, 2021 •

edited

cmsbuild commented Feb 22, 2021

qliphy commented Feb 22, 2021 •

edited

makortel commented Feb 22, 2021 •

edited

cmsbuild commented Feb 22, 2021

cvuosalo commented Feb 22, 2021 •

edited

cvuosalo commented Feb 22, 2021

smuzaffar commented Feb 22, 2021

cvuosalo commented Feb 22, 2021

smuzaffar commented Feb 22, 2021 •

edited

qliphy commented Feb 23, 2021

qliphy commented Feb 23, 2021

qliphy commented Feb 23, 2021

cvuosalo commented Feb 23, 2021

makortel commented Feb 26, 2021

cvuosalo commented Feb 26, 2021

makortel commented Feb 26, 2021 •

edited

cvuosalo commented Feb 26, 2021

makortel commented Feb 26, 2021

cvuosalo commented Feb 26, 2021

makortel commented Mar 3, 2021

cvuosalo commented Mar 3, 2021

Dr15Jones commented Mar 3, 2021

makortel commented Mar 3, 2021

cvuosalo commented Mar 3, 2021

davidlange6 commented Mar 3, 2021 via email

Dr15Jones commented Mar 3, 2021

davidlange6 commented Mar 4, 2021

cvuosalo commented Mar 5, 2021

davidlange6 commented Mar 8, 2021 via email

cvuosalo commented Mar 9, 2021

qliphy commented Mar 11, 2021

Workflow 11634.911 appearing broken in several PR tests #32963

Workflow 11634.911 appearing broken in several PR tests #32963

Comments

qliphy commented Feb 22, 2021 • edited

cmsbuild commented Feb 22, 2021

qliphy commented Feb 22, 2021 • edited

makortel commented Feb 22, 2021 • edited

cmsbuild commented Feb 22, 2021

cvuosalo commented Feb 22, 2021 • edited

cvuosalo commented Feb 22, 2021

smuzaffar commented Feb 22, 2021

cvuosalo commented Feb 22, 2021

smuzaffar commented Feb 22, 2021 • edited

qliphy commented Feb 23, 2021

qliphy commented Feb 23, 2021

qliphy commented Feb 23, 2021

cvuosalo commented Feb 23, 2021

makortel commented Feb 26, 2021

cvuosalo commented Feb 26, 2021

makortel commented Feb 26, 2021 • edited

cvuosalo commented Feb 26, 2021

makortel commented Feb 26, 2021

cvuosalo commented Feb 26, 2021

makortel commented Mar 3, 2021

cvuosalo commented Mar 3, 2021

Dr15Jones commented Mar 3, 2021

makortel commented Mar 3, 2021

cvuosalo commented Mar 3, 2021

davidlange6 commented Mar 3, 2021 via email

Dr15Jones commented Mar 3, 2021

davidlange6 commented Mar 4, 2021

cvuosalo commented Mar 5, 2021

davidlange6 commented Mar 8, 2021 via email

cvuosalo commented Mar 9, 2021

qliphy commented Mar 11, 2021

qliphy commented Feb 22, 2021 •

edited

qliphy commented Feb 22, 2021 •

edited

makortel commented Feb 22, 2021 •

edited

cvuosalo commented Feb 22, 2021 •

edited

smuzaffar commented Feb 22, 2021 •

edited

makortel commented Feb 26, 2021 •

edited