-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SIGSEGV in HGCalImagingAlgo present in RelVals for slc7_aarch64_gcc530 & slc7_aarch64_gcc700 (aarch64 only) #19179
Comments
A new Issue was created by @mrodozov . @davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks. cms-bot commands are listed here |
assign reconstruction, upgrade |
I'm running valgrind on the step 3 of 27034.0. The job isn't finished yet but it has already found
|
if this helps, they've substituted makeClusters with populate here https://github.com/cms-sw/cmssw/pull/18236/files#diff-b09c179fedcd894f76956c03c04d943cR132. seems like the usage of populate inside the produce is where something goes wrong |
@rovere @felicepantaleo @clelange please follow up for this HGCal issue. @mrodozov please change the title of this issue to be more descriptive of the problem (e.g. "SIGSEGV in HGCalImagingAlgo" |
Also, for the record, we need some instructions to reproduce. |
The valgrind log has
and
|
To reproduce, I created a work area for CMSSW_9_2_X_2017-06-06-2300 on a standard amd64 machine (slc6_amd64_gcc530) and then ran step 3 of workflow 27034.0. This does not crash, but valgrind does show problems. |
As a first guess, I think the problem is probably line 571 and/or 572
probably because of an off by one error in the |
As a test I added the following to HGCalImagingAlgo.cc
I then ran the job and it failed with cmsRun: /uscms_data/d2/cdj/build/temp/crash/CMSSW_9_2_X_2017-06-06-2300/src/RecoLocalCalo/HGCalRecAlgos/src/HGCalImagingAlgo.cc:567: void HGCalImagingAlgo::computeThreshold(): Assertion `wafer < static_cast(thresholds[layer-1].size())' failed. So it does look like an off by one error with |
Hi @Dr15Jones @mrodozov - sorry about the hassle. If it's just a off-by-one error for
dummy.resize(maxNumberOfWafersPerLayer+1, 0); should fix it. If not, then the magic number in
I can have a look at that tomorrow, too many other things going on today. |
I ran valgrind on 27434.0 after compiling
So it looks like @Dr15Jones had the right idea. |
@clelange , using |
@slava77, the crash is only visible on aarch64. In order to reproduce it you need to login to one of our arm64 build machines then create 92X dev area and run workflow 27034.0. I have created cmsuser account on moonshot-arm64-13.cern.ch (I can send you the password in email). |
@clelange I think you're correct that we need to ask @bsunanda for the correct "magic number" * It would be great to be able to get this number directly from the HGCal geometry/topology in a way that would enforce its correctness... |
I looked at 27034.0 CMSSW_9_2_ROOT6_X_2017-06-08-2300 slc6_amd64_gcc700
Same workflow on AArch64 for CMSSW_9_2_ROOT6_X_2017-06-05-2300 produces 40 invalid writes/reads. Some are here:
|
There are wafer #'s 0..795 which are present for FH. So setting the maximum to 796 should help. I am trying to get a helper function in geometry which can provide maximum wafer # for a given configuration. |
Submitted a PR with hardwired number soon to be replaced by number derived from geometry |
Good,
maybe we should move back to use op [] that is infinitely faster.
On Mon, Jun 12, 2017, 23:27 Kevin Pedro ***@***.***> wrote:
I confirm that #19198 <#19198> from
@bsunanda <https://github.com/bsunanda> does not have any out-of-range
exceptions when I run workflow 27434.0 replacing []s with .at()s.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#19179 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABHaR-CEitAskwvaZVSXOtFUWHkupUktks5sDaqhgaJpZM4N1lz3>
.
--
Ciao,
--Marco.
…___________
Marco Rovere
Marco.Rovere@cern.ch
CERN EP-CMG-CO | room 40 3-A28 | tel +41227671209 (71209)
|
I confirm that 4 failing workflows on aarch64 run without crash with #19198
|
But does that fix all the issues from valgrind?
…On Tue, Jun 13, 2017, 9:02 AM Malik Shahzad Muzaffar < ***@***.***> wrote:
I confirm that 4 failing workflows on aarch64 run without crash with
#19198 <#19198>
27034.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D16_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D16+RecoFullGlobal_2023D16+HARVESTFullGlobal_2023D16 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED - time date Tue Jun 13 08:51:06 2017-date Tue Jun 13 08:03:32 2017; exit: 0 0 0 0
27034.2_TTbar_14TeV_Timing+TTbar_14TeV_TuneCUETP8M1_2023D16_GenSimHLBeamSpotFull14_Timing+DigiFullTrigger_Timing_2023D16+RecoFullGlobal_Timing_2023D16+HARVESTFullGlobal_Timing_2023D16 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED - time date Tue Jun 13 08:51:06 2017-date Tue Jun 13 08:03:37 2017; exit: 0 0 0 0
27434.0_TTbar_14TeV+TTbar_14TeV_TuneCUETP8M1_2023D17_GenSimHLBeamSpotFull14+DigiFullTrigger_2023D17+RecoFullGlobal_2023D17+HARVESTFullGlobal_2023D17 Step0-PASSED Step1-PASSED Step2-PASSED Step3-PASSED - time date Tue Jun 13 08:51:05 2017-date Tue Jun 13 08:03:40 2017; exit: 0 0 0 0
3 3 3 3 tests passed, 0 0 0 0 failed
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#19179 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAMB47nb7Ho7kw6hqhTUlSPFrmDJsSKYks5sDjQBgaJpZM4N1lz3>
.
|
PR #19207 should fix the TMVA::Reader invalid read issue we have seen here #19179 (comment) |
Phase2-hgx85 Correct for overwriting (as reported in PR #19179)
+1 |
+1 fixed in #19198 as noted above already |
This issue is fully signed and ready to be closed. |
We were tracking release validation errors present only for aarch64 builds (here http://goo.gl/bhxlJE and here http://goo.gl/wPUz5C, fails 270* and 274* SIGSEGV) and found they've started after this PR #18236. Before that, we ran manually the first test 27034.0 which failed with the following:
This appears to start failing in the destructor of HGCalClusterProducer (which is empty), but
as we went further there was a reference showing something was wrong with the disposal of
https://github.com/cms-sw/cmssw/blob/master/RecoLocalCalo/HGCalRecProducers/plugins/HGCalClusterProducer.cc#L47
showing not proper deletion of a nested vector structure.
@Dr15Jones @clelange
The text was updated successfully, but these errors were encountered: