Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ecal and Hcal Local GPU Reco crashes on missing detector #34197

Closed
Sam-Harper opened this issue Jun 21, 2021 · 53 comments
Closed

Ecal and Hcal Local GPU Reco crashes on missing detector #34197

Sam-Harper opened this issue Jun 21, 2021 · 53 comments

Comments

@Sam-Harper
Copy link
Contributor

Sam-Harper commented Jun 21, 2021

A bug was exposed last MWGR in that both HCAL and ECAL local reconstruction on a GPU do not have protections when the respective detector is out.

This is explicitly in the HBHERecHitProducerGPU

cmsRun: /build/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_1-slc7_amd64_gcc900/build/CMSSW_11_3_1-build/tmp/BUILDROOT/ebcff69f73237d0d563ea1e7c33f2d83/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_1/src/RecoLocalCalo/HcalRecProducers/src/MahiGPU.cu:1064: void hcal::reconstruction::entryPoint(const hcal::reconstruction::InputDataGPU&, hcal::reconstruction::OutputDataGPU&, const hcal::reconstruction::ConditionsProducts&, hcal::reconstruction::ScratchDataGPU&, const hcal::reconstruction::ConfigParameters&, cudaStream_t): Assertion `startingSample == 0 || startingSample == 2' failed.

and a similar crash was observed in ECAL

To reproduce simply run over any run with the appropriate detector missing.

@cmsbuild
Copy link
Contributor

A new Issue was created by @Sam-Harper Sam Harper.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@makortel
Copy link
Contributor

assign reconstruction, heterogeneous

@cmsbuild
Copy link
Contributor

New categories assigned: heterogeneous,reconstruction

@slava77,@fwyzard,@perrotta,@makortel,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

@makortel
Copy link
Contributor

FYI @cms-sw/ecal-dpg-l2 @cms-sw/hcal-dpg-l2 @vkhristenko

@fwyzard
Copy link
Contributor

fwyzard commented Jun 21, 2021

An other corner case that needs to be tested and possibly addressed is the unpacking and local reconstruction of calibration data.

It may also be useful to check if data taken in non-standard conditions (different zero suppression, etc.) is properly handled.

@thomreis
Copy link
Contributor

Is there a recipe to reproduce this? Running at P5 with a specific HLT configuration? @amassiro FYI even though I guess you know already.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 23, 2021

@mzarucki does FOG have some samples to reproduce the issues ?

@mzarucki
Copy link
Contributor

mzarucki commented Jun 24, 2021

Hi all,

@smorovic put the file from the run 342053 with ECAL in, HCAL out (e-log) that was crashing in NFS /nfshome0/smorovic/gpu-test

The HLT menu used was /cdaq/cosmic/commissioning2021/MWGR4/Cosmics_GPUTest1_V2/V2 which requires the upgraded ConfDB (briefly described here).

Cheers,
Mateusz

PS. The test we did with HCAL in and ECAL out was run 342175 (e-log)

@fwyzard
Copy link
Contributor

fwyzard commented Jun 24, 2021 via email

@mzarucki
Copy link
Contributor

mzarucki commented Jun 25, 2021

Hi Andrea, all,

I came up with the simplest set of instructions to reproduce the errors, by copying the HLT menu file and raw input files from @smorovic into my directory on NFS (/nfshome0/mzarucki/GPUTests).

As a first step, one would need to be logged into one of the GPU machines with a working area set up - all instructions are included in the GPU development Twiki: https://twiki.cern.ch/twiki/bin/viewauth/CMS/TriggerDevelopmentWithGPUs

Concerning the working area, it could be either on NFS (/nfshome0/) or on the /data/user/$USER directory of the machine, as described here.

The release to set up is CMSSW_11_3_1_patch1. You can copy the HLT GPU menu configuration file from: /nfshome0/mzarucki/GPUTests/Cosmics_GPUTest1_V2-V2.py. On top of the standard configuration, there is a block of code which sets up the DAQ source, including the local input file:

process.EvFDaqDirector = cms.Service(
    "EvFDaqDirector",
    runNumber=cms.untracked.uint32(342053),
    baseDir=cms.untracked.string("tmp"),
    buBaseDir=cms.untracked.string(
        "/nfshome0/mzarucki/CMSSW_11_3_1_patch1/src"
    ),
    useFileBroker=cms.untracked.bool(False),
    fileBrokerKeepAlive=cms.untracked.bool(True),
    fileBrokerPort=cms.untracked.string("8080"),
    fileBrokerUseLocalLock=cms.untracked.bool(True),
    fuLockPollInterval=cms.untracked.uint32(2000),
    requireTransfersPSet=cms.untracked.bool(False),
    selectedTransferMode=cms.untracked.string(""),
    mergingPset=cms.untracked.string(""),
    outputAdler32Recheck=cms.untracked.bool(False),
)

process.source.fileNames = cms.untracked.vstring("/nfshome0/mzarucki/GPUTests/ECALin_HCALout/run342053/run342053_ls0013_index000002.raw")
#process.source.fileNames = cms.untracked.vstring("/nfshome0/mzarucki/GPUTests/HCALin_ECALout/run342110/run342110_ls0023_index000000.raw")
process.source.fileListMode = True

where one would have to update the buBaseDir to your CMSSW directory.

To recreate the HCAL crash from run 342053 with ECAL in and HCAL out (e-log), one would use the file that is already set up (/nfshome0/mzarucki/GPUTests/ECALin_HCALout/run342053/run342053_ls0013_index000002.raw).

To recreate the ECAL crash from run 342175 with HCAL in and ECAL out (e-log), we have another raw file locally available from run 342110 with the same setup (/nfshome0/mzarucki/GPUTests/HCALin_ECALout/run342110/run342110_ls0023_index000000.raw). In the EvFDaqDirector one would have to update the runNumber, as well as the process.source.fileNames of course.

@fwyzard, @Sam-Harper please comment if there is anything else to add.

Cheers,
Mateusz

PS. @Sam-Harper has updated the hltGetConfiguration command for dumping GPU menus, however, this is not (yet) working on the GPU machines. Therefore, a local copy is the simplest alternative. The menus that were ran are also temporarily saved in /var/log/hltd/pid/ on the machines that were used in the DAQ configuration (fu-c2a02-37-[01-04]).

@thomreis
Copy link
Contributor

Thanks @mzarucki for the detailed recipe. I think this should be enough to get going.
However, @fwyzard is the 11_3_1_patch1 release installed on the GPU machines? I find only 11_3_0_patch1.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 30, 2021

@thomreis all offline releases are actually available from /cvmfs; I've added instruction for using them to the twiki: https://twiki.cern.ch/twiki/bin/view/CMS/TriggerDevelopmentWithGPUs#Setting_up_a_working_area_fo_AN2 .

@thomreis
Copy link
Contributor

@fwyzard thanks I got it to work. I must have had a cached version of the twiki without those instructions.
I can now reproduce the crash.

@fwyzard
Copy link
Contributor

fwyzard commented Jun 30, 2021

no, I just added them one hour ago :-)

@thomreis
Copy link
Contributor

I have made PRs to master and CMSSW_11_3_X with a fix for the ECAL out crash.

@Sam-Harper
Copy link
Contributor Author

thanks Thomas!

@Sam-Harper
Copy link
Contributor Author

@cms-sw/hcal-dpg-l2 is there a timeline the HCAL DPG to have a look at this? Thanks!

@mseidel42
Copy link
Contributor

@mariadalfonso could you have a look, please?

@mzarucki
Copy link
Contributor

mzarucki commented Aug 4, 2021

Hi @fwyzard,

Yes, our previous tests have already shown this. I have re-done them and confirm that we see no crashes when running only one set of path types in 11_3_3 + fixes. We saw this already for ECAL in the pure 11_3_3 release, and we see this for Pixel (11_3_3 + Pixel PR #34684) and HCAL (11_3_3 + HCAL PR #34750).

Cheers,
Mateusz

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

I could reproduce the crash with the configuration of @mzarucki. It happens always with 2 or more threads.

I have extended a bit the error message just before the crash and it seems that there is a wrong number of channels passed in the digis.

%MSG-e EcalUncalibRecHitProducerGPU:   EcalUncalibRecHitProducerGPU:hltEcalUncalibRecHitGPU 04-Aug-2021 10:10:48 CEST  Run: 343762 Event: 2109
Max number of channels exceeded in barrel or endcap. Number of barrel channels: 2294272000 with maxNumberHitsEB=61200, number of endcap channels: 32689 with maxNumberHitsEE=14648
%MSG
----- Begin Fatal Exception 04-Aug-2021 10:10:54 CEST-----------------------
An exception of category 'StdException' occurred while
   [0] Processing  Event run: 343762 lumi: 1 event: 2109 stream: 1
   [1] Running path 'HLT_ECALHT1_GPU_v1'
   [2] Prefetching for module CaloTowersCreator/'hltTowerMakerEcal'
   [3] Prefetching for module EcalRecHitProducer/'hltEcalRecHit@cuda'
   [4] Prefetching for module EcalUncalibRecHitConvertGPU2CPUFormat/'hltEcalUncalibRecHit@cuda'
   [5] Prefetching for module EcalCPUUncalibRecHitProducer/'hltEcalUncalibRecHitSoA'
   [6] Calling method for module EcalUncalibRecHitProducerGPU/'hltEcalUncalibRecHitGPU'
Exception Message:
A std::exception was thrown.

/data/user/treis/ecal_local_reco/CMSSW_11_3_3/src/RecoLocalCalo/EcalRecProducers/plugins/EcalUncalibRecHitMultiFitAlgoGPU.cu, line 118:
cudaCheck(cudaGetLastError());
cudaErrorInvalidConfiguration: invalid configuration argument
----- End Fatal Exception -------------------------------------------------

In principle the number of channels should be zero.
The crash is not always the same after the error message. I also got this one already:

terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_11_3_3-slc7_amd64_gcc900/build/CMSSW_11_3_3-build/tmp/BUILDROOT/402e2a5eeeb9630ea9f5469bb50cc947/opt/cmssw/slc7_amd64_gcc900/cms/cmssw/CMSSW_11_3_3/src/HeterogeneousCore/CUDACore/src/ScopedContext.cc, line 86:
cudaCheck(cudaStreamAddCallback(stream, cudaScopedContextCallback, new CallbackData{waitingTaskHolder_, device}, 0));
cudaErrorIllegalAddress: an illegal memory access was encountered

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

The crash is not always the same after the error message.

The problem is that - if the error happens inside a kernel, running asynchronously on the GPU - it till be reported by the first CUDA runtime call after it, in any thread.

This can happen in the memory allocator or in the framework support calls, because they are quite frequent and kind of wrap all other CUDA modules, even if the error is not there.

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

While the crash happens in the EcalUncalibRecHitProducerGPU the reason for it seems to be in the EcalRawToDigiGPU module already where the false number of channels is stored in the digis.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

looking at the code of EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc:

void EcalRawToDigiGPU::acquire(edm::Event const& event,
                               edm::EventSetup const& setup,
                               edm::WaitingTaskWithArenaHolder holder) {
...
  // unpack if at least one FED has data
  if (counter > 0) {
    ecal::raw::entryPoint(
        inputCPU, inputGPU, outputGPU_, scratchGPU, outputCPU_, conditions, ctx.stream(), counter, currentCummOffset);
  }
}

void EcalRawToDigiGPU::produce(edm::Event& event, edm::EventSetup const& setup) {
  cms::cuda::ScopedContextProduce ctx{cudaState_};

  // get the number of channels
  outputGPU_.digisEB.size = outputCPU_.nchannels[0];
  outputGPU_.digisEE.size = outputCPU_.nchannels[1];

  ctx.emplace(event, digisEBToken_, std::move(outputGPU_.digisEB));
  ctx.emplace(event, digisEEToken_, std::move(outputGPU_.digisEE));

  // reset ptrs that are carried as members
  outputCPU_.nchannels.reset();
}

my guess is that by not calling ecal::raw::entryPoint(...), outputCPU_.nchannels is not properly set, and outputGPU_.digisEB.size gets a random value (same for EE).

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

So, it might be enough to add

  outputCPU_.nchannels[0] = 0;
  outputCPU_.nchannels[1] = 0;

right before

  // unpack if at least one FED has data
  if (counter > 0) {
    ecal::raw::entryPoint(
        inputCPU, inputGPU, outputGPU_, scratchGPU, outputCPU_, conditions, ctx.stream(), counter, currentCummOffset);
  }

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

or there may be other fields that should be initialised properly instead of skipping the call altogether - I don't know by heart

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

Yes that is my guess as well. Checking that now.

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

Confirmed.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

I can confirm it as well, with this patch

diff --git a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
index 4dcb1bd0e26e..36fdaeb4cfe9 100644
--- a/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
+++ b/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc
@@ -134,6 +134,10 @@ void EcalRawToDigiGPU::acquire(edm::Event const& event,
     ++counter;
   }
 
+  // reset the number of channels
+  outputCPU_.nchannels[0] = 0;
+  outputCPU_.nchannels[1] = 0;
+
   // unpack if at least one FED has data
   if (counter > 0) {
     ecal::raw::entryPoint(

the ECAL plus Pixel job runs to completion:

TrigReport ---------- Path   Summary ------------
TrigReport  Trig Bit#   Executed     Passed     Failed      Error Name
...
TrigReport     1   38      21765      21765          0          0 Status_OnGPU
TrigReport     1   39      21765          0      21765          0 HLT_L1RandomTrigType_v1
TrigReport     1   40      21765      21765          0          0 HLT_L1PhysicsTrigType_v1
TrigReport     1   41      21765      21765          0          0 HLT_Pixel_v1
TrigReport     1   42      21765          0      21765          0 HLT_ECALHT1_v1
TrigReport     1   43      21765          0      21765          0 HLT_ECALHT1_NoGPU_v1
TrigReport     1   44      21765          0      21765          0 HLT_ECALHT1_GPU_v1

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

So

  // output cpu
  outputCPU_ = {cms::cuda::make_host_unique<uint32_t[]>(2, ctx.stream())};

in https://github.com/cms-sw/cmssw/blob/master/EventFilter/EcalRawToDigi/plugins/EcalRawToDigiGPU.cc#L108 does not initialise the object?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

I think it just allocates the memory, but doesn't perform any initialisation.

@thomreis
Copy link
Contributor

thomreis commented Aug 4, 2021

OK. I'll prepare a PR.
Thanks for the help.

@fwyzard
Copy link
Contributor

fwyzard commented Aug 4, 2021

By the way, looking again at the code, it's a bit of a waste to deallocate and reallocate 2 integers at every event...
It should be few ms faster to allocate them once (in the constructor, but only if CUDA is available) and simply resetting the values to 0 at the end of produce.

But let's keep this separate from the fix itself.

@mzarucki
Copy link
Contributor

mzarucki commented Aug 4, 2021

Hi all,

Just wanted to confirm with you that running our Hilton GPU tests with the full GPU menu [1] over run 343762 with all three PRs (Pixel #34684, HCAL #34750 and ECAL #34768) on top of CMSSW_11_3_3 we do not see any more crashes (as documented in this e-log).

Thank you for the quick reaction.

Best regards,
Mateusz on behalf of FOG

[1] /cdaq/cosmic/commissioning2021/CRUZET/Cosmics_GPU/V2

@perrotta
Copy link
Contributor

Can this issue get closed, then?

@fwyzard
Copy link
Contributor

fwyzard commented Aug 10, 2021

+heterogeneous

@cmsbuild
Copy link
Contributor

This issue is fully signed and ready to be closed.

@mzarucki
Copy link
Contributor

Dear all,

From the FOG side, I would like to report that we have tested the full GPU menu in CMSSW_11_3_4 in run 344449 with ECAL, HCAL and Pixel out of the run and we saw no issues (as reported in this e-log and today's Daily Run meeting just now). This confirms that the updated protections as working well.

Best regards,
Mateusz on behalf of FOG

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests