Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

assertion failed in RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h #38922

Open
fwyzard opened this issue Aug 1, 2022 · 14 comments

Comments

@fwyzard
Copy link
Contributor

fwyzard commented Aug 1, 2022

Building CMSW 12.4.4 plus #38902, #38884 and #38921 with GPU assertion enabled (with USER_CXXFLAGS="-g -DGPU_DEBUG" USER_CUDA_FLAGS="-g -DGPU_DEBUG" scram b -j$(nproc)) results in an assertion failure in RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h while running over run = 356381, lumi = 220, event = 175766601:

/data/user/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, signed int *, int) [with __nv_bool isPhase2 = false]: block: [604,0,0], thread: [1,0,0] Assertion `l < maxNeighbours` failed.

(for a large number of GPU threads, but I did not copy all the messages).

The problem can be reproduced with

cmsrel CMSSW_12_4_4
cd CMSSW_12_4_4/src
cmsenv
git cms-init
git cms-merge-topic 38884
git cms-merge-topic 38902
git cms-merge-topic 38921
git cms-addpkg RecoLocalTracker/SiPixelClusterizer RecoPixelVertexing/PixelTrackFitting RecoPixelVertexing/PixelTriplets
USER_CXXFLAGS="-g -DGPU_DEBUG" USER_CUDA_FLAGS="-g -DGPU_DEBUG" scram b -j`nproc`
cd ..
mkdir run
cd run
cp /afs/cern.ch/user/f/fwyzard/public/CMSSW_12_4_4/* .
cmsRun crashing.py
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 1, 2022

assign heterogeneous

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 1, 2022

assign tracker

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2022

New categories assigned: heterogeneous

@fwyzard,@makortel you have been requested to review this Pull request/Issue and eventually sign? Thanks

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2022

A new Issue was created by @fwyzard Andrea Bocci.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 1, 2022

@VinInn @makortel FYI

@makortel
Copy link
Contributor

makortel commented Aug 1, 2022

assign reconstruction

@makortel
Copy link
Contributor

makortel commented Aug 1, 2022

FYI @cms-sw/trk-dpg-l2

@cmsbuild
Copy link
Contributor

cmsbuild commented Aug 1, 2022

New categories assigned: reconstruction

@jpata,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

@VinInn
Copy link
Contributor

VinInn commented Aug 1, 2022

large number of duplicate pixels I suppose

@makortel
Copy link
Contributor

makortel commented Aug 1, 2022

IIUC the failing assertion is this one


(from #38902 since it changes the line numbers wrt master)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 1, 2022

By the way, here is the full log from the offending event:

++++ starting: source event
++++ finished: source event
++++ starting: processing event : stream = 0 run = 356381 lumi = 220 event = 175766601 time = 7125571355059409920
++++++ starting: processing path 'PathGettingStuck' : stream = 0
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'hltSiPixelClustersGPU' id = 5
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'hltOnlineBeamSpotToGPU' id = 4
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'hltSiPixelClustersGPU' id = 5
++++++++ starting: processing event acquire for module: stream = 0 label = 'hltSiPixelClustersGPU' id = 5
++++++++ starting: prefetching before processing event for module: stream = 0 label = 'hltOnlineBeamSpot' id = 3
decoding 63450 digis.
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'hltOnlineBeamSpot' id = 3
++++++++ starting: processing event for module: stream = 0 label = 'hltOnlineBeamSpot' id = 3
++++++++ finished: processing event for module: stream = 0 label = 'hltOnlineBeamSpot' id = 3
++++++++ finished: prefetching before processing event for module: stream = 0 label = 'hltOnlineBeamSpotToGPU' id = 4
++++++++ starting: processing event for module: stream = 0 label = 'hltOnlineBeamSpotToGPU' id = 4
CUDA countModules kernel launch with 248 blocks of 256 threads
CUDA findClus kernel launch with 1856 blocks of 384 threads
++++++++ finished: processing event for module: stream = 0 label = 'hltOnlineBeamSpotToGPU' id = 4
start clusterizer for module 1101 in block 16
histo size 21
start clusterizer for module 701 in block 329
start clusterizer for module 101 in block 424
columns with more than 60 px 1 in 92
start clusterizer for module 1 in block 514
# loops 9
start clusterizer for module 1001 in block 531
histo size 10
start clusterizer for module 301 in block 584
start clusterizer for module 501 in block 601
start clusterizer for module 801 in block 634
histo size 73
3 clusters in module 1101
histo size 174
start clusterizer for module 901 in block 733
# loops 3
histo size 7
histo size 88
histo size 29
histo size 14
# loops 5
2 clusters in module 701
start clusterizer for module 201 in block 900
# loops 3
# loops 5
histo size 50
# loops 3
start clusterizer for module 401 in block 1000
# loops 3
20 clusters in module 101
start clusterizer for module 601 in block 1030
/tmp/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, int *, int) [with __nv_bool isPhase2 = false]: block: [476,0,0], thread: [189,0,0] Assertion `l < maxNeighbours` failed.
/tmp/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, int *, int) [with __nv_bool isPhase2 = false]: block: [476,0,0], thread: [191,0,0] Assertion `l < maxNeighbours` failed.
# loops 15
/tmp/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, int *, int) [with __nv_bool isPhase2 = false]: block: [476,0,0], thread: [257,0,0] Assertion `l < maxNeighbours` failed.
...
/tmp/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, int *, int) [with __nv_bool isPhase2 = false]: block: [476,0,0], thread: [157,0,0] Assertion `l < maxNeighbours` failed.
/tmp/fwyzard/CMSSW_12_4_4/src/RecoLocalTracker/SiPixelClusterizer/plugins/gpuClustering.h:208: void gpuClustering::findClus(const unsigned short *, const unsigned short *, const unsigned short *, const unsigned int *, unsigned int *, unsigned int *, int *, int) [with __nv_bool isPhase2 = false]: block: [476,0,0], thread: [159,0,0] Assertion `l < maxNeighbours` failed.
2 clusters in module 1001
start clusterizer for module 1501 in block 1093
30 clusters in module 1
histo size 79
start clusterizer for module 1201 in block 1106
start clusterizer for module 1401 in block 1162
# loops 9
start clusterizer for module 1301 in block 1166
10 clusters in module 501
4 clusters in module 801
histo size 10
histo size 16
17 clusters in module 301
histo size 15
# loops 5
histo size 30
start clusterizer for module 1601 in block 1304
8 clusters in module 901
start clusterizer for module 1701 in block 1344
histo size 7
# loops 3
histo size 22
# loops 7
17 clusters in module 201
start clusterizer for module 1801 in block 1443
# loops 3
# loops 3
histo size 33
2 clusters in module 401
# loops 3
histo size 41
# loops 3
4 clusters in module 601
5 clusters in module 1501
12 clusters in module 1201
histo size 39
3 clusters in module 1401
12 clusters in module 1301
# loops 7
# loops 3
# loops 3
6 clusters in module 1601
11 clusters in module 1701
16 clusters in module 1801
terminate called after throwing an instance of 'std::runtime_error'
  what():  
/data/cmsbld/jenkins/workspace/auto-builds/CMSSW_12_4_4-slc7_amd64_gcc10/build/CMSSW_12_4_4-build/tmp/BUILDROOT/c168b61873ba8a4c8e213417620ca4f0/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_4_4/src/HeterogeneousCore/CUDAUtilities/src/CachingHostAllocator.h, line 543:
cudaCheck(error = cudaEventRecord(search_key.ready_event, search_key.associated_stream));
cudaErrorAssert: device-side assert triggered

@VinInn
Copy link
Contributor

VinInn commented Aug 1, 2022

the detection of duplicate pixels has never been integrated, not even in debug mode...

@VinInn
Copy link
Contributor

VinInn commented Aug 1, 2022

One can try to add this
https://github.com/cms-sw/cmssw/pull/37359/files#diff-6473251300f99a4418907c32c97ac4f20ad06764ccbbe757e8acf39f02c3da1aR126
in debug mode
(just the printout, no fix)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 2, 2022

@VinInn thanks for the suggestion.

I've squashed and rebased #37359 like this (https://github.com/cms-sw/cmssw/compare/master...fwyzard:cmssw:test_expensive_duplicate_pixel_removal?expand=1), and it does fix the problem of the job getting stuck or failing the assertion.

I guess it's finally time to pick an efficient solution 🤷‍♂️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants