[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

junliume · 2022-03-11T21:14:47Z

./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

On gfx908 :

MIOpen(HIP): Info [ConvolutionForward] algo = 5, workspace = 0
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoImplicitGEMM
:0:rocdevice.cpp :2616: 103657926263 us: 37351: [tid:0x7fe289112700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

On gfx90a:

MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoGEMM
MIOpen(HIP): auto miopen::solver::GemmFwd1x1_0_1::GetSolution(const miopen::ExecutionContext &, const conv::ProblemDescription &)::(anonymous class)::operator()(const std::vector &)::(anonymous class)::operator()(const miopen::Handle &, const miopen::AnyInvokeParams &) const{
MIOpen(HIP): "convolution, 1x1" = convolution, 1x1
MIOpen(HIP): }
MIOpen(HIP): Info2 [CallGemmStridedBatched] gemm_desc: {isColMajor 0, transA 0, transB 0, m 32, n 81, k 76, lda 76, ldb 81, ldc 81, batch_count 16, strideA 0, strideB 6156, strideC 2592, alpha 1, beta 0, dataType 1}
MIOpen(HIP): miopenStatus_t miopen::CallGemmStridedBatched(const miopen::Handle &, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey *, miopen::GemmBackend_t, bool){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen Forward Conv. Algorithm: 0, Solution: 88/GemmFwd1x1_0_1

junliume · 2022-03-14T19:17:49Z

It’s quite interesting that if you try the command on a fresh docker container, it would actually work. However, run it a few times, and it consistently fails thereafter.

[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.018987 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 332, 30, 0.018987
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29908722944 us: 167  : [tid:0x7f3cee38c700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29911124821 us: 169  : [tid:0x7fc0e1ac4700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

atamazov · 2022-03-21T15:47:02Z

Looks like a HIP runtime problem or issue in the invoker of ConvAsmImplicitGemmGTCDynamicFwdXdlops.

The difference between fresh and used container is: binary cache and find-db. Does removing these helps?

Is this scenario (test passed one or more times, then memory fault(s), then consistently out of resources) stable or random?

/cc @DrizztDoUrden

junliume · 2022-03-21T19:14:44Z

@atamazov this issue is currently assigned to runtime.
Yes clearing cache and config does help:

[First Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.019431 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 324, 29, 0.019431
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[Second Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2614: 472573914516 us: 163 : [tid:0x7f2699c78700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction. code: 0x2a
Aborted (core dumped)
[Clear Cache]:/opt/rocm/miopen# rm -rf ~/.cache/miopen/
[Clear Config]:/opt/rocm/miopen# rm -rf ~/.config/miopen/
[Third Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.019076 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 330, 30, 0.019076
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[Fourth Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2614: 472627880390 us: 317 : [tid:0x7fe398062700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

atamazov · 2022-03-21T19:33:15Z

@junliume

Yes clearing cache and config does help...

Then this is most likely MIOpen issue. I am afraid it'll be back.

Can you please find out if clearing find-db or clearing binary cache helps?

junliume · 2022-03-21T19:43:34Z

@cderb @JehandadKhan this is an easily reproducible issue with the above mentioned driver command, could you or assign someone to take a look?

JehandadKhan · 2022-06-30T19:14:49Z

@atamazov Can you please investigate this, if you have time ?

atamazov · 2022-07-01T23:46:09Z

@JehandadKhan With pleasure, but I do not have gfx908/90a available. Or there is some MI100/200 node available for open-source developers?

atamazov · 2022-07-01T23:46:48Z

Or this is reproducible on MI50 or Navi21?

junliume · 2022-08-03T07:31:13Z

@atamazov @JehandadKhan This issue still exists in the latest build, so far I have tested on gfx908 and gfx1030 and only gfx908 has this problem: (I have attached detailed logs)
First Run Is okay

MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021280 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 
[pass_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249012/pass_log.log)
[fail_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249014/fail_log.log)
296, 27, 0.021280
Forward Convolution Verifies OK on CPU reference (3.64455e-08)

pass_log.log
fail_log.log

Second Run Is NOT okay

root@ixt-rack-148:/opt/rocm# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021333 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 295, 27, 0.021333
Forward Convolution FAILED: 0.026341 > 1.5e-05

carlushuang · 2022-08-04T11:35:31Z

I tried to reproduce this issue, and found, if every time we manually delete the user db (or just run for the first time, since there is no user db yet), then we can have the correct result.

e.g, using docker compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:10584_ubuntu20.04_py3.7_pytorch_rocm5.3_internal_testing_9b13302

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, repeat for a second time, you will get computation error
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

However, if before run the cmd we manually delete the user db (which should exist in ~/.config/miopen/*.ufdb.txt)

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, manually delete the user db before run the same cmd, you will get the correct result
rm -rf ~/.config/miopen/*.ufdb.txt
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

@JehandadKhan can you please take a look at this behavior?

carlushuang · 2022-08-04T11:43:37Z

Wait, #1619 should disabled above solver.

# first time, manually tune the kernel, will have correct result, and actually that solver reports not applicable
export MIOPEN_FIND_MODE=4
export MIOPEN_FIND_ENFORCE=1
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

# second time, launch normally, will have correct result
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

I guess just retune the db should be fine? @JehandadKhan

DrizztDoUrden · 2022-08-04T12:09:05Z

I guess just retune the db should be fine?

That may work as a temporary hack for manually picked cases (AFAIK we have one rn, but it is possible to gather them from running tests two times in a row and logging failures, but continuing), but we don't know if there are other cases it would fail at that are not covered by our tests. And, obviously, it is impossible to test every case in sane amount of time.

atamazov · 2022-08-04T12:40:37Z

@junliume

This issue still exists in the latest build...

AFAICS from the logs, you've used latest amd-master (Mainline) which is bfe7103 and 21 days old. #1619 is not there yet; it is promoted into Staging for now. We shall either promote Staging into Master or wait until release branch is cut and then cherry-pick #1619 directly there.

(I am assuming that the reason of this issue is ConvAsmImplicitGemmGTCDynamicFwdXdlops).

junliume · 2022-08-04T19:59:26Z

@carlushuang @atamazov Thanks for the detective work! It seems that #1619 is critical. However, recent staging has found that #1619 has caused some performance regressions. Instead of disabling it I think we need to fix it afterall.

carlushuang · 2022-08-05T07:07:18Z

#1675 is for narrow down the non-applicable range

ppanchad-amd · 2024-04-16T15:44:03Z

@junliume Is this issue fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

junliume added bug help wanted value_high urgency_high labels Mar 11, 2022

junliume changed the title ~~[bug] 1x1 convolution failure passing from MIOpen to rocBLAS~~ [bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS Mar 14, 2022

JehandadKhan self-assigned this Mar 21, 2022

JehandadKhan assigned atamazov Jun 30, 2022

junliume modified the milestones: ROCm 5.2, ROCm 5.3 Jun 30, 2022

junliume pinned this issue Aug 3, 2022

junliume added urgency_blocker and removed urgency_high labels Aug 3, 2022

aska-0096 unpinned this issue Aug 4, 2022

junliume removed this from the ROCm 5.3 milestone Aug 18, 2022

junliume added this to the ROCm 5.4 milestone Aug 18, 2022

junliume removed this from the ROCm 5.4 milestone Jan 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

junliume commented Mar 11, 2022

junliume commented Mar 14, 2022 •

edited

Loading

atamazov commented Mar 21, 2022

junliume commented Mar 21, 2022

atamazov commented Mar 21, 2022 •

edited

Loading

junliume commented Mar 21, 2022

JehandadKhan commented Jun 30, 2022

atamazov commented Jul 1, 2022

atamazov commented Jul 1, 2022

junliume commented Aug 3, 2022

carlushuang commented Aug 4, 2022

carlushuang commented Aug 4, 2022

DrizztDoUrden commented Aug 4, 2022

atamazov commented Aug 4, 2022

junliume commented Aug 4, 2022

carlushuang commented Aug 5, 2022

ppanchad-amd commented Apr 16, 2024

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

Comments

junliume commented Mar 11, 2022

junliume commented Mar 14, 2022 • edited Loading

atamazov commented Mar 21, 2022

junliume commented Mar 21, 2022

atamazov commented Mar 21, 2022 • edited Loading

junliume commented Mar 21, 2022

JehandadKhan commented Jun 30, 2022

atamazov commented Jul 1, 2022

atamazov commented Jul 1, 2022

junliume commented Aug 3, 2022

carlushuang commented Aug 4, 2022

carlushuang commented Aug 4, 2022

DrizztDoUrden commented Aug 4, 2022

atamazov commented Aug 4, 2022

junliume commented Aug 4, 2022

carlushuang commented Aug 5, 2022

ppanchad-amd commented Apr 16, 2024

junliume commented Mar 14, 2022 •

edited

Loading

atamazov commented Mar 21, 2022 •

edited

Loading