Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

Open
junliume opened this issue Mar 11, 2022 · 16 comments
Open

[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460

junliume opened this issue Mar 11, 2022 · 16 comments

Comments

@junliume
Copy link
Collaborator

./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1

On gfx908 :

MIOpen(HIP): Info [ConvolutionForward] algo = 5, workspace = 0
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoImplicitGEMM
:0:rocdevice.cpp :2616: 103657926263 us: 37351: [tid:0x7fe289112700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

On gfx90a:

MIOpen(HIP): Info [ConvolutionForward] algo = 0, workspace = 0
MIOpen(HIP): Info2 [GetInvoker] Returning an invoker for problem 76x9x9x1x1x32x9x9x16xNCHWxFP32x0x0x1x1x1x1x1xF and algorithm miopenConvolutionFwdAlgoGEMM
MIOpen(HIP): auto miopen::solver::GemmFwd1x1_0_1::GetSolution(const miopen::ExecutionContext &, const conv::ProblemDescription &)::(anonymous class)::operator()(const std::vector &)::(anonymous class)::operator()(const miopen::Handle &, const miopen::AnyInvokeParams &) const{
MIOpen(HIP): "convolution, 1x1" = convolution, 1x1
MIOpen(HIP): }
MIOpen(HIP): Info2 [CallGemmStridedBatched] gemm_desc: {isColMajor 0, transA 0, transB 0, m 32, n 81, k 76, lda 76, ldb 81, ldc 81, batch_count 16, strideA 0, strideB 6156, strideC 2592, alpha 1, beta 0, dataType 1}
MIOpen(HIP): miopenStatus_t miopen::CallGemmStridedBatched(const miopen::Handle &, miopen::GemmDescriptor, ConstData_t, int, ConstData_t, int, Data_t, int, miopen::FindDbKCacheKey *, miopen::GemmBackend_t, bool){
MIOpen(HIP): "rocBLAS" = rocBLAS
MIOpen(HIP): }
MIOpen Forward Conv. Algorithm: 0, Solution: 88/GemmFwd1x1_0_1

@junliume junliume changed the title [bug] 1x1 convolution failure passing from MIOpen to rocBLAS [bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS Mar 14, 2022
@junliume
Copy link
Collaborator Author

junliume commented Mar 14, 2022

It’s quite interesting that if you try the command on a fresh docker container, it would actually work. However, run it a few times, and it consistently fails thereafter.

[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.018987 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 332, 30, 0.018987
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29908722944 us: 167  : [tid:0x7f3cee38c700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_MEMORY_FAULT: Agent attempted to access an inaccessible address. code: 0x2b
Aborted (core dumped)
[] ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp            :2616: 29911124821 us: 169  : [tid:0x7fc0e1ac4700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

@atamazov
Copy link
Contributor

Looks like a HIP runtime problem or issue in the invoker of ConvAsmImplicitGemmGTCDynamicFwdXdlops.

The difference between fresh and used container is: binary cache and find-db. Does removing these helps?

Is this scenario (test passed one or more times, then memory fault(s), then consistently out of resources) stable or random?

/cc @DrizztDoUrden

@junliume
Copy link
Collaborator Author

@atamazov this issue is currently assigned to runtime.
Yes clearing cache and config does help:

[First Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.019431 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 324, 29, 0.019431
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[Second Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2614: 472573914516 us: 163 : [tid:0x7f2699c78700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION: The agent attempted to execute an illegal shader instruction. code: 0x2a
Aborted (core dumped)
[Clear Cache]:/opt/rocm/miopen# rm -rf ~/.cache/miopen/
[Clear Config]:/opt/rocm/miopen# rm -rf ~/.config/miopen/
[Third Run Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.019076 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32, 6303744, 403712, 165888, 330, 30, 0.019076
Forward Convolution Verifies OK on CPU reference (3.61839e-08)
[Fourth Run NOT Fresh]:/opt/rocm/miopen# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
:0:rocdevice.cpp :2614: 472627880390 us: 317 : [tid:0x7fe398062700] Device::callbackQueue aborting with error : HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. code: 0x1008
Aborted (core dumped)

@atamazov
Copy link
Contributor

atamazov commented Mar 21, 2022

@junliume

Yes clearing cache and config does help...

Then this is most likely MIOpen issue. I am afraid it'll be back.
image

Can you please find out if clearing find-db or clearing binary cache helps?

@junliume
Copy link
Collaborator Author

@cderb @JehandadKhan this is an easily reproducible issue with the above mentioned driver command, could you or assign someone to take a look?

@JehandadKhan
Copy link
Collaborator

@atamazov Can you please investigate this, if you have time ?

@junliume junliume modified the milestones: ROCm 5.2, ROCm 5.3 Jun 30, 2022
@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2022

@JehandadKhan With pleasure, but I do not have gfx908/90a available. Or there is some MI100/200 node available for open-source developers?

@atamazov
Copy link
Contributor

atamazov commented Jul 1, 2022

Or this is reproducible on MI50 or Navi21?

@junliume
Copy link
Collaborator Author

junliume commented Aug 3, 2022

@atamazov @JehandadKhan This issue still exists in the latest build, so far I have tested on gfx908 and gfx1030 and only gfx908 has this problem: (I have attached detailed logs)
First Run Is okay

MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021280 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 
[pass_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249012/pass_log.log)
[fail_log.log](https://github.com/ROCmSoftwarePlatform/MIOpen/files/9249014/fail_log.log)
296, 27, 0.021280
Forward Convolution Verifies OK on CPU reference (3.64455e-08)

pass_log.log
fail_log.log

Second Run Is NOT okay

root@ixt-rack-148:/opt/rocm# ./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
MIOpen Forward Conv. Algorithm: 5, Solution: 74/ConvAsmImplicitGemmGTCDynamicFwdXdlops
GPU Kernel Time Forward Conv. Elapsed: 0.021333 ms (average)
stats: name, n, c, ho, wo, x, y, k, flopCnt, bytesRead, bytesWritten, GFLOPs, GB/s, timeMs
stats: fwd-conv1x1u1, 16, 76, 9, 9, 1, 1, 32,  6303744, 403712, 165888, 295, 27, 0.021333
Forward Convolution FAILED: 0.026341 > 1.5e-05

@junliume junliume pinned this issue Aug 3, 2022
@carlushuang
Copy link
Contributor

I tried to reproduce this issue, and found, if every time we manually delete the user db (or just run for the first time, since there is no user db yet), then we can have the correct result.

e.g, using docker compute-artifactory.amd.com:5000/rocm-plus-docker/framework/compute-rocm-dkms-no-npi-hipclang:10584_ubuntu20.04_py3.7_pytorch_rocm5.3_internal_testing_9b13302

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, repeat for a second time, you will get computation error
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

However, if before run the cmd we manually delete the user db (which should exist in ~/.config/miopen/*.ufdb.txt)

# inside above docker, run for first time, will have correct anwser
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 
# then, manually delete the user db before run the same cmd, you will get the correct result
rm -rf ~/.config/miopen/*.ufdb.txt
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

@JehandadKhan can you please take a look at this behavior?

@aska-0096 aska-0096 unpinned this issue Aug 4, 2022
@carlushuang
Copy link
Contributor

Wait, #1619 should disabled above solver.

# first time, manually tune the kernel, will have correct result, and actually that solver reports not applicable
export MIOPEN_FIND_MODE=4
export MIOPEN_FIND_ENFORCE=1
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

# second time, launch normally, will have correct result
/opt/rocm//bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1 

I guess just retune the db should be fine? @JehandadKhan

@DrizztDoUrden
Copy link
Contributor

I guess just retune the db should be fine?

That may work as a temporary hack for manually picked cases (AFAIK we have one rn, but it is possible to gather them from running tests two times in a row and logging failures, but continuing), but we don't know if there are other cases it would fail at that are not covered by our tests. And, obviously, it is impossible to test every case in sane amount of time.

@atamazov
Copy link
Contributor

atamazov commented Aug 4, 2022

@junliume

This issue still exists in the latest build...

AFAICS from the logs, you've used latest amd-master (Mainline) which is bfe7103 and 21 days old. #1619 is not there yet; it is promoted into Staging for now. We shall either promote Staging into Master or wait until release branch is cut and then cherry-pick #1619 directly there.

(I am assuming that the reason of this issue is ConvAsmImplicitGemmGTCDynamicFwdXdlops).

@junliume
Copy link
Collaborator Author

junliume commented Aug 4, 2022

@carlushuang @atamazov Thanks for the detective work! It seems that #1619 is critical. However, recent staging has found that #1619 has caused some performance regressions. Instead of disabling it I think we need to fix it afterall.

@carlushuang
Copy link
Contributor

#1675 is for narrow down the non-applicable range

@junliume junliume removed this from the ROCm 5.3 milestone Aug 18, 2022
@junliume junliume added this to the ROCm 5.4 milestone Aug 18, 2022
@junliume junliume removed this from the ROCm 5.4 milestone Jan 11, 2023
@ppanchad-amd
Copy link

@junliume Is this issue fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants