-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug][gfx908] 1x1 convolution failure passing from MIOpen to rocBLAS #1460
Comments
It’s quite interesting that if you try the command on a fresh docker container, it would actually work. However, run it a few times, and it consistently fails thereafter.
|
Looks like a HIP runtime problem or issue in the invoker of ConvAsmImplicitGemmGTCDynamicFwdXdlops. The difference between fresh and used container is: binary cache and find-db. Does removing these helps? Is this scenario (test passed one or more times, then memory fault(s), then consistently out of resources) stable or random? /cc @DrizztDoUrden |
@atamazov this issue is currently assigned to runtime.
|
@cderb @JehandadKhan this is an easily reproducible issue with the above mentioned driver command, could you or assign someone to take a look? |
@atamazov Can you please investigate this, if you have time ? |
@JehandadKhan With pleasure, but I do not have gfx908/90a available. Or there is some MI100/200 node available for open-source developers? |
Or this is reproducible on MI50 or Navi21? |
@atamazov @JehandadKhan This issue still exists in the latest build, so far I have tested on gfx908 and gfx1030 and only gfx908 has this problem: (I have attached detailed logs)
Second Run Is NOT okay
|
I tried to reproduce this issue, and found, if every time we manually delete the user db (or just run for the first time, since there is no user db yet), then we can have the correct result. e.g, using docker
However, if before run the cmd we manually delete the user db (which should exist in ~/.config/miopen/*.ufdb.txt)
@JehandadKhan can you please take a look at this behavior? |
Wait, #1619 should disabled above solver.
I guess just retune the db should be fine? @JehandadKhan |
That may work as a temporary hack for manually picked cases (AFAIK we have one rn, but it is possible to gather them from running tests two times in a row and logging failures, but continuing), but we don't know if there are other cases it would fail at that are not covered by our tests. And, obviously, it is impossible to test every case in sane amount of time. |
AFAICS from the logs, you've used latest amd-master (Mainline) which is bfe7103 and 21 days old. #1619 is not there yet; it is promoted into Staging for now. We shall either promote Staging into Master or wait until release branch is cut and then cherry-pick #1619 directly there. (I am assuming that the reason of this issue is ConvAsmImplicitGemmGTCDynamicFwdXdlops). |
@carlushuang @atamazov Thanks for the detective work! It seems that #1619 is critical. However, recent staging has found that #1619 has caused some performance regressions. Instead of disabling it I think we need to fix it afterall. |
#1675 is for narrow down the non-applicable range |
@junliume Is this issue fixed with latest ROCm 6.0.2 (HIP 6.0.32831)? Thanks! |
./bin/MIOpenDriver conv -n 16 -c 76 -H 9 -W 9 -k 32 -y 1 -x 1 -p 0 -q 0 -u 1 -v 1 -l 1 -j 1 -m conv -g 1 -F 1 -t 1
On gfx908 :
On gfx90a:
The text was updated successfully, but these errors were encountered: