-
Notifications
You must be signed in to change notification settings - Fork 344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: rocm 6.1.0 with wheel pytorch at rx7600xt and running stable diffusion => HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid. code: 0x100f #3054
Comments
Same issue with Radeon 780M IGP inside 7840U, which is gfx1103. |
Same with RX 7600 on ROCm 6.0.2 on Arch. |
Same issue here with a SER7 mini pc. AMD Ryzen 7 7840HS w/ Radeon 780M Graphics. rocm 6.0.0 that's bundled with fedora 40. One thing that I've noticed is that it won't get triggered if you use 32bit models. The issue only arises when playing with fp16 models or doing fp16 operations. I get so many amdgpu crashes with confyui :(, sometimes it recovers, sometimes it doesn't. May 06 20:52:06.678044 fedora kernel: amdgpu: amdgpu_amdkfd_restore_userptr_worker: Failed to resume KFD
May 06 20:57:48.841133 fedora kernel: [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
May 06 20:57:48.842808 fedora kernel: amdgpu 0000:c6:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
May 06 20:57:48.843368 fedora kernel: amdgpu 0000:c6:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
May 06 20:57:48.843488 fedora kernel: amdgpu 0000:c6:00.0: amdgpu: Failed to evict queue 1
May 06 20:57:48.843587 fedora kernel: amdgpu: Failed to evict process queues
May 06 20:57:48.843598 fedora kernel: amdgpu 0000:c6:00.0: amdgpu: GPU reset begin!
if I run Confyui like this it'll crash amdgpu during first generation. python main.py --use-split-cross-attention --disable-cuda-malloc --force-fp16 --fp16-unet --fp16-vae. [ 2244.423381] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2244.423555] amdgpu 0000:c6:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1002
[ 2244.423557] amdgpu 0000:c6:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 2244.423562] amdgpu 0000:c6:00.0: amdgpu: Failed to evict queue 1
[ 2244.423563] amdgpu: Failed to evict process queues
[ 2244.423584] amdgpu 0000:c6:00.0: amdgpu: GPU reset begin!
[ 2244.557162] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2244.557331] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2244.686375] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2244.686519] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2244.722199] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2244.826144] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2244.826296] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2244.937723] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2244.955289] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2244.955425] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.084781] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.085321] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.153317] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2245.214478] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.214612] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.290256] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.290416] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2245.345326] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.345467] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.419315] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.419459] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2245.474638] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.474770] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.548276] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.548417] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2245.605132] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.605280] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.677534] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.677671] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2245.734707] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.734884] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.805152] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.805363] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2245.868985] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2245.869187] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2245.938470] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2245.938672] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2246.002419] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2246.002644] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2246.070654] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2246.070854] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2246.137502] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2246.137685] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2246.203915] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2246.204114] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2246.270888] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=14
[ 2246.271065] [drm:amdgpu_mes_reg_write_reg_wait [amdgpu]] *ERROR* failed to reg_write_reg_wait
[ 2246.337204] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2246.337389] [drm:amdgpu_mes_unmap_legacy_queue [amdgpu]] *ERROR* failed to unmap legacy queue
[ 2246.338964] amdgpu 0000:c6:00.0: amdgpu: MODE2 reset
[ 2246.373708] amdgpu 0000:c6:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2246.374197] [drm] PCIE GART of 512M enabled (table at 0x00000083FFD00000).
[ 2246.374380] [drm] VRAM is lost due to GPU reset!
[ 2246.374383] amdgpu 0000:c6:00.0: amdgpu: SMU is resuming...
[ 2246.375639] amdgpu 0000:c6:00.0: amdgpu: SMU is resumed successfully!
[ 2246.377051] [drm] DMUB hardware initialized: version=0x08003700
[ 2246.600168] [drm] kiq ring mec 3 pipe 1 q 0
[ 2246.601916] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 2246.602002] amdgpu 0000:c6:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 2246.602537] amdgpu 0000:c6:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2246.602539] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2246.602541] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2246.602543] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 2246.602545] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 2246.602547] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 2246.602549] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 2246.602550] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 2246.602552] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 2246.602554] amdgpu 0000:c6:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2246.602555] amdgpu 0000:c6:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 2246.602557] amdgpu 0000:c6:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 2246.602559] amdgpu 0000:c6:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 2246.603835] amdgpu 0000:c6:00.0: amdgpu: recover vram bo from shadow start
[ 2246.603837] amdgpu 0000:c6:00.0: amdgpu: recover vram bo from shadow done
[ 2246.603855] amdgpu 0000:c6:00.0: amdgpu: GPU reset(1) succeeded!
[ 2248.012487] rfkill: input handler enabled
[ 2248.272906] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2248.488619] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2248.704623] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2250.079124] rfkill: input handler disabled
[ 2259.685882] [drm:mes_v11_0_submit_pkt_and_poll_completion.constprop.0 [amdgpu]] *ERROR* MES failed to response msg=3
[ 2259.686054] amdgpu 0000:c6:00.0: amdgpu: failed to remove hardware queue from MES, doorbell=0x1000
[ 2259.686057] amdgpu 0000:c6:00.0: amdgpu: MES might be in unrecoverable state, issue a GPU reset
[ 2259.686059] amdgpu 0000:c6:00.0: amdgpu: Failed to remove queue 0
[ 2259.686103] amdgpu 0000:c6:00.0: amdgpu: GPU reset begin!
[ 2259.916472] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2260.133458] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2260.350076] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2260.430349] amdgpu 0000:c6:00.0: amdgpu: MODE2 reset
[ 2260.465620] amdgpu 0000:c6:00.0: amdgpu: GPU reset succeeded, trying to resume
[ 2260.466077] [drm] PCIE GART of 512M enabled (table at 0x00000083FFD00000).
[ 2260.466159] [drm] VRAM is lost due to GPU reset!
[ 2260.466170] amdgpu 0000:c6:00.0: amdgpu: SMU is resuming...
[ 2260.467166] amdgpu 0000:c6:00.0: amdgpu: SMU is resumed successfully!
[ 2260.468603] [drm] DMUB hardware initialized: version=0x08003700
[ 2260.691721] [drm] kiq ring mec 3 pipe 1 q 0
[ 2260.694214] [drm] VCN decode and encode initialized successfully(under DPG Mode).
[ 2260.694304] amdgpu 0000:c6:00.0: [drm:jpeg_v4_0_hw_init [amdgpu]] JPEG decode initialized successfully.
[ 2260.694879] amdgpu 0000:c6:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[ 2260.694882] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[ 2260.694884] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[ 2260.694886] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 6 on hub 0
[ 2260.694888] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 7 on hub 0
[ 2260.694889] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 8 on hub 0
[ 2260.694892] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 9 on hub 0
[ 2260.694893] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 10 on hub 0
[ 2260.694894] amdgpu 0000:c6:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 11 on hub 0
[ 2260.694896] amdgpu 0000:c6:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[ 2260.694898] amdgpu 0000:c6:00.0: amdgpu: ring vcn_unified_0 uses VM inv eng 0 on hub 8
[ 2260.694900] amdgpu 0000:c6:00.0: amdgpu: ring jpeg_dec uses VM inv eng 1 on hub 8
[ 2260.694901] amdgpu 0000:c6:00.0: amdgpu: ring mes_kiq_3.1.0 uses VM inv eng 13 on hub 0
[ 2260.696172] amdgpu 0000:c6:00.0: amdgpu: recover vram bo from shadow start
[ 2260.696175] amdgpu 0000:c6:00.0: amdgpu: recover vram bo from shadow done
[ 2260.696194] amdgpu 0000:c6:00.0: amdgpu: GPU reset(2) succeeded!
[ 2261.704123] rfkill: input handler enabled
[ 2261.980781] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2262.196811] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839
[ 2262.413335] amdgpu 0000:c6:00.0: [drm] REG_WAIT timeout 1us * 100000 tries - optc1_wait_for_state line:839 Got lucky it recouped, didn't have to resart! If I run it like this python main.py --use-split-cross-attention --disable-cuda-malloc I can make a couple of runs before it crashes. If I run it two or three times and then restart the server I can keep it running without bringing down the system for quite a while. I have 16gigs of ram assigned as vram via the BIOS. I have this is my .bashrc. export PYTORCH_ROCM_ARCH="gfx1100"
export HSA_OVERRIDE_GFX_VERSION=11.0.0
export HSA_ENABLE_SDMA=0
export PYTORCH_HIP_ALLOC_CONF="garbage_collection_threshold:0.7,max_split_size_mb:1024"
So frustrated :/. I have question, does this happened with dedicated cards that are unsuported? It it less likely to be like this if I had a real computer with say a 7900GRE? This is just pytorch 6.0 on ROCM 6.0. It's less buggy if I use the ROCM 5.7 version of pytorch. That combo usually just crashes when I run out of memory. |
Tested the same setup on a 7800XT equipped desktop, Fedora 40 with ROCm 6.0. No such issue. |
Seems like same with rx7600 (xt)
2024년 5월 18일 (토) 오후 2:00, Tommy He ***@***.***>님이 작성:
… So frustrated :/. I have question, does this happened with dedicated cards
that are unsuported? It it less likely to be like this if I had a real
computer with say a 7900GRE? This is just pytorch 6.0 on ROCM 6.0. It's
less buggy if I use the ROCM 5.7 version of pytorch. That combo usually
just crashes when I run out of memory.
Tested the same setup on a 7800XT equipped desktop, Fedora 40 with ROCm
6.0. No such issue.
This mainly affects for integrated GPU, I guess.
—
Reply to this email directly, view it on GitHub
<#3054 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AI5IJS3OXIEEBXT4MTWRQKDZC3N5PAVCNFSM6AAAAABGSZJNPOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJYGY2DCNRRGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Problem Description
Hi, I have rx 7600 xt. I install rocm 6.1. and today i install wheel pytorch version at venv
http://repo.radeon.com/rocm/manylinux/rocm-rel-6.1/torch-2.1.2%2Brocm6.1-cp310-cp310-linux_x86_64.whl
but when i run that,
:0:rocdevice.cpp :2879: 3764010792 us: [pid:10952 tid:0x75a40ebff640] Callback: Queue 0x75a278c00000 aborting with error : HSA_STATUS_ERROR_INVALID_ISA: The instruction set architecture is invalid. code: 0x100f
the error occur
Stable diffusion is work with pytorch with rocm 5.7 version at venv environment.
but only that is not work at rocm 6.0 and rocm6.1 pytorch environment
AUTOMATIC1111/stable-diffusion-webui#15434
Operating System
OS: NAME="Ubuntu" VERSION="22.04.4 LTS (Jammy Jellyfish)"
CPU
CPU: model name : AMD Ryzen 5 5600 6-Core Processor
GPU
AMD Radeon RX 7900 XT
ROCm Version
ROCm 6.0.0
ROCm Component
No response
Steps to Reproduce
No response
(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support
Additional Information
No response
The text was updated successfully, but these errors were encountered: