-
Notifications
You must be signed in to change notification settings - Fork 373
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Issue]: amdgpu driver errors #2642
Comments
This issues is same to me. If it's solved, please let me know. dmesg log: |
They said it's caused by a bug in the CPU page table update code. You can either revert the commit I mentioned and hope some thing work, or wait for them to release the actual fix. |
I do not think it is a duplicate of #2596, the GCVM_L2_PROTECTION_FAULT_STATUS is different and there is no NULL pointer dereference. Also the mentioned possible workarounds have no effect for this 0x00000B3C status (revert of https://lists.freedesktop.org/archives/amd-gfx/2023-October/100298.html, revert of 96c211f1f9ef82183493f4ceed4e347b52849149 as mentioned in https://gitlab.freedesktop.org/drm/amd/-/issues/2991, or changing amdgpu.vm_update_mode) I also have a RX 7900 XT and the same error as here, there is also this stable-diffusion-webui report with same card and same code: AUTOMATIC1111/stable-diffusion-webui#14128. This sounds like a Navi 31 specific bug. I am not sure if the bug is directly in amdgpu or in rocm,but running stable-diffusion-webui with rocm 5.7 triggers a lot of these faults, until it triggers a GPU reset (dmesg log attached). As mentioned this happens almost immediately with: |
Update on my previous comment, I am now running kernel 6.7.0 and stable diffusion works well now with rocm 5.7.
|
Same here on Linux 6.6.10:
Running SD on 7900XTX with ROCm 5.7. |
These GCVM_L2_PROTECTION_FAULT_STATUS errors seem to have completely disappeared now, after rebooting to kernel 6.7.7 (also switched to rocm 6) |
|
@bog-dan-ro Can you please test with latest ROCm 6.1.2? If resolved, please close the ticket. Thanks! |
Hi @bog-dan-ro, I wasn't able to reproduce these page faults after running the MatrixTranspose, inline_asm or assembly_to_executable tests. My testing was done using ROCm 6.2 on a system running the Linux 6.8 kernel. I do see a couple of fixes rolled out related to page faults occuring after tests were successful. The root cause was likely the same and the fix should resolve all the errors reported in this thread. If you do encounter these page faults again on the latest ROCm release, please open a new ticket and we will further investigate the issue. Thanks! |
Problem Description
When running some examples I see errors in dmesg:
/opt/rocm-5.7.0/share/hip/samples/2_Cookbook/0_MatrixTranspose
Same error ^ /opt/rocm-5.7.0/share/hip/samples/2_Cookbook/10_inline_asm
/opt/rocm-5.7.0/share/hip/samples/2_Cookbook/16_assembly_to_executable$ ./square_asm.out
app output:
dmesg output:
Operating System
22.04.3 LTS (Jammy Jellyfish)
CPU
AMD Ryzen 9 7950X3D
GPU
Radeon RX 7900 XT
ROCm Version
5.7.1
ROCm Component
No response
Steps to Reproduce
Run those examples and check the
dmesg
output.Running the same examples and some LLMs using the FOSS mesa drivers work just fine, is
amdgpu
really needed for ROCm? If not what are the advantages of theamdgpu
?Output of /opt/rocm/bin/rocminfo --support
ROCk module is loaded
HSA System Attributes
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model: LARGE
System Endianness: LITTLE
Mwaitx: DISABLED
DMAbuf Support: YES
==========
HSA Agents
Agent 1
Name: AMD Ryzen 9 7950X3D 16-Core Processor
Uuid: CPU-XX
Marketing Name: AMD Ryzen 9 7950X3D 16-Core Processor
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0(0x0)
Queue Min Size: 0(0x0)
Queue Max Size: 0(0x0)
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768(0x8000) KB
Chip ID: 0(0x0)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 5759
BDFID: 0
Internal Node ID: 0
Compute Unit: 32
SIMDs per CU: 0
Shader Engines: 0
Shader Arrs. per Eng.: 0
WatchPts on Addr. Ranges:1
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: FINE GRAINED
Size: 65564044(0x3e86d8c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 65564044(0x3e86d8c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
Pool 3
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 65564044(0x3e86d8c) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: TRUE
ISA Info:
Agent 2
Name: gfx1100
Uuid: GPU-0001b7e800000000
Marketing Name: Radeon RX 7900 XT
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128(0x80)
Queue Min Size: 64(0x40)
Queue Max Size: 131072(0x20000)
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 32(0x20) KB
L2: 6144(0x1800) KB
L3: 81920(0x14000) KB
Chip ID: 29772(0x744c)
ASIC Revision: 0(0x0)
Cacheline Size: 64(0x40)
Max Clock Freq. (MHz): 2075
BDFID: 768
Internal Node ID: 1
Compute Unit: 84
SIMDs per CU: 2
Shader Engines: 6
Shader Arrs. per Eng.: 2
WatchPts on Addr. Ranges:4
Features: KERNEL_DISPATCH
Fast F16 Operation: TRUE
Wavefront Size: 32(0x20)
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Max Waves Per CU: 32(0x20)
Max Work-item Per CU: 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
Max fbarriers/Workgrp: 32
Packet Processor uCode:: 494
SDMA engine uCode:: 19
IOMMU Support:: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 2
Segment: GLOBAL; FLAGS:
Size: 20955136(0x13fc000) KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Accessible by all: FALSE
Pool 3
Segment: GROUP
Size: 64(0x40) KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Accessible by all: FALSE
ISA Info:
ISA 1
Name: amdgcn-amd-amdhsa--gfx1100
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Size: 1024(0x400)
Workgroup Max Size per Dimension:
x 1024(0x400)
y 1024(0x400)
z 1024(0x400)
Grid Max Size: 4294967295(0xffffffff)
Grid Max Size per Dimension:
x 4294967295(0xffffffff)
y 4294967295(0xffffffff)
z 4294967295(0xffffffff)
FBarrier Max Size: 32
*** Done ***
The text was updated successfully, but these errors were encountered: