Skip to content

[Issue]: enableing runtime pm causes failure to resume smu on mi100 #183

@IMbackK

Description

@IMbackK

Problem Description

Booting with runtime pm enabled causes the devices to fail to apear due to a failure to resume smu on mi100 devices, dmesg:

[   33.711163] [drm] PCIE GART of 512M enabled.
[   33.716881] [drm] PTB located at 0x00000087FEF00000
[   33.723056] amdgpu 0000:03:00.0: amdgpu: PSP is resuming...
[   33.778774] amdgpu 0000:03:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   33.850011] amdgpu 0000:03:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   33.858894] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[   33.865392] amdgpu 0000:03:00.0: amdgpu: SMC is not ready
[   33.871308] amdgpu 0000:03:00.0: amdgpu: SMC engine is not correctly up!
[   33.878965] amdgpu 0000:03:00.0: amdgpu: resume of IP block <smu> failed -5
[   33.886517] amdgpu 0000:03:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   37.878379] [drm] PCIE GART of 512M enabled.
[   37.883438] [drm] PTB located at 0x00000087FEF00000
[   37.889037] amdgpu 0000:83:00.0: amdgpu: PSP is resuming...
[   37.945322] amdgpu 0000:83:00.0: amdgpu: reserve 0x400000 from 0x87fe800000 for PSP TMR
[   38.016581] amdgpu 0000:83:00.0: amdgpu: RAP: optional rap ta ucode is not available
[   38.024903] amdgpu 0000:83:00.0: amdgpu: SMU is resuming...
[   38.030956] amdgpu 0000:83:00.0: amdgpu: SMC is not ready
[   38.036682] amdgpu 0000:83:00.0: amdgpu: SMC engine is not correctly up!
[   38.044093] amdgpu 0000:83:00.0: amdgpu: resume of IP block <smu> failed -5
[   38.051452] amdgpu 0000:83:00.0: amdgpu: amdgpu_device_ip_resume failed (-5).
[   38.416529] amdgpu 0000:c3:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=none:owns=none

rocm-smi:

Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
Expected integer value from monitor, but got ""
============================================ ROCm System Management Interface ============================================
====================================================== Concise Info ======================================================
Device  Node  IDs              Temp    Power  Partitions          SCLK    MCLK    Fan  Perf     PwrCap       VRAM%  GPU%  
              (DID,     GUID)  (Edge)  (Avg)  (Mem, Compute, ID)                                                          
==========================================================================================================================
0       3     0x738c,   4106   N/A     N/A    N/A, N/A, 0         None    None    0%   unknown  Unsupported  0%     0%    

Operating System

ubuntu 24.04

CPU

Epyc 7552

GPU

MI100

ROCm Version

ROCm 6.3.1

ROCm Component

No response

Steps to Reproduce

add amdgpu.runpm=1 to kernel command line

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

/opt/rocm/bin/rocminfo --support
ROCk module is loaded
hsa api call failure at: /usr/src/debug/rocminfo/rocminfo-rocm-6.2.4/rocminfo.cc:1306
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.

Additional Information

No response

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions