Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCM 3.9-17 fan not work #1300

Closed
yanghoxom opened this issue Nov 23, 2020 · 37 comments
Closed

ROCM 3.9-17 fan not work #1300

yanghoxom opened this issue Nov 23, 2020 · 37 comments

Comments

@yanghoxom
Copy link

yanghoxom commented Nov 23, 2020

rock-dkms:
Installed: 1:3.9-17
Candidate: 1:3.9-17
Version table:
*** 1:3.9-17 500
500 https://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
100 /var/lib/dpkg/status
Device: gigabyte 5600xt
OS: Ubuntu 20.04.1 LTS

rocm-smi

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU  Temp   AvgPwr  SCLK    MCLK    Fan   Perf  PwrCap  VRAM%  GPU%  
0    49.0c  31.0W   800Mhz  875Mhz  100.0%  auto  180.0W   12%   16%   
================================================================================
============================= End of ROCm SMI Log ==============================

but when I look at the vga and can't see the fan spinning

in 3.5, it work normal. in this version, i can set fan speed but fan not spin. anyuone know reason or have any idea for fix it ? thanks

@yanghoxom yanghoxom changed the title ROCM 3.9-19 fan not work ROCM 3.9-17 fan not work Nov 23, 2020
@ROCmSupport
Copy link

Thanks @memsenpai for reaching out.
We will check and get back asap.

@yanghoxom
Copy link
Author

yanghoxom commented Nov 23, 2020

@ROCmSupport
one more, maybe helpful.
i checked on windoww with app of amd, it show max fan speed ~4000 rpm
but when i set 100% with rocm-smi, i see in info show fan speed is ~14000 rpm(Even though the fans are still not really active)
in 3.5 version, when i use ubuntu and set fan speed = 35%, fan speed is ~1300 rpm

@ROCmSupport
Copy link

Thanks for the additional information @memsenpai
Looks like its hardware issue for me, as software functions are working properly.

@yanghoxom
Copy link
Author

@ROCmSupport but it still work on rocm 3.5
have any way help me rollback to version 3.5 ?

@ROCmSupport
Copy link

Hi @memsenpai
Are you sure that fan runs on ROCm 3.5, please confirm asap.
You can install ROCm 3.5/3.5.1 from http://repo.radeon.com/rocm/apt/3.5.1 space

@yanghoxom
Copy link
Author

@ROCmSupport i rollback to 3.8 and it working now
even rocm-smi report fan speed not correct (alway 0%)
hardward-info same
but i set by https://github.com/DominiLux/amdgpu-pro-fans and i check by my eye in vga and saw fan spinning.
totally
with 3.8: rocm-smi and hardware-info report incorrect fan speed(as command rocm-smi --setfan 10), alway return 0 but fan spin and work normal.
with 3.9: rocm-smi and hardware-info report correct fan speed(as command rocm-smi --setfan 10) but fan not spin.

@ROCmSupport
Copy link

Thanks @memsenpai.
Can you please help us with below information.
/opt/rocm/bin/rocminfo
/opt/rocm/opencl/bin/clinfo

@yanghoxom
Copy link
Author

@ROCmSupport here, for you
/opt/rocm/bin/rocminfo


ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
  Uuid:                    CPU-XX                             
  Marketing Name:          Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   4300                               
  BDFID:                   0                                  
  Internal Node ID:        0                                  
  Compute Unit:            12                                 
  SIMDs per CU:            0                                  
  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16331632(0xf93370) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16331632(0xf93370) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx1010                            
  Uuid:                    GPU-XX                             
  Marketing Name:          Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 29471(0x731f)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   1780                               
  BDFID:                   768                                
  Internal Node ID:        1                                  
  Compute Unit:            36                                 
  SIMDs per CU:            4                                  
  Shader Engines:          4                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        80(0x50)                           
  Max Work-item Per CU:    2560(0xa00)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    6275072(0x5fc000) KB               
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1010         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done *** 

/opt/rocm/opencl/bin/clinfo

Number of platforms:				 1
  Platform Profile:				 FULL_PROFILE
  Platform Version:				 OpenCL 2.0 AMD-APP (3186.0)
  Platform Name:				 AMD Accelerated Parallel Processing
  Platform Vendor:				 Advanced Micro Devices, Inc.
  Platform Extensions:				 cl_khr_icd cl_amd_event_callback 


  Platform Name:				 AMD Accelerated Parallel Processing
Number of devices:				 1
  Device Type:					 CL_DEVICE_TYPE_GPU
  Vendor ID:					 1002h
  Board name:					 Navi 10 [Radeon RX 5600 OEM/5600 XT / 5700/5700 XT]
  Device Topology:				 PCI[ B#3, D#0, F#0 ]
  Max compute units:				 18
  Max work items dimensions:			 3
    Max work items[0]:				 1024
    Max work items[1]:				 1024
    Max work items[2]:				 1024
  Max work group size:				 256
  Preferred vector width char:			 4
  Preferred vector width short:			 2
  Preferred vector width int:			 1
  Preferred vector width long:			 1
  Preferred vector width float:			 1
  Preferred vector width double:		 1
  Native vector width char:			 4
  Native vector width short:			 2
  Native vector width int:			 1
  Native vector width long:			 1
  Native vector width float:			 1
  Native vector width double:			 1
  Max clock frequency:				 1780Mhz
  Address bits:					 64
  Max memory allocation:			 5461822668
  Image support:				 Yes
  Max number of images read arguments:		 128
  Max number of images write arguments:		 8
  Max image 2D width:				 16384
  Max image 2D height:				 16384
  Max image 3D width:				 2048
  Max image 3D height:				 2048
  Max image 3D depth:				 2048
  Max samplers within kernel:			 29471
  Max size of kernel argument:			 1024
  Alignment (bits) of base address:		 1024
  Minimum alignment (bytes) for any datatype:	 128
  Single precision floating point capability
    Denorms:					 Yes
    Quiet NaNs:					 Yes
    Round to nearest even:			 Yes
    Round to zero:				 Yes
    Round to +ve and infinity:			 Yes
    IEEE754-2008 fused multiply-add:		 Yes
  Cache type:					 Read/Write
  Cache line size:				 64
  Cache size:					 16384
  Global memory size:				 6425673728
  Constant buffer size:				 5461822668
  Max number of constant args:			 8
  Local memory type:				 Scratchpad
  Local memory size:				 65536
  Max pipe arguments:				 16
  Max pipe active reservations:			 16
  Max pipe packet size:				 1166855372
  Max global variable size:			 5461822668
  Max global variable preferred total size:	 6425673728
  Max read/write image args:			 64
  Max on device events:				 1024
  Queue on device max size:			 8388608
  Max on device queues:				 1
  Queue on device preferred size:		 262144
  SVM capabilities:				 
    Coarse grain buffer:			 Yes
    Fine grain buffer:				 Yes
    Fine grain system:				 No
    Atomics:					 No
  Preferred platform atomic alignment:		 0
  Preferred global atomic alignment:		 0
  Preferred local atomic alignment:		 0
  Kernel Preferred work group size multiple:	 32
  Error correction support:			 0
  Unified memory for Host and Device:		 0
  Profiling timer resolution:			 1
  Device endianess:				 Little
  Available:					 Yes
  Compiler available:				 Yes
  Execution capabilities:				 
    Execute OpenCL kernels:			 Yes
    Execute native function:			 No
  Queue on Host properties:				 
    Out-of-Order:				 No
    Profiling :					 Yes
  Queue on Device properties:				 
    Out-of-Order:				 Yes
    Profiling :					 Yes
  Platform ID:					 0x7faae9374cd0
  Name:						 gfx1010
  Vendor:					 Advanced Micro Devices, Inc.
  Device OpenCL C version:			 OpenCL C 2.0 
  Driver version:				 3186.0 (HSA1.1,LC)
  Profile:					 FULL_PROFILE
  Version:					 OpenCL 2.0 
  Extensions:					 cl_khr_fp64 cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_int64_base_atomics cl_khr_int64_extended_atomics cl_khr_3d_image_writes cl_khr_byte_addressable_store cl_khr_fp16 cl_khr_gl_sharing cl_amd_device_attribute_query cl_amd_media_ops cl_amd_media_ops2 cl_khr_image2d_from_buffer cl_khr_subgroups cl_khr_depth_images cl_amd_copy_buffer_p2p cl_amd_assembly_program 


@ROCmSupport
Copy link

ROCmSupport commented Nov 26, 2020

Hi @memsenpai
We are not officially supporting Navi10 series of cards with ROCm. You can check the docs for more clarity.
(Our test teams are not validating too and so I do not have an ready answer)
Anyway I will try to gather more information by reproducing this problem in any other cards and will try to answer this question, if possible.

@ptitjes
Copy link

ptitjes commented Nov 27, 2020

We are not officially supporting Navi10 series of cards with ROCm. You can check the docs for more clarity.

@ROCmSupport, release 20.45 of amdgpu-pro now uses ROCm as an OpenCL driver. So are we supposed to understand that AMD Navi10 graphics card aren't supported by any AMD driver ?

@ROCmSupport
Copy link

Hi @ptitjes
Navi10 support is there with amdgpu-pro driver. But ROCm official support is not there.

@ROCmSupport
Copy link

ROCmSupport commented Nov 27, 2020

Hi @memsenpai
We are not observing this issue locally on ROCm 3.9.
Fan values are perfectly showing and fans are spinning too with SMI for Radeon7.

@ptitjes
Copy link

ptitjes commented Nov 28, 2020

@ROCmSupport, well, with ROCm now being part of AMDGPU-PRO, you obviously and implicitly have to support Navi10 and Big Navi. You should update your front-page!

@yanghoxom
Copy link
Author

@ROCmSupport i try 3.9 and 3.9.1, both showing fake speed and fan not spin. I also have no experience for error checking. I will wait for the next version to try again

@ROCmSupport
Copy link

ROCmSupport commented Nov 30, 2020

Hi @memsenpai
Let me check on a different hardware and will update you soon.
I will verify with 3.10 also which is going to be released in 1 or 2 days and will update.
Thank you.

@ROCmSupport
Copy link

Hi @memsenpai
I recommend you to try with 3.10 which has updated rock-dkms rock-dkms-firmware packages which have the fix for 5.4.0-56 kernel.
Can you please try with the latest 3.10 repo as of today and update your findings.
Thank you.

@yanghoxom
Copy link
Author

yanghoxom commented Dec 16, 2020

@ROCmSupport it ok if i test with 5.4.0-54 ?

@ROCmSupport
Copy link

Hi @memsenpai
You can test it with 5.4.0-54 kernel. It works.

@yanghoxom
Copy link
Author

yanghoxom commented Dec 16, 2020

@ROCmSupport
i tested, it's still the same error as before.
p/s: use 3.10.0 made me can not use this command: rocm-smi, i must change to use /opt/rocm-3.10.0/bin/rocm_smi.py

@ROCmSupport
Copy link

Hi @memsenpai
Issue should not land now as the ROCm 3.10 repo has been updated with fixed kernel packages.
Recommend to do clean uninstall and fresh install of 3.10 again, should work

@yanghoxom
Copy link
Author

@ROCmSupport can you tell me how to clean uninstall ?
i just try follow this guide https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html try to remove 3.10 and reinstall 3.8 (work with me before update to 3.10) but everything still broken.
package version back to 3.8 but stilll has error and fan not work.

@ROCmSupport
Copy link

Recommend to do a fresh 3.10 install(repo has been updated with fixed packages for kernel and so recommending).

Uninstall ROCm:

  1. sudo apt autoremove rocm-dkms
  2. Check if any packages are left, by sudo dpkg -l | grep
    means hsa, hip, llvm, comgr, rock, rocm
    If you find any packages from above, remove all packages as sudo apt purge
  3. Once all removed, reboot
  4. Map the repo again and install the latest ROCm using sudo apt install rocm-dkms
  5. Thank you.

@yanghoxom
Copy link
Author

@ROCmSupport
thank you for help me many.
after i try fresh install 3.10, kernel 5.4.0-54, it's still the same error as before.

@ROCmSupport
Copy link

Hi @memsenpai
I am not able to reproduce the issue with 5.4.0-54.
Can you please share the error logs.

@yanghoxom
Copy link
Author

yanghoxom commented Dec 17, 2020

@ROCmSupport
How can I get the error log for you?
For me, it doesn't give any error, simply the fan is not working.

[    0.000000] Linux version 5.4.0-54-generic (buildd@lcy01-amd64-024) (gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) #60-Ubuntu SMP Fri Nov 6 10:37:59 UTC 2020 (Ubuntu 5.4.0-54.60-generic 5.4.65)
[    6.552418] amdkcl: loading out-of-tree module taints kernel.
[    6.552437] amdkcl: module verification failed: signature and/or required key missing - tainting kernel
[    6.674875] [drm] amdgpu kernel modesetting enabled.
[    6.674875] [drm] amdgpu version: 5.6.15
[    6.674921] amdgpu: CRAT table not found
[    6.674923] amdgpu: Virtual CRAT table created for CPU
[    6.674934] amdgpu: Topology: Add CPU node
[    6.677491] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x50000000 -> 0x5fffffff
[    6.677493] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x60000000 -> 0x601fffff
[    6.677494] amdgpu 0000:03:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x40100000 -> 0x4017ffff
[    6.677496] fb0: switching to amdgpudrmfb from EFI VGA
[    6.677603] amdgpu 0000:03:00.0: vgaarb: deactivate vga console
[    6.677629] amdgpu 0000:03:00.0: enabling device (0006 -> 0007)
[    6.677803] amdgpu 0000:03:00.0: amdgpu: Trusted Memory Zone (TMZ) feature disabled as experimental (default)
[    6.677804] amdgpu 0000:03:00.0: amdgpu: set kernel compute queue number to 8 due to invalid parameter provided by user
[    6.678647] amdgpu: ATOM BIOS: xxx-xxx-xxx
[    6.678696] amdgpu 0000:03:00.0: amdgpu: VRAM: 6128M 0x0000008000000000 - 0x000000817EFFFFFF (6128M used)
[    6.678697] amdgpu 0000:03:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    6.679923] [drm] amdgpu: 6128M of VRAM memory ready
[    6.679925] [drm] amdgpu: 15948M of GTT memory ready.
[    7.471371] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[    7.503475] amdgpu 0000:03:00.0: amdgpu: use vbios provided pptable
[    7.503476] amdgpu 0000:03:00.0: amdgpu: smc_dpm_info table revision(format.content): 4.5
[    7.538201] amdgpu 0000:03:00.0: amdgpu: SMU is initialized successfully!
[    7.566816] snd_hda_intel 0000:03:00.1: bound 0000:03:00.0 (ops amdgpu_dm_audio_component_bind_ops [amdgpu])
[    7.797471] kfd kfd: amdgpu: Allocated 3969056 bytes on gart
[    7.798050] amdgpu: Virtual CRAT table created for GPU
[    7.798144] amdgpu: Topology: Add dGPU node [0x731f:0x1002]
[    7.798145] kfd kfd: amdgpu: added device 1002:731f
[    7.798147] amdgpu 0000:03:00.0: amdgpu: SE 2, SH per SE 2, CU per SH 10, active_cu_number 36
[    7.799770] fbcon: amdgpudrmfb (fb0) is primary device
[    7.799837] amdgpu 0000:03:00.0: fb0: amdgpudrmfb frame buffer device
[    7.840902] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[    7.840904] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[    7.840905] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[    7.840905] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[    7.840906] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[    7.840907] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[    7.840908] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[    7.840908] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[    7.840909] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[    7.840910] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[    7.840911] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[    7.840911] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[    7.840912] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[    7.840913] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[    7.840914] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[    7.840915] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1
[    7.846688] [drm] Initialized amdgpu 3.40.0 20150101 for 0000:03:00.0 on minor 0
[13047.973124] amdgpu 0000:03:00.0: amdgpu: RAS: optional ras ta ucode is not available
[13047.997127] amdgpu 0000:03:00.0: amdgpu: SMU is resuming...
[13048.032178] amdgpu 0000:03:00.0: amdgpu: SMU is resumed successfully!
[13048.215125] amdgpu 0000:03:00.0: amdgpu: ring gfx_0.0.0 uses VM inv eng 0 on hub 0
[13048.215126] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 1 on hub 0
[13048.215126] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 4 on hub 0
[13048.215127] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 5 on hub 0
[13048.215127] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 6 on hub 0
[13048.215128] amdgpu 0000:03:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 7 on hub 0
[13048.215128] amdgpu 0000:03:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 8 on hub 0
[13048.215129] amdgpu 0000:03:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 9 on hub 0
[13048.215130] amdgpu 0000:03:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 10 on hub 0
[13048.215130] amdgpu 0000:03:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 11 on hub 0
[13048.215131] amdgpu 0000:03:00.0: amdgpu: ring sdma0 uses VM inv eng 12 on hub 0
[13048.215131] amdgpu 0000:03:00.0: amdgpu: ring sdma1 uses VM inv eng 13 on hub 0
[13048.215132] amdgpu 0000:03:00.0: amdgpu: ring vcn_dec uses VM inv eng 0 on hub 1
[13048.215133] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc0 uses VM inv eng 1 on hub 1
[13048.215133] amdgpu 0000:03:00.0: amdgpu: ring vcn_enc1 uses VM inv eng 4 on hub 1
[13048.215134] amdgpu 0000:03:00.0: amdgpu: ring jpeg_dec uses VM inv eng 5 on hub 1

i try
rocm-smi --setfan 30%
rocm-smi return fan speed working with 100% and 4000rpm

i try
rocm-smi --setfan 100%
rocm-smi return fan speed working with 100% and 12000 rpm
(sensor working wrong? fan's max speed parameter wrong?)

In fact, I checked it with my eyes, the fans weren't spin.

@ROCmSupport
Copy link

Thanks.
I have passed this info to dev and let me try to get more information for you.
Thank you.

@yanghoxom
Copy link
Author

yanghoxom commented Dec 17, 2020

@ROCmSupport i updated log boot 👆 , maybe it help. thanks for support me

@ROCmSupport
Copy link

Hi @memsenpai
Is it still reproducible with ROCm 4.0?
Please test with 4.0 and update asap.

CC: @kentrussell

@kentrussell
Copy link
Collaborator

kentrussell commented Jan 5, 2021

EDIT: I read through the whole thread, updating my response accordingly

  1. Can you confirm that the fan does spin at all? I saw mixed information above, so ensuring that the fan can spin is paramount.
  2. We have a fix coming in 4.1 regarding improper set-fan calculations on SMU11 (Navi). If you try the 4.1 build once released, we should be good to go.

@ddobreff
Copy link

ddobreff commented Jan 7, 2021

Hi @kentrussell , if you are referring to this patch: https://patchwork.freedesktop.org/patch/397122/?series=83066&rev=2
It fixes fan speed settings in manual mode but the problem with not able to set 100%(ends up at around 81%) persists. The problem affects only swsmu devices, Vega10, Vega20 and Polaris are not affected.
The new approach to setup fans for swsmu for me doesn't work at all, and I'm using reference 5700 which is supposed to be least affected by AIB vendor modifications.
-> https://cgit.freedesktop.org/~agd5f/linux/commit/?h=amd-staging-drm-next&id=337b57aecb3e11294b0e547aac871a5481fd42ed

@yanghoxom
Copy link
Author

@ROCmSupport @kentrussell
i tested with 4.0 and everything is the same as before.

@kentrussell
Copy link
Collaborator

@ddobreff There are a few patches, but there is also https://www.mail-archive.com/amd-gfx@lists.freedesktop.org/msg55933.html . Unfortunately 4.0 doesn't have these fixes, they will be in 4.1 when we move to the newer kernel base.

@ddobreff
Copy link

ddobreff commented Jan 8, 2021

@kentrussell tried that too, same result, fan cannot exceed 81% on Navi10 reference design. Also have Navi21 if you wish I can perform the test and report, but for now old way seems to work ok.

@kentrussell
Copy link
Collaborator

Thanks @ddobreff . Once we get 4.1 release out, we can revisit this issue as well as @memsenpai 's issue. We won't be backporting fixes to the 5.6-based branch but if we can see the issue on 4.1, we can start testing fixes on the newer code base, and can also get the amdgpu (kernel and PM) guys looking into it as well (they won't look into deprecated kernel bases).
As a temporary investigation, you could try the 20.45 Pro driver or the open-sourced amdgpu driver and see if the issue continues to persist there too. If it's on the upstream amd-staging-drm-next kernel as well, then it's easier to reproduce and easier to get isolated. But it's a lot of time and effort, and makes ROCm work a lot harder. But if you're bored in lockdown and have nothing else to do, then those are a couple things to try that can alleviate your boredom until we drop 4.1

@ROCmSupport
Copy link

Hi @memsenpai and all,
I have verified with the latest internal build which will go for ROCm 4.1, hardware fan is running in my case.
Request you to verify with the latest ROCm 4.1.
Thank you.

@ROCmSupport
Copy link

No issue with 4.1, am closing this.
Thank you.

@sinix-del
Copy link

sinix-del commented May 31, 2022

@ROCmSupport

First i try with kernel 5.4 and rocm 4.5.2 clean install i get rpm 0 for GPU0 when two cards in system. System is Ubuntu 20.04 LTS. After that i do clean install and put kernel 5.11, full remove other kernels, full update 5.11, and after that install rocm 5.0, the thing is same.
Here is the output:

======================= ROCm System Management Interface =======================
================================= Concise Info =================================
GPU Temp AvgPwr SCLK MCLK Fan Perf PwrCap VRAM% GPU%
0 57.0c 119.0W 900Mhz 1050Mhz 0% manual 264.0W 83% 99%
1 57.0c 129.0W 900Mhz 1050Mhz 73.73% manual 264.0W 83% 99%
================================================================================
============================= End of ROCm SMI Log ==============================

And rocm-smi -a is the same like on 4.5.2

======================= ROCm System Management Interface =======================
========================= Version of System Component ==========================
Driver version: 5.13.11.21.50
================================================================================
====================================== ID ======================================
GPU[0] : GPU ID: 0x687f
GPU[1] : GPU ID: 0x687f
================================================================================
================================== Unique ID ===================================
GPU[0] : Unique ID: 0x215084616963084
GPU[1] : Unique ID: 0x2150654c1aa5144
================================================================================
==================================== VBIOS =====================================
GPU[0] : VBIOS version: 113-D0500100-O08
GPU[1] : VBIOS version: 113-D0500100-O08
================================================================================
================================= Temperature ==================================
GPU[0] : Temperature (Sensor edge) (C): 57.0
GPU[0] : Temperature (Sensor junction) (C): 63.0
GPU[0] : Temperature (Sensor memory) (C): 78.0
GPU[0] : Temperature (Sensor HBM 0) (C): N/A
GPU[0] : Temperature (Sensor HBM 1) (C): N/A
GPU[0] : Temperature (Sensor HBM 2) (C): N/A
GPU[0] : Temperature (Sensor HBM 3) (C): N/A
GPU[1] : Temperature (Sensor edge) (C): 58.0
GPU[1] : Temperature (Sensor junction) (C): 70.0
GPU[1] : Temperature (Sensor memory) (C): 90.0
GPU[1] : Temperature (Sensor HBM 0) (C): N/A
GPU[1] : Temperature (Sensor HBM 1) (C): N/A
GPU[1] : Temperature (Sensor HBM 2) (C): N/A
GPU[1] : Temperature (Sensor HBM 3) (C): N/A
================================================================================
========================== Current clock frequencies ===========================
GPU[0] : pcie clock level: 1 (5.0GT/s x1)
GPU[1] : dcefclk clock level: 0: (600Mhz)
GPU[1] : mclk clock level: 3: (1050Mhz)
GPU[1] : sclk clock level: 7: (900Mhz)
GPU[1] : socclk clock level: 7: (1107Mhz)
GPU[1] : pcie clock level: 1 (5.0GT/s x1)
================================================================================
============================== Current Fan Metric ==============================
GPU[0] : Unable to detect fan speed for GPU 0
GPU[1] : Fan Level: 188 (74%)
GPU[1] : Fan RPM: 2441
================================================================================
============================ Show Performance Level ============================
GPU[0] : Performance Level: manual
GPU[1] : Performance Level: manual
================================================================================
=============================== OverDrive Level ================================
GPU[0] : GPU OverDrive value (%): 0
GPU[1] : GPU OverDrive value (%): 0
================================================================================
=============================== OverDrive Level ================================
================================== Power Cap ===================================
GPU[0] : Max Graphics Package Power (W): 264.0
GPU[1] : Max Graphics Package Power (W): 264.0
================================================================================
============================= Show Power Profiles ==============================
GPU[0] : 1. Available power profile (#1 of 7): CUSTOM
GPU[0] : 2. Available power profile (#2 of 7): VIDEO
GPU[0] : 3. Available power profile (#3 of 7): POWER SAVING
GPU[0] : 4. Available power profile (#4 of 7): COMPUTE*
GPU[0] : 5. Available power profile (#5 of 7): VR
GPU[0] : 6. Available power profile (#6 of 7): 3D FULL SCREEN
GPU[0] : 7. Available power profile (#7 of 7): BOOTUP DEFAULT
GPU[1] : 1. Available power profile (#1 of 7): CUSTOM
GPU[1] : 2. Available power profile (#2 of 7): VIDEO
GPU[1] : 3. Available power profile (#3 of 7): POWER SAVING
GPU[1] : 4. Available power profile (#4 of 7): COMPUTE*
GPU[1] : 5. Available power profile (#5 of 7): VR
GPU[1] : 6. Available power profile (#6 of 7): 3D FULL SCREEN
GPU[1] : 7. Available power profile (#7 of 7): BOOTUP DEFAULT
================================================================================
============================== Power Consumption ===============================
GPU[0] : Average Graphics Package Power (W): 119.0
GPU[1] : Average Graphics Package Power (W): 131.0
================================================================================

========================= Supported clock frequencies ==========================
GPU[0] : Supported PCIe frequencies on GPU0
GPU[0] : 0: 5.0GT/s x1
GPU[0] : 1: 5.0GT/s x1 *
GPU[0] :


GPU[1] : Supported dcefclk frequencies on GPU1
GPU[1] : 0: 600Mhz *
GPU[1] : 1: 720Mhz
GPU[1] : 2: 800Mhz
GPU[1] : 3: 847Mhz
GPU[1] : 4: 900Mhz
GPU[1] :
GPU[1] : Supported mclk frequencies on GPU1
GPU[1] : 0: 167Mhz
GPU[1] : 1: 500Mhz
GPU[1] : 2: 850Mhz
GPU[1] : 3: 1050Mhz *
GPU[1] :
GPU[1] : Supported sclk frequencies on GPU1
GPU[1] : 0: 852Mhz
GPU[1] : 1: 1175Mhz
GPU[1] : 2: 1105Mhz
GPU[1] : 3: 1110Mhz
GPU[1] : 4: 1115Mhz
GPU[1] : 5: 1120Mhz
GPU[1] : 6: 1125Mhz
GPU[1] : 7: 900Mhz *
GPU[1] :
GPU[1] : Supported socclk frequencies on GPU1
GPU[1] : 0: 600Mhz
GPU[1] : 1: 720Mhz
GPU[1] : 2: 800Mhz
GPU[1] : 3: 847Mhz
GPU[1] : 4: 900Mhz
GPU[1] : 5: 960Mhz
GPU[1] : 6: 1028Mhz
GPU[1] : 7: 1107Mhz *
GPU[1] :
GPU[1] : Supported PCIe frequencies on GPU1
GPU[1] : 0: 5.0GT/s x1
GPU[1] : 1: 5.0GT/s x1 *
GPU[1] :


================================================================================
============================== % time GPU is busy ==============================
GPU[0] : GPU use (%): 100
GPU[0] : GFX Activity: N/A
GPU[1] : GPU use (%): 100
GPU[1] : GFX Activity: N/A
================================================================================
============================== Current Memory Use ==============================
GPU[0] : Memory Activity: N/A
GPU[1] : Memory Activity: N/A
================================================================================
================================ Memory Vendor =================================
GPU[0] : GPU memory vendor: samsung
GPU[1] : GPU memory vendor: samsung
================================================================================
============================= PCIe Replay Counter ==============================
GPU[0] : PCIe Replay Count: 0
GPU[1] : PCIe Replay Count: 0
================================================================================
================================ Serial Number =================================
GPU[0] : Serial Number: N/A
GPU[1] : Serial Number: N/A
================================================================================
================================ KFD Processes =================================
KFD process information:
PID PROCESS NAME GPU(s) VRAM USED SDMA USED CU OCCUPANCY
2045 teamredminer 0 14147411968 0 20
================================================================================
============================= GPUs Indexed by PID ==============================
None
================================================================================
================== GPU Memory clock frequencies and voltages ===================
================================================================================
=============================== Current voltage ================================
GPU[0] : Voltage (mV): 900
GPU[1] : Voltage (mV): 900
================================================================================
================================== PCI Bus ID ==================================
GPU[0] : PCI Bus: 0000:04:00.0
GPU[1] : PCI Bus: 0000:07:00.0
================================================================================
============================= Firmware Information =============================
GPU[0] : ASD firmware version: 553648242
GPU[0] : CE firmware version: 79
GPU[0] : DMCU firmware version: 0
GPU[0] : MC firmware version: 0
GPU[0] : ME firmware version: 166
GPU[0] : MEC firmware version: 33232
GPU[0] : MEC2 firmware version: 33232
GPU[0] : PFP firmware version: 194
GPU[0] : RLC firmware version: 96
GPU[0] : RLC SRLC firmware version: 0
GPU[0] : RLC SRLG firmware version: 0
GPU[0] : RLC SRLS firmware version: 0
GPU[0] : SDMA firmware version: 434
GPU[0] : SDMA2 firmware version: 434
GPU[0] : SMC firmware version: 05.28.19.00
GPU[0] : SOS firmware version: 0x0008025d
GPU[0] : TA RAS firmware version: 00.00.00.00
GPU[0] : TA XGMI firmware version: 00.00.00.00
GPU[0] : UVD firmware version: 0x422b1100
GPU[0] : VCE firmware version: 0x39060400
GPU[0] : VCN firmware version: 0x00000000
GPU[1] : ASD firmware version: 553648242
GPU[1] : CE firmware version: 79
GPU[1] : DMCU firmware version: 0
GPU[1] : MC firmware version: 0
GPU[1] : ME firmware version: 166
GPU[1] : MEC firmware version: 33232
GPU[1] : MEC2 firmware version: 33232
GPU[1] : PFP firmware version: 194
GPU[1] : RLC firmware version: 96
GPU[1] : RLC SRLC firmware version: 0
GPU[1] : RLC SRLG firmware version: 0
GPU[1] : RLC SRLS firmware version: 0
GPU[1] : SDMA firmware version: 434
GPU[1] : SDMA2 firmware version: 434
GPU[1] : SMC firmware version: 05.28.19.00
GPU[1] : SOS firmware version: 0x0008025d
GPU[1] : TA RAS firmware version: 00.00.00.00
GPU[1] : TA XGMI firmware version: 00.00.00.00
GPU[1] : UVD firmware version: 0x422b1100
GPU[1] : VCE firmware version: 0x39060400
GPU[1] : VCN firmware version: 0x00000000
================================================================================
================================= Product Info =================================
GPU[0] : Card series: Vega 10 XL/XT [Radeon RX Vega 56/64]
GPU[0] : Card model: 0xe37f
GPU[0] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[0] : Card SKU: D05001
GPU[1] : Card series: Vega 10 XL/XT [Radeon RX Vega 56/64]
GPU[1] : Card model: 0xe37f
GPU[1] : Card vendor: Advanced Micro Devices, Inc. [AMD/ATI]
GPU[1] : Card SKU: D05001
================================================================================
================================== Pages Info ==================================
============================ Show Valid sclk Range =============================
GPU[0] : Unable to display sclk range
GPU[1] : Unable to display sclk range
================================================================================
============================ Show Valid mclk Range =============================
GPU[0] : Unable to display mclk range
GPU[1] : Unable to display mclk range
================================================================================
=========================== Show Valid voltage Range ===========================
GPU[0] : Unable to display voltage range
GPU[1] : Unable to display voltage range
================================================================================
============================= Voltage Curve Points =============================
GPU[0] : Voltage Curve is not supported
GPU[1] : Voltage Curve is not supported
================================================================================
=============================== Consumed Energy ================================
================================================================================
============================= End of ROCm SMI Log ==============================

So can you please help me or point where i 2 go, what i look in.

P.S.

When one card is in PC they read all, when i start mining-deamon and put fan parametar they work good, when i put a second card, they don't listen parameters for the first card, only for second, the first card goes 90%.

Thx in advance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants