Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

incoherent power settings and behaviour of Vega frontier edition #67

Closed
akostadinov opened this issue Jan 1, 2019 · 6 comments
Closed

Comments

@akostadinov
Copy link

Hi, I was playing with overdrive settings on my Vega Frontier Edition vs rocm-dkms-2.0.89-1.x86_64

I enabled overdrive settings and used rocm-smi to set some settings. By pure luck I'm observing a very strange behaviour. First of all I'm aware of ROCm/ROCm#463 (comment) and ROCm/ROCm#564 (comment) and these are the commands I prepared:

rocm-smi --setfan 60%

rocm-smi --autorespond y --setmlevel 2 1100 903
rocm-smi --autorespond y --setmlevel 3 1100 905

rocm-smi --autorespond y --setslevel 2 992 901
rocm-smi --autorespond y --setslevel 3 993 902
rocm-smi --autorespond y --setslevel 4 994 903
rocm-smi --autorespond y --setslevel 5 995 904
rocm-smi --autorespond y --setslevel 6 996 905
rocm-smi --autorespond y --setslevel 7 1000 906

With these settings my benchmark (ethminer) worked fine initially hitting 41.5MH/s but quickly card was throttled to 37MH/s. smi showed:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     69c    207.0W   1000Mhz 1100Mhz 8.0GT/s, x16   66.67%  manual  220W     26806%    16%      100%     
================================================================================================
========================               End of ROCm SMI Log              ========================

Earlier by mistake I tried different (not according to recommendations) settings which turn out to work much better though:

rocm-smi --setfan 60%

rocm-smi --autorespond y --setslevel 5 1000 906
rocm-smi --autorespond y --setslevel 6 1000 906
rocm-smi --autorespond y --setslevel 7 1000 906

rocm-smi --autorespond y --setmlevel 3 1100 950

With the above settings the card works cooler and lower power at 41.5MH/s stable. smi shows:

========================        ROCm System Management Interface        ========================
================================================================================================
GPU   Temp   AvgPwr   SCLK    MCLK    PCLK           Fan     Perf    PwrCap   SCLK OD   MCLK OD  GPU%
0     43c    141.0W   1000Mhz 1100Mhz 8.0GT/s, x16   66.67%  manual  220W     26806%    16%      99%      
================================================================================================
========================               End of ROCm SMI Log              ========================

The only difference is AvgPwr which dropped to 141W from 207 and card is cooler as a result thus performance not thermal throttled.

I have verified that inside /sys/class/drm/card0/device/pp_od_clk_voltage settings are what were expected by my rocm-smi commands (luck_settings.txt vs bad_settings.txt).

So question is why the supposedly incorrect settings work way better than the recommended ones although gpu/mem clocks appear to be the same?

P.S. another interesting thing is why Fan works at 66% instead of 60% as command should set it to?

@akostadinov akostadinov changed the title incoherent power behaviour of Vega frontier edition incoherent power settings and behaviour of Vega frontier edition Jan 1, 2019
@kentrussell
Copy link
Contributor

For the fan, it was an issue with the SMI taking values at different times and giving different results. That part should be fine in 2.2 now that I handled a race condition where the value and the percent weren't matching up, giving skewed percentages.

As for the incorrect settings, I am unsure as it could just be the quirk of the GPU itself in terms of how it handles the voltages and what "optimal" values are. Maybe @Jgreathouse has an idea.

@jlgreathouse
Copy link
Contributor

IIRC, I've observed the same behavior but I have not had the bandwidth to root cause it. Hunting down power idiosyncrasies takes a tremendous amount of work since it crosses half a dozen layers of software and firmware.

@akostadinov
Copy link
Author

I suspect that this is issue with firmware because I remember some strange results doing my ill-fated windows attempts one year ago. Given that difference is more than 25% it must be worth for AMD to invest the resources and figure out what is doing crap. It will be a huge performance/power advantage of AMD if its cards start working more efficiently just by fixing some power profile selection algo or something.

@jlgreathouse
Copy link
Contributor

It's been a while since I tested this, but IIRC I was only able to reproduce the problem when setting custom powerplay tables rather than using the tables we ship in the GPU's VBIOS. This significantly reduces the value proposition, since the vast majority of normal users aren't setting custom powerplay tables.

@akostadinov
Copy link
Author

Do you mean powerplay tables like changing in the driver? I didn't change them. Only run the commands in description. Given different use cases can't be covered well with default power/clock settings, in case user can easily tune them, that will be a huge advantage for AMD. I think for compute use cases this would be used a lot.

25% difference was only between "properly" optimized settings and settings that seem improper but effectively work awesome. If I run on vanilla default settings (no clock setting) card tops at 250W with lower performance. So changing power/clock settings can result in 50% improved characteristics.

IMHO that's worth it. On the other hand I trust that AMD has a clue what must be a priority.

Regards.

kentrussell pushed a commit that referenced this issue Nov 30, 2022
To set the panel orientation property with quirk, we need the mode size
provided by EDID. This info is available after EDID is read by dc_link_detect()
and updated by amdgpu_dm_update_connector_after_detect(). The detection
happens at driver load in amdgpu_dm_initialize_drm_device() and,
therefore, we can get modes and set panel orientation before
drm_dev_register() to avoid DRM warns on creating the connector property
after device registration:

[    2.563969] ------------[ cut here ]------------
[    2.563971] WARNING: CPU: 6 PID: 325 at drivers/gpu/drm/drm_mode_object.c:45 drm_mode_object_add+0x72/0x80 [drm]
[    2.563997] Modules linked in: btusb btrtl btbcm btintel btmtk bluetooth rfkill ecdh_generic ecc usbhid crc16 amdgpu(+) drm_ttm_helper ttm agpgart gpu_sched i2c_algo_bit drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm serio_raw sdhci_pci atkbd libps2 cqhci vivaldi_fmap ccp sdhci i8042 crct10dif_pclmul crc32_pclmul hid_multitouch ghash_clmulni_intel aesni_intel crypto_simd cryptd wdat_wdt mmc_core cec xhci_pci sp5100_tco rng_core xhci_pci_renesas serio 8250_dw i2c_hid_acpi i2c_hid btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq dm_mirror dm_region_hash dm_log dm_mod pkcs8_key_parser crypto_user
[    2.564032] CPU: 6 PID: 325 Comm: systemd-udevd Not tainted 5.18.0-amd-staging-drm-next+ #67
[    2.564034] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0105 03/21/2022
[    2.564036] RIP: 0010:drm_mode_object_add+0x72/0x80 [drm]
[    2.564053] Code: f0 89 c3 85 c0 78 07 89 45 00 44 89 65 04 4c 89 ef e8 e2 99 04 f1 31 c0 85 db 0f 4e c3 5b 5d 41 5c 41 5d c3 80 7f 50 00 74 ac <0f> 0b eb a8 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 41 54 4c
[    2.564055] RSP: 0018:ffffb2e880413860 EFLAGS: 00010202
[    2.564056] RAX: ffffffffc0ba1440 RBX: ffff99508a860010 RCX: 0000000000000001
[    2.564057] RDX: 00000000b0b0b0b0 RSI: ffff99508c050110 RDI: ffff99508a860010
[    2.564058] RBP: ffff99508c050110 R08: 0000000000000020 R09: ffff99508c292c20
[    2.564059] R10: 0000000000000000 R11: ffff99508c0507d8 R12: 00000000b0b0b0b0
[    2.564060] R13: 0000000000000004 R14: ffffffffc068a4b6 R15: ffffffffc068a47f
[    2.564061] FS:  00007fc69b5f1a40(0000) GS:ffff9953aff80000(0000) knlGS:0000000000000000
[    2.564063] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.564063] CR2: 00007f9506804000 CR3: 0000000107f92000 CR4: 0000000000350ee0
[    2.564065] Call Trace:
[    2.564068]  <TASK>
[    2.564070]  drm_property_create+0xc9/0x170 [drm]
[    2.564088]  drm_property_create_enum+0x1f/0x70 [drm]
[    2.564105]  drm_connector_set_panel_orientation_with_quirk+0x96/0xc0 [drm]
[    2.564123]  get_modes+0x4fb/0x530 [amdgpu]
[    2.564378]  drm_helper_probe_single_connector_modes+0x1ad/0x850 [drm_kms_helper]
[    2.564390]  drm_client_modeset_probe+0x229/0x1400 [drm]
[    2.564411]  ? xas_store+0x52/0x5e0
[    2.564416]  ? kmem_cache_alloc_trace+0x177/0x2c0
[    2.564420]  __drm_fb_helper_initial_config_and_unlock+0x44/0x4e0 [drm_kms_helper]
[    2.564430]  drm_fbdev_client_hotplug+0x173/0x210 [drm_kms_helper]
[    2.564438]  drm_fbdev_generic_setup+0xa5/0x166 [drm_kms_helper]
[    2.564446]  amdgpu_pci_probe+0x35e/0x370 [amdgpu]
[    2.564621]  local_pci_probe+0x45/0x80
[    2.564625]  ? pci_match_device+0xd7/0x130
[    2.564627]  pci_device_probe+0xbf/0x220
[    2.564629]  ? sysfs_do_create_link_sd+0x69/0xd0
[    2.564633]  really_probe+0x19c/0x380
[    2.564637]  __driver_probe_device+0xfe/0x180
[    2.564639]  driver_probe_device+0x1e/0x90
[    2.564641]  __driver_attach+0xc0/0x1c0
[    2.564643]  ? __device_attach_driver+0xe0/0xe0
[    2.564644]  ? __device_attach_driver+0xe0/0xe0
[    2.564646]  bus_for_each_dev+0x78/0xc0
[    2.564648]  bus_add_driver+0x149/0x1e0
[    2.564650]  driver_register+0x8f/0xe0
[    2.564652]  ? 0xffffffffc1023000
[    2.564654]  do_one_initcall+0x44/0x200
[    2.564657]  ? kmem_cache_alloc_trace+0x177/0x2c0
[    2.564659]  do_init_module+0x4c/0x250
[    2.564663]  __do_sys_init_module+0x12e/0x1b0
[    2.564666]  do_syscall_64+0x3b/0x90
[    2.564670]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    2.564673] RIP: 0033:0x7fc69bff232e
[    2.564674] Code: 48 8b 0d 45 0b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 12 0b 0c 00 f7 d8 64 89 01 48
[    2.564676] RSP: 002b:00007ffe872ba3e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    2.564677] RAX: ffffffffffffffda RBX: 000055873f797820 RCX: 00007fc69bff232e
[    2.564678] RDX: 000055873f7bf390 RSI: 0000000001155e81 RDI: 00007fc699e4d010
[    2.564679] RBP: 00007fc699e4d010 R08: 000055873f7bfe20 R09: 0000000001155e90
[    2.564680] R10: 000000055873f7bf R11: 0000000000000246 R12: 000055873f7bf390
[    2.564681] R13: 000000000000000d R14: 000055873f7c4cb0 R15: 000055873f797820
[    2.564683]  </TASK>
[    2.564683] ---[ end trace 0000000000000000 ]---
[    2.564696] ------------[ cut here ]------------
[    2.564696] WARNING: CPU: 6 PID: 325 at drivers/gpu/drm/drm_mode_object.c:242 drm_object_attach_property+0x52/0x80 [drm]
[    2.564717] Modules linked in: btusb btrtl btbcm btintel btmtk bluetooth rfkill ecdh_generic ecc usbhid crc16 amdgpu(+) drm_ttm_helper ttm agpgart gpu_sched i2c_algo_bit drm_display_helper drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm serio_raw sdhci_pci atkbd libps2 cqhci vivaldi_fmap ccp sdhci i8042 crct10dif_pclmul crc32_pclmul hid_multitouch ghash_clmulni_intel aesni_intel crypto_simd cryptd wdat_wdt mmc_core cec xhci_pci sp5100_tco rng_core xhci_pci_renesas serio 8250_dw i2c_hid_acpi i2c_hid btrfs blake2b_generic libcrc32c crc32c_generic crc32c_intel xor raid6_pq dm_mirror dm_region_hash dm_log dm_mod pkcs8_key_parser crypto_user
[    2.564738] CPU: 6 PID: 325 Comm: systemd-udevd Tainted: G        W         5.18.0-amd-staging-drm-next+ #67
[    2.564740] Hardware name: Valve Jupiter/Jupiter, BIOS F7A0105 03/21/2022
[    2.564741] RIP: 0010:drm_object_attach_property+0x52/0x80 [drm]
[    2.564759] Code: 2d 83 f8 18 74 33 48 89 74 c1 08 48 8b 4f 08 48 89 94 c1 c8 00 00 00 48 8b 47 08 83 00 01 c3 4d 85 d2 75 dd 83 7f 58 01 75 d7 <0f> 0b eb d3 41 80 78 50 00 74 cc 0f 0b eb c8 44 89 ce 48 c7 c7 28
[    2.564760] RSP: 0018:ffffb2e8804138d8 EFLAGS: 00010246
[    2.564761] RAX: 0000000000000010 RBX: ffff99508c1a2000 RCX: ffff99508c1a2180
[    2.564762] RDX: 0000000000000003 RSI: ffff99508c050100 RDI: ffff99508c1a2040
[    2.564763] RBP: 00000000ffffffff R08: ffff99508a860010 R09: 00000000c0c0c0c0
[    2.564763] R10: 0000000000000000 R11: 0000000000000020 R12: ffff99508a860010
[    2.564764] R13: ffff995088733008 R14: ffff99508c1a2000 R15: ffffffffc068a47f
[    2.564765] FS:  00007fc69b5f1a40(0000) GS:ffff9953aff80000(0000) knlGS:0000000000000000
[    2.564766] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    2.564767] CR2: 00007f9506804000 CR3: 0000000107f92000 CR4: 0000000000350ee0
[    2.564768] Call Trace:
[    2.564769]  <TASK>
[    2.564770]  drm_connector_set_panel_orientation_with_quirk+0x4a/0xc0 [drm]
[    2.564789]  get_modes+0x4fb/0x530 [amdgpu]
[    2.565024]  drm_helper_probe_single_connector_modes+0x1ad/0x850 [drm_kms_helper]
[    2.565036]  drm_client_modeset_probe+0x229/0x1400 [drm]
[    2.565056]  ? xas_store+0x52/0x5e0
[    2.565060]  ? kmem_cache_alloc_trace+0x177/0x2c0
[    2.565062]  __drm_fb_helper_initial_config_and_unlock+0x44/0x4e0 [drm_kms_helper]
[    2.565072]  drm_fbdev_client_hotplug+0x173/0x210 [drm_kms_helper]
[    2.565080]  drm_fbdev_generic_setup+0xa5/0x166 [drm_kms_helper]
[    2.565088]  amdgpu_pci_probe+0x35e/0x370 [amdgpu]
[    2.565261]  local_pci_probe+0x45/0x80
[    2.565263]  ? pci_match_device+0xd7/0x130
[    2.565265]  pci_device_probe+0xbf/0x220
[    2.565267]  ? sysfs_do_create_link_sd+0x69/0xd0
[    2.565268]  really_probe+0x19c/0x380
[    2.565270]  __driver_probe_device+0xfe/0x180
[    2.565272]  driver_probe_device+0x1e/0x90
[    2.565274]  __driver_attach+0xc0/0x1c0
[    2.565276]  ? __device_attach_driver+0xe0/0xe0
[    2.565278]  ? __device_attach_driver+0xe0/0xe0
[    2.565279]  bus_for_each_dev+0x78/0xc0
[    2.565281]  bus_add_driver+0x149/0x1e0
[    2.565283]  driver_register+0x8f/0xe0
[    2.565285]  ? 0xffffffffc1023000
[    2.565286]  do_one_initcall+0x44/0x200
[    2.565288]  ? kmem_cache_alloc_trace+0x177/0x2c0
[    2.565290]  do_init_module+0x4c/0x250
[    2.565291]  __do_sys_init_module+0x12e/0x1b0
[    2.565294]  do_syscall_64+0x3b/0x90
[    2.565296]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[    2.565297] RIP: 0033:0x7fc69bff232e
[    2.565298] Code: 48 8b 0d 45 0b 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 49 89 ca b8 af 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 12 0b 0c 00 f7 d8 64 89 01 48
[    2.565299] RSP: 002b:00007ffe872ba3e8 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    2.565301] RAX: ffffffffffffffda RBX: 000055873f797820 RCX: 00007fc69bff232e
[    2.565302] RDX: 000055873f7bf390 RSI: 0000000001155e81 RDI: 00007fc699e4d010
[    2.565303] RBP: 00007fc699e4d010 R08: 000055873f7bfe20 R09: 0000000001155e90
[    2.565303] R10: 000000055873f7bf R11: 0000000000000246 R12: 000055873f7bf390
[    2.565304] R13: 000000000000000d R14: 000055873f7c4cb0 R15: 000055873f797820
[    2.565306]  </TASK>
[    2.565307] ---[ end trace 0000000000000000 ]---

--

v2:
- call amdgpu_dm_connector_get_modes() instead of ddc_get_modes() (Harry)

Fixes: d77de78 ("amd/display: enable panel orientation quirks")
Acked-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Melissa Wen <mwen@igalia.com>
Signed-off-by: Alex Deucher <alexander.deucher@amd.com>
@ppanchad-amd
Copy link

@akostadinov Apologies for the lack of response. Vega Frontier Edition boards are no longer supported on the latest ROCm 6.2. Thanks!

@ppanchad-amd ppanchad-amd closed this as not planned Won't fix, can't repro, duplicate, stale Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants