-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Release Candidate - Testing Requested #76
Comments
I will work updating the User Guide.
BTW, I have been able to overclock and underclock the endpoints and undervolt the curve. |
I have modified the format of To overclock, I assume you would not need to change the curve, but just define an operating point at a higher Frequency then the stock highest. This may be limited by the OD_Range points. |
Good, got it. |
I have always been working to manage power, so I don't have much experience overclocking, though I have tried it with older cards in some benchmarking I was doing. The curve is what defines how AVFS works. The GPU is meant to operate on that curve. Perhaps the curve doesn't represent operating points beyond the curve accurately, so maybe redefining an end point might make sense. Perhaps it is a good idea to plot the curve in excel and see how any modified curve would compare. Another use could be instability for an aged card. Maybe you get get more life out of it by shifting the whole curve by a voltage offset. |
@csecht |
@csecht A couple of minor observations:
Have you been able to test the latest on master on your systems? On my systems, it is more responsive with the optimizations. |
Okay, I’ll update the plot and pac Type1 examples tomorrow.
Yes, I did notice that the monitor GUI launches very quickly. Nice. I’ll test the other modules tomorrow.
… On Jun 3, 2020, at 6:52 PM, Rick ***@***.***> wrote:
@csecht <https://github.com/csecht>
I have merged your pull request. Looks good!
A couple of minor observations:
Applicable version should be 3.2.x
The plot example is from old version. There are minor format changes in the latest.
The pac example for Type 1 cards is not the latest.
Have you been able to test the latest on master on your systems? On my systems, it is more responsive with the optimizations.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#76 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ALMVCQSLLDY7ZERFI5RDIULRU3O2XANCNFSM4NQVTHMA>.
|
I implemented another optimization by using an Enum object in the definition of sensors instead of using names which should be slightly faster. It was a major change, so a thorough review of |
It all looks good. Nice and responsive too. |
@csecht |
I got this error this morning with PAC whenever I try to change any parameter:
At which point it just hangs and I don't get a prompt to enter my sudo password. EDIT: I was able to successfully run a startup PAC BASH script, as a service, to change the sclk endpoint for that card, so the device files can be edited. |
@csecht |
I just pushed a more robust approach. |
Yes, that fixed it. |
Still not happy with the robustness of the solution, so I will delay official release for a week. I did enhance critical temp reading and display value for all sensors in |
I think I have a more robust solution for dealing with variable types of numeric values in pac and monitor. While working on this, I implemented Enum for GPU Types and Vendors. This makes it so I no longer use numeric type indicators and use enumerated names instead. Perhaps the Users Guide needs to be updated with these new type names:
Probably only the the last two are relevant to the user. |
I edited the User Guide accordingly and issued Pull Request. "Type 0" was replaced with Type Undefined, etc. Type PStatesNE was not introduced in the guide. |
Actually, Type0 was used for and older GPU that had non-editable p-states, but the re-write in 3.x seems to have eliminated that classification. I have one old card. Maybe I will work with that to re-implement the classification of PStatesNE type. |
Would Undefined be used as an |
Yes, Undefined is the default type. It gets set to PStates or CurvePts when the pstates are read from the It looks like the code I had to set the Type for HD series is missing after the rewrite. I need to put an old card back in and work it out again with the new code base. I have made some user guide modifications, so be sure to pull the latest if you are going to make some edits. |
I have implemented a few more Enum objects and made a major change to how sensors are read. It should be much more efficient now. I think that was the last major change for release 3.2. I will release this weekend, so let me know if you see any issues. |
It looks like I gave away the R9 290x card I had, so I installed an older HD 7870 GPU. It had only a few parameters available, but I am not sure if this is due to not having amdgpu installed. I am using Ubuntu 20.04, and there is no amdgpu install package for it yet. Anyway, here is what I get with
Also, I am now reading the device id details and decoding from pciid file for the non-readable onboard GPUs. Here is what I get for my server system:
I ran into a problem where the missing values in Legacy cards causes problems with monitor and plot, so I exclude them when getting a list of readable GPUs. |
So, for the User Guide, what is the difference between PStatesNE and Legacy cards? |
A minor point in formatting output from amdgpu-ls:
For Current Temps, the order of 'edge' and 'junction' ought to be switched, to match the order in Critical Temps (or visa versa). |
I am concerned that the observations for HD 7870 are very different from what I observed for R9 290x. Not sure if it is a real difference, or an artifact of not having amdgpu driver package installed on my 20.04 system. Let's hold off documenting Legacy and PStatesNE until I get more clarity. |
Implemented sorting of dictionaries for print in the latest on master. |
I just remembered I had a Radeon HD 4650, so I installed it in my machine with Ubuntu 18.04, kernel 5.3.0, and amdgpu version 20.10-1048554, then ran
Get a similar error with all other amdgpu-utils commands, except
|
$ ./amdgpu-ls --debug |
Is that log from the latest on master? I added a few more log statements in the latest. |
sorry. Here is the terminal stdout
and here is the debug file: |
I think I have covered the other places where card_path is referenced. Let me know when you get a chance to try it out. |
Got it. Here is teminal
and the debug |
Looks like readable flag was still True for unsupported GPUs. I fixed that. |
Hmmmm. The terminal:
and debug: |
Oops... Used wrong string format in logger. Fixed and pushed. |
not quite...
|
It looks like the readable flag is still True. Not sure why, so I have added more logger statements. |
The debug says it's looking in the path /sys/devices/, but the only thing there is the CPU. Shouldn't it look in /sys/class/drm/ where the GPUs are? The HD 4650 is in the first PCI slot, so ...
In the /sys /devices directory:
|
The way to associate the correct card path is by looking for the full system device path with the pcie id in the pathname. This version of the pathname is derived from the typical card_path name using resolve. So I check full system path of each potential card path for a match to the pcie_id. If a match is found, then that card path is associated with the pcie_id. For this card, no match is found. |
From the log file: This card has pcie_id of: 01:00.0 There are 2 potential card paths: 0 & 1
Neither matches the pcie_id of so, GPU type set to Unsupported |
Even if we find that there is a valid card path, still need to fix the issue where an unsupported card is interpreted as readable. Let's get this one fixed first, then work on a potential issue of matching a pcie_id to a card path. |
To make the card path details more clear, I have added the system card path to the output of |
I have discovered an inconsistency in the way I was accessing the list of GPU's. Maybe this was the source of unreadable cards being read. But the real problem was that I was only checking readability flag in |
Yes! it's working now to deal with the unsupported card:
Now, about that card path...
A grep for those pci-ids shows that card's path is /sys/devices/pci0000:00/0000:00:01.0/0000:01:00.0/
And...
|
A minor point of formatting
To match the terminal output of the |
I am going to need to think about how to deal with cards that don't have a normal card_path. I am currently only examining the system path of card paths that exist. I will work on it over the weekend. Hope you don't mind, but I have made significant changes across all modules to deal with the issue causing confusion in the way I access gpu's in a GPU List. The code is now much more intuitive. I have only tried on one of my systems, but it is getting late here. I will push to master. Let me know if you find any issues. It also includes the help format change. |
I ran through all the commands and everything is working. Nice.
|
Still researching how to get the /sys/devices path for a specific pcie ID. My first attempt is this code:
But it maxes out cpu for a long time and hasn't returned anything useful yet. Still need to do some research. |
Maybe the use of a naked '*' is too greedy. Would a more explicit regex work?
or this; is more general, but uses dot. to give * something to work on and '?' removes the greediness of *
I tested these regex out on https://pythex.org/ and both seem to work for matching up to the pcie-id. |
Yes, my first approach was to use compiled regex, but it seems like glob only uses wild cards. Seems like the main problem with recursive glob is that there are symbolic links to back in the directory structure, so it may be getting into an infinite loop. My final solution is to make multiple glob calls, each deeper than the previous. You can see the code in this commit Seems like in your case, the files in the sys device directory are only generic pcie files, so I still classify the card as unsupported. Let me know of any concerns. |
Looks good. From 'amdgpu-ls':
|
Thanks for the confirmation! I have run all of the utilities on all 3 of my systems and all looks good. I plan to release tomorrow. Let me know if you notice any additional issues. |
No issues found. All modules check out fine on my local and remote hosts. |
I made an official release and announced on the SETI@Home message board. |
I just downloaded the master branch with your latest #85 pull merge and everything looks good. |
Thanks for the confirmation. I will close this issue and raise one for the next major release plan. |
I have prepared v3.2.0 Release Candidate 1 on master. I have tested on my 3 systems. Looks good so far. Please provide your experience here as verification/feedback before release planned for this coming weekend. Thanks!
The text was updated successfully, but these errors were encountered: