New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardcoded PCI address range #6
Comments
I don't actually know what the addresses printed in However I really doubt that the kernel will log the regions mapped by the Anyway, I do believe that the hardcoded addresses in this region may be an issue for graphics cards that are not Pascal based. Unfortunately I do not have any other graphics card to test with. Do you have a graphics card that doesn't work with these addresses? |
First of all you seem to have much more knowledge about these things, if you have any good sources for learning more about this stuff, I would appreciate it if you could share them with me. Secondly I would probably clarify what I meant above. I tried your vgpu_unlock with a 1070 on a AMD 2950x system. |
You know what, that's pretty good information, I feel like it should be included in the README so that AMD users, more specifically AMD Zen users can get help for setting this software up on their machines. Great findings! |
I might be having the same issue. I'm not sure how to check the logging, but it's not working on my system. Specs:
Looked through the dmesg like @arki05 and found this line which is similar to his After changing the magic to I'm not sure if this is even the issue, but it's the only relevant problem I could find |
The fix only works if you are certain that it's a memory address range issue. Are you trying to emulate an RTX A40? That card seems to be a fair bit different as well, but that shouldn't be a limiting factor as we have seen with the GTX 1060 running this script. Do you happen to have an intel system to try this on? My intel based system didn't require an address change. I'm assuming the 10GB VRAM may require a slightly higher memory range though. |
I do have an Intel i3-9100F that I could spin up tomorrow if that's even going to work? |
In vgpu_unlock_hooks.c there is a section for enabling logs. You have to change the 0 to a 1
Try to enable logs and rebuild & reinstall the DKMS module. Reboot and post you dmesg / logs. That might help find the issue. |
Already enabled the logs, but I'm not sure where to find them. I can post the dmesg here tomorrow for sure |
It is likely the same issue. The addresses printed in dmesg is the PCI BAR (Base Address Register) setup by the kernel, for more information on how that works see this wikipedia article: PCI configuration space. What we are interested in is BAR 3 (documented here) which maps the cards VRAM onto the PCI bus. From my understanding some code is being written into the cards VRAM using this mapping, then the card's Falcon microprocessor is used to execute that code which generates the magic and key values, and the those can be read back by the driver. Unfortunately I believe that the different generations of cards has different versions of the Falcon microprocessor, so the code used might not be the same. It is therefore also likely that the offset into BAR 3 will have to be different for the different generations of cards. As far as I know vgpu_unlock has only been tested on Pascal (10-series) graphics cards. If anyone is interested in providing additional log files for analysis, I would like MMIO-traces for the execution of |
RTX 30 series has resizable BAR support, so I would assume that this new generation of GPU uses a greater memory space than previous cards. There should be a way to get a beginning and end value for that space, so would it work if you plugged those values into the script? |
It would, if you knew the offset of the magic and key values, and those offsets were constant. |
Alright, just ran the mmio trace and the dmesg. The MMIO trace was apparently really difficult or something, because I don't think I got it to work properly. dmesg_3080.log Will test my Intel system next |
Unfortunately it doesn't look like the memory regions that I am interested in was accessed during the recording of that log file. This is likely related to We can list NVIDIA devices found by mmiotrace (annotations and formatting added for readability):
The first device is the RTX3080 GPU (pci dev id 0x2206) which we are interested in and the second device is an audio device (probably for sound over HDMI) which is not interesting. We can see that there are three initialized bars on the GPU, BAR0, BAR1 and BAR3. We can now look at all mapping commands: Show
Here we can see that BAR0 is mapped five times (id 1, 2, 3, 36 and 57), but BAR3 is never mapped. Unfortunately it is the values inside BAR3 that I am interested in. Documentation for MMIO-trace, including the log file format can be found here. |
Hmm, I'll give it a try again then. I've also tried running the 3080 on Intel, but no luck there either. The PCI id was 0000:01:00.0 instead of 0000:c1:00.0, but it didn't change a thing. It might not be a PCI address issue afterall. One odd thing I did notice was the GPU temperature and power usage being really high on both systems after applying the mod. The 3080 was about 60C after a while and sucked about 160W. This is definitely not normal and only happens using this script. |
The 3080 is rather new, but I know there are implementations for the GA100 chip in driver 450 and 460, but I don't know if they have added GA102 yet, which the RTX A6000 and 3080 have. It might work in the future, though. Speaking of 450, have you tried out the 450 driver, or is it no longer available for download from the Nvidia Enterprise portal? |
The 460 driver supports both the RTX A6000 and the A40. The 450 doesn't, it only supports the A100. I've tried to install it but it would just give me an error (which is obvious I guess) |
I've noticed slightly higher idle wattages on mine too, but only 33 watts which is not much since I believe this script was built mostly around Pascal with not much testing done on newer generations like Ampere. Usually if you are seeing much higher wattages, temps, and fans, that likely means that the driver is unable to properly work with the graphics card. Notice that your GPU is sitting on P0 high-performance power state despite idling. This isn't supposed to happen on normal operation and could mean that the script is disallowing the driver to work as intended. I'm no expert by any means, but I figure a modified version of the script focussed on Ampere's far greater memory space usage and other quirks of Ampere generation could be made either separately or part of the same script, just that it will only activate upon detection of an Ampere card PCI ID. And one last thing, this is unrelated but @FIFARenderZ do you plan on purchasing a license for vGPU after your trial license expires for realtime usage? Or are you trying out this setup for tinkering purposes? |
I have no idea why the GPU usage would be affected by the script. But the MMIO-trace is equally useful whether or not vgpu_unlock is used. So an MMIO-trace with an unmodified driver and Ampere GPU would be interesting. |
That would make sense, although I haven't seen any other power state mentioned in the
For tinkering purposes right now. Maybe we're able to do much more later 😉
For sure, which is what I was trying to do. It didn't work out for some reason and I'll give it another try tomorrow |
Tried another round of MMIO-tracing with no success. The driver works normally when booted, but once I start tracing (and disabled & enabled the driver) it just spits out "No device found". Checked the logs and it again contained no info about BAR3. @DualCoder Did you do anything different from the guide? Because I'm kind of at a loss right now PS: Maybe I'm doing something wrong, but every time I execute |
If anyone wants to join, https://discord.gg/mAz38ZBrjx |
We usually use the EEVBLOG forum to discuss this but their data center went on fire. I joined your telegram but it would be nice if we all can have permission to add messages.
In theory it could, but based on @FIFARenderZ's experience with the RTX 3080, it may or may not work out. This script works with older generations like Pascal and Turing though. Also, the 3090 uses the GA102 which is the same as the 3080, so your chances of success are going to be about as high as everyone else with a 3080... |
Has been solved by dualcoder in 54d90cde |
In the readme you wrote
"Physical PCI address range 0xf0000000-0xf1000000"
in my case the range turned out to be 0x4810000000-0x4811ffffff
The Address Range can be found in dmesg in lines like this:
pci 0000:0a:00.0: reg 0x1c: [mem 0x4810000000-0x4811ffffff 64bit pref]
This is Probably due to above 4g decoding (not 100% sure, but my best guess)
The text was updated successfully, but these errors were encountered: