Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia-bug-report.sh requested for further information regarding why the script failed. #1

Closed
KrutavShah opened this issue Feb 5, 2021 · 3 comments

Comments

@KrutavShah
Copy link
Contributor

Hello, the README says that the script is currently not working right now, so I would like if you can run nvidia-bug-report.sh to generate a report of all the messages and errors put out by vgpu manager and vgpud. This will help in figuring out what went wrong. Thank you.

@DualCoder
Copy link
Owner

Here is the requested file: nvidia-bug-report.log.gz

The interesting section is this:

Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: notice: vmiop_env_log: vmiop-env: guest_max_gpfn:0x0
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: notice: vmiop_env_log: (0x0): Received start call from nvidia-vgpu-vfio module: mdev uuid 38512783-4893-47f7-9179-b0594167e86b GPU PCI id 00:01:00.0 config params vgpu_type_id=50
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: notice: vmiop_env_log: (0x0): pluginconfig: vgpu_type_id=50
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: notice: vmiop_env_log: Successfully updated env symbols!
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: NVOS status 0x56
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: Assertion Failed at 0xf69873bf:293
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: 11 frames returned by backtrace
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(_nv005021vgpu+0x18) [0x7ff3f69cc3c8]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0xa3e3b) [0x7ff3f6982e3b]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0xa83bf) [0x7ff3f69873bf]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: /lib/x86_64-linux-gnu/libnvidia-vgpu.so(+0xa98c7) [0x7ff3f69888c7]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x413e72]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x4140e9]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x40e9d7]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x40c2c9]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x40bc7c]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xeb) [0x7ff3f6e7109b]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: vgpu() [0x4033ba]
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: (0x0): init_device_instance failed for inst 0 with error 1 (error setting vGPU configuration information from RM)
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: (0x0): Initialization: init_device_instance failed error 1
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_log: display_init failed for inst: 0
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_env_log: (0x0): vmiope_process_configuration: plugin registration error
Feb 13 15:20:08 Debian-dom0 nvidia-vgpu-mgr[1562]: error: vmiop_env_log: (0x0): vmiope_process_configuration failed with 0x1f
Feb 13 15:20:08 Debian-dom0 kernel: [nvidia-vgpu-vfio] 38512783-4893-47f7-9179-b0594167e86b: start failed. status: 0x1

According to nvstatuscodes.h the status code 0x56 is NV_STATUS_CODE(NV_ERR_NOT_SUPPORTED, 0x00000056, "Call not supported"). As far as I can tell this is returned by the code inside the nv-kernel.o_binary file and I have not been able to figure out why.

@KrutavShah
Copy link
Contributor Author

KrutavShah commented Feb 13, 2021

error 1 (error setting vGPU configuration information from RM)
As far as I know, this is a pretty typical error when you’re using the wrong graphics card. Because Nvidia only officially supports Red Hat Linux, I used that for these tests and ran the Red Hat hypervisor on top of a KVM hypervisor. The level 1 hypervisor spoofed the PCI ID for the Red Hat to be able to detect a “Tesla P4,” and what happens is that instead of getting loads of errors, I usually get the same error 1. So far, I haven’t looked at your whole bug report, but I will compare it to some of my previous testing and try to dig up a few details. Right now I have a feeling that it has to do with ECC memory, a feature that has to be disabled for vGPU to work in any way. However on GeForce, you can’t turn on or off ECC so there needs to be some additional modifications. I’ll let you know when more information surfaces.

@DualCoder
Copy link
Owner

I am closing this since the bug report has been provided and the new README explains what causes the failure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants