Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A40 not supported? #21

Closed
dimm0 opened this issue Jul 9, 2021 · 7 comments · Fixed by #22
Closed

A40 not supported? #21

dimm0 opened this issue Jul 9, 2021 · 7 comments · Fixed by #22

Comments

@dimm0
Copy link

dimm0 commented Jul 9, 2021

Logs from daemonset pod:

2021/07/09 00:38:05 Not a device, continuing
2021/07/09 00:38:05 Nvidia device  0000:01:00.0
2021/07/09 00:38:05 Iommu Group 0
2021/07/09 00:38:05 Device Id 2235
2021/07/09 00:38:05 Nvidia device  0000:81:00.0
2021/07/09 00:38:05 Iommu Group 60
2021/07/09 00:38:05 Device Id 2235
2021/07/09 00:38:05 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2021/07/09 00:38:05 Iommu Map map[0:[{0000:01:00.0}] 60:[{0000:81:00.0}]]
2021/07/09 00:38:05 Device Map map[2235:[0 60[]]
2021/07/09 00:38:05 vGPU Map  map[]
2021/07/09 00:38:05 GPU vGPU Map  map[]
2021/07/09 00:38:05 Error: Could not find device name for device id: 2235
root@k8s-usra-01:/home/nautilus# lspci -nnk -d 10de:
01:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2235] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:145a]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
81:00.0 3D controller [0302]: NVIDIA Corporation Device [10de:2235] (rev a1)
	Subsystem: NVIDIA Corporation Device [10de:145a]
	Kernel driver in use: vfio-pci
	Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

I don't see A40 in /usr/pci.ids either

@rthallisey
Copy link
Collaborator

Your devices are still being seen, but the name is showing up as 2235.

2021/07/09 00:38:05 Nvidia device  0000:01:00.0
2021/07/09 00:38:05 Iommu Group 0
2021/07/09 00:38:05 Device Id 2235
2021/07/09 00:38:05 Nvidia device  0000:81:00.0

2021/07/09 00:38:05 Error: Could not find device name for device id: 2235
The code that creates this error gets the id like this:

# cat /sys/bus/pci/devices/0000\:81\:00.0/device
0x2235

Can you paste what shows up on your nodes?

similar issue: #20

@dimm0
Copy link
Author

dimm0 commented Jul 9, 2021

I fixed this by adding the 2235 GPU to the /usr/pci.ids file, works fine after.
What do you mean by what shows up on my node? Where?

@rthallisey
Copy link
Collaborator

kubectl get nodes -o yaml <node> will have an Allocatable section for the devices. Can you paste that?

@dimm0
Copy link
Author

dimm0 commented Jul 9, 2021

As I said it's fixed now, and showing the GPU the way I added it to the /usr/pci.ids. Before I modified the image, nothing new was appearing in the node's allocatable resources, since the pod was stopping with the error that I provided.

But I had to extend your image and add the /usr/pci.ids modified by me, I was hoping you can do it the right way in your image. Currently the /usr/pci.ids in this repo has no definition for A40.

@rthallisey
Copy link
Collaborator

Before I modified the image, nothing new was appearing in the node's allocatable resources, since the pod was stopping with the error that I provided.

Ok I misunderstood. I didn't realize the pod was exiting. Looking at your logs, the device was registered as 2235 so I figured the dp would continue on and use that for the name on the node. TBH, sounds like a bug that the dp exits early..., regardless I'll add the A40 (also need A100) to the dp pci.ids to get this unblocked.

@dimm0
Copy link
Author

dimm0 commented Jul 9, 2021

It was not crashlooping, but it was not going further either

@dimm0
Copy link
Author

dimm0 commented Jul 9, 2021

Thanks!

@dimm0 dimm0 closed this as completed Jul 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants