The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
1. Issue or feature description
Question about nvidia-peermem.ko
In my understanging, nvidia-peermem.ko is not installed by CUDA Driver Container.
If I want to install nvidia-peermem.ko, Which way I should do
1)Install nvidia-peermem.ko and use CUDA Container Driver.
2) Install CUDA-Driver/NVIDIA-Driver on host.
If method 1 is available, would you provide me a instruction of nvidia-peermem.ko
References
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#nvidia-peermem
https://gitlab.com/nvidia/container-images/driver/-/blob/master/ubuntu20.04/nvidia-driver#L201-L204
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.
1. Quick Debug Checklist
i2c_coreandipmi_msghandlerloaded on the nodes?kubectl describe clusterpolicies --all-namespaces)1. Issue or feature description
Question about nvidia-peermem.ko
In my understanging, nvidia-peermem.ko is not installed by CUDA Driver Container.
If I want to install nvidia-peermem.ko, Which way I should do
1)Install nvidia-peermem.ko and use CUDA Container Driver.
2) Install CUDA-Driver/NVIDIA-Driver on host.
If method 1 is available, would you provide me a instruction of nvidia-peermem.ko
References
https://docs.nvidia.com/cuda/gpudirect-rdma/index.html#nvidia-peermem
https://gitlab.com/nvidia/container-images/driver/-/blob/master/ubuntu20.04/nvidia-driver#L201-L204
2. Steps to reproduce the issue
3. Information to attach (optional if deemed irrelevant)
kubernetes pods status:
kubectl get pods --all-namespaceskubernetes daemonset status:
kubectl get ds --all-namespacesIf a pod/ds is in an error state or pending state
kubectl describe pod -n NAMESPACE POD_NAMEIf a pod/ds is in an error state or pending state
kubectl logs -n NAMESPACE POD_NAMEOutput of running a container on the GPU machine:
docker run -it alpine echo fooDocker configuration file:
cat /etc/docker/daemon.jsonDocker runtime configuration:
docker info | grep runtimeNVIDIA shared directory:
ls -la /run/nvidiaNVIDIA packages directory:
ls -la /usr/local/nvidia/toolkitNVIDIA driver directory:
ls -la /run/nvidia/driverkubelet logs
journalctl -u kubelet > kubelet.logs