Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use device_create to ensure /dev nodes are created correctly. #547

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

arnej27959
Copy link

After following installation instruction for CUDA on RHEL 8.8, I got into problems later on; after debugging with system call tracing it turned out because some of the device nodes like /dev/nvidia-uvm or /dev/nvidiactl did not exist. There are tips in
https://docs.nvidia.com/cuda/cuda-installation-guide-linux/index.html#device-node-verification
for how to fix this manually, but that should not really be necessary.
Currently there are rules in /usr/lib/udev/rules.d/60-nvidia.rules which creates these using "mknod", but "journalctl" showed that they fail randomly:

Aug 16 09:35:34 gpu-test-arnej-1 sudo[6801]: arnej_yahooinc_com : TTY=pts/0 ; PWD=/home/arnej_yahooinc_com ; USER=root ; COMMAND=/bin/nvidia-modprobe
Aug 16 09:35:34 gpu-test-arnej-1 sudo[6801]: pam_unix(sudo:session): session opened for user root by arnej_yahooinc_com(uid=0)
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia: module license 'NVIDIA' taints kernel.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: Disabling lock debugging due to kernel taint
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[614]: Network interface NamePolicy= disabled on kernel command line, ignoring.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia-nvlink: Nvlink Core is being initialized, major device number 240
Aug 16 09:35:35 gpu-test-arnej-1 kernel: NVRM: loading NVIDIA UNIX x86_64 Kernel Module  535.86.10  Wed Jul 26 23:20:03 UTC 2023
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6804]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6812]: Process '/usr/bin/bash -c '/usr/bin/mknod -Z -m 666 /dev/nvidiactl c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) 255'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 systemd-udevd[6804]: Process '/usr/bin/bash -c 'for i in $(cat /proc/driver/nvidia/gpus/*/information | grep Minor | cut -d \  -f 4); do /usr/bin/mknod -Z -m 666 /dev/nvidia${i} c $(grep nvidia-frontend /proc/devices | cut -d \  -f 1) ${i}; done'' failed with exit code 1.
Aug 16 09:35:35 gpu-test-arnej-1 kernel: nvidia-uvm: Loaded the UVM driver, major device number 238.

Best practice however is that the device driver should trigger creation directly using device_create() kernel function, using "mknod" in udev rules is not the usual way to solve this.

This PR takes care of calling device_create() as needed and device_destroy() to cleanup when module or device is detached. I have tested it after disabling udev rules by using "rmmod" and "modprobe" to load and unload modules, and of course also that it works on reboot.

@kanashimia
Copy link

The original justification for not using device_create is probably that it is marked as EXPORT_SYMBOL_GPL, and as such can't be used in a proprietary module, that is not a problem in the open kernel module, but it adds additional difference between them.

@CLAassistant
Copy link

CLAassistant commented Jun 6, 2024

CLA assistant check
All committers have signed the CLA.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants