Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting GPU device minor number: Not Supported #332

Open
7 tasks
zengzhengrong opened this issue Sep 6, 2022 · 32 comments
Open
7 tasks

Getting GPU device minor number: Not Supported #332

zengzhengrong opened this issue Sep 6, 2022 · 32 comments
Assignees

Comments

@zengzhengrong
Copy link


1. Issue or feature description

helm install nvidia-device-plugin

 helm upgrade -i nvdp nvdp/nvidia-device-plugin --namespace nvidia-device-plugin --create-namespace --version 0.12.2 

nvidia-device-plugin-ctr logs

2022/09/06 15:24:00 Starting FS watcher.
2022/09/06 15:24:00 Starting OS watcher.
2022/09/06 15:24:00 Starting Plugins.
2022/09/06 15:24:00 Loading configuration.
2022/09/06 15:24:00 Initializing NVML.
2022/09/06 15:24:00 Updating config with default resource matching patterns.
2022/09/06 15:24:00 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "plugin": {
      "passDeviceSpecs": true,
      "deviceListStrategy": "envvar",
      "deviceIDStrategy": "index"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
2022/09/06 15:24:00 Retreiving plugins.
panic: Unable to load resource managers to manage plugin devices: error building device map: error building device map from config.resources: error building GPU device map: error building GPU Device: error getting device paths: error getting GPU device minor number: Not Supported

goroutine 1 [running]:
main.(*migStrategyNone).GetPlugins(0xc000010a30)
	/build/cmd/nvidia-device-plugin/mig-strategy.go:57 +0x1a5
main.startPlugins(0xc0000e5c58?, {0xc0001cc460, 0x9, 0xe}, 0x9?)
	/build/cmd/nvidia-device-plugin/main.go:247 +0x4bd
main.start(0x10d7b20?, {0xc0001cc460, 0x9, 0xe})
	/build/cmd/nvidia-device-plugin/main.go:147 +0x355
main.main.func1(0xc0001cc460?)
	/build/cmd/nvidia-device-plugin/main.go:43 +0x32
github.com/urfave/cli/v2.(*App).RunContext(0xc0001e8820, {0xca9328?, 0xc00003a050}, {0xc000032230, 0x1, 0x1})
	/build/vendor/github.com/urfave/cli/v2/app.go:322 +0x953
github.com/urfave/cli/v2.(*App).Run(...)
	/build/vendor/github.com/urfave/cli/v2/app.go:224
main.main()
	/build/cmd/nvidia-device-plugin/main.go:91 +0x665

When I use ctr to run test gpu is ok

ctr run --rm --gpus 0 nvcr.io/nvidia/k8s/cuda-sample:nbody test-gpu /tmp/nbody -benchmark

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
        -fullscreen       (run n-body simulation in fullscreen mode)
        -fp64             (use double precision floating point values for simulation)
        -hostmem          (stores simulation data in host memory)
        -benchmark        (run benchmark to measure performance) 
        -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
        -device=<d>       (where d=0,1,2.... for the CUDA device to use)
        -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
        -compare          (compares simulation results running once on the default GPU and once on the CPU)
        -cpu              (run n-body simulation on the CPU)
        -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Pascal" with compute capability 6.1

> Compute 6.1 CUDA device: [NVIDIA GeForce GTX 1060 3GB]
9216 bodies, total time for 10 iterations: 7.467 ms
= 113.747 billion interactions per second
= 2274.931 single-precision GFLOP/s at 20 flops per interaction

3. Information to attach (optional if deemed irrelevant)

Common error checking:

  • The output of nvidia-smi -a on your host
 nvidia-smi -a
 
 ==============NVSMI LOG==============
 
 Timestamp                                 : Tue Sep  6 15:30:06 2022
 Driver Version                            : 516.94
 CUDA Version                              : 11.7
 
 Attached GPUs                             : 1
 GPU 00000000:01:00.0
     Product Name                          : NVIDIA GeForce GTX 1060 3GB
     Product Brand                         : GeForce
     Product Architecture                  : Pascal
     Display Mode                          : Enabled
     Display Active                        : Enabled
     Persistence Mode                      : Enabled
     MIG Mode
         Current                           : N/A
         Pending                           : N/A
     Accounting Mode                       : Disabled
     Accounting Mode Buffer Size           : 4000
     Driver Model
         Current                           : WDDM
         Pending                           : WDDM
     Serial Number                         : N/A
     GPU UUID                              : GPU-9445de88-eb50-477d-ff7c-5e0d77cdb203
     Minor Number                          : N/A
     VBIOS Version                         : 86.06.3c.00.2e
     MultiGPU Board                        : No
     Board ID                              : 0x100
     GPU Part Number                       : N/A
     Module ID                             : 0
     Inforom Version
         Image Version                     : G001.0000.01.04
         OEM Object                        : 1.1
         ECC Object                        : N/A
         Power Management Object           : N/A
     GPU Operation Mode
         Current                           : N/A
         Pending                           : N/A
     GSP Firmware Version                  : N/A
     GPU Virtualization Mode
         Virtualization Mode               : None
         Host VGPU Mode                    : N/A
     IBMNPU
         Relaxed Ordering Mode             : N/A
     PCI
         Bus                               : 0x01
         Device                            : 0x00
         Domain                            : 0x0000
         Device Id                         : 0x1C0210DE
         Bus Id                            : 00000000:01:00.0
         Sub System Id                     : 0x11C210DE
         GPU Link Info
             PCIe Generation
                 Max                       : 3
                 Current                   : 3
             Link Width
                 Max                       : 16x
                 Current                   : 16x
         Bridge Chip
             Type                          : N/A
             Firmware                      : N/A
         Replays Since Reset               : 0
         Replay Number Rollovers           : 0
         Tx Throughput                     : 0 KB/s
         Rx Throughput                     : 8000 KB/s
     Fan Speed                             : 42 %
     Performance State                     : P5
     Clocks Throttle Reasons
         Idle                              : Active
         Applications Clocks Setting       : Not Active
         SW Power Cap                      : Not Active
         HW Slowdown                       : Not Active
             HW Thermal Slowdown           : Not Active
             HW Power Brake Slowdown       : Not Active
         Sync Boost                        : Not Active
         SW Thermal Slowdown               : Not Active
         Display Clock Setting             : Not Active
     FB Memory Usage
         Total                             : 3072 MiB
         Reserved                          : 84 MiB
         Used                              : 2407 MiB
         Free                              : 580 MiB
     BAR1 Memory Usage
         Total                             : 256 MiB
         Used                              : 2 MiB
         Free                              : 254 MiB
     Compute Mode                          : Default
     Utilization
         Gpu                               : 3 %
         Memory                            : 5 %
         Encoder                           : 0 %
         Decoder                           : 0 %
     Encoder Stats
         Active Sessions                   : 0
         Average FPS                       : 0
         Average Latency                   : 0
     FBC Stats
         Active Sessions                   : 0
         Average FPS                       : 0
         Average Latency                   : 0
     Ecc Mode
         Current                           : N/A
         Pending                           : N/A
     ECC Errors
         Volatile
             Single Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
             Double Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
         Aggregate
             Single Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
             Double Bit            
                 Device Memory             : N/A
                 Register File             : N/A
                 L1 Cache                  : N/A
                 L2 Cache                  : N/A
                 Texture Memory            : N/A
                 Texture Shared            : N/A
                 CBU                       : N/A
                 Total                     : N/A
     Retired Pages
         Single Bit ECC                    : N/A
         Double Bit ECC                    : N/A
         Pending Page Blacklist            : N/A
     Remapped Rows                         : N/A
     Temperature
         GPU Current Temp                  : 45 C
         GPU Shutdown Temp                 : 102 C
         GPU Slowdown Temp                 : 99 C
         GPU Max Operating Temp            : N/A
         GPU Target Temperature            : 83 C
         Memory Current Temp               : N/A
         Memory Max Operating Temp         : N/A
     Power Readings
         Power Management                  : Supported
         Power Draw                        : 12.16 W
         Power Limit                       : 120.00 W
         Default Power Limit               : 120.00 W
         Enforced Power Limit              : 120.00 W
         Min Power Limit                   : 60.00 W
         Max Power Limit                   : 140.00 W
     Clocks
         Graphics                          : 683 MHz
         SM                                : 683 MHz
         Memory                            : 810 MHz
         Video                             : 607 MHz
     Applications Clocks
         Graphics                          : N/A
         Memory                            : N/A
     Default Applications Clocks
         Graphics                          : N/A
         Memory                            : N/A
     Max Clocks
         Graphics                          : 1911 MHz
         SM                                : 1911 MHz
         Memory                            : 4004 MHz
         Video                             : 1708 MHz
     Max Customer Boost Clocks
         Graphics                          : N/A
     Clock Policy
         Auto Boost                        : N/A
         Auto Boost Default                : N/A
     Voltage
         Graphics                          : N/A
     Processes                             : None
  • Your docker configuration file (e.g: /etc/docker/daemon.json)
  • The k8s-device-plugin container logs
  • The kubelet logs on the node (e.g: sudo journalctl -r -u kubelet)

Additional information that might help better understand your environment and reproduce the bug:

  • Any relevant kernel output lines from dmesg
  • NVIDIA packages version from dpkg -l '*nvidia*' or rpm -qa '*nvidia*'
dpkg -l '*nvidia*'
Desired=Unknown/Install/Remove/Purge/Hold
| Status=Not/Inst/Conf-files/Unpacked/halF-conf/Half-inst/trig-aWait/Trig-pend
|/ Err?=(none)/Reinst-required (Status,Err: uppercase=bad)
||/ Name                          Version      Architecture Description
+++-=============================-============-============-=====================================================
ii  libnvidia-container-tools     1.10.0-1     amd64        NVIDIA container runtime library (command-line tools)
ii  libnvidia-container1:amd64    1.10.0-1     amd64        NVIDIA container runtime library
ii  nvidia-container-runtime      3.10.0-1     all          NVIDIA container runtime
un  nvidia-container-runtime-hook <none>       <none>       (no description available)
ii  nvidia-container-toolkit      1.10.0-1     amd64        NVIDIA container runtime hook
  • NVIDIA container library version from nvidia-container-cli -V
nvidia-container-cli -V
cli-version: 1.10.0
lib-version: 1.10.0
build date: 2022-06-13T10:39+00:00
build revision: 395fd41701117121f1fd04ada01e1d7e006a37ae
build compiler: x86_64-linux-gnu-gcc-8 8.3.0
build platform: x86_64
build flags: -D_GNU_SOURCE -D_FORTIFY_SOURCE=2 -DNDEBUG -std=gnu11 -O2 -g -fdata-sections -ffunction-sections -fplan9-extensions -fstack-protector -fno-strict-aliasing -fvisibility=hidden -Wall -Wextra -Wcast-align -Wpointer-arith -Wmissing-prototypes -Wnonnull -Wwrite-strings -Wlogical-op -Wformat=2 -Wmissing-format-attribute -Winit-self -Wshadow -Wstrict-prototypes -Wunreachable-code -Wconversion -Wsign-conversion -Wno-unknown-warning-option -Wno-format-extra-args -Wno-gnu-alignof-expression -Wl,-zrelro -Wl,-znow -Wl,-zdefs -Wl,--gc-sections
nvidia-container-cli list  
/dev/dxg
/usr/lib/wsl/drivers/nv_dispi.inf_amd64_47917a79b8c7fd22/nvidia-smi
/usr/lib/x86_64-linux-gnu/libnvidia-ml.so.1
/usr/lib/x86_64-linux-gnu/libcuda.so.1
/usr/lib/x86_64-linux-gnu/libdxcore.so

continaerd config containerd.toml

[plugins."io.containerd.grpc.v1.cri".containerd]
      default_runtime_name = "nvdia"
      disable_snapshot_annotations = true
      discard_unpacked_layers = false
      no_pivot = false
      snapshotter = "overlayfs"

      [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = "io.containerd.runtime.v1.linux"

        [plugins."io.containerd.grpc.v1.cri".containerd.default_runtime.options]
        Runtime = "nvidia-container-runtime"
      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]

        [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia]
          base_runtime_spec = ""
          container_annotations = []
          pod_annotations = []
          privileged_without_host_devices = false
          runtime_engine = ""
          runtime_root = ""
          runtime_type = "io.containerd.runtime.v1.linux"

          [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvdia.options]
            BinaryName = "/usr/bin/nvidia-container-runtime"
            Runtime = "nvidia-container-runtime"
            CriuImagePath = ""
            CriuPath = ""
            CriuWorkPath = ""
            IoGid = 0
            IoUid = 0
            NoNewKeyring = false
            NoPivotRoot = false
            Root = ""
            ShimCgroup = ""
            SystemdCgroup = false

      [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime]
        base_runtime_spec = ""
        container_annotations = []
        pod_annotations = []
        privileged_without_host_devices = false
        runtime_engine = ""
        runtime_root = ""
        runtime_type = ""

        [plugins."io.containerd.grpc.v1.cri".containerd.untrusted_workload_runtime.options]

    [plugins."io.containerd.grpc.v1.cri".image_decryption]
      key_model = "node"

    [plugins."io.containerd.grpc.v1.cri".registry]
      config_path = ""

      [plugins."io.containerd.grpc.v1.cri".registry.auths]

      [plugins."io.containerd.grpc.v1.cri".registry.configs]

      [plugins."io.containerd.grpc.v1.cri".registry.headers]

      [plugins."io.containerd.grpc.v1.cri".registry.mirrors]
[plugins."io.containerd.runtime.v1.linux"]
    no_shim = false
    runtime = "nvidia-container-runtime"
    runtime_root = ""
    shim = "containerd-shim"
    shim_debug = false
@elezar
Copy link
Member

elezar commented Sep 7, 2022

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

@zengzhengrong
Copy link
Author

You seem to be running the device plugin under WSL2. This is not currently a supported use case of the device plugin. The specific reason is that device nodes on WSL2 and Linux systems are not the same and as such the CPU Manager Workaround (which includes the device nodes in the container being launched) does not work as expected.

All right, I follow this guide https://docs.nvidia.com/cuda/wsl-user-guide/index.html#getting-started-with-cuda-on-wsl , to install cuda on wsl , Seem the limitations, https://docs.nvidia.com/cuda/wsl-user-guide/index.html#known-limitations-for-linux-cuda-apps , There are no limit to install k8s on wsl and use ctr command run gpu as good as well

@wizpresso-steve-cy-fan
Copy link

@elezar would you guys put this into the roadmap? our company is running Windows but we wanted to transition into Linux, so WSL2 seems like a natural choice. We are running deep learning workload that requires CUDA support and while Docker Desktop does support GPU workload, it would be strange to not see this work in normal WSL2 containers as well

@patrykkaj
Copy link

Hi @elezar ,
in case it's unlikely to appear on the roadmap soon, could you please describe a rough plan of how the support should be added? And whether executing the plan would be doable by outside contributors?
Thanks!

@elezar
Copy link
Member

elezar commented Sep 29, 2022

@patrykkaj I think that in theory this could be done by outside contributors and is simplified by the recent changes to support Tegra-based systems. What I can see happening here is that:

  1. we detect whether this is a WSL2 system (e.g. by checking for the presence of dxcore.so.1)
  2. modify / extend the NVML resource manager to create a device that does not require the device minor number.

Some things to note here:

  • On WSL2 systems there is currently no option to select specific devices. This means that the available devices should be treated as a set and cannot be assigned to different containers. The other thing to note is that the device node (for use with the CPU manager workaround) on WSL2 systems is /dev/dxg and not /dev/nvidia*.

If you feel comfortable creating an MR against https://gitlab.com/nvidia/kubernetes/device-plugin that adds this functionality, we can work together on getting it in.

@Vinrobot
Copy link

Vinrobot commented Nov 1, 2022

Hello,

I was interested in this, and I adapted the plugin to work.
I pushed my version to GitLab (https://gitlab.com/Vinrobot/nvidia-kubernetes-device-plugin/-/tree/features/wsl2) and it works on my machine.
I also had to modify NVIDIA/gpu-monitoring-tools (https://github.com/Vinrobot/nvidia-gpu-monitoring-tools/tree/features/wsl2) to also use /dev/dxg.

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

@elezar
Copy link
Member

elezar commented Nov 2, 2022

@Vinrobot thanks for the work here. Some thoughts on this:

We recently moved away from nvidia-gpu-monitoring-tools and use bindings from go-nvml through go-nvlib instead.

I think the steps outlined in #332 (comment) should be considered as the starting point. Check if dxcore.so.1 is available and if it is assume a WSL2 system (one could also check for the existence of /dev/dxg here). In this case, create wslDevice that implements the deviceInfo Interface and ensure that this gets instatiated when enumerating devices. This can then return 0 for the minor number and return the correct path.

With regards to the following:

I can try to do a clean version, but I don't really know how to correctly check if /dev/dxg is a NVIDIA GPU or an incompatible device, does someone have a good idea?

I don't think that this is required. If there are no NVIDIA GPUs available on the system then the NVML enumeration that is used to list the devices would not be expected to work. This should already be handled by the lower-level components of the NVIDIA container stack.

@Vinrobot
Copy link

Vinrobot commented Nov 2, 2022

Hi @elezar,
Thanks for the feedback.

I tried to make it work with the most recent version, but I got this error (on the pod)

Warning  UnexpectedAdmissionError  30s   kubelet            Allocate failed due to device plugin GetPreferredAllocation rpc failed with err: rpc error: code = Unknown desc = error getting list of preferred allocation devices: unable to retrieve list of available devices: error creating nvml.Device 0: unsupported GPU device, which is unexpected

which is caused by this line in gpu-monitoring-tools (still used by gpuallocator).

As it's the same as before, I can re-use my custom version of gpu-monitoring-tools to make it work, but it's not the goal.
Anyway, I will look into it tomorrow.

@elezar
Copy link
Member

elezar commented Nov 3, 2022

@Vinrobot yes, it is an issue that gpuallocator still uses gpu-monitoring-tools. It is on our roadmap to port it to the go-nvml bindings, but this is not yet complete.

The issue is the call to get alligned allocation here. (You can confirm this by removing this section).

If this does workd, what we would need is a mechanism to disable this for WSL2 devices.

One option would be to add a AllignedAllocationSupported() bool function to the Devices and Device types. This could look something like:

// AllignedAllocationSupported checks whether all devices support an alligned allocation
func (ds Devices) AllignedAllocationSupported() bool {
	for _, d := range ds {
		if !d.AllignedAllocationSupported() {
			return false
		}
	}
	return true
}

// AllignedAllocationSupported checks whether the device supports an alligned allocation
func (d Device) AllignedAllocationSupported() bool {
	if d.IsMigDevice() {
		return false
	}

	for _, p := range d.Paths {
		if p == "/dev/dgx" {
			return false
		}
	}

	return true
}

(Note that this should still be discussed and could definitely be improved, but would be a good starting point).

@achim92
Copy link
Contributor

achim92 commented May 11, 2023

Hi @elezar,

I'm also interested in running the device plugin with WSL2.
I have created an MR https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291

Would be great to get those changes in.

@elezar
Copy link
Member

elezar commented May 12, 2023

Thanks @achim92 -- I will have a look at the MR.

Note that with the v1.13.0 release of the NVIDIA Container Toolkit we now support the generation of CDI specifications on WSL2 based systems. Support for consuming this and generating a spec for available devices was included in the v0.14.0 version of the device plugin. This was largely targeted at usage in the context of our GPU operator, but could be generalised to also support WSL2-based systems without requiring additional device plugin changes.

@leon96
Copy link

leon96 commented May 15, 2023

hi @elezar,
Does v0.14.0 support adding GPU resources to Capacity and Allocatable?
I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.

@achim92
Copy link
Contributor

achim92 commented May 15, 2023

Thanks @elezar,

would be even better without requiring additional device plugin changes.

I have generated cdi with nvidia-ctk cdi generate --output=/etc/cdi/nvidia.yaml:

cdiVersion: 0.3.0
containerEdits:
  hooks:
  - args:
    - nvidia-ctk
    - hook
    - create-symlinks
    - --link
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi::/usr/bin/nvidia-smi
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  - args:
    - nvidia-ctk
    - hook
    - update-ldcache
    - --folder
    - /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4
    - --folder
    - /usr/lib/wsl/lib
    hookName: createContainer
    path: /usr/bin/nvidia-ctk
  mounts:
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/lib/libdxcore.so
    hostPath: /usr/lib/wsl/lib/libdxcore.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvcubins.bin
    options:
    - ro
    - nosuid
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/nvidia-smi
    options:
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda.so.1.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libcuda_loader.so
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ptxjitcompiler.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
  - containerPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    hostPath: /usr/lib/wsl/drivers/nvblig.inf_amd64_2e6b1db93a108fb4/libnvidia-ml.so.1
    options:
    - ro
    - nosuid
    - nodev
    - bind
devices:
- containerEdits:
    deviceNodes:
    - path: /dev/dxg
  name: all
kind: nvidia.com/gpu

I also removed NVIDIA Container Runtime hook under /usr/share/containers/oci/hooks.d/oci-nvidia-hook.json.
How can I enable CDI to make it work? I'm using cri-o as container runtime, so CDI support should be enabled by default.

I0515 08:39:51.471150       1 main.go:154] Starting FS watcher.
I0515 08:39:51.471416       1 main.go:161] Starting OS watcher.
I0515 08:39:51.472727       1 main.go:176] Starting Plugins.
I0515 08:39:51.472771       1 main.go:234] Loading configuration.
I0515 08:39:51.473017       1 main.go:242] Updating config with default resource matching patterns.
I0515 08:39:51.473350       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 08:39:51.473380       1 main.go:256] Retreiving plugins.
W0515 08:39:51.473833       1 factory.go:31] No valid resources detected, creating a null CDI handler
I0515 08:39:51.474021       1 factory.go:107] Detected non-NVML platform: could not load NVML library: libnvidia-ml.so.1: cannot open shared object file: No such file or directory
I0515 08:39:51.474878       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
E0515 08:39:51.474918       1 factory.go:115] Incompatible platform detected
E0515 08:39:51.474925       1 factory.go:116] If this is a GPU node, did you configure the NVIDIA Container Toolkit?
E0515 08:39:51.474930       1 factory.go:117] You can check the prerequisites at: https://github.com/NVIDIA/k8s-device-plugin#prerequisites
E0515 08:39:51.474934       1 factory.go:118] You can learn how to set the runtime at: https://github.com/NVIDIA/k8s-device-plugin#quick-start
E0515 08:39:51.474937       1 factory.go:119] If this is not a GPU node, you should set up a toleration or nodeSelector to only deploy this plugin on GPU nodes
I0515 08:39:51.474946       1 main.go:287] No devices found. Waiting indefinitely.

@achim92
Copy link
Contributor

achim92 commented May 24, 2023

@elezar could you please give some guidance here?

@NikulausRui
Copy link

hi @elezar, Does v0.14.0 support adding GPU resources to Capacity and Allocatable? I'm using WSL2 + v0.14.0, and the device plugin logs are showing "No devices found. Waiting indefinitely."

I0515 07:23:12.247146       1 main.go:154] Starting FS watcher.
I0515 07:23:12.247248       1 main.go:161] Starting OS watcher.
I0515 07:23:12.248352       1 main.go:176] Starting Plugins.
I0515 07:23:12.248389       1 main.go:234] Loading configuration.
I0515 07:23:12.248530       1 main.go:242] Updating config with default resource matching patterns.
I0515 07:23:12.248786       1 main.go:253]
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0515 07:23:12.248816       1 main.go:256] Retreiving plugins.
I0515 07:23:12.251257       1 factory.go:107] Detected NVML platform: found NVML library
I0515 07:23:12.251330       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0515 07:23:12.270094       1 main.go:287] No devices found. Waiting indefinitely.

Hi brother, I've encountered the same issue. Have you managed to solve it?

@elezar
Copy link
Member

elezar commented May 25, 2023

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

@davidshen84
Copy link

Hi @elezar ,

How can I test your changes? Do I need to create a new image and install the plugin to my k8s using https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml as a template?

Thanks

@wizpresso-steve-cy-fan
Copy link

@elezar We are also interested in this

@wizpresso-steve-cy-fan
Copy link

I believe registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 would be the right image right?

@davidshen84
Copy link

✔️ registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016

WSL environment

WSL version: 1.2.5.0
Kernel version: 5.15.90.1
WSLg version: 1.0.51
MSRDC version: 1.2.3770
Direct3D version: 1.608.2-61064218
DXCore version: 10.0.25131.1002-220531-1700.rs-onecore-base2-hyp
Windows version: 10.0.19044.3208

K8S Setup

≥ k3s --version                                                                                                                    
k3s version v1.26.4+k3s1 (8d0255af)
go version go1.19.8

nvidia-smi output in WSL

Tue Jul 25 16:36:46 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.54.04              Driver Version: 536.25       CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA RTX A2000 8GB Lap...    On  | 00000000:01:00.0 Off |                  N/A |
| N/A   46C    P8               3W /  40W |      0MiB /  8192MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

##deleted processes table##

nvidia-device-plugin daemonset pod log

I0725 06:26:03.108417       1 main.go:154] Starting FS watcher.
I0725 06:26:03.108468       1 main.go:161] Starting OS watcher.
I0725 06:26:03.108974       1 main.go:176] Starting Plugins.
I0725 06:26:03.108995       1 main.go:234] Loading configuration.
I0725 06:26:03.109063       1 main.go:242] Updating config with default resource matching patterns.
I0725 06:26:03.109205       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0725 06:26:03.109219       1 main.go:256] Retrieving plugins.
I0725 06:26:03.113336       1 factory.go:107] Detected NVML platform: found NVML library
I0725 06:26:03.113372       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0725 06:26:03.138677       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0725 06:26:03.139033       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0725 06:26:03.143248       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Test GPU pod output

Used the example from https://docs.k3s.io/advanced#nvidia-container-runtime-support

Run "nbody -benchmark [-numbodies=<numBodies>]" to measure performance.
    -fullscreen       (run n-body simulation in fullscreen mode)
    -fp64             (use double precision floating point values for simulation)
    -hostmem          (stores simulation data in host memory)
    -benchmark        (run benchmark to measure performance) 
    -numbodies=<N>    (number of bodies (>= 1) to run in simulation) 
    -device=<d>       (where d=0,1,2.... for the CUDA device to use)
    -numdevices=<i>   (where i=(number of CUDA devices > 0) to use for simulation)
    -compare          (compares simulation results running once on the default GPU and once on the CPU)
    -cpu              (run n-body simulation on the CPU)
    -tipsy=<file.bin> (load a tipsy model file for simulation)

NOTE: The CUDA Samples are not meant for performance measurements. Results may vary when GPU Boost is enabled.

> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
GPU Device 0: "Ampere" with compute capability 8.6

> Compute 8.6 CUDA device: [NVIDIA RTX A2000 8GB Laptop GPU]
20480 bodies, total time for 10 iterations: 25.066 ms
= 167.327 billion interactions per second
= 3346.542 single-precision GFLOP/s at 20 flops per interaction
Stream closed EOF for default/nbody-gpu-benchmark (cuda-container)

Thank you @elezar . I hope this commit can be merged into this repo and published asap 🚀 !

@wizpresso-steve-cy-fan
Copy link

wizpresso-steve-cy-fan commented Jul 25, 2023

@davidshen84 I can also confirm it works. However, we have to add some additional stuff:

$ touch /run/nvidia/validations/toolkit-ready
$ touch /run/nvidia/validations/driver-ready
$ mkdir -p /run/nvidia/driver/dev
$ ln -s /run/nvidia/driver/dev/dxg /dev/dxg

Annotate the WSL node:

    nvidia.com/gpu-driver-upgrade-state: pod-restart-required
    nvidia.com/gpu.count: '1'
    nvidia.com/gpu.deploy.container-toolkit: 'true'
    nvidia.com/gpu.deploy.dcgm: 'true'
    nvidia.com/gpu.deploy.dcgm-exporter: 'true'
    nvidia.com/gpu.deploy.device-plugin: 'true'
    nvidia.com/gpu.deploy.driver: 'true'
    nvidia.com/gpu.deploy.gpu-feature-discovery: 'true'
    nvidia.com/gpu.deploy.node-status-exporter: 'true'
    nvidia.com/gpu.deploy.nvsm: ''
    nvidia.com/gpu.deploy.operands: 'true'
    nvidia.com/gpu.deploy.operator-validator: 'true'
    nvidia.com/gpu.present: 'true'
    nvidia.com/device-plugin.config: 'RTX-4070-Ti'

Change device plugin in ClusterPolicy:

  devicePlugin:
    config:
      name: time-slicing-config
    enabled: true
    env:
      - name: PASS_DEVICE_SPECS
        value: 'true'
      - name: FAIL_ON_INIT_ERROR
        value: 'true'
      - name: DEVICE_LIST_STRATEGY
        value: envvar
      - name: DEVICE_ID_STRATEGY
        value: uuid
      - name: NVIDIA_VISIBLE_DEVICES
        value: all
      - name: NVIDIA_DRIVER_CAPABILITIES
        value: all
    image: k8s-device-plugin
    imagePullPolicy: IfNotPresent
    repository: registry.gitlab.com/nvidia/kubernetes/device-plugin/staging
    version: 8b416016

It should work for now:


> Windowed mode
> Simulation data stored in video memory
> Single precision floating point simulation
> 1 Devices used for simulation
MapSMtoCores for SM 8.9 is undefined.  Default to use 128 Cores/SM
MapSMtoArchName for SM 8.9 is undefined.  Default to use Ampere
GPU Device 0: "Ampere" with compute capability 8.9

> Compute 8.9 CUDA device: [NVIDIA GeForce RTX 4070 Ti]
61440 bodies, total time for 10 iterations: 34.665 ms
= 1088.943 billion interactions per second
= 21778.869 single-precision GFLOP/s at 20 flops per interaction

@davidshen84
Copy link

davidshen84 commented Jul 25, 2023 via email

@wizpresso-steve-cy-fan
Copy link

@davidshen84 Because I used the gpu-operator for automatic GPU provision

@davidshen84
Copy link

davidshen84 commented Jul 25, 2023 via email

@msclock
Copy link

msclock commented Jul 26, 2023

I verified the staging imageregistry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 that it is truely working on wsl2.

Based on dockerd

Step 1, install k3s cluster based on dockerd

curl -sfL https://get.k3s.io | sh -s - --docker

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: docker
EOF

# install nvdp
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
helm upgrade -i nvdp nvdp/nvidia-device-plugin \
    --namespace nvdp \
    --create-namespace \
    --set=runtimeClassName=nvidia \
    --set=image.repository=registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin \
    --set=image.tag=8b416016

Based on containerd

Step 1, install k3s cluster based on containerd

curl -sfL https://get.k3s.io | sh -

Step 2, install dp with the staging image.

# set RuntimeClass
cat <<EOF | kubectl apply -f -
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
  name: nvidia
handler: nvidia # change the handler to `nvidia` for containerd
EOF

# install nvdp with the same steps as above.

Test with nvdp

cat <<EOF | kubectl apply -f -
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  restartPolicy: Never
  runtimeClassName: nvidia
  containers:
    - name: cuda-container
      image: nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda10.2
      resources:
        limits:
          nvidia.com/gpu: 1 # requesting 1 GPU
  tolerations:
  - key: nvidia.com/gpu
    operator: Exists
    effect: NoSchedule
EOF

And, the example cuda-sample-vectoradd can work normally.Waiting for the next working release on wsl2😃😃

@davidshen84
Copy link

Note: We have https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 under review from @achim92 to allow the device plugin to work under WSL2. Testing of the changes there would be welcomed.

Hi @elezar, I saw this PR has been merged in the upstream repository for a long time. What's the plan to publish this on GitHub?

@guhuajun
Copy link

guhuajun commented Aug 16, 2023

Hi @elezar,

I can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 is working for me. Even my GPU card is Quadro P1000. :) I can move forward to test Koordiator.

itadmin@server:~/repos/k3s-on-wsl2$ cat /proc/version
Linux version 5.15.90.1-microsoft-standard-WSL2 (oe-user@oe-host) (x86_64-msft-linux-gcc (GCC) 9.3.0, GNU ld (GNU Binutils) 2.34.0.20200220) #1 SMP Fri Jan 27 02:56:13 UTC 2023
itadmin@server:~/repos/k3s-on-wsl2$ sudo docker run --rm --runtime=nvidia --gpus all nvidia/cuda:11.6.2-base-ubuntu20.04 nvidia-smi
Wed Aug 16 06:21:56 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 525.105.14   Driver Version: 528.86       CUDA Version: 12.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Quadro P1000        On   | 00000000:01:00.0  On |                  N/A |
| 34%   39C    P8    N/A /  47W |   1061MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A        23      G   /Xwayland                       N/A      |
+-----------------------------------------------------------------------------+
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl -n kube-system logs nvidia-device-plugin-daemonset-q642m
I0816 06:20:28.927429       1 main.go:154] Starting FS watcher.
I0816 06:20:28.927534       1 main.go:161] Starting OS watcher.
I0816 06:20:28.927691       1 main.go:176] Starting Plugins.
I0816 06:20:28.927698       1 main.go:234] Loading configuration.
I0816 06:20:28.927762       1 main.go:242] Updating config with default resource matching patterns.
I0816 06:20:28.927936       1 main.go:253] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "nvidiaDriverRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  }
}
I0816 06:20:28.927960       1 main.go:256] Retrieving plugins.
I0816 06:20:28.930313       1 factory.go:107] Detected NVML platform: found NVML library
I0816 06:20:28.930362       1 factory.go:107] Detected non-Tegra platform: /sys/devices/soc0/family file not found
I0816 06:20:28.947623       1 server.go:165] Starting GRPC server for 'nvidia.com/gpu'
I0816 06:20:28.948059       1 server.go:117] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0816 06:20:28.949737       1 server.go:125] Registered device plugin for 'nvidia.com/gpu' with Kubelet
itadmin@server:~/repos/k3s-on-wsl2$ sudo kubectl get nodes -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Node
  metadata:
    annotations:
      etcd.k3s.cattle.io/node-address: 172.18.88.17
      etcd.k3s.cattle.io/node-name: server-d622491e
      flannel.alpha.coreos.com/backend-data: '{"VNI":1,"VtepMAC":"52:95:ba:16:e9:29"}'
      flannel.alpha.coreos.com/backend-type: vxlan
      flannel.alpha.coreos.com/kube-subnet-manager: "true"
      flannel.alpha.coreos.com/public-ip: 172.18.88.17
      k3s.io/node-args: '["server","--cluster-init","true","--etcd-expose-metrics","true","--disable","traefik","--disable-cloud-controller","true","--docker","true","--kubelet-arg","node-status-update-frequency=4s","--kube-controller-manager-arg","node-monitor-period=2s","--kube-controller-manager-arg","node-monitor-grace-period=16s","--kube-apiserver-arg","default-not-ready-toleration-seconds=20","--kube-apiserver-arg","default-unreachable-toleration-seconds=20","--write-kubeconfig","/home/itadmin/.kube/config","--private-registry","/etc/rancher/k3s/registry.yaml","--flannel-iface","eth0","--bind-address","172.18.88.17","--https-listen-port","6443","--advertise-address","172.18.88.17","--log","/var/log/k3s-server.log"]'
      k3s.io/node-config-hash: IDWWDZRIJO5DHZKGYYHONVZC2DN7TK7THKPSONCFR74ST4LAGNGQ====
      k3s.io/node-env: '{"K3S_DATA_DIR":"/var/lib/rancher/k3s/data/c26e7571d760c5f199d18efd197114f1ca4ab1e6ffe494f96feb65c87fcb8cf0"}'
      node.alpha.kubernetes.io/ttl: "0"
      volumes.kubernetes.io/controller-managed-attach-detach: "true"
    creationTimestamp: "2023-08-16T05:47:03Z"
    finalizers:
    - wrangler.cattle.io/managed-etcd-controller
    - wrangler.cattle.io/node
    labels:
      beta.kubernetes.io/arch: amd64
      beta.kubernetes.io/os: linux
      kubernetes.io/arch: amd64
      kubernetes.io/hostname: server
      kubernetes.io/os: linux
      node-role.kubernetes.io/control-plane: "true"
      node-role.kubernetes.io/etcd: "true"
      node-role.kubernetes.io/master: "true"
    name: server
    resourceVersion: "8151"
    uid: 04b6a572-830c-4102-a9a9-15265e4f6a15
  spec:
    podCIDR: 10.42.0.0/24
    podCIDRs:
    - 10.42.0.0/24
  status:
    addresses:
    - address: 172.18.88.17
      type: InternalIP
    - address: server
      type: Hostname
    allocatable:
      cpu: "4"
      ephemeral-storage: "1027046117185"
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    capacity:
      cpu: "4"
      ephemeral-storage: 1055762868Ki
      hugepages-1Gi: "0"
      hugepages-2Mi: "0"
      memory: 32760580Ki
      nvidia.com/gpu: "1"
      pods: "110"
    conditions:
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient memory available
      reason: KubeletHasSufficientMemory
      status: "False"
      type: MemoryPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has no disk pressure
      reason: KubeletHasNoDiskPressure
      status: "False"
      type: DiskPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:03Z"
      message: kubelet has sufficient PID available
      reason: KubeletHasSufficientPID
      status: "False"
      type: PIDPressure
    - lastHeartbeatTime: "2023-08-16T06:20:34Z"
      lastTransitionTime: "2023-08-16T05:47:07Z"
      message: kubelet is posting ready status
      reason: KubeletReady
      status: "True"
      type: Ready
    daemonEndpoints:
      kubeletEndpoint:
        Port: 10250
    images:
    - names:
      - nvcr.io/nvidia/tensorflow@sha256:7b74f2403f62032db8205cf228052b105bd94f2871e27c1f144c5145e6072984
      - nvcr.io/nvidia/tensorflow:20.03-tf2-py3
      sizeBytes: 7440987700
    - names:
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin@sha256:35ef4e7f7070e9ec0c9d9f9658200ce2dd61b53a436368e8ea45ec02ced78559
      - 192.168.0.96:5000/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016
      sizeBytes: 298298015
    - names:
      - 192.168.0.96:5000/nvidia/k8s-device-plugin@sha256:68fa1607030680a5430ee02cf4fdce040c99436d680ae24ba81ef5bbf4409e8e
      - nvcr.io/nvidia/k8s-device-plugin@sha256:15c4280d13a61df703b12d1fd1b5b5eec4658157db3cb4b851d3259502310136
      - 192.168.0.96:5000/nvidia/k8s-device-plugin:v0.14.1
      - nvcr.io/nvidia/k8s-device-plugin:v0.14.1
      sizeBytes: 298277535
    - names:
      - nvidia/cuda@sha256:4b0c83c0f2e66dc97b52f28c7acf94c1461bfa746d56a6f63c0fef5035590429
      - nvidia/cuda:11.6.2-base-ubuntu20.04
      sizeBytes: 153991389
    - names:
      - rancher/mirrored-metrics-server@sha256:16185c0d4d01f8919eca4779c69a374c184200cd9e6eded9ba53052fd73578df
      - rancher/mirrored-metrics-server:v0.6.2
      sizeBytes: 68892890
    - names:
      - rancher/mirrored-coredns-coredns@sha256:823626055cba80e2ad6ff26e18df206c7f26964c7cd81a8ef57b4dc16c0eec61
      - rancher/mirrored-coredns-coredns:1.9.4
      sizeBytes: 49802873
    - names:
      - rancher/local-path-provisioner@sha256:db1a3225290dd8be481a1965fc7040954d0aa0e1f86a77c92816d7c62a02ae5c
      - rancher/local-path-provisioner:v0.0.23
      sizeBytes: 37443889
    - names:
      - rancher/mirrored-pause@sha256:74c4244427b7312c5b901fe0f67cbc53683d06f4f24c6faee65d4182bf0fa893
      - rancher/mirrored-pause:3.6
      sizeBytes: 682696
    nodeInfo:
      architecture: amd64
      bootID: de2732a0-17d9-4272-a205-7b9ac1103e2b
      containerRuntimeVersion: docker://20.10.25
      kernelVersion: 5.15.90.1-microsoft-standard-WSL2
      kubeProxyVersion: v1.26.3+k3s1
      kubeletVersion: v1.26.3+k3s1
      machineID: 53da58bf9ac14c33847a4b6e1269419b
      operatingSystem: linux
      osImage: Ubuntu 22.04.3 LTS
      systemUUID: 53da58bf9ac14c33847a4b6e1269419b
kind: List
metadata:
  resourceVersion: ""

@alexeadem
Copy link

alexeadem commented Jan 30, 2024

Tested and documented in qbo with:

  • Windows 11
  • WSL2
  • Docker cgroup v2
  • Nvidia GPU operator
  • Kubeflow

https://docs.qbo.io/#/ai_and_ml?id=kubeflow

Thanks to @achim92 contrib and @elezar approval :)

Please note that in Linux default helm chart works in qbo and kind so there is no need for this.

This fix also works for kind kubernetes using accept-nvidia-visible-devices-as-volume-mounts = true in /etc/nvidia-container-runtime/config.toml

and

 extraMounts:
    - hostPath: /dev/null
      containerPath: /var/run/nvidia-container-devices/all

More details see here:

kubernetes-sigs/kind#3257 (comment)

A couple of notes for the gpu-operator

Labels

Nvidia GPU operator requires a manual label: feature.node.kubernetes.io/pci-10de.present=true for node-feature-discovery to add all necessary labels for the GPU operator to work. This applies only to kind and qbo not sure why k8s requires more labels as indicated here #332 (comment)

The label can be added as follows:

for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
        kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done

The reson is that WSL2 doesn't contains PCI info under /sys and node-feature-discovery is unable detect the GPU

I believe the relevant code is here: node-feature-discovery/source/usb/utils.go:106

I believe node-feature-discovery is expecting something like the output below to build 10de label

lspci -nn |grep -i  nvidia
0000:01:00.0 VGA compatible controller [0300]: NVIDIA Corporation GA106M [GeForce RTX 3060 Mobile / Max-Q] [10de:2560] (rev a1)
0000:01:00.1 Audio device [0403]: NVIDIA Corporation Device [10de:228e] (rev a1)

I believe the right place to add this label is once the driver has been detected in the host. See here

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

I'll add my comments there.

Docker Image for device-plugin

I built a new image based on https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/291 for testing purposes but also working with the one provided here #332 (comment)

git branch
* device-plugin-wsl2

device-plugin docker image

heml chart templates

Docker Image for gpu-operator

I created docker image with changes similar to this

https://gitlab.com/nvidia/kubernetes/gpu-operator/-/merge_requests/881/diffs

gpu-operator docker image

Docker Image for gpu-operator-validator

gpu-operator-validator image

Blogs on how to install: Nvidia GPU Operator + Kubeflow + Docker in Docker + cgroups v2 (In Linux and Windows WSL2)

Blog part 1

Blog part 2

@pbasov
Copy link

pbasov commented Feb 5, 2024

Thank you for working on this, now that WSL2 supports systemd I think more people will be running k8s on Windows.
Can confirm registry.gitlab.com/nvidia/kubernetes/device-plugin/staging/k8s-device-plugin:8b416016 working on kubeadm deployed cluster with Driver Version: 551.23 and 2080ti.

@elezar
Copy link
Member

elezar commented Feb 5, 2024

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

@alexeadem
Copy link

Just a general note: We will release a v0.15.0-rc.1 of the GPU Device Plugin in the next week or so including these change. That should then allow us to get more concrete feedback on the released version instead of relying on the SHA-tagged image.

hi @elezar any update on when the v0.15.0-rc.1 is going to be out?

@mrjohnsonalexander
Copy link

mrjohnsonalexander commented Jun 29, 2024

v0.15.0-rc1 successfully enabled my scenario today: https://github.com/mrjohnsonalexander/classic

TL;DR Stack notes

  • Python Version 3.11
  • Docker Community Version 26.1.4
  • Nvidia Device Plugin v0.15.0-rc.1
  • Kubernetes v1.30
  • Containerd 1.6.33
  • WSL Distribution Centos Stream9
  • WSL version: 2.2.4.0
  • OS Version Windows 10 BUILD 19045
  • Nvidia Game Ready Driver Version: 555.99
  • Installed Physical Memory (RAM) 32 GB
  • Nvidia Geforce RTX 4060 Ti 8 GB VRAM
  • Intel CPU i7-4820k

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests