diff --git a/alnair-device-plugin/README.md b/alnair-device-plugin/README.md index 3d59177..da2585f 100644 --- a/alnair-device-plugin/README.md +++ b/alnair-device-plugin/README.md @@ -12,13 +12,9 @@ The CUDA driver API interpose library can be found [here](https://github.com/Cen ### 1. Prerequisites * Provision a Kubernetes cluster with at least one GPU node. -* Install [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) and configure it as the default docker runtime. +* Install Nvidia docker runtime [nvidia-docker2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) and configure it as the default docker runtime. * A kubeconfig with enough permissions. -**NOTE**: Don't install nvidia-docker2. We mount host /run directory to vgpu-server container. Nvidia-docker2 has conflicts with mounting /run. Error is like ``` stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/XXXX/merged/run/nvidia-persistenced/socket: no such device or address: unknown"```. If this happens, you can purge all nvidia* packages and reinstall. - - - ### 2. Single Yaml Installation: ```kubectl apply -f https://raw.githubusercontent.com/CentaurusInfra/alnair/main/alnair-device-plugin/manifests/alnair-device-plugin.yaml``` @@ -52,10 +48,9 @@ spec: ![memory](./docs/images/memory_limits.png) #### 2. Compute limit - Combine with profiler, GPU utilization can be viewed. The actual utilization should be no greater than the limits. However, different types of GPU cards may behave differently. Fine tune is required when apply to new types of GPUs. - - The chart below is vGPU Pod example running on a RTX 2080 card with different settings of vgpu-compute limits. The last red one is from running on physical GPU without limits. + Combine with profiler, GPU utilization can be viewed. The actual utilization should be no greater than the limits. However, different types of GPU cards may behave differently. Fine tune is required when apply to new types of GPUs. - ![compute](./docs/images/compute_limits.png) + The following graph shows the GPU utilization of the same deep learning job, but with different vgpu-compute limits specified: + ![compute](./docs/images/vgpu-compute.png) diff --git a/alnair-device-plugin/docs/images/vgpu-compute.png b/alnair-device-plugin/docs/images/vgpu-compute.png new file mode 100644 index 0000000..c1f3577 Binary files /dev/null and b/alnair-device-plugin/docs/images/vgpu-compute.png differ