Skip to content

Commit

Permalink
upload vgpu-compute results graph
Browse files Browse the repository at this point in the history
  • Loading branch information
hxhp committed Apr 7, 2022
1 parent f5a446f commit 5ceef24
Show file tree
Hide file tree
Showing 2 changed files with 4 additions and 9 deletions.
13 changes: 4 additions & 9 deletions alnair-device-plugin/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,13 +12,9 @@ The CUDA driver API interpose library can be found [here](https://github.com/Cen

### 1. Prerequisites
* Provision a Kubernetes cluster with at least one GPU node.
* Install [nvidia-container-runtime](https://github.com/NVIDIA/nvidia-container-runtime) and configure it as the default docker runtime.
* Install Nvidia docker runtime [nvidia-docker2](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#docker) and configure it as the default docker runtime.
* A kubeconfig with enough permissions.

**NOTE**: Don't install nvidia-docker2. We mount host /run directory to vgpu-server container. Nvidia-docker2 has conflicts with mounting /run. Error is like ``` stderr: nvidia-container-cli: mount error: file creation failed: /var/lib/docker/overlay2/XXXX/merged/run/nvidia-persistenced/socket: no such device or address: unknown"```. If this happens, you can purge all nvidia* packages and reinstall.



### 2. Single Yaml Installation:

```kubectl apply -f https://raw.githubusercontent.com/CentaurusInfra/alnair/main/alnair-device-plugin/manifests/alnair-device-plugin.yaml```
Expand Down Expand Up @@ -52,10 +48,9 @@ spec:
![memory](./docs/images/memory_limits.png)

#### 2. Compute limit
Combine with profiler, GPU utilization can be viewed. The actual utilization should be no greater than the limits. However, different types of GPU cards may behave differently. Fine tune is required when apply to new types of GPUs.

The chart below is vGPU Pod example running on a RTX 2080 card with different settings of vgpu-compute limits. The last red one is from running on physical GPU without limits.
Combine with profiler, GPU utilization can be viewed. The actual utilization should be no greater than the limits. However, different types of GPU cards may behave differently. Fine tune is required when apply to new types of GPUs.

![compute](./docs/images/compute_limits.png)
The following graph shows the GPU utilization of the same deep learning job, but with different vgpu-compute limits specified:
![compute](./docs/images/vgpu-compute.png)


Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 5ceef24

Please sign in to comment.