Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating CPU quota causes NVML unknown error #515

Open
dvenza opened this issue Nov 2, 2017 · 5 comments
Open

Updating CPU quota causes NVML unknown error #515

dvenza opened this issue Nov 2, 2017 · 5 comments

Comments

@dvenza
Copy link

@dvenza dvenza commented Nov 2, 2017

I'm testing nvidia-docker 2, starting containers through Zoe Analytics, that uses the network Docker API.
What Zoe does is to dynamically set CPU quotas to redistribute spare capacity, but it makes nvidia-docker break down:

Start a container (the nvidia plugin is set as default in daemon.json):

$ docker run -d -e NVIDIA_VISIBLE_DEVICES=all -p 8888 gcr.io/tensorflow/tensorflow:1.3.0-gpu-py3

Test with nvidia-smi (it works):

$ docker exec -it 9e nvidia-smi
Thu Nov  2 08:03:25 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.81                 Driver Version: 384.81                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla P100-PCIE...  Off  | 00000000:04:00.0 Off |                    0 |
| N/A   26C    P0    31W / 250W |      0MiB / 16276MiB |      0%      Default |

[...]

Change the CPU quota:

$ docker update --cpu-quota 640000 9e

Test with nvidia-smi (it breaks):

$ docker exec -it 9e nvidia-smi
Failed to initialize NVML: Unknown Error
  • If I set the cpu quota at the beginning, it works.
  • I tried with different values for the quota, it always breaks
  • I could find no messages in the logs
  • The same happens updating the memory soft limit (--memory-reservation)
dvenza added a commit to DistributedSystemsGroup/zoe that referenced this issue Nov 2, 2017
dvenza added a commit to DistributedSystemsGroup/zoe that referenced this issue Nov 2, 2017
Workaround bug in nvidia-docker: NVIDIA/nvidia-docker#515

See merge request !41
@3XX0

This comment has been minimized.

Copy link
Member

@3XX0 3XX0 commented Nov 2, 2017

Good catch, it looks like Docker is resetting all the cgroups when it only needs to update one (CPU quota in this case).
Not sure how we can workaround that though.

@mrjackbo

This comment has been minimized.

Copy link

@mrjackbo mrjackbo commented Jan 10, 2019

Has there been any progress on this? It seems I ran into the same problem while trying to set up the Kubernetes cpu-manager with „static“ policy.
(https://kubernetes.io/docs/tasks/administer-cluster/cpu-management-policies/)

@klueska

This comment has been minimized.

Copy link

@klueska klueska commented Apr 4, 2019

@3XX0 I think it is unlikely that this will ever be addressed upstream.

From docker's perspective, they own and control all of the cgroups/devices set up for the containers they launch. If something comes along (in this case, libnvidia-container) and changes those cgroups/device settings outside of docker, then docker should be free to resolve these discrepancies in order to keep its state in sync.

The long-term solution should probably involve making libnvidia-container "docker-aware" in some way so that it can update the necessary state changes via:

https://docs.docker.com/engine/api/v1.25/#operation/ContainerUpdate

I know this goes against the current design (i.e. making libnvidia-container container runtime agnostic), but I don't see any other way around this.

For example, if you do a docker inspect on a functioning GPU-enabled container today, you will see that its device list is empty, even though it clearly has the nvidia devices injected into it and the cgroup access to those devices is set up properly. However, once some external entity hits docker's ContainerUpdate API (whether directly via the CLI or through an API call like the CPUManager does in Kubernetes), docker resolves this empty device list to disk, essentially "undoing" what libnvidia-container had set up in regards to these devices.

@RenaudWasTaken How does the new --gpu flag for docker handle the fact that libnvidia-container is messing with cgroups/devices outside of docker's control?

@klueska

This comment has been minimized.

Copy link

@klueska klueska commented Apr 4, 2019

@mrjackbo if your setup is constrained such that GPUs will only ever be used by containers that have CPUsets assigned to them via the static allocation policy, then the following patch to Kubernetes will avoid having docker update its cgroups after these containers are initially launched.

diff --git a/pkg/kubelet/cm/cpumanager/cpu_manager.go b/pkg/kubelet/cm/cpumanager/cpu_manager.go
index 4ccddd5..ff3fbdf 100644
--- a/pkg/kubelet/cm/cpumanager/cpu_manager.go
+++ b/pkg/kubelet/cm/cpumanager/cpu_manager.go
@@ -242,7 +242,8 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                        // - policy does not want to track the container
                        // - kubelet has just been restarted - and there is no previous state file
                        // - container has been removed from state by RemoveContainer call (DeletionTimestamp is set)
-                       if _, ok := m.state.GetCPUSet(containerID); !ok {
+                       cset, ok := m.state.GetCPUSet(containerID)
+                       if !ok {
                                if status.Phase == v1.PodRunning && pod.DeletionTimestamp == nil {
                                        klog.V(4).Infof("[cpumanager] reconcileState: container is not present in state - trying to add (pod: %s, container: %s, container id: %s)", pod.Name, container.Name, containerID)
                                        err := m.AddContainer(pod, &container, containerID)
@@ -258,7 +259,13 @@ func (m *manager) reconcileState() (success []reconciledContainer, failure []rec
                                }
                        }

-                       cset := m.state.GetCPUSetOrDefault(containerID)
+                       if !cset.IsEmpty() && m.policy.Name() == string(PolicyStatic) {
+                               klog.V(4).Infof("[cpumanager] reconcileState: skipping container; assigned cpuset unchanged (pod: %s, container: %s, container id: %s, cpuset: \"%v\")", pod.Name, container.Name, containerID, cset)
+                               success = append(success, reconciledContainer{pod.Name, container.Name, containerID})
+                               continue
+                       }
+
+                       cset = m.state.GetDefaultCPUSet()
                        if cset.IsEmpty() {
                                // NOTE: This should not happen outside of tests.
                                klog.Infof("[cpumanager] reconcileState: skipping container; assigned cpuset is empty (pod: %s, container: %s)", pod.Name, container.Name)
@arlofaria

This comment has been minimized.

Copy link

@arlofaria arlofaria commented Aug 22, 2019

Here's a workaround that might be helpful:
docker run --device /dev/nvidiactl:/dev/nvidiactl --device /dev/nvidia0:/dev/nvidia0 ...
(Replace/repeat nvidia0 with other/more devices as needed.)

This seems to fix the problem with both --runtime=nvidia or the newer --gpus option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.