Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397

dayinfinite · 2024-04-26T07:28:12Z

Description
k8s 1.28
containerd 2.0

I want curl k8s checkpoint to create a container checkpoint

Steps to reproduce the issue:

curl -sk -X POST "https://127.0.0.1:10250/checkpoint/default/gpu-base-02/gpu-base-02" --key /etc/kubernetes/pki/apiserver-kubelet-client.key --cacert /etc/kubernetes/pki/ca.crt --cert /etc/kubernetes/pki/apiserver-kubelet-client.crt

Describe the results you received:

I want curl k8s checkpoint to create a container checkpoint

Describe the results you expected:

The actual situation is that an error occurs, showing: checkpointing of default/gpu-base-02/gpu-base-02 failed (rpc error: code = Unknown desc = checkpointing container "208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8" failed: /usr/bin/nvidia-contain er-runtime did not terminate successfully: exit status 1: criu failed: type NOTIFY errno 0 path= /run/containerd/io.containerd.runtime.v2.task/k8s.io/208a82339ddc590e460b89912304f56ad64924f89a959f982b17aeb6ab0c2aa8/criu-dump. log: unknown)

Additional information you deem important (e.g. issue happens only occasionally):

CRIU logs and information:

CRIU full dump/restore logs:

(00.011105) mnt: Inspecting sharing on 1494 shared_id 0 master_id 0 (@./proc/sys)
(00.011109) mnt: Inspecting sharing on 1493 shared_id 0 master_id 0 (@./proc/irq)
(00.011113) mnt: Inspecting sharing on 1492 shared_id 0 master_id 0 (@./proc/fs)
(00.011116) mnt: Inspecting sharing on 1491 shared_id 0 master_id 0 (@./proc/bus)
(00.011120) mnt: Inspecting sharing on 1611 shared_id 0 master_id 13 (@./proc/driver/nvidia/gpus/0000:b1:00.0)
(00.011124) Error (criu/mount.c:1088): mnt: Mount 1611 ./proc/driver/nvidia/gpus/0000:b1:00.0 (master_id: 13 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(00.011142) net: Unlock network
(00.011146) Running network-unlock scripts
(00.011149) RPC
(00.072541) Unfreezing tasks into 1
(00.072552) Unseizing 1641382 into 1
(00.072562) Unseizing 1641424 into 1
(00.072568) Unseizing 1641533 into 1
(00.072580) Unseizing 1641475 into 1
(00.072586) Unseizing 1641500 into 1
(00.072599) Unseizing 2157578 into 1
(00.072632) Error (criu/cr-dump.c:2093): Dumping FAILED.

Output of `criu --version`:

Version: 3.18

Output of `criu check --all`:

Looks good.

Additional environment details:

adrianreber · 2024-04-26T07:30:45Z

Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.

We have seen success with AMD GPUs.

dayinfinite · 2024-04-26T09:00:06Z

Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.

We have seen success with AMD GPUs.
I checked that containerd caused this error, and containerd uses criu. Then you know how to skip nvidia when checkpointing nvidia gpus container

dayinfinite · 2024-04-26T09:01:43Z

Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.
We have seen success with AMD GPUs.
I checked that containerd caused this error, and containerd uses criu. Then you know how to skip nvidia when checkpointing nvidia gpus container

I just want to keep the environment inside the container, mainly files

adrianreber · 2024-04-26T09:09:38Z

Then checkpointing is the wrong approach.

dayinfinite · 2024-04-26T09:40:33Z

Then checkpointing is the wrong approach.

What I mean is, I want to preserve the environment inside the container. After checkpointing the export, it is then built as an image. This method is very fast for building a runtime image

adrianreber · 2024-04-26T10:12:21Z

Sorry, I do not understand what you want to do. First you said you want to checkpoint the container then you said you want to just keep the environment inside of the container.

Anyway, checkpointing containers with Nvidia GPUs does not work. You need to talk to Nvidia to enable it.

dayinfinite · 2024-04-27T06:19:20Z

Sorry, I do not understand what you want to do. First you said you want to checkpoint the container then you said you want to just keep the environment inside of the container.

Anyway, checkpointing containers with Nvidia GPUs does not work. You need to talk to Nvidia to enable it.

thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397

Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 27, 2024

Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397

Cannot checkpoint container: /usr/bin/nvidia-container-runtime did not terminate successfully: exit status 1 #2397

Comments

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 26, 2024

adrianreber commented Apr 26, 2024

dayinfinite commented Apr 27, 2024