[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

dims · 2024-03-12T12:47:33Z

the job ci-kubernetes-e2e-gce-device-plugin-gpu will start failing in GCP in May https://cloud.google.com/compute/docs/eol/k80-eol

By the time we are hoping sig-node folks can help figure out a transition to T4 GPUs

/assign sig-node

Context: following up from #32241

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2024-03-12T12:47:37Z

@dims: GitHub didn't allow me to assign the following users: sig-node.

Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time.
For more information please see the contributor guide

In response to this:

the job ci-kubernetes-e2e-gce-device-plugin-gpu will start failing in GCP in May https://cloud.google.com/compute/docs/eol/k80-eol

By the time we are hoping sig-node folks can help figure out a transition to T4 GPUs

/assign sig-node

Context: following up from #32241

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

dims · 2024-03-12T12:47:52Z

/sig node

ameukam · 2024-03-12T13:35:18Z

This will likely to affect v1.31 CI Signal about GPU scheduling.

dims · 2024-03-12T14:13:16Z

We do have a backup CI job we can promote to informing - https://testgrid.k8s.io/amazon-ec2#ci-kubernetes-e2e-ec2-device-plugin-gpu&width=20

BenTheElder · 2024-03-21T22:39:32Z

Have we brought this up to SIG node directly? (not sure how often they check this issue tracker)

ameukam · 2024-03-22T15:47:00Z

cc @bobbypage

kannon92 · 2024-04-16T20:24:58Z

I'll add this to our sig-node-ci agenda.

SergeyKanzhelev · 2024-04-17T17:08:26Z

/assign

endocrimes · 2024-05-19T18:56:42Z

I see there was a previous attempt to migrate GPU stuff that was reverted, but without much context (#32147) - it's too long ago to (easily?) find any logs that would indicate why.

Anyone got a TL;DR? (and then I can take this)

BenTheElder · 2024-05-20T17:36:13Z

xref: kubernetes/kubernetes#124950

This is now hard-failing to bring up the clusters as the grace period has expired.

We might as well change the config to use some other GPU and see what that failure looks like with current logs?

BenTheElder · 2024-05-20T17:44:55Z

#32635

BenTheElder · 2024-05-20T21:09:37Z

Note that while the job is "green" after #32635 we are not running [Feature:GPUDevicePlugin] run Nvidia GPU Device Plugin tests anymore and the Windows test is skipped so ... no real tests are run.

https://testgrid.k8s.io/sig-release-master-blocking#gce-device-plugin-gpu-master&show-stale-tests=&width=5

BenTheElder · 2024-05-20T21:18:33Z

run with t4: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-device-plugin-gpu/1792639007469867008

(with no GPU test running though ...)

BenTheElder · 2024-05-21T00:18:22Z

The job's ARTIFACTS are weird ... docker log but not contained ... and nothing for the GPU driver install / no pod logs.

That's probably the next thing to fix, the worker node artifacts are lacking details that would be helpful.

I'm guessing though we don't actually get the GPU installed and so the test doesn't run, and unlike the windows GPU test it doesn't get marked skipped it just doesn't report.

endocrimes · 2024-05-21T11:07:55Z

ci-kubernetes-node-kubelet-serial-containerd is reliably passing now and no tests seem to be dropped from the last-healthy-k80-runs (I'm not actually sure those test selectors result in a GPU test any more, we might be able to drop GPUs from that matrix 🤔)

aojea · 2024-05-21T14:00:28Z

/cc

upodroid · 2024-05-21T14:05:26Z

Last time I looked at this(2 months ago), installing nvidia drivers on cos-109 for T4 GPU was not working.

kubernetes/kubernetes#123814
kubernetes/kubernetes#123600

https://kubernetes.slack.com/archives/CCK68P2Q2/p1708914356010229 Long thread with more details

BenTheElder · 2024-05-21T18:00:52Z

I think the next step is getting the GPU jobs dumping pod / containerd logs, so we can even see what is happening properly.

BenTheElder · 2024-05-21T20:53:39Z

This job still uses kubernetes_e2e.py, so there's that ... the kubelet-serial-containerd job does as well, these should really be using kubetest2 and skipping the ancient deprecated scenarios/* but ...

back on this topic, the jobs running containerd with this old tooling have some additional arugments that shouldn't be necessary on current runners. rather than migrate I'm just going to add the logging args quickly and move on for now, enough other things to deal with :-)

BenTheElder · 2024-05-21T21:23:28Z

#32640 adds containerd logs, we probably also need the device plugin pod.

BenTheElder · 2024-05-21T21:29:04Z

... or the GPU driver install

BenTheElder · 2024-05-28T16:27:59Z

kubernetes/kubernetes#124950 (comment)

BenTheElder · 2024-05-28T16:55:39Z

So the problem now is that there are no tests ... kubernetes/kubernetes#124950 (comment)

It's unclear if we had tests that they would work on this cluster, but the cluster is coming up and the reason we have no test results is there are no tests now :-) (as opposed to the cluster not coming up due to using k80 previously)

dims added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Mar 12, 2024

k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Mar 12, 2024

dims mentioned this issue Mar 12, 2024

Drop ci-kubernetes-e2e-gce-device-plugin-gpu-canary as duplicate #32241

Merged

k8s-ci-robot added sig/node Categorizes an issue or PR as relevant to SIG Node. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Mar 12, 2024

k8s-ci-robot assigned SergeyKanzhelev Apr 17, 2024

BenTheElder mentioned this issue May 20, 2024

replace k80 with t4 #32635

Merged

BenTheElder mentioned this issue May 20, 2024

EC2 + GCE GPU CI Jobs not running any test cases kubernetes/kubernetes#124950

Open

BenTheElder mentioned this issue May 21, 2024

ci gce gpu: dump containerd logs #32640

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

dims commented Mar 12, 2024

k8s-ci-robot commented Mar 12, 2024

dims commented Mar 12, 2024

ameukam commented Mar 12, 2024

dims commented Mar 12, 2024

BenTheElder commented Mar 21, 2024

ameukam commented Mar 22, 2024

kannon92 commented Apr 16, 2024

SergeyKanzhelev commented Apr 17, 2024

endocrimes commented May 19, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 21, 2024

endocrimes commented May 21, 2024

aojea commented May 21, 2024

upodroid commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 28, 2024

BenTheElder commented May 28, 2024 •

edited

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242

Comments

dims commented Mar 12, 2024

k8s-ci-robot commented Mar 12, 2024

dims commented Mar 12, 2024

ameukam commented Mar 12, 2024

dims commented Mar 12, 2024

BenTheElder commented Mar 21, 2024

ameukam commented Mar 22, 2024

kannon92 commented Apr 16, 2024

SergeyKanzhelev commented Apr 17, 2024

endocrimes commented May 19, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 20, 2024

BenTheElder commented May 21, 2024

endocrimes commented May 21, 2024

aojea commented May 21, 2024

upodroid commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 21, 2024

BenTheElder commented May 28, 2024

BenTheElder commented May 28, 2024 • edited

BenTheElder commented May 28, 2024 •

edited