-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ci-kubernetes-e2e-gce-device-plugin-gpu] NVIDIA K80 end of support #32242
Comments
@dims: GitHub didn't allow me to assign the following users: sig-node. Note that only kubernetes members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/sig node |
This will likely to affect v1.31 CI Signal about GPU scheduling. |
We do have a backup CI job we can promote to informing - https://testgrid.k8s.io/amazon-ec2#ci-kubernetes-e2e-ec2-device-plugin-gpu&width=20 |
Have we brought this up to SIG node directly? (not sure how often they check this issue tracker) |
cc @bobbypage |
I'll add this to our sig-node-ci agenda. |
/assign |
I see there was a previous attempt to migrate GPU stuff that was reverted, but without much context (#32147) - it's too long ago to (easily?) find any logs that would indicate why. Anyone got a TL;DR? (and then I can take this) |
xref: kubernetes/kubernetes#124950 This is now hard-failing to bring up the clusters as the grace period has expired. We might as well change the config to use some other GPU and see what that failure looks like with current logs? |
Note that while the job is "green" after #32635 we are not running |
run with t4: https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/ci-kubernetes-e2e-gce-device-plugin-gpu/1792639007469867008 (with no GPU test running though ...) |
The job's ARTIFACTS are weird ... docker log but not contained ... and nothing for the GPU driver install / no pod logs. That's probably the next thing to fix, the worker node artifacts are lacking details that would be helpful. I'm guessing though we don't actually get the GPU installed and so the test doesn't run, and unlike the windows GPU test it doesn't get marked skipped it just doesn't report. |
|
/cc |
Last time I looked at this(2 months ago), installing nvidia drivers on cos-109 for T4 GPU was not working. kubernetes/kubernetes#123814 https://kubernetes.slack.com/archives/CCK68P2Q2/p1708914356010229 Long thread with more details |
I think the next step is getting the GPU jobs dumping pod / containerd logs, so we can even see what is happening properly. |
This job still uses kubernetes_e2e.py, so there's that ... the kubelet-serial-containerd job does as well, these should really be using kubetest2 and skipping the ancient deprecated scenarios/* but ... back on this topic, the jobs running containerd with this old tooling have some additional arugments that shouldn't be necessary on current runners. rather than migrate I'm just going to add the logging args quickly and move on for now, enough other things to deal with :-) |
#32640 adds containerd logs, we probably also need the device plugin pod. |
... or the GPU driver install |
So the problem now is that there are no tests ... kubernetes/kubernetes#124950 (comment) It's unclear if we had tests that they would work on this cluster, but the cluster is coming up and the reason we have no test results is there are no tests now :-) (as opposed to the cluster not coming up due to using k80 previously) |
the job
ci-kubernetes-e2e-gce-device-plugin-gpu
will start failing in GCP in May https://cloud.google.com/compute/docs/eol/k80-eolBy the time we are hoping sig-node folks can help figure out a transition to T4 GPUs
/assign sig-node
Context: following up from #32241
The text was updated successfully, but these errors were encountered: