You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AICR H100 GitHub Actions runners have shown long-running performance variance and instability. Recent slow, canceled, or failed jobs strongly correlate with managed-runner NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1.
Description
AICR H100 Kind CI has had intermittent slow-runner and instability issues for a while. The current investigation focused on recent failures in the linux-amd64-gpu-h100-latest-2 runner pool, where the variance became severe enough that the H100 PR jobs did not pass for several days.
On slow/unhealthy runners, jobs run for more than 2 hours and fail with Kubernetes-level symptoms: API server timeouts, control-plane instability, readiness/liveness probe timeouts, runtime components not becoming available, and cleanup/debug steps timing out.
On healthy runners, the same PR workflows complete in about 20 minutes.
This strongly suggests the root issue is runner performance/health rather than AICR code or CI logic. AICR can mitigate some CI behavior, but when the runner is overloaded the Kind control plane becomes unreliable and causes cascading failures.
Evidence
The recent strongest signal is correlation with the managed-runner NodeUID printed by NVIDIA runner bootstrap logs. Multiple recent slow, canceled, or failed jobs landed on the same NodeUID:
507dc070-3563-4487-8a0c-d411242a11a1
Those jobs used different ephemeral runner names and even different H100 runner paths:
24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm
24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z
609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw
Successful fast runs landed on a different NodeUID:
3510eba6-bc52-4e81-9bb3-4f7ab58ebc51
This points to a broader H100 runner performance/reliability issue, with 507dc070-3563-4487-8a0c-d411242a11a1 as a recent concrete host-level example that appears unhealthy or overloaded.
Primary Recent Failure
Most recent PR #694 H100 x2 conformance test failure/cancel:
The same workflow completed quickly on the same healthy NodeUID as the successful inference run.
Conclusion
AICR H100 Kind CI has seen slow-runner and instability issues over time. The recent failures strongly correlate with the underlying managed-runner NodeUID:
507dc070-3563-4487-8a0c-d411242a11a1
This correlation is independent of ephemeral runner name and appears across both x1 and x2 H100 runner paths. The observed Kubernetes failures look different at the component level, but they share the same underlying signal: overloaded or unhealthy runner/node behavior.
This points to a runner infrastructure issue rather than an AICR code or CI logic issue.
Follow-Up
Follow up with IPP to improve H100 runner performance and reliability, especially avoiding overloaded/unhealthy hosts.
Why are jobs scheduled on NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1 much slower?
Is this host unhealthy, overloaded, or oversubscribed?
Are multiple runner VMs being scheduled on the same physical H100 host?
Is there CPU steal/pressure, IO pressure, memory pressure, or kubelet/containerd contention?
Can this host be quarantined if unhealthy?
Can IPP review whether this pattern exists on other H100 runner hosts as well?
Can runner scheduling or startup health checks include load average, CPU pressure/steal, IO pressure, memory pressure, and concurrent runner count per host?
Can runner health metadata, including NodeUID and pressure signals, be exposed consistently in job logs so CI failures can be correlated to host health quickly?
Improve the current AICR H100 CI flow so it is simpler, cheaper, and easier to diagnose.
Summary
AICR H100 GitHub Actions runners have shown long-running performance variance and instability. Recent slow, canceled, or failed jobs strongly correlate with managed-runner
NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1.Description
AICR H100 Kind CI has had intermittent slow-runner and instability issues for a while. The current investigation focused on recent failures in the
linux-amd64-gpu-h100-latest-2runner pool, where the variance became severe enough that the H100 PR jobs did not pass for several days.On slow/unhealthy runners, jobs run for more than 2 hours and fail with Kubernetes-level symptoms: API server timeouts, control-plane instability, readiness/liveness probe timeouts, runtime components not becoming available, and cleanup/debug steps timing out.
On healthy runners, the same PR workflows complete in about 20 minutes.
This strongly suggests the root issue is runner performance/health rather than AICR code or CI logic. AICR can mitigate some CI behavior, but when the runner is overloaded the Kind control plane becomes unreliable and causes cascading failures.
Evidence
The recent strongest signal is correlation with the managed-runner
NodeUIDprinted by NVIDIA runner bootstrap logs. Multiple recent slow, canceled, or failed jobs landed on the same NodeUID:507dc070-3563-4487-8a0c-d411242a11a1Those jobs used different ephemeral runner names and even different H100 runner paths:
24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rwSuccessful fast runs landed on a different NodeUID:
3510eba6-bc52-4e81-9bb3-4f7ab58ebc51This points to a broader H100 runner performance/reliability issue, with
507dc070-3563-4487-8a0c-d411242a11a1as a recent concrete host-level example that appears unhealthy or overloaded.Primary Recent Failure
Most recent PR #694 H100 x2 conformance test failure/cancel:
GPU Training Test (nvkind + H100 x2)24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwmlinux-amd64-gpu-h100-latest-2507dc070-3563-4487-8a0c-d411242a11a1kai-scheduler/admissionto become readyAdditional Failed/Slow Evidence
Failed inference example with high load diagnostics:
GPU Inference Test (nvkind + H100 x2)24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4zlinux-amd64-gpu-h100-latest-2507dc070-3563-4487-8a0c-d411242a11a141CPUs and241GiBmemory75in one diagnostic snapshot129.20,127.54,112.75Earlier canceled inference example:
GPU Inference Test (nvkind + H100 x1)609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw507dc070-3563-4487-8a0c-d411242a11a1Successful Fast Runs
Successful inference example:
GPU Inference Test (nvkind + H100 x2)20m46s24ba-l-amd-g-h100-l-2-f45nb-runner-mtscq3510eba6-bc52-4e81-9bb3-4f7ab58ebc51Successful training example:
GPU Training Test (nvkind + H100 x2)20m58s24ba-l-amd-g-h100-l-2-f45nb-runner-bnk5r3510eba6-bc52-4e81-9bb3-4f7ab58ebc51Conclusion
AICR H100 Kind CI has seen slow-runner and instability issues over time. The recent failures strongly correlate with the underlying managed-runner NodeUID:
507dc070-3563-4487-8a0c-d411242a11a1This correlation is independent of ephemeral runner name and appears across both x1 and x2 H100 runner paths. The observed Kubernetes failures look different at the component level, but they share the same underlying signal: overloaded or unhealthy runner/node behavior.
This points to a runner infrastructure issue rather than an AICR code or CI logic issue.
Follow-Up
Follow up with IPP to improve H100 runner performance and reliability, especially avoiding overloaded/unhealthy hosts.
NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1much slower?Improve the current AICR H100 CI flow so it is simpler, cheaper, and easier to diagnose.
Success Criteria