Skip to content

H100 Kind CI unstable on GPU runners #696

@yuanchen8911

Description

@yuanchen8911

Summary

AICR H100 GitHub Actions runners have shown long-running performance variance and instability. Recent slow, canceled, or failed jobs strongly correlate with managed-runner NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1.

Description

AICR H100 Kind CI has had intermittent slow-runner and instability issues for a while. The current investigation focused on recent failures in the linux-amd64-gpu-h100-latest-2 runner pool, where the variance became severe enough that the H100 PR jobs did not pass for several days.

On slow/unhealthy runners, jobs run for more than 2 hours and fail with Kubernetes-level symptoms: API server timeouts, control-plane instability, readiness/liveness probe timeouts, runtime components not becoming available, and cleanup/debug steps timing out.

On healthy runners, the same PR workflows complete in about 20 minutes.

This strongly suggests the root issue is runner performance/health rather than AICR code or CI logic. AICR can mitigate some CI behavior, but when the runner is overloaded the Kind control plane becomes unreliable and causes cascading failures.

Evidence

The recent strongest signal is correlation with the managed-runner NodeUID printed by NVIDIA runner bootstrap logs. Multiple recent slow, canceled, or failed jobs landed on the same NodeUID:

  • 507dc070-3563-4487-8a0c-d411242a11a1

Those jobs used different ephemeral runner names and even different H100 runner paths:

  • 24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm
  • 24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z
  • 609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw

Successful fast runs landed on a different NodeUID:

  • 3510eba6-bc52-4e81-9bb3-4f7ab58ebc51

This points to a broader H100 runner performance/reliability issue, with 507dc070-3563-4487-8a0c-d411242a11a1 as a recent concrete host-level example that appears unhealthy or overloaded.

Primary Recent Failure

Most recent PR #694 H100 x2 conformance test failure/cancel:

  • Job: https://github.com/NVIDIA/aicr/actions/runs/25013461746/job/73255076257?pr=694
  • Workflow: GPU Training Test (nvkind + H100 x2)
  • Runner: 24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm
  • Runner label: linux-amd64-gpu-h100-latest-2
  • NodeUID / host id: 507dc070-3563-4487-8a0c-d411242a11a1
  • Result: failed/canceled around the 2-hour job limit
  • Root symptom: Chainsaw was stuck waiting for kai-scheduler/admission to become ready
  • Interpretation: the 2-hour timeout only cut off the run; the real issue was the runtime component not becoming healthy on that slow host

Additional Failed/Slow Evidence

Failed inference example with high load diagnostics:

  • Job: https://github.com/NVIDIA/aicr/actions/runs/24974424715/job/73135328977?pr=687
  • Workflow: GPU Inference Test (nvkind + H100 x2)
  • Runner: 24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z
  • Runner label: linux-amd64-gpu-h100-latest-2
  • NodeUID / host id: 507dc070-3563-4487-8a0c-d411242a11a1
  • Runner exposed roughly 41 CPUs and 241GiB memory
  • Diagnostics showed severe host pressure:
    • Load average around 75 in one diagnostic snapshot
    • Near failure/debug time: 129.20, 127.54, 112.75
  • Kubelet and containerd were each consuming more than one full CPU
  • Symptoms included:
    • API server timeouts
    • control-plane instability
    • repeated readiness/liveness probe timeouts across unrelated pods
    • runtime components not becoming Available
    • cleanup/debug steps timing out

Earlier canceled inference example:

  • Job: https://github.com/NVIDIA/aicr/actions/runs/25010653926/job/73245252932
  • Workflow: GPU Inference Test (nvkind + H100 x1)
  • Runner: 609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw
  • Runner path: H100 x1
  • NodeUID / host id: 507dc070-3563-4487-8a0c-d411242a11a1
  • This is supporting evidence that the same bad NodeUID appears across both H100 x1 and x2 runner paths.

Successful Fast Runs

Successful inference example:

Successful training example:

Conclusion

AICR H100 Kind CI has seen slow-runner and instability issues over time. The recent failures strongly correlate with the underlying managed-runner NodeUID:

  • 507dc070-3563-4487-8a0c-d411242a11a1

This correlation is independent of ephemeral runner name and appears across both x1 and x2 H100 runner paths. The observed Kubernetes failures look different at the component level, but they share the same underlying signal: overloaded or unhealthy runner/node behavior.

This points to a runner infrastructure issue rather than an AICR code or CI logic issue.

Follow-Up

  1. Follow up with IPP to improve H100 runner performance and reliability, especially avoiding overloaded/unhealthy hosts.

    • Slack thread: https://nvidia.slack.com/archives/C0A790E5MBP/p1777313281278739?thread_ts=1776424399.083159&cid=C0A790E5MBP
    • Key questions:
      • Why are jobs scheduled on NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1 much slower?
      • Is this host unhealthy, overloaded, or oversubscribed?
      • Are multiple runner VMs being scheduled on the same physical H100 host?
      • Is there CPU steal/pressure, IO pressure, memory pressure, or kubelet/containerd contention?
      • Can this host be quarantined if unhealthy?
      • Can IPP review whether this pattern exists on other H100 runner hosts as well?
      • Can runner scheduling or startup health checks include load average, CPU pressure/steal, IO pressure, memory pressure, and concurrent runner count per host?
      • Can runner health metadata, including NodeUID and pressure signals, be exposed consistently in job logs so CI failures can be correlated to host health quickly?
  2. Improve the current AICR H100 CI flow so it is simpler, cheaper, and easier to diagnose.

    • Latest test PR: Refactor and harden H100 GPU CI workflow #694
    • Keep only lightweight health checks and useful failure diagnostics.
    • Avoid heavy CI workarounds that try to compensate for bad infrastructure.
    • Keep CI-only behavior scoped to Kind and avoid unnecessary changes to end-user deployment paths.

Success Criteria

  • IPP confirms and mitigates the overloaded runner condition.
  • H100 Kind CI avoids unhealthy hosts or gets reliable runner-health signaling.
  • AICR keeps the CI workflow clean and maintainable through Refactor and harden H100 GPU CI workflow #694 or its successor.
  • On healthy runners, H100 training and inference consistently complete in about 20 minutes.

Metadata

Metadata

Assignees

Labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions