H100 Kind CI unstable on GPU runners

## Summary

AICR H100 GitHub Actions runners have shown long-running performance variance and instability. Recent slow, canceled, or failed jobs strongly correlate with managed-runner `NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1`.

## Description

AICR H100 Kind CI has had intermittent slow-runner and instability issues for a while. The current investigation focused on recent failures in the `linux-amd64-gpu-h100-latest-2` runner pool, where the variance became severe enough that the H100 PR jobs did not pass for several days.

On slow/unhealthy runners, jobs run for more than 2 hours and fail with Kubernetes-level symptoms: API server timeouts, control-plane instability, readiness/liveness probe timeouts, runtime components not becoming available, and cleanup/debug steps timing out.

On healthy runners, the same PR workflows complete in about 20 minutes.

This strongly suggests the root issue is runner performance/health rather than AICR code or CI logic. AICR can mitigate some CI behavior, but when the runner is overloaded the Kind control plane becomes unreliable and causes cascading failures.

## Evidence

The recent strongest signal is correlation with the managed-runner `NodeUID` printed by NVIDIA runner bootstrap logs. Multiple recent slow, canceled, or failed jobs landed on the same NodeUID:

- `507dc070-3563-4487-8a0c-d411242a11a1`

Those jobs used different ephemeral runner names and even different H100 runner paths:

- `24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm`
- `24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z`
- `609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw`

Successful fast runs landed on a different NodeUID:

- `3510eba6-bc52-4e81-9bb3-4f7ab58ebc51`

This points to a broader H100 runner performance/reliability issue, with `507dc070-3563-4487-8a0c-d411242a11a1` as a recent concrete host-level example that appears unhealthy or overloaded.

### Primary Recent Failure

Most recent PR #694 H100 x2 conformance test failure/cancel:

- Job: https://github.com/NVIDIA/aicr/actions/runs/25013461746/job/73255076257?pr=694
- Workflow: `GPU Training Test (nvkind + H100 x2)`
- Runner: `24ba-l-amd-g-h100-l-2-f45nb-runner-pdpwm`
- Runner label: `linux-amd64-gpu-h100-latest-2`
- NodeUID / host id: `507dc070-3563-4487-8a0c-d411242a11a1`
- Result: failed/canceled around the 2-hour job limit
- Root symptom: Chainsaw was stuck waiting for `kai-scheduler/admission` to become ready
- Interpretation: the 2-hour timeout only cut off the run; the real issue was the runtime component not becoming healthy on that slow host

### Additional Failed/Slow Evidence

Failed inference example with high load diagnostics:

- Job: https://github.com/NVIDIA/aicr/actions/runs/24974424715/job/73135328977?pr=687
- Workflow: `GPU Inference Test (nvkind + H100 x2)`
- Runner: `24ba-l-amd-g-h100-l-2-f45nb-runner-rdn4z`
- Runner label: `linux-amd64-gpu-h100-latest-2`
- NodeUID / host id: `507dc070-3563-4487-8a0c-d411242a11a1`
- Runner exposed roughly `41` CPUs and `241GiB` memory
- Diagnostics showed severe host pressure:
  - Load average around `75` in one diagnostic snapshot
  - Near failure/debug time: `129.20`, `127.54`, `112.75`
- Kubelet and containerd were each consuming more than one full CPU
- Symptoms included:
  - API server timeouts
  - control-plane instability
  - repeated readiness/liveness probe timeouts across unrelated pods
  - runtime components not becoming Available
  - cleanup/debug steps timing out

Earlier canceled inference example:

- Job: https://github.com/NVIDIA/aicr/actions/runs/25010653926/job/73245252932
- Workflow: `GPU Inference Test (nvkind + H100 x1)`
- Runner: `609e-l-amd-g-h100-l-1-9rdw2-runner-cb5rw`
- Runner path: H100 x1
- NodeUID / host id: `507dc070-3563-4487-8a0c-d411242a11a1`
- This is supporting evidence that the same bad NodeUID appears across both H100 x1 and x2 runner paths.

### Successful Fast Runs

Successful inference example:

- Job: https://github.com/NVIDIA/aicr/actions/runs/25002493314/job/73216227822
- Workflow: `GPU Inference Test (nvkind + H100 x2)`
- Duration: `20m46s`
- Runner: `24ba-l-amd-g-h100-l-2-f45nb-runner-mtscq`
- NodeUID / host id: `3510eba6-bc52-4e81-9bb3-4f7ab58ebc51`
- The same workflow completed quickly, and the Kubernetes/control-plane symptoms disappeared.

Successful training example:

- Job: https://github.com/NVIDIA/aicr/actions/runs/25002493348/job/73216217936
- Workflow: `GPU Training Test (nvkind + H100 x2)`
- Duration: `20m58s`
- Runner: `24ba-l-amd-g-h100-l-2-f45nb-runner-bnk5r`
- NodeUID / host id: `3510eba6-bc52-4e81-9bb3-4f7ab58ebc51`
- The same workflow completed quickly on the same healthy NodeUID as the successful inference run.

## Conclusion

AICR H100 Kind CI has seen slow-runner and instability issues over time. The recent failures strongly correlate with the underlying managed-runner NodeUID:

- `507dc070-3563-4487-8a0c-d411242a11a1`

This correlation is independent of ephemeral runner name and appears across both x1 and x2 H100 runner paths. The observed Kubernetes failures look different at the component level, but they share the same underlying signal: overloaded or unhealthy runner/node behavior.

This points to a runner infrastructure issue rather than an AICR code or CI logic issue.

## Follow-Up

1. Follow up with IPP to improve H100 runner performance and reliability, especially avoiding overloaded/unhealthy hosts.
   - Slack thread: https://nvidia.slack.com/archives/C0A790E5MBP/p1777313281278739?thread_ts=1776424399.083159&cid=C0A790E5MBP
   - Key questions:
     - Why are jobs scheduled on `NodeUID: 507dc070-3563-4487-8a0c-d411242a11a1` much slower?
     - Is this host unhealthy, overloaded, or oversubscribed?
     - Are multiple runner VMs being scheduled on the same physical H100 host?
     - Is there CPU steal/pressure, IO pressure, memory pressure, or kubelet/containerd contention?
     - Can this host be quarantined if unhealthy?
     - Can IPP review whether this pattern exists on other H100 runner hosts as well?
     - Can runner scheduling or startup health checks include load average, CPU pressure/steal, IO pressure, memory pressure, and concurrent runner count per host?
     - Can runner health metadata, including NodeUID and pressure signals, be exposed consistently in job logs so CI failures can be correlated to host health quickly?

2. Improve the current AICR H100 CI flow so it is simpler, cheaper, and easier to diagnose.
   - Latest test PR: https://github.com/NVIDIA/aicr/pull/694
   - Keep only lightweight health checks and useful failure diagnostics.
   - Avoid heavy CI workarounds that try to compensate for bad infrastructure.
   - Keep CI-only behavior scoped to Kind and avoid unnecessary changes to end-user deployment paths.

## Success Criteria

- IPP confirms and mitigates the overloaded runner condition.
- H100 Kind CI avoids unhealthy hosts or gets reliable runner-health signaling.
- AICR keeps the CI workflow clean and maintainable through #694 or its successor.
- On healthy runners, H100 training and inference consistently complete in about 20 minutes.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H100 Kind CI unstable on GPU runners #696

Summary

Description

Evidence

Primary Recent Failure

Additional Failed/Slow Evidence

Successful Fast Runs

Conclusion

Follow-Up

Success Criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

H100 Kind CI unstable on GPU runners #696

Description

Summary

Description

Evidence

Primary Recent Failure

Additional Failed/Slow Evidence

Successful Fast Runs

Conclusion

Follow-Up

Success Criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions