Skip to content

feat(validation): add performance-phase constraints to OKE overlays once OCI testbed lands #1007

@yuanchen8911

Description

@yuanchen8911

Summary

Add performance-phase constraints to the OCI / OKE GB200 overlays once an OCI testbed is available to produce empirically-grounded thresholds. This is the largest testbed-blocked cohort.

Affected overlays

Overlay Required performance constraint kind
gb200-oke-training NCCL all-reduce-bw (nccl-all-reduce-bw-net + nccl-all-reduce-bw-nvls, mirroring gb200-eks-training)
gb200-oke-ubuntu-training same
gb200-oke-ubuntu-training-kubeflow same
gb200-oke-ubuntu-inference-dynamo inference-perf (throughput + TTFT p99, mirroring h100-eks-ubuntu-inference-dynamo)

These are all the OKE entries flagged by the strict-mode floor today (AICR_VALIDATION_FLOOR_STRICT=1). The deployment + conformance phases are inherited from oke.yaml via PR #1001; only the performance phase remains gapped.

Blocker

No OCI testbed available today for:

  • NCCL bandwidth measurements on GB200 / OKE bare-metal shapes (BM.GPU.B200.8 or equivalent NVL72 IMEX domains)
  • inference-perf runs on the same hardware

The GB200 EKS reference (recipes/overlays/gb200-eks-training.yaml:90-126) split NCCL into -net (EFA) and -nvls (MNNVL across the NVL72 IMEX domain) channels — OCI's network stack is different, so the EKS thresholds (>= 40 GB/s NET, >= 500 GB/s NVLS) are not portable. OCI deserves its own empirically-grounded numbers.

Design notes

  • NCCL training overlays should follow the GB200 / EKS multi-channel pattern (one constraint per transport) rather than a single nccl-all-reduce-bw — both transports exercise the actual interconnect and a silent fallback to NET should not masquerade as a pass.
  • Dynamo inference uses the same inference-perf check as h100-eks-ubuntu-inference-dynamo; thresholds can start at the H100 placeholder floors (inference-throughput >= 5000, inference-ttft-p99 <= 200) if GB200 is at least as fast, then tighten.

Done when

  • OCI GB200 testbed produces baseline NCCL bandwidth numbers (NET + NVLS) and inference-perf numbers (throughput tok/s, TTFT p99 ms).
  • All 4 overlays gain a performance.checks block with the appropriate constraint set.
  • The overlays disappear from the AICR_VALIDATION_FLOOR_STRICT=1 floor test output.

Related

Metadata

Metadata

Assignees

No one assigned
    No fields configured for Enhancement.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions