Skip to content

feat(ci): migrate from Azure to AWS ephemeral runners#15620

Merged
ko3n1g merged 20 commits into
mainfrom
ko3n1g/ci/aws-ephemeral-runners
Apr 28, 2026
Merged

feat(ci): migrate from Azure to AWS ephemeral runners#15620
ko3n1g merged 20 commits into
mainfrom
ko3n1g/ci/aws-ephemeral-runners

Conversation

@ko3n1g
Copy link
Copy Markdown
Contributor

@ko3n1g ko3n1g commented Apr 17, 2026

Claude summary

Summary

  • Migrates all runners from self-hosted-azure* to AWS runners selected by the FW-CI-templates pre-flight (NVIDIA members → DEFAULT_RUNNER_PREFIX, external contributors → NON_NVIDIA_RUNNER_PREFIX)
  • Switches container registry from nemoci.azurecr.io (Azure ACR) to AWS ECR (DEFAULT_CONTAINER_REGISTRY / NON_NVIDIA_CONTAINER_REGISTRY), both configured as repo vars and routed via pre-flight.outputs.registry
  • Replaces the bespoke is-not-external-contributor job and local check-nvidia-sso-membership action with the shared NVIDIA-NeMo/FW-CI-templates/_cicd_preflight.yml@v0.89.0 reusable workflow — same pattern as Megatron-Bridge
  • Renames the old NeMo-specific pre-flightconfigure (outputs test_to_run, components_to_run, PR label flags); configure runs after pre-flight to preserve ordering
  • Rebuilds _build_container.yml as an inline docker/build-push-action job (removes FW-CI-templates delegation) so ECR auth works on AWS runners; registry is now forwarded from pre-flight.outputs.registry
  • Propagates built image URLs through workflow outputs so all downstream jobs reference the correct ECR image tag (format: nemo-speech:<image-name>-<run_id>)
  • Updates test-template action to accept a full image URL instead of constructing nemoci.azurecr.io/<name>:<run_id> internally
  • Switches cicd-main.yml from pull_request: types: [labeled] to push: branches: pull-request/[0-9]+ (copy-pr-bot pattern, onboarded in ci: onboard copy-pr-bot #15631); derives PR author from branch name; removes cicd-relabel-bot.yml and label-based pre-flight logic

Job graph

pre-flight  (FW-CI-templates: SSO → runner_prefix, registry)
configure   (NeMo-specific: test_to_run, components_to_run, labels)
  └── code-linting
  └── cicd-wait-in-queue
       └── cicd-test-container-build  (runner + registry from pre-flight)
            ├── cicd-import-tests
            ├── L0_Setup_Test_Data_And_Models
            ├── cicd-main-unit-tests
            └── cicd-main-speech
                 └── Nemo_CICD_Test
                      └── Coverage

Repo vars required

Var Internal External
DEFAULT_RUNNER_PREFIX nemo-ci-aws-gpu-x2
NON_NVIDIA_RUNNER_PREFIX nemo-ci-aws-gpu-x2-ephemeral
DEFAULT_CONTAINER_REGISTRY 766267172432.dkr.ecr.us-east-1.amazonaws.com
NON_NVIDIA_CONTAINER_REGISTRY (ephemeral ECR or public)
SSO_USERS_FILENAME users_sso.json

Test plan

  • Verify pre-flight (FW-CI-templates) passes for internal contributor PRs and selects the correct runner prefix
  • Verify configure emits correct test_to_run / components_to_run outputs
  • Verify cicd-test-container-build builds and pushes to ECR successfully
  • Verify cicd-import-tests pulls the correct ECR image
  • Verify unit tests in cicd-main-unit-tests run with the ECR image
  • Verify speech tests in cicd-main-speech build and run the speech ECR image
  • Verify CI triggers on push to pull-request/* branch (not on label)
  • Verify external contributor PR is routed to ephemeral runner

Signed-off-by: Oliver Koenig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace nemoci.azurecr.io (Azure ACR) with
766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech (AWS ECR)
across all CI workflows. Rebuild _build_container.yml as an inline
docker/build-push-action job so ECR registry access works on AWS
runners. Image tags embed the image-name prefix (e.g.
nemo_container-<run_id>) since all images share one ECR repository.

Propagate image URLs through workflow outputs so downstream jobs
reference the correct ECR tag. Update test-template action to accept
a full image URL. Add root checkout to each job using the local
action so ./.github/actions/ resolves correctly.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace ubuntu-latest with inputs.runner in all CPU-only matrix entries
in cicd-main-unit-tests.yml and cicd-main-speech.yml.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Add test-data-path input to test-template action and wire
vars.DEFAULT_TEST_DATA_PATH through all callers. Defaults to
/mnt/datadrive/TestData when the variable is unset.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace Bors-style `nv-gha-runners/get-pr-info` action with direct
`github.event.pull_request.*` context, following the pattern from
Megatron-Bridge/pull/3370 adapted for NeMo's standard PR events:

- cicd-main.yml: remove `get-pr-info` step; use
  `github.event.pull_request.user.login` for SSO username lookup;
  bump checkout action from v4 to v6
- _build_container.yml: remove Bors-style `get-pr-info` step (condition
  `startsWith(refs/heads/pull-request/)` never fires for pull_request
  events); use `github.event.pull_request.number` for cache keys

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
ko3n1g and others added 2 commits April 22, 2026 15:16
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 22, 2026

/ok to test

@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @ko3n1g 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@ko3n1g ko3n1g marked this pull request as ready for review April 22, 2026 21:01
@ko3n1g ko3n1g marked this pull request as draft April 22, 2026 21:02
…re-flight

- Remove bespoke `is-not-external-contributor` job and its local
  `check-nvidia-sso-membership` action
- Wire `NVIDIA-NeMo/FW-CI-templates/_cicd_preflight.yml@v0.89.0` as
  the new `pre-flight` job (SSO check + runner/registry selection via
  `DEFAULT_RUNNER_PREFIX`, `NON_NVIDIA_RUNNER_PREFIX`,
  `DEFAULT_CONTAINER_REGISTRY`, `NON_NVIDIA_CONTAINER_REGISTRY` vars)
- Rename old NeMo-specific pre-flight → `configure` (needs: pre-flight);
  keeps test_to_run, components_to_run, and label outputs
- Downstream jobs: runner from `pre-flight.outputs.runner_prefix`,
  NeMo-specific outputs from `configure.outputs.*`

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g
Copy link
Copy Markdown
Contributor Author

ko3n1g commented Apr 22, 2026

/ok to test

@ko3n1g ko3n1g marked this pull request as ready for review April 22, 2026 21:16
@ko3n1g ko3n1g requested a review from chtruong814 April 22, 2026 21:16
@ko3n1g ko3n1g enabled auto-merge (squash) April 22, 2026 21:23
@chtruong814
Copy link
Copy Markdown
Collaborator

Maybe need to rsync whatever they think HF HOME needs to be ?

@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @ko3n1g 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @ko3n1g 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
@ko3n1g ko3n1g force-pushed the ko3n1g/ci/aws-ephemeral-runners branch from 81557d5 to c95b234 Compare April 28, 2026 08:14
@github-actions
Copy link
Copy Markdown
Contributor

[🤖]: Hi @ko3n1g 👋,

We wanted to let you know that a CICD pipeline for this PR just finished successfully.

So it might be time to merge this PR or get some approvals.

Copy link
Copy Markdown
Collaborator

@chtruong814 chtruong814 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for doing this. please hold off on merging until you give the Speech team a heads up and give them a chance to communicate within their team new way to kick off CI.

@ko3n1g ko3n1g merged commit 52fee26 into main Apr 28, 2026
233 of 237 checks passed
@ko3n1g ko3n1g deleted the ko3n1g/ci/aws-ephemeral-runners branch April 28, 2026 14:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants