feat(ci): migrate from Azure to AWS ephemeral runners#15620
Conversation
Signed-off-by: Oliver Koenig <okoenig@nvidia.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace nemoci.azurecr.io (Azure ACR) with 766267172432.dkr.ecr.us-east-1.amazonaws.com/nemo-speech (AWS ECR) across all CI workflows. Rebuild _build_container.yml as an inline docker/build-push-action job so ECR registry access works on AWS runners. Image tags embed the image-name prefix (e.g. nemo_container-<run_id>) since all images share one ECR repository. Propagate image URLs through workflow outputs so downstream jobs reference the correct ECR tag. Update test-template action to accept a full image URL. Add root checkout to each job using the local action so ./.github/actions/ resolves correctly. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
9d11be8 to
7e9c29f
Compare
Replace ubuntu-latest with inputs.runner in all CPU-only matrix entries in cicd-main-unit-tests.yml and cicd-main-speech.yml. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Add test-data-path input to test-template action and wire vars.DEFAULT_TEST_DATA_PATH through all callers. Defaults to /mnt/datadrive/TestData when the variable is unset. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Replace Bors-style `nv-gha-runners/get-pr-info` action with direct `github.event.pull_request.*` context, following the pattern from Megatron-Bridge/pull/3370 adapted for NeMo's standard PR events: - cicd-main.yml: remove `get-pr-info` step; use `github.event.pull_request.user.login` for SSO username lookup; bump checkout action from v4 to v6 - _build_container.yml: remove Bors-style `get-pr-info` step (condition `startsWith(refs/heads/pull-request/)` never fires for pull_request events); use `github.event.pull_request.number` for cache keys Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test |
|
[🤖]: Hi @ko3n1g 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
…re-flight - Remove bespoke `is-not-external-contributor` job and its local `check-nvidia-sso-membership` action - Wire `NVIDIA-NeMo/FW-CI-templates/_cicd_preflight.yml@v0.89.0` as the new `pre-flight` job (SSO check + runner/registry selection via `DEFAULT_RUNNER_PREFIX`, `NON_NVIDIA_RUNNER_PREFIX`, `DEFAULT_CONTAINER_REGISTRY`, `NON_NVIDIA_CONTAINER_REGISTRY` vars) - Rename old NeMo-specific pre-flight → `configure` (needs: pre-flight); keeps test_to_run, components_to_run, and label outputs - Downstream jobs: runner from `pre-flight.outputs.runner_prefix`, NeMo-specific outputs from `configure.outputs.*` Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
|
/ok to test |
|
Maybe need to rsync whatever they think HF HOME needs to be ? |
|
[🤖]: Hi @ko3n1g 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
|
[🤖]: Hi @ko3n1g 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
600c643 to
81557d5
Compare
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: oliver könig <okoenig@nvidia.com>
81557d5 to
c95b234
Compare
|
[🤖]: Hi @ko3n1g 👋, We wanted to let you know that a CICD pipeline for this PR just finished successfully. So it might be time to merge this PR or get some approvals. |
chtruong814
left a comment
There was a problem hiding this comment.
thanks a lot for doing this. please hold off on merging until you give the Speech team a heads up and give them a chance to communicate within their team new way to kick off CI.
Claude summary
Summary
self-hosted-azure*to AWS runners selected by the FW-CI-templates pre-flight (NVIDIA members →DEFAULT_RUNNER_PREFIX, external contributors →NON_NVIDIA_RUNNER_PREFIX)nemoci.azurecr.io(Azure ACR) to AWS ECR (DEFAULT_CONTAINER_REGISTRY/NON_NVIDIA_CONTAINER_REGISTRY), both configured as repo vars and routed viapre-flight.outputs.registryis-not-external-contributorjob and localcheck-nvidia-sso-membershipaction with the sharedNVIDIA-NeMo/FW-CI-templates/_cicd_preflight.yml@v0.89.0reusable workflow — same pattern as Megatron-Bridgepre-flight→configure(outputstest_to_run,components_to_run, PR label flags);configureruns afterpre-flightto preserve ordering_build_container.ymlas an inlinedocker/build-push-actionjob (removes FW-CI-templates delegation) so ECR auth works on AWS runners;registryis now forwarded frompre-flight.outputs.registrynemo-speech:<image-name>-<run_id>)test-templateaction to accept a full image URL instead of constructingnemoci.azurecr.io/<name>:<run_id>internallycicd-main.ymlfrompull_request: types: [labeled]topush: branches: pull-request/[0-9]+(copy-pr-bot pattern, onboarded in ci: onboard copy-pr-bot #15631); derives PR author from branch name; removescicd-relabel-bot.ymland label-based pre-flight logicJob graph
Repo vars required
DEFAULT_RUNNER_PREFIXnemo-ci-aws-gpu-x2NON_NVIDIA_RUNNER_PREFIXnemo-ci-aws-gpu-x2-ephemeralDEFAULT_CONTAINER_REGISTRY766267172432.dkr.ecr.us-east-1.amazonaws.comNON_NVIDIA_CONTAINER_REGISTRYSSO_USERS_FILENAMEusers_sso.jsonTest plan
pre-flight(FW-CI-templates) passes for internal contributor PRs and selects the correct runner prefixconfigureemits correcttest_to_run/components_to_runoutputscicd-test-container-buildbuilds and pushes to ECR successfullycicd-import-testspulls the correct ECR imagecicd-main-unit-testsrun with the ECR imagecicd-main-speechbuild and run the speech ECR imagepull-request/*branch (not on label)