fix(aws): infer AMI architecture from instance type for arm64 support#669
Conversation
When image.architecture is not explicitly set, holodeck defaults to x86_64 regardless of the instance type. This causes EC2 RunInstances to fail with "Unsupported configuration" when an arm64-only instance type (e.g., g5g, m7g, c7g) is paired with an x86_64 AMI. Add inferArchFromInstanceType() that queries DescribeInstanceTypes to determine the instance type's supported architectures. When the type only supports arm64, architecture is automatically set to "arm64" before AMI resolution. This is wired into all three resolution paths: resolveOSToAMI (single-node), setLegacyAMI (legacy default), and createInstances (cluster mode). Backward compatible: dual-arch or x86_64-only instance types still default to x86_64. Explicit image.architecture always takes precedence. Ref: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 22025054055Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR improves Holodeck’s AWS AMI selection by automatically inferring CPU architecture from the chosen EC2 instance type when image.architecture is not explicitly set, enabling arm64-only instance types (e.g., g5g.*) to resolve arm64 AMIs instead of incorrectly defaulting to x86_64.
Changes:
- Add
inferArchFromInstanceType()(backed byDescribeInstanceTypes) to inferarm64for arm64-only instance types, otherwise defaulting tox86_64. - Wire architecture inference into AMI resolution for single-node (
resolveOSToAMI), legacy default AMI selection (setLegacyAMI), and cluster mode (createInstances). - Add unit and end-to-end-ish tests covering inference behavior and
resolveOSToAMIintegration.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| pkg/provider/aws/image.go | Adds instance-type-based architecture inference and uses it in OS + legacy AMI resolution paths. |
| pkg/provider/aws/cluster.go | Applies architecture inference during cluster nodepool AMI resolution when image.architecture is unset. |
| pkg/provider/aws/image_test.go | Adds test coverage for inference logic and for resolveOSToAMI using inferred arm64. |
| for _, a := range archs { | ||
| switch a { | ||
| case "x86_64": | ||
| hasX86 = true | ||
| case "arm64": | ||
| hasArm = true | ||
| } | ||
| } |
There was a problem hiding this comment.
inferArchFromInstanceType only treats exact "x86_64" and "arm64" as signals. getInstanceTypeArch returns raw EC2 ArchitectureType strings, which can include "x86_64_mac" and "arm64_mac"; those currently fall through and would be inferred as "x86_64" even if the instance type is ARM-only. Consider mapping the *_mac variants to their base arch (or using a small normalization helper) before setting hasX86/hasArm.
pkg/provider/aws/image.go
Outdated
| if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil { | ||
| arch = inferred | ||
| } else { | ||
| arch = "x86_64" // Default if inference fails | ||
| } |
There was a problem hiding this comment.
When image.architecture is unset, inference errors are silently swallowed and the code falls back to x86_64. This can mask IAM/EC2 API failures and reintroduce the same cryptic RunInstances mismatch error for arm64-only types if inference fails at runtime. Consider surfacing the inference failure (e.g., return a wrapped error that tells users to set image.architecture explicitly), or at minimum emitting a warning when falling back.
| if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil { | |
| arch = inferred | |
| } else { | |
| arch = "x86_64" // Default if inference fails | |
| } | |
| inferred, err := p.inferArchFromInstanceType(p.Spec.Type) | |
| if err != nil { | |
| return fmt.Errorf( | |
| "failed to infer image architecture from instance type %s: %w; set spec.image.architecture to override", | |
| p.Spec.Type, | |
| err, | |
| ) | |
| } | |
| arch = inferred |
| } else { | ||
| // Infer architecture from instance type (e.g., arm64 for g5g/m7g/c7g) | ||
| if inferred, err := p.inferArchFromInstanceType(instanceType); err == nil { | ||
| arch = inferred | ||
| } | ||
| // If inference fails, leave empty for resolveImageForNode to default | ||
| } |
There was a problem hiding this comment.
createInstances ignores errors from inferArchFromInstanceType and proceeds with an empty arch, relying on downstream defaults. If inference fails due to an AWS API/IAM issue, users won't know why arm64 inference didn't happen and may still hit a confusing instance/AMI mismatch at RunInstances. Consider returning/logging the inference error (or explicitly defaulting to x86_64 with a warning) when image.architecture isn't set.
| name: "dual-arch instance type defaults to x86_64", | ||
| instanceType: "m6i.large", | ||
| setupMock: func(ec2Mock *MockEC2Client) { | ||
| ec2Mock.DescribeInstTypesFunc = func(ctx context.Context, | ||
| params *ec2.DescribeInstanceTypesInput, | ||
| optFns ...func(*ec2.Options)) (*ec2.DescribeInstanceTypesOutput, error) { | ||
| return &ec2.DescribeInstanceTypesOutput{ | ||
| InstanceTypes: []types.InstanceTypeInfo{ | ||
| { | ||
| InstanceType: "m6i.large", | ||
| ProcessorInfo: &types.ProcessorInfo{ | ||
| SupportedArchitectures: []types.ArchitectureType{ | ||
| types.ArchitectureTypeX8664, | ||
| types.ArchitectureTypeArm64, | ||
| }, | ||
| }, | ||
| }, | ||
| }, | ||
| }, nil | ||
| } | ||
| }, | ||
| wantArch: "x86_64", | ||
| wantErr: false, |
There was a problem hiding this comment.
The test case labeled "dual-arch instance type" uses m6i.large, which is an x86_64-only family in AWS. Using a clearly synthetic name (or a real instance type that actually reports multiple architectures) would make the intent of the test less confusing for future maintainers.
- Normalize *_mac architecture variants (x86_64_mac, arm64_mac) using strings.HasPrefix instead of exact match, so Mac instance types like mac2-m2.metal are correctly classified - Surface inference errors in resolveOSToAMI, setLegacyAMI, and createInstances instead of silently falling back to x86_64, which would mask IAM/API failures and reproduce the original mismatch - Use synthetic instance type name in dual-arch test case to avoid confusion with real AWS instance families - Add test coverage for arm64_mac and x86_64_mac architecture variants Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Add an E2E test that exercises the full GPU stack (driver, CTK, Docker, Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU). The test intentionally omits image.architecture to validate that the architecture inference from instance type (added in NVIDIA#669) works end-to-end in production. The g5g instance type is arm64-only, so holodeck must infer arm64 and resolve the correct AMI automatically. This test only runs on merge to main (not on PRs) since g5g instances are more expensive than the standard x86_64 test fleet. The periodic cleanup workflow already covers us-east-1 where g5g is available. Changes: - tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch - tests/aws_test.go: new "arm64" labeled test entry - .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
inferArchFromInstanceType()helper that queriesDescribeInstanceTypesto detect arm64-only instance typesresolveOSToAMI(single-node),setLegacyAMI(legacy default), andcreateInstances(cluster mode)image.architectureis unset and the instance type only supports arm64 (e.g.,g5g,m7g,c7g), holodeck now automatically resolves arm64 AMIs instead of defaulting to x86_64Root Cause
When a holodeck config specifies an arm64-only instance type (like
g5g.xlarge) without explicitly settingimage.architecture: arm64, holodeck unconditionally defaults tox86_64. This resolves an x86_64 AMI, and EC2RunInstancesrejects the mismatch:The existing cross-validation (#664) catches this mismatch in
DryRun()with a better error, but doesn't prevent it. This PR fixes the selection itself.Backward Compatibility
image.architecturealways takes precedence (no behavior change)x86_64(no behavior change)Supersedes
Closes the gap left by PRs #661-664 which addressed validation and provisioning but not AMI selection.
Related: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634
Test plan
TestInferArchFromInstanceType— arm64-only, x86_64-only, dual-arch, API errorTestResolveOSToAMI_InfersArchFromInstanceType— end-to-end: g5g.xlarge + empty arch → arm64 AMIgo test ./pkg/provider/aws/... -v— all 84 Ginkgo + unit tests passgo test ./pkg/... -count=1— full package suite passesgo vet ./pkg/provider/aws/...— clean