fix(aws): infer AMI architecture from instance type for arm64 support by ArangoGutierrez · Pull Request #669 · NVIDIA/holodeck

ArangoGutierrez · 2026-02-14T21:41:13Z

Summary

Add inferArchFromInstanceType() helper that queries DescribeInstanceTypes to detect arm64-only instance types
Wire architecture inference into all three AMI resolution paths: resolveOSToAMI (single-node), setLegacyAMI (legacy default), and createInstances (cluster mode)
When image.architecture is unset and the instance type only supports arm64 (e.g., g5g, m7g, c7g), holodeck now automatically resolves arm64 AMIs instead of defaulting to x86_64

Root Cause

When a holodeck config specifies an arm64-only instance type (like g5g.xlarge) without explicitly setting image.architecture: arm64, holodeck unconditionally defaults to x86_64. This resolves an x86_64 AMI, and EC2 RunInstances rejects the mismatch:

api error Unsupported: The requested configuration is currently not supported.

The existing cross-validation (#664) catches this mismatch in DryRun() with a better error, but doesn't prevent it. This PR fixes the selection itself.

Backward Compatibility

Explicit image.architecture always takes precedence (no behavior change)
Dual-arch or x86_64-only instance types still default to x86_64 (no behavior change)
Only arm64-only instance types trigger the new inference

Supersedes

Closes the gap left by PRs #661-664 which addressed validation and provisioning but not AMI selection.
Related: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634

Test plan

TestInferArchFromInstanceType — arm64-only, x86_64-only, dual-arch, API error
TestResolveOSToAMI_InfersArchFromInstanceType — end-to-end: g5g.xlarge + empty arch → arm64 AMI
go test ./pkg/provider/aws/... -v — all 84 Ginkgo + unit tests pass
go test ./pkg/... -count=1 — full package suite passes
go vet ./pkg/provider/aws/... — clean
CI validation

When image.architecture is not explicitly set, holodeck defaults to x86_64 regardless of the instance type. This causes EC2 RunInstances to fail with "Unsupported configuration" when an arm64-only instance type (e.g., g5g, m7g, c7g) is paired with an x86_64 AMI. Add inferArchFromInstanceType() that queries DescribeInstanceTypes to determine the instance type's supported architectures. When the type only supports arm64, architecture is automatically set to "arm64" before AMI resolution. This is wired into all three resolution paths: resolveOSToAMI (single-node), setLegacyAMI (legacy default), and createInstances (cluster mode). Backward compatible: dual-arch or x86_64-only instance types still default to x86_64. Explicit image.architecture always takes precedence. Ref: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634 Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

coveralls · 2026-02-14T21:44:46Z

Pull Request Test Coverage Report for Build 22025054055

Details

30 of 54 (55.56%) changed or added relevant lines in 2 files are covered.
2 unchanged lines in 1 file lost coverage.
Overall coverage increased (+0.4%) to 48.234%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/provider/aws/cluster.go	0	12	0.0%
pkg/provider/aws/image.go	30	42	71.43%

Files with Coverage Reduction	New Missed Lines	%
pkg/provider/aws/image.go	2	88.05%

Totals
Change from base Build 21999063489:	0.4%
Covered Lines:	2609
Relevant Lines:	5409

💛 - Coveralls

Copilot

Pull request overview

This PR improves Holodeck’s AWS AMI selection by automatically inferring CPU architecture from the chosen EC2 instance type when image.architecture is not explicitly set, enabling arm64-only instance types (e.g., g5g.*) to resolve arm64 AMIs instead of incorrectly defaulting to x86_64.

Changes:

Add inferArchFromInstanceType() (backed by DescribeInstanceTypes) to infer arm64 for arm64-only instance types, otherwise defaulting to x86_64.
Wire architecture inference into AMI resolution for single-node (resolveOSToAMI), legacy default AMI selection (setLegacyAMI), and cluster mode (createInstances).
Add unit and end-to-end-ish tests covering inference behavior and resolveOSToAMI integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
pkg/provider/aws/image.go	Adds instance-type-based architecture inference and uses it in OS + legacy AMI resolution paths.
pkg/provider/aws/cluster.go	Applies architecture inference during cluster nodepool AMI resolution when `image.architecture` is unset.
pkg/provider/aws/image_test.go	Adds test coverage for inference logic and for `resolveOSToAMI` using inferred arm64.

Copilot · 2026-02-14T21:46:04Z

pkg/provider/aws/image.go

+	for _, a := range archs {
+		switch a {
+		case "x86_64":
+			hasX86 = true
+		case "arm64":
+			hasArm = true
+		}
+	}


inferArchFromInstanceType only treats exact "x86_64" and "arm64" as signals. getInstanceTypeArch returns raw EC2 ArchitectureType strings, which can include "x86_64_mac" and "arm64_mac"; those currently fall through and would be inferred as "x86_64" even if the instance type is ARM-only. Consider mapping the *_mac variants to their base arch (or using a small normalization helper) before setting hasX86/hasArm.

Copilot · 2026-02-14T21:46:04Z

pkg/provider/aws/image.go

+		if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil {
+			arch = inferred
+		} else {
+			arch = "x86_64" // Default if inference fails
+		}


When image.architecture is unset, inference errors are silently swallowed and the code falls back to x86_64. This can mask IAM/EC2 API failures and reintroduce the same cryptic RunInstances mismatch error for arm64-only types if inference fails at runtime. Consider surfacing the inference failure (e.g., return a wrapped error that tells users to set image.architecture explicitly), or at minimum emitting a warning when falling back.

Suggested change

if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil {

arch = inferred

} else {

arch = "x86_64" // Default if inference fails

}

inferred, err := p.inferArchFromInstanceType(p.Spec.Type)

if err != nil {

return fmt.Errorf(

"failed to infer image architecture from instance type %s: %w; set spec.image.architecture to override",

p.Spec.Type,

err,

)

}

arch = inferred

Copilot · 2026-02-14T21:46:05Z

pkg/provider/aws/cluster.go

+	} else {
+		// Infer architecture from instance type (e.g., arm64 for g5g/m7g/c7g)
+		if inferred, err := p.inferArchFromInstanceType(instanceType); err == nil {
+			arch = inferred
+		}
+		// If inference fails, leave empty for resolveImageForNode to default
 	}


createInstances ignores errors from inferArchFromInstanceType and proceeds with an empty arch, relying on downstream defaults. If inference fails due to an AWS API/IAM issue, users won't know why arm64 inference didn't happen and may still hit a confusing instance/AMI mismatch at RunInstances. Consider returning/logging the inference error (or explicitly defaulting to x86_64 with a warning) when image.architecture isn't set.

Copilot · 2026-02-14T21:46:05Z

pkg/provider/aws/image_test.go

+			name:         "dual-arch instance type defaults to x86_64",
+			instanceType: "m6i.large",
+			setupMock: func(ec2Mock *MockEC2Client) {
+				ec2Mock.DescribeInstTypesFunc = func(ctx context.Context,
+					params *ec2.DescribeInstanceTypesInput,
+					optFns ...func(*ec2.Options)) (*ec2.DescribeInstanceTypesOutput, error) {
+					return &ec2.DescribeInstanceTypesOutput{
+						InstanceTypes: []types.InstanceTypeInfo{
+							{
+								InstanceType: "m6i.large",
+								ProcessorInfo: &types.ProcessorInfo{
+									SupportedArchitectures: []types.ArchitectureType{
+										types.ArchitectureTypeX8664,
+										types.ArchitectureTypeArm64,
+									},
+								},
+							},
+						},
+					}, nil
+				}
+			},
+			wantArch: "x86_64",
+			wantErr:  false,


The test case labeled "dual-arch instance type" uses m6i.large, which is an x86_64-only family in AWS. Using a clearly synthetic name (or a real instance type that actually reports multiple architectures) would make the intent of the test less confusing for future maintainers.

- Normalize *_mac architecture variants (x86_64_mac, arm64_mac) using strings.HasPrefix instead of exact match, so Mac instance types like mac2-m2.metal are correctly classified - Surface inference errors in resolveOSToAMI, setLegacyAMI, and createInstances instead of silently falling back to x86_64, which would mask IAM/API failures and reproduce the original mismatch - Use synthetic instance type name in dual-arch test case to avoid confusion with real AWS instance families - Add test coverage for arm64_mac and x86_64_mac architecture variants Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Add an E2E test that exercises the full GPU stack (driver, CTK, Docker, Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU). The test intentionally omits image.architecture to validate that the architecture inference from instance type (added in NVIDIA#669) works end-to-end in production. The g5g instance type is arm64-only, so holodeck must infer arm64 and resolve the correct AMI automatically. This test only runs on merge to main (not on PRs) since g5g instances are more expensive than the standard x86_64 test fleet. The periodic cleanup workflow already covers us-east-1 where g5g is available. Changes: - tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch - tests/aws_test.go: new "arm64" labeled test entry - .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Copilot AI review requested due to automatic review settings February 14, 2026 21:41

Copilot started reviewing on behalf of ArangoGutierrez February 14, 2026 21:41 View session

Copilot AI reviewed Feb 14, 2026

View reviewed changes

ArangoGutierrez merged commit 44e43ba into NVIDIA:main Feb 14, 2026
19 checks passed

ArangoGutierrez mentioned this pull request Feb 15, 2026

test(e2e): add ARM64 GPU end-to-end test on merge to main #670

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(aws): infer AMI architecture from instance type for arm64 support#669

fix(aws): infer AMI architecture from instance type for arm64 support#669
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-arch-inference

ArangoGutierrez commented Feb 14, 2026

Uh oh!

coveralls commented Feb 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Copilot AI Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArangoGutierrez commented Feb 14, 2026

Summary

Root Cause

Backward Compatibility

Supersedes

Test plan

Uh oh!

coveralls commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 22025054055

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coveralls commented Feb 14, 2026 •

edited

Loading