Skip to content

fix(aws): infer AMI architecture from instance type for arm64 support#669

Merged
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-arch-inference
Feb 14, 2026
Merged

fix(aws): infer AMI architecture from instance type for arm64 support#669
ArangoGutierrez merged 2 commits intoNVIDIA:mainfrom
ArangoGutierrez:fix/arm64-arch-inference

Conversation

@ArangoGutierrez
Copy link
Collaborator

Summary

  • Add inferArchFromInstanceType() helper that queries DescribeInstanceTypes to detect arm64-only instance types
  • Wire architecture inference into all three AMI resolution paths: resolveOSToAMI (single-node), setLegacyAMI (legacy default), and createInstances (cluster mode)
  • When image.architecture is unset and the instance type only supports arm64 (e.g., g5g, m7g, c7g), holodeck now automatically resolves arm64 AMIs instead of defaulting to x86_64

Root Cause

When a holodeck config specifies an arm64-only instance type (like g5g.xlarge) without explicitly setting image.architecture: arm64, holodeck unconditionally defaults to x86_64. This resolves an x86_64 AMI, and EC2 RunInstances rejects the mismatch:

api error Unsupported: The requested configuration is currently not supported.

The existing cross-validation (#664) catches this mismatch in DryRun() with a better error, but doesn't prevent it. This PR fixes the selection itself.

Backward Compatibility

  • Explicit image.architecture always takes precedence (no behavior change)
  • Dual-arch or x86_64-only instance types still default to x86_64 (no behavior change)
  • Only arm64-only instance types trigger the new inference

Supersedes

Closes the gap left by PRs #661-664 which addressed validation and provisioning but not AMI selection.
Related: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634

Test plan

  • TestInferArchFromInstanceType — arm64-only, x86_64-only, dual-arch, API error
  • TestResolveOSToAMI_InfersArchFromInstanceType — end-to-end: g5g.xlarge + empty arch → arm64 AMI
  • go test ./pkg/provider/aws/... -v — all 84 Ginkgo + unit tests pass
  • go test ./pkg/... -count=1 — full package suite passes
  • go vet ./pkg/provider/aws/... — clean
  • CI validation

When image.architecture is not explicitly set, holodeck defaults to
x86_64 regardless of the instance type. This causes EC2 RunInstances
to fail with "Unsupported configuration" when an arm64-only instance
type (e.g., g5g, m7g, c7g) is paired with an x86_64 AMI.

Add inferArchFromInstanceType() that queries DescribeInstanceTypes to
determine the instance type's supported architectures. When the type
only supports arm64, architecture is automatically set to "arm64"
before AMI resolution. This is wired into all three resolution paths:
resolveOSToAMI (single-node), setLegacyAMI (legacy default), and
createInstances (cluster mode).

Backward compatible: dual-arch or x86_64-only instance types still
default to x86_64. Explicit image.architecture always takes precedence.

Ref: https://github.com/NVIDIA/gpu-driver-container/actions/runs/22012665274/job/63611032634
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Copilot AI review requested due to automatic review settings February 14, 2026 21:41
@coveralls
Copy link

coveralls commented Feb 14, 2026

Pull Request Test Coverage Report for Build 22025054055

Details

  • 30 of 54 (55.56%) changed or added relevant lines in 2 files are covered.
  • 2 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+0.4%) to 48.234%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/provider/aws/cluster.go 0 12 0.0%
pkg/provider/aws/image.go 30 42 71.43%
Files with Coverage Reduction New Missed Lines %
pkg/provider/aws/image.go 2 88.05%
Totals Coverage Status
Change from base Build 21999063489: 0.4%
Covered Lines: 2609
Relevant Lines: 5409

💛 - Coveralls

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves Holodeck’s AWS AMI selection by automatically inferring CPU architecture from the chosen EC2 instance type when image.architecture is not explicitly set, enabling arm64-only instance types (e.g., g5g.*) to resolve arm64 AMIs instead of incorrectly defaulting to x86_64.

Changes:

  • Add inferArchFromInstanceType() (backed by DescribeInstanceTypes) to infer arm64 for arm64-only instance types, otherwise defaulting to x86_64.
  • Wire architecture inference into AMI resolution for single-node (resolveOSToAMI), legacy default AMI selection (setLegacyAMI), and cluster mode (createInstances).
  • Add unit and end-to-end-ish tests covering inference behavior and resolveOSToAMI integration.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pkg/provider/aws/image.go Adds instance-type-based architecture inference and uses it in OS + legacy AMI resolution paths.
pkg/provider/aws/cluster.go Applies architecture inference during cluster nodepool AMI resolution when image.architecture is unset.
pkg/provider/aws/image_test.go Adds test coverage for inference logic and for resolveOSToAMI using inferred arm64.

Comment on lines 378 to 385
for _, a := range archs {
switch a {
case "x86_64":
hasX86 = true
case "arm64":
hasArm = true
}
}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inferArchFromInstanceType only treats exact "x86_64" and "arm64" as signals. getInstanceTypeArch returns raw EC2 ArchitectureType strings, which can include "x86_64_mac" and "arm64_mac"; those currently fall through and would be inferred as "x86_64" even if the instance type is ARM-only. Consider mapping the *_mac variants to their base arch (or using a small normalization helper) before setting hasX86/hasArm.

Copilot uses AI. Check for mistakes.
Comment on lines 74 to 78
if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil {
arch = inferred
} else {
arch = "x86_64" // Default if inference fails
}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When image.architecture is unset, inference errors are silently swallowed and the code falls back to x86_64. This can mask IAM/EC2 API failures and reintroduce the same cryptic RunInstances mismatch error for arm64-only types if inference fails at runtime. Consider surfacing the inference failure (e.g., return a wrapped error that tells users to set image.architecture explicitly), or at minimum emitting a warning when falling back.

Suggested change
if inferred, err := p.inferArchFromInstanceType(p.Spec.Type); err == nil {
arch = inferred
} else {
arch = "x86_64" // Default if inference fails
}
inferred, err := p.inferArchFromInstanceType(p.Spec.Type)
if err != nil {
return fmt.Errorf(
"failed to infer image architecture from instance type %s: %w; set spec.image.architecture to override",
p.Spec.Type,
err,
)
}
arch = inferred

Copilot uses AI. Check for mistakes.
Comment on lines 405 to 411
} else {
// Infer architecture from instance type (e.g., arm64 for g5g/m7g/c7g)
if inferred, err := p.inferArchFromInstanceType(instanceType); err == nil {
arch = inferred
}
// If inference fails, leave empty for resolveImageForNode to default
}
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

createInstances ignores errors from inferArchFromInstanceType and proceeds with an empty arch, relying on downstream defaults. If inference fails due to an AWS API/IAM issue, users won't know why arm64 inference didn't happen and may still hit a confusing instance/AMI mismatch at RunInstances. Consider returning/logging the inference error (or explicitly defaulting to x86_64 with a warning) when image.architecture isn't set.

Copilot uses AI. Check for mistakes.
Comment on lines 1065 to 1087
name: "dual-arch instance type defaults to x86_64",
instanceType: "m6i.large",
setupMock: func(ec2Mock *MockEC2Client) {
ec2Mock.DescribeInstTypesFunc = func(ctx context.Context,
params *ec2.DescribeInstanceTypesInput,
optFns ...func(*ec2.Options)) (*ec2.DescribeInstanceTypesOutput, error) {
return &ec2.DescribeInstanceTypesOutput{
InstanceTypes: []types.InstanceTypeInfo{
{
InstanceType: "m6i.large",
ProcessorInfo: &types.ProcessorInfo{
SupportedArchitectures: []types.ArchitectureType{
types.ArchitectureTypeX8664,
types.ArchitectureTypeArm64,
},
},
},
},
}, nil
}
},
wantArch: "x86_64",
wantErr: false,
Copy link

Copilot AI Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test case labeled "dual-arch instance type" uses m6i.large, which is an x86_64-only family in AWS. Using a clearly synthetic name (or a real instance type that actually reports multiple architectures) would make the intent of the test less confusing for future maintainers.

Copilot uses AI. Check for mistakes.
- Normalize *_mac architecture variants (x86_64_mac, arm64_mac) using
  strings.HasPrefix instead of exact match, so Mac instance types like
  mac2-m2.metal are correctly classified
- Surface inference errors in resolveOSToAMI, setLegacyAMI, and
  createInstances instead of silently falling back to x86_64, which
  would mask IAM/API failures and reproduce the original mismatch
- Use synthetic instance type name in dual-arch test case to avoid
  confusion with real AWS instance families
- Add test coverage for arm64_mac and x86_64_mac architecture variants

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
@ArangoGutierrez ArangoGutierrez merged commit 44e43ba into NVIDIA:main Feb 14, 2026
19 checks passed
ArangoGutierrez added a commit to ArangoGutierrez/holodeck that referenced this pull request Feb 15, 2026
Add an E2E test that exercises the full GPU stack (driver, CTK, Docker,
Kubernetes) on an ARM64 g5g.xlarge instance (Graviton2 + T4g GPU).

The test intentionally omits image.architecture to validate that the
architecture inference from instance type (added in NVIDIA#669) works
end-to-end in production. The g5g instance type is arm64-only, so
holodeck must infer arm64 and resolve the correct AMI automatically.

This test only runs on merge to main (not on PRs) since g5g instances
are more expensive than the standard x86_64 test fleet. The periodic
cleanup workflow already covers us-east-1 where g5g is available.

Changes:
- tests/data/test_aws_arm64.yml: g5g.xlarge config, no explicit arch
- tests/aws_test.go: new "arm64" labeled test entry
- .github/workflows/e2e.yaml: e2e-test-arm64 job gated on main

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants