feat: wire cluster networking into CreateCluster flow by ArangoGutierrez · Pull Request #728 · NVIDIA/holodeck

ArangoGutierrez · 2026-03-14T08:07:37Z

Summary

Wire public subnet, NAT gateway, and route tables into CreateCluster() flow
NLB placed in public subnet (internet-facing) instead of private subnet
Cluster instances get AssociatePublicIpAddress: false (private subnet behind NAT GW)

Test plan

Unit tests pass
Lint clean
CI checks

Connect existing networking functions (public subnet, public route table, NAT gateway, private route table) into the CreateCluster() sequence. Instances in cluster mode no longer get public IPs (private subnet behind NAT GW), and the NLB is placed in the public subnet for internet-facing access. - Add createPublicSubnet, createPublicRouteTable, createNATGateway, createPrivateRouteTable calls between route table and security group phases in CreateCluster - Change NLB subnet from cache.Subnetid to cache.PublicSubnetid - Set AssociatePublicIpAddress to false for cluster instances - Add cluster_test.go with tests for networking order, NLB subnet binding, and instance public IP behavior Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Rewrite cluster_test.go to use focused unit tests instead of full CreateCluster integration tests. The InstanceRunningWaiter can't be easily mocked and causes 5-minute timeouts per test. Also fix HighAvailabilityConfig -> HAConfig type name. Signed-off-by: Eduardo Aguilar <eduardoa@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

coveralls · 2026-03-14T08:12:57Z

Pull Request Test Coverage Report for Build 23084812122

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

1 of 54 (1.85%) changed or added relevant lines in 3 files are covered.
1 unchanged line in 1 file lost coverage.
Overall coverage increased (+1.9%) to 55.859%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
pkg/provider/aws/nlb.go	0	2	0.0%
pkg/provisioner/cluster.go	0	23	0.0%
pkg/provider/aws/cluster.go	1	29	3.45%

Files with Coverage Reduction	New Missed Lines	%
pkg/provisioner/cluster.go	1	12.71%

Totals
Change from base Build 23083001272:	1.9%
Covered Lines:	3637
Relevant Lines:	6511

💛 - Coveralls

Copilot

Pull request overview

This PR updates the AWS multinode-cluster networking setup so that an internet-facing NLB is placed in a public subnet, while cluster nodes are launched without public IPs.

Changes:

Switch NLB subnet selection from the private subnet (Subnetid) to the public subnet (PublicSubnetid).
Add “Phase 1b” cluster networking steps in CreateCluster (public subnet, public route table, NAT gateway, private route table).
Change cluster instance creation to not associate public IPs, and add a new test file intended to validate the wiring.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

File	Description
pkg/provider/aws/nlb.go	Updates NLB creation to use `cache.PublicSubnetid` for an internet-facing NLB.
pkg/provider/aws/cluster.go	Wires in public/NAT networking creation and disables public IP association on cluster instances.
pkg/provider/aws/cluster_test.go	Adds tests intended to validate cluster networking wiring and subnet/IP behavior.

You can also share your feedback on Copilot code review. Take the survey.

pkg/provider/aws/cluster.go

+	if err := p.createPrivateRouteTable(&cache.AWS); err != nil {
+		_ = p.updateDegradedCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Error creating private route table")
+		return fmt.Errorf("error creating private route table: %w", err)
+	}
+	_ = p.updateProgressingCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Private route table created")


pkg/provider/aws/cluster.go

 				NetworkInterfaces: []types.InstanceNetworkInterfaceSpecification{
 					{
-						AssociatePublicIpAddress: aws.Bool(true),
+						AssociatePublicIpAddress: aws.Bool(false),


pkg/provider/aws/cluster_test.go

+// TestClusterNetworkingPhasesExist verifies that CreateCluster has the
+// networking calls wired in by checking the code structure.
+// The actual networking functions are tested individually in create_test.go.
+func TestClusterNetworkingPhasesExist(t *testing.T) {
+	// Verify the provider has the networking methods we expect to be wired in.
+	// This is a compile-time guarantee — if these methods don't exist, this won't build.
+	var p *Provider
+	_ = p.createPublicSubnet
+	_ = p.createPublicRouteTable


pkg/provider/aws/cluster_test.go

+// TestNLBUsesPublicSubnetField verifies the NLB creation reads PublicSubnetid
+// from the cache, not Subnetid (the private subnet).
+func TestNLBUsesPublicSubnetField(t *testing.T) {
+	// The NLB creation in nlb.go uses cache.PublicSubnetid.
+	// Verify by checking the cache field exists and is distinct from Subnetid.
+	cache := &ClusterCache{}
+	cache.Subnetid = "subnet-private"
+	cache.PublicSubnetid = "subnet-public"
+
+	if cache.Subnetid == cache.PublicSubnetid {
+		t.Error("Subnetid and PublicSubnetid should be distinct for cluster mode")
+	}


pkg/provider/aws/cluster_test.go

+// TestInstancesNoPublicIPInClusterMode verifies that createInstances sets
+// AssociatePublicIpAddress to false (checked by reading the code).
+// The actual value is set at cluster.go line where createInstances builds
+// the RunInstancesInput with AssociatePublicIpAddress: aws.Bool(false).
+func TestInstancesNoPublicIPInClusterMode(t *testing.T) {
+	// This test validates the Image bypass works correctly for cluster tests.
+	// The actual AssociatePublicIpAddress=false is verified at the code level
+	// and will be caught by E2E tests. Unit-testing it requires a full
+	// InstanceRunningWaiter mock which is not worth the complexity.
+	env := v1alpha1.Environment{}
+	env.Spec.Cluster = &v1alpha1.ClusterSpec{
+		ControlPlane: v1alpha1.ControlPlaneSpec{
+			Count:        1,
+			InstanceType: "t3.medium",
+			Image:        &v1alpha1.Image{ImageId: aws.String("ami-test")},
+		},
+	}
+
+	// Verify Image bypasses OS resolution
+	if env.Spec.Cluster.ControlPlane.Image == nil {
+		t.Fatal("Image should be set to bypass AMI resolution")
+	}
+	if *env.Spec.Cluster.ControlPlane.Image.ImageId != "ami-test" {
+		t.Errorf("ImageId = %q, want %q", *env.Spec.Cluster.ControlPlane.Image.ImageId, "ami-test")
+	}
+}


Copilot

Pull request overview

Adds AWS cluster networking primitives to support internet-facing NLBs and private-subnet cluster nodes in Holodeck’s AWS provider.

Changes:

Switch NLB subnet selection to use a dedicated public subnet (PublicSubnetid).
Add “Phase 1b” cluster networking creation steps (public subnet, public route table, NAT gateway, private route table).
Flip cluster node instance networking to not associate public IPs; add a small test file.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
`pkg/provider/aws/nlb.go`	Use `cache.PublicSubnetid` for the internet-facing NLB.
`pkg/provider/aws/cluster.go`	Create public subnet/NAT/private routing for clusters; disable public IP assignment on instances.
`pkg/provider/aws/cluster_test.go`	Adds tests intended to validate cluster networking wiring (currently non-behavioral).

Comments suppressed due to low confidence (1)

pkg/provider/aws/cluster.go:709

Setting AssociatePublicIpAddress=false means cluster nodes won't receive public IPs in the private subnet. Current multinode provisioning and kubeconfig download paths dial SSH using node.PublicIP from status (see cmd/cli/create/create.go:499-568), and cluster.go populates PublicIP/PublicDNS from EC2 public fields (which will now be empty). This will break provisioning unless you also wire up an alternative transport (e.g., SSM) and include InstanceID/transport info in the node list. Either keep public IPs for now, or implement/persist SSM-based transport end-to-end (instance profile + node list + transport selection).

						AssociatePublicIpAddress: aws.Bool(false),
						DeleteOnTermination:      aws.Bool(true),
						DeviceIndex:              aws.Int32(0),
						Groups:                   []string{sgID},
						SubnetId:                 aws.String(cache.Subnetid),

You can also share your feedback on Copilot code review. Take the survey.

pkg/provider/aws/cluster.go

+	if err := p.createPrivateRouteTable(&cache.AWS); err != nil {
+		_ = p.updateDegradedCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Error creating private route table")
+		return fmt.Errorf("error creating private route table: %w", err)
+	}
+	_ = p.updateProgressingCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Private route table created")


pkg/provider/aws/cluster_test.go

+// TestClusterNetworkingPhasesExist verifies that CreateCluster has the
+// networking calls wired in by checking the code structure.
+// The actual networking functions are tested individually in create_test.go.
+func TestClusterNetworkingPhasesExist(t *testing.T) {
+	// Verify the provider has the networking methods we expect to be wired in.
+	// This is a compile-time guarantee — if these methods don't exist, this won't build.
+	var p *Provider
+	_ = p.createPublicSubnet
+	_ = p.createPublicRouteTable
+	_ = p.createNATGateway
+	_ = p.createPrivateRouteTable
+}


pkg/provider/aws/cluster_test.go

+// TestNLBUsesPublicSubnetField verifies the NLB creation reads PublicSubnetid
+// from the cache, not Subnetid (the private subnet).
+func TestNLBUsesPublicSubnetField(t *testing.T) {
+	// The NLB creation in nlb.go uses cache.PublicSubnetid.
+	// Verify by checking the cache field exists and is distinct from Subnetid.
+	cache := &ClusterCache{}
+	cache.Subnetid = "subnet-private"
+	cache.PublicSubnetid = "subnet-public"
+
+	if cache.Subnetid == cache.PublicSubnetid {
+		t.Error("Subnetid and PublicSubnetid should be distinct for cluster mode")
+	}
+}


pkg/provider/aws/cluster_test.go

+// TestInstancesNoPublicIPInClusterMode verifies that createInstances sets
+// AssociatePublicIpAddress to false (checked by reading the code).
+// The actual value is set at cluster.go line where createInstances builds
+// the RunInstancesInput with AssociatePublicIpAddress: aws.Bool(false).
+func TestInstancesNoPublicIPInClusterMode(t *testing.T) {
+	// This test validates the Image bypass works correctly for cluster tests.
+	// The actual AssociatePublicIpAddress=false is verified at the code level
+	// and will be caught by E2E tests. Unit-testing it requires a full
+	// InstanceRunningWaiter mock which is not worth the complexity.
+	env := v1alpha1.Environment{}
+	env.Spec.Cluster = &v1alpha1.ClusterSpec{
+		ControlPlane: v1alpha1.ControlPlaneSpec{
+			Count:        1,
+			InstanceType: "t3.medium",
+			Image:        &v1alpha1.Image{ImageId: aws.String("ami-test")},
+		},
+	}
+
+	// Verify Image bypasses OS resolution
+	if env.Spec.Cluster.ControlPlane.Image == nil {
+		t.Fatal("Image should be set to bypass AMI resolution")
+	}
+	if *env.Spec.Cluster.ControlPlane.Image.ImageId != "ami-test" {
+		t.Errorf("ImageId = %q, want %q", *env.Spec.Cluster.ControlPlane.Image.ImageId, "ami-test")
+	}
+}


…orNode, add real tests 1. Remove orphaned createRouteTable() call in cluster mode — private subnet gets NAT-routed table from createPrivateRouteTable(), public subnet gets IGW-routed table from createPublicRouteTable(). The old call created an unused IGW table for the private subnet. 2. Complete hostForNode migration in GetClusterHealth — match nodes by both PublicIP and PrivateIP so SSM transport nodes (no public IP) are found. GetClusterHealthFromEnv falls back to PrivateIP when PublicIP is empty. 3. Replace trivial compile-time checks with 6 behavioral mock tests: - TestCreateInstancesSetsNoPublicIP (captures RunInstancesInput) - TestCreateInstancesUsesRoleSecurityGroup (CP/Worker SG selection) - TestPrivateRouteTableRoutesToNATGW (NAT GW, not IGW) - TestPublicRouteTableRoutesToIGW (IGW + public subnet association) - TestPublicSubnetCreatedInCorrectCIDR (10.0.1.0/24) - TestNATGatewayCreatedInPublicSubnet (placed in public subnet) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

The private subnet topology (PR NVIDIA#728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

* fix: wait for NAT Gateway to reach available state before creating routes CreateNatGateway returns immediately with a pending NAT Gateway ID. Using this ID in CreateRoute before the NAT GW is available causes InvalidNatGatewayID.NotFound errors, breaking all cluster deployments. Poll DescribeNatGateways until state transitions to available (or failed) before returning from createNATGateway. This matches the existing pattern in deleteNATGateway which polls for deleted state. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: remove public IP assertion for cluster nodes in private subnet In the production cluster topology all nodes are in the private subnet and do not have public IPs. The E2E test was asserting PublicIP is non-empty for all nodes, which fails by design. Only PrivateIP is guaranteed. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: wire SSM transport for private-subnet cluster nodes Cluster nodes in private subnets have no public IP, so SSH connections must go through SSM port-forwarding. The E2E test and GetClusterHealth were building NodeInfo without setting the Transport field, causing SSH to attempt direct connection to an empty host. Wire SSMTransport for nodes with no PublicIP, matching the production code path in buildClusterNodeInfoList. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: place cluster instances in public subnet with public IPs The private subnet topology (PR #728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * cluster: skip NAT Gateway creation to avoid EIP quota exhaustion Cluster instances are placed in the public subnet with AssociatePublicIpAddress=true, so they reach the internet directly through the IGW. The NAT Gateway (and its EIP allocation + private route table) are unnecessary for this topology. Removing them from the cluster code path frees up EIP quota (AWS limit: 5 per region in us-west-1), which was being exhausted when multiple CI jobs ran concurrently, causing AddressLimitExceeded errors on the rpm-rocky and rpm-al2023 E2E tests. The private subnet is retained for future SSM endpoint use. Single-node mode (create.go) is unchanged and continues to use NAT Gateway for its private-subnet instances. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): add RevokeSecurityGroup{Ingress,Egress} to EC2Client interface Extend the EC2Client interface with RevokeSecurityGroupIngress and RevokeSecurityGroupEgress methods needed for proper SG cleanup. Update both mock implementations to satisfy the interface. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): revoke SG rules before deletion to prevent DependencyViolation Before calling DeleteSecurityGroup, enumerate and revoke all ingress and egress rules. This breaks cross-SG references (e.g., worker SG referencing control-plane SG) that cause DependencyViolation errors. Gracefully handles already-deleted groups (InvalidGroup.NotFound). Revoke errors are logged as warnings and do not block the delete attempt. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): wait for ENIs to drain before deleting security groups After instance termination, AWS ENIs can linger for 2-5 minutes in 'in-use' state. Attempting to delete security groups while ENIs reference them causes DependencyViolation errors. Add waitForENIsDrained() that polls DescribeNetworkInterfaces every 10s (up to 5min) until all non-available ENIs in the VPC are gone. Insert this as Phase 1.5 between instance termination and SG deletion. ENI drain timeout is a warning, not a fatal error — the extended retry budget on deleteSecurityGroup provides a second line of defense. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): increase retry budget to ~5min for eventual-consistency delays Increase maxRetries from 5 to 10 and retryDelay from 5s to 10s. This extends the total retry budget from ~65s to ~5min, covering the full AWS ENI detach window (2-5 min) as a final safety net beyond the explicit ENI drain wait and SG rule revocation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: add RevokeSecurityGroup methods to shared mock EC2Client The EC2Client interface gained RevokeSecurityGroupIngress and RevokeSecurityGroupEgress in commit fa13fef, but the shared mock in pkg/testutil/mocks was not updated. This broke pkg/cleanup tests which use the shared mock. Also fix pre-existing gofmt issue in mock_ec2_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): break cross-SG references and disassociate route tables before deletion Security group deletion failed with DependencyViolation because worker SG ingress rules referenced CP SG and vice versa. Fix: revoke both SGs' ingress rules upfront before any deletion. Egress revocation is skipped because CI IAM user lacks ec2:RevokeSecurityGroupEgress permission (and the default egress rule doesn't create cross-SG dependencies). Route table deletion also failed with DependencyViolation because subnet associations were not removed first. Fix: describe route table associations and disassociate non-main entries before calling DeleteRouteTable. Both fixes validated with local E2E: create + delete of a minimal cluster completed cleanly with no leaked resources. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): update test to match egress revocation removal TestRevokeSecurityGroupRules_RevokesIngressAndEgress asserted that egress revocation was called, but the implementation now intentionally skips it (CI IAM lacks ec2:RevokeSecurityGroupEgress). Updated test to assert 0 egress calls and renamed to TestRevokeSecurityGroupRules_RevokesIngressOnly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): delete subnets before route tables to avoid DependencyViolation CI IAM user (cnt-ci) lacks ec2:DisassociateRouteTable, so the explicit disassociation approach fails with 403. Instead, swap the deletion order: delete subnets first (which implicitly removes route table associations), then delete route tables. This avoids needing DisassociateRouteTable entirely. Remove the disassociateRouteTableSubnets helper and DisassociateRouteTable from the EC2Client interface since they are no longer needed. Validated with local E2E: create + delete completes cleanly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> --------- Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez added 2 commits March 14, 2026 08:18

ArangoGutierrez requested a review from Copilot March 14, 2026 08:09

Copilot started reviewing on behalf of ArangoGutierrez March 14, 2026 08:10 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

ArangoGutierrez requested a review from Copilot March 14, 2026 08:20

ArangoGutierrez marked this pull request as ready for review March 14, 2026 08:21

Copilot started reviewing on behalf of ArangoGutierrez March 14, 2026 08:21 View session

Copilot AI reviewed Mar 14, 2026

View reviewed changes

ArangoGutierrez added 2 commits March 14, 2026 09:55

style: fix gofmt alignment in cluster_test.go

0a0c514

Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>

ArangoGutierrez merged commit da4748e into NVIDIA:main Mar 16, 2026
21 of 23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: wire cluster networking into CreateCluster flow#728

feat: wire cluster networking into CreateCluster flow#728
ArangoGutierrez merged 4 commits intoNVIDIA:mainfrom
ArangoGutierrez:feat/prod-clusters-wire

ArangoGutierrez commented Mar 14, 2026

Uh oh!

coveralls commented Mar 14, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ArangoGutierrez commented Mar 14, 2026

Summary

Test plan

Uh oh!

coveralls commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 23084812122

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

coveralls commented Mar 14, 2026 •

edited

Loading