feat: wire cluster networking into CreateCluster flow#728
feat: wire cluster networking into CreateCluster flow#728ArangoGutierrez merged 4 commits intoNVIDIA:mainfrom
Conversation
Connect existing networking functions (public subnet, public route table, NAT gateway, private route table) into the CreateCluster() sequence. Instances in cluster mode no longer get public IPs (private subnet behind NAT GW), and the NLB is placed in the public subnet for internet-facing access. - Add createPublicSubnet, createPublicRouteTable, createNATGateway, createPrivateRouteTable calls between route table and security group phases in CreateCluster - Change NLB subnet from cache.Subnetid to cache.PublicSubnetid - Set AssociatePublicIpAddress to false for cluster instances - Add cluster_test.go with tests for networking order, NLB subnet binding, and instance public IP behavior Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Rewrite cluster_test.go to use focused unit tests instead of full CreateCluster integration tests. The InstanceRunningWaiter can't be easily mocked and causes 5-minute timeouts per test. Also fix HighAvailabilityConfig -> HAConfig type name. Signed-off-by: Eduardo Aguilar <eduardoa@nvidia.com> Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Pull Request Test Coverage Report for Build 23084812122Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Pull request overview
This PR updates the AWS multinode-cluster networking setup so that an internet-facing NLB is placed in a public subnet, while cluster nodes are launched without public IPs.
Changes:
- Switch NLB subnet selection from the private subnet (
Subnetid) to the public subnet (PublicSubnetid). - Add “Phase 1b” cluster networking steps in
CreateCluster(public subnet, public route table, NAT gateway, private route table). - Change cluster instance creation to not associate public IPs, and add a new test file intended to validate the wiring.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| pkg/provider/aws/nlb.go | Updates NLB creation to use cache.PublicSubnetid for an internet-facing NLB. |
| pkg/provider/aws/cluster.go | Wires in public/NAT networking creation and disables public IP association on cluster instances. |
| pkg/provider/aws/cluster_test.go | Adds tests intended to validate cluster networking wiring and subnet/IP behavior. |
You can also share your feedback on Copilot code review. Take the survey.
| if err := p.createPrivateRouteTable(&cache.AWS); err != nil { | ||
| _ = p.updateDegradedCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Error creating private route table") | ||
| return fmt.Errorf("error creating private route table: %w", err) | ||
| } | ||
| _ = p.updateProgressingCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Private route table created") |
| NetworkInterfaces: []types.InstanceNetworkInterfaceSpecification{ | ||
| { | ||
| AssociatePublicIpAddress: aws.Bool(true), | ||
| AssociatePublicIpAddress: aws.Bool(false), |
pkg/provider/aws/cluster_test.go
Outdated
| // TestClusterNetworkingPhasesExist verifies that CreateCluster has the | ||
| // networking calls wired in by checking the code structure. | ||
| // The actual networking functions are tested individually in create_test.go. | ||
| func TestClusterNetworkingPhasesExist(t *testing.T) { | ||
| // Verify the provider has the networking methods we expect to be wired in. | ||
| // This is a compile-time guarantee — if these methods don't exist, this won't build. | ||
| var p *Provider | ||
| _ = p.createPublicSubnet | ||
| _ = p.createPublicRouteTable |
pkg/provider/aws/cluster_test.go
Outdated
| // TestNLBUsesPublicSubnetField verifies the NLB creation reads PublicSubnetid | ||
| // from the cache, not Subnetid (the private subnet). | ||
| func TestNLBUsesPublicSubnetField(t *testing.T) { | ||
| // The NLB creation in nlb.go uses cache.PublicSubnetid. | ||
| // Verify by checking the cache field exists and is distinct from Subnetid. | ||
| cache := &ClusterCache{} | ||
| cache.Subnetid = "subnet-private" | ||
| cache.PublicSubnetid = "subnet-public" | ||
|
|
||
| if cache.Subnetid == cache.PublicSubnetid { | ||
| t.Error("Subnetid and PublicSubnetid should be distinct for cluster mode") | ||
| } |
pkg/provider/aws/cluster_test.go
Outdated
| // TestInstancesNoPublicIPInClusterMode verifies that createInstances sets | ||
| // AssociatePublicIpAddress to false (checked by reading the code). | ||
| // The actual value is set at cluster.go line where createInstances builds | ||
| // the RunInstancesInput with AssociatePublicIpAddress: aws.Bool(false). | ||
| func TestInstancesNoPublicIPInClusterMode(t *testing.T) { | ||
| // This test validates the Image bypass works correctly for cluster tests. | ||
| // The actual AssociatePublicIpAddress=false is verified at the code level | ||
| // and will be caught by E2E tests. Unit-testing it requires a full | ||
| // InstanceRunningWaiter mock which is not worth the complexity. | ||
| env := v1alpha1.Environment{} | ||
| env.Spec.Cluster = &v1alpha1.ClusterSpec{ | ||
| ControlPlane: v1alpha1.ControlPlaneSpec{ | ||
| Count: 1, | ||
| InstanceType: "t3.medium", | ||
| Image: &v1alpha1.Image{ImageId: aws.String("ami-test")}, | ||
| }, | ||
| } | ||
|
|
||
| // Verify Image bypasses OS resolution | ||
| if env.Spec.Cluster.ControlPlane.Image == nil { | ||
| t.Fatal("Image should be set to bypass AMI resolution") | ||
| } | ||
| if *env.Spec.Cluster.ControlPlane.Image.ImageId != "ami-test" { | ||
| t.Errorf("ImageId = %q, want %q", *env.Spec.Cluster.ControlPlane.Image.ImageId, "ami-test") | ||
| } | ||
| } |
There was a problem hiding this comment.
Pull request overview
Adds AWS cluster networking primitives to support internet-facing NLBs and private-subnet cluster nodes in Holodeck’s AWS provider.
Changes:
- Switch NLB subnet selection to use a dedicated public subnet (
PublicSubnetid). - Add “Phase 1b” cluster networking creation steps (public subnet, public route table, NAT gateway, private route table).
- Flip cluster node instance networking to not associate public IPs; add a small test file.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
pkg/provider/aws/nlb.go |
Use cache.PublicSubnetid for the internet-facing NLB. |
pkg/provider/aws/cluster.go |
Create public subnet/NAT/private routing for clusters; disable public IP assignment on instances. |
pkg/provider/aws/cluster_test.go |
Adds tests intended to validate cluster networking wiring (currently non-behavioral). |
Comments suppressed due to low confidence (1)
pkg/provider/aws/cluster.go:709
- Setting AssociatePublicIpAddress=false means cluster nodes won't receive public IPs in the private subnet. Current multinode provisioning and kubeconfig download paths dial SSH using node.PublicIP from status (see cmd/cli/create/create.go:499-568), and cluster.go populates PublicIP/PublicDNS from EC2 public fields (which will now be empty). This will break provisioning unless you also wire up an alternative transport (e.g., SSM) and include InstanceID/transport info in the node list. Either keep public IPs for now, or implement/persist SSM-based transport end-to-end (instance profile + node list + transport selection).
AssociatePublicIpAddress: aws.Bool(false),
DeleteOnTermination: aws.Bool(true),
DeviceIndex: aws.Int32(0),
Groups: []string{sgID},
SubnetId: aws.String(cache.Subnetid),
You can also share your feedback on Copilot code review. Take the survey.
| if err := p.createPrivateRouteTable(&cache.AWS); err != nil { | ||
| _ = p.updateDegradedCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Error creating private route table") | ||
| return fmt.Errorf("error creating private route table: %w", err) | ||
| } | ||
| _ = p.updateProgressingCondition(*p.DeepCopy(), &cache.AWS, "v1alpha1.Creating", "Private route table created") |
pkg/provider/aws/cluster_test.go
Outdated
| // TestClusterNetworkingPhasesExist verifies that CreateCluster has the | ||
| // networking calls wired in by checking the code structure. | ||
| // The actual networking functions are tested individually in create_test.go. | ||
| func TestClusterNetworkingPhasesExist(t *testing.T) { | ||
| // Verify the provider has the networking methods we expect to be wired in. | ||
| // This is a compile-time guarantee — if these methods don't exist, this won't build. | ||
| var p *Provider | ||
| _ = p.createPublicSubnet | ||
| _ = p.createPublicRouteTable | ||
| _ = p.createNATGateway | ||
| _ = p.createPrivateRouteTable | ||
| } |
pkg/provider/aws/cluster_test.go
Outdated
| // TestNLBUsesPublicSubnetField verifies the NLB creation reads PublicSubnetid | ||
| // from the cache, not Subnetid (the private subnet). | ||
| func TestNLBUsesPublicSubnetField(t *testing.T) { | ||
| // The NLB creation in nlb.go uses cache.PublicSubnetid. | ||
| // Verify by checking the cache field exists and is distinct from Subnetid. | ||
| cache := &ClusterCache{} | ||
| cache.Subnetid = "subnet-private" | ||
| cache.PublicSubnetid = "subnet-public" | ||
|
|
||
| if cache.Subnetid == cache.PublicSubnetid { | ||
| t.Error("Subnetid and PublicSubnetid should be distinct for cluster mode") | ||
| } | ||
| } |
pkg/provider/aws/cluster_test.go
Outdated
| // TestInstancesNoPublicIPInClusterMode verifies that createInstances sets | ||
| // AssociatePublicIpAddress to false (checked by reading the code). | ||
| // The actual value is set at cluster.go line where createInstances builds | ||
| // the RunInstancesInput with AssociatePublicIpAddress: aws.Bool(false). | ||
| func TestInstancesNoPublicIPInClusterMode(t *testing.T) { | ||
| // This test validates the Image bypass works correctly for cluster tests. | ||
| // The actual AssociatePublicIpAddress=false is verified at the code level | ||
| // and will be caught by E2E tests. Unit-testing it requires a full | ||
| // InstanceRunningWaiter mock which is not worth the complexity. | ||
| env := v1alpha1.Environment{} | ||
| env.Spec.Cluster = &v1alpha1.ClusterSpec{ | ||
| ControlPlane: v1alpha1.ControlPlaneSpec{ | ||
| Count: 1, | ||
| InstanceType: "t3.medium", | ||
| Image: &v1alpha1.Image{ImageId: aws.String("ami-test")}, | ||
| }, | ||
| } | ||
|
|
||
| // Verify Image bypasses OS resolution | ||
| if env.Spec.Cluster.ControlPlane.Image == nil { | ||
| t.Fatal("Image should be set to bypass AMI resolution") | ||
| } | ||
| if *env.Spec.Cluster.ControlPlane.Image.ImageId != "ami-test" { | ||
| t.Errorf("ImageId = %q, want %q", *env.Spec.Cluster.ControlPlane.Image.ImageId, "ami-test") | ||
| } | ||
| } |
…orNode, add real tests 1. Remove orphaned createRouteTable() call in cluster mode — private subnet gets NAT-routed table from createPrivateRouteTable(), public subnet gets IGW-routed table from createPublicRouteTable(). The old call created an unused IGW table for the private subnet. 2. Complete hostForNode migration in GetClusterHealth — match nodes by both PublicIP and PrivateIP so SSM transport nodes (no public IP) are found. GetClusterHealthFromEnv falls back to PrivateIP when PublicIP is empty. 3. Replace trivial compile-time checks with 6 behavioral mock tests: - TestCreateInstancesSetsNoPublicIP (captures RunInstancesInput) - TestCreateInstancesUsesRoleSecurityGroup (CP/Worker SG selection) - TestPrivateRouteTableRoutesToNATGW (NAT GW, not IGW) - TestPublicRouteTableRoutesToIGW (IGW + public subnet association) - TestPublicSubnetCreatedInCorrectCIDR (10.0.1.0/24) - TestNATGatewayCreatedInPublicSubnet (placed in public subnet) Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The private subnet topology (PR NVIDIA#728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
The private subnet topology (PR NVIDIA#728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
* fix: wait for NAT Gateway to reach available state before creating routes CreateNatGateway returns immediately with a pending NAT Gateway ID. Using this ID in CreateRoute before the NAT GW is available causes InvalidNatGatewayID.NotFound errors, breaking all cluster deployments. Poll DescribeNatGateways until state transitions to available (or failed) before returning from createNATGateway. This matches the existing pattern in deleteNATGateway which polls for deleted state. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: remove public IP assertion for cluster nodes in private subnet In the production cluster topology all nodes are in the private subnet and do not have public IPs. The E2E test was asserting PublicIP is non-empty for all nodes, which fails by design. Only PrivateIP is guaranteed. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: wire SSM transport for private-subnet cluster nodes Cluster nodes in private subnets have no public IP, so SSH connections must go through SSM port-forwarding. The E2E test and GetClusterHealth were building NodeInfo without setting the Transport field, causing SSH to attempt direct connection to an empty host. Wire SSMTransport for nodes with no PublicIP, matching the production code path in buildClusterNodeInfoList. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: place cluster instances in public subnet with public IPs The private subnet topology (PR #728) moved cluster instances behind a NAT GW, but SSM infrastructure (IAM instance profiles) was never set up, making SSH connections impossible. Revert instances to the public subnet with AssociatePublicIpAddress=true so they are directly reachable via SSH. The NAT GW + NLB infrastructure remains in place for future SSM migration. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * cluster: skip NAT Gateway creation to avoid EIP quota exhaustion Cluster instances are placed in the public subnet with AssociatePublicIpAddress=true, so they reach the internet directly through the IGW. The NAT Gateway (and its EIP allocation + private route table) are unnecessary for this topology. Removing them from the cluster code path frees up EIP quota (AWS limit: 5 per region in us-west-1), which was being exhausted when multiple CI jobs ran concurrently, causing AddressLimitExceeded errors on the rpm-rocky and rpm-al2023 E2E tests. The private subnet is retained for future SSM endpoint use. Single-node mode (create.go) is unchanged and continues to use NAT Gateway for its private-subnet instances. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): add RevokeSecurityGroup{Ingress,Egress} to EC2Client interface Extend the EC2Client interface with RevokeSecurityGroupIngress and RevokeSecurityGroupEgress methods needed for proper SG cleanup. Update both mock implementations to satisfy the interface. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): revoke SG rules before deletion to prevent DependencyViolation Before calling DeleteSecurityGroup, enumerate and revoke all ingress and egress rules. This breaks cross-SG references (e.g., worker SG referencing control-plane SG) that cause DependencyViolation errors. Gracefully handles already-deleted groups (InvalidGroup.NotFound). Revoke errors are logged as warnings and do not block the delete attempt. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): wait for ENIs to drain before deleting security groups After instance termination, AWS ENIs can linger for 2-5 minutes in 'in-use' state. Attempting to delete security groups while ENIs reference them causes DependencyViolation errors. Add waitForENIsDrained() that polls DescribeNetworkInterfaces every 10s (up to 5min) until all non-available ENIs in the VPC are gone. Insert this as Phase 1.5 between instance termination and SG deletion. ENI drain timeout is a warning, not a fatal error — the extended retry budget on deleteSecurityGroup provides a second line of defense. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * feat(aws): increase retry budget to ~5min for eventual-consistency delays Increase maxRetries from 5 to 10 and retryDelay from 5s to 10s. This extends the total retry budget from ~65s to ~5min, covering the full AWS ENI detach window (2-5 min) as a final safety net beyond the explicit ENI drain wait and SG rule revocation. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix: add RevokeSecurityGroup methods to shared mock EC2Client The EC2Client interface gained RevokeSecurityGroupIngress and RevokeSecurityGroupEgress in commit fa13fef, but the shared mock in pkg/testutil/mocks was not updated. This broke pkg/cleanup tests which use the shared mock. Also fix pre-existing gofmt issue in mock_ec2_test.go. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): break cross-SG references and disassociate route tables before deletion Security group deletion failed with DependencyViolation because worker SG ingress rules referenced CP SG and vice versa. Fix: revoke both SGs' ingress rules upfront before any deletion. Egress revocation is skipped because CI IAM user lacks ec2:RevokeSecurityGroupEgress permission (and the default egress rule doesn't create cross-SG dependencies). Route table deletion also failed with DependencyViolation because subnet associations were not removed first. Fix: describe route table associations and disassociate non-main entries before calling DeleteRouteTable. Both fixes validated with local E2E: create + delete of a minimal cluster completed cleanly with no leaked resources. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): update test to match egress revocation removal TestRevokeSecurityGroupRules_RevokesIngressAndEgress asserted that egress revocation was called, but the implementation now intentionally skips it (CI IAM lacks ec2:RevokeSecurityGroupEgress). Updated test to assert 0 egress calls and renamed to TestRevokeSecurityGroupRules_RevokesIngressOnly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> * fix(aws): delete subnets before route tables to avoid DependencyViolation CI IAM user (cnt-ci) lacks ec2:DisassociateRouteTable, so the explicit disassociation approach fails with 403. Instead, swap the deletion order: delete subnets first (which implicitly removes route table associations), then delete route tables. This avoids needing DisassociateRouteTable entirely. Remove the disassociateRouteTableSubnets helper and DisassociateRouteTable from the EC2Client interface since they are no longer needed. Validated with local E2E: create + delete completes cleanly. Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com> --------- Signed-off-by: Carlos Eduardo Arango Gutierrez <eduardoa@nvidia.com>
Summary
CreateCluster()flowAssociatePublicIpAddress: false(private subnet behind NAT GW)Test plan