Skip to content

Introduce config opt-in NLB provisioning with Security Groups #1158

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

mtulio
Copy link
Contributor

@mtulio mtulio commented Jun 11, 2025

What type of PR is this?

/kind feature

What this PR does / why we need it:

This PR introduces managed security group support for Service type-LoadBalancer NLB through a global cloud-config option. Key changes:

  • Managed Security Groups: Introduces NLBSecurityGroupMode: Managed cloud-config to automatically provision Security Groups for new NLBs
  • Automatic Lifecycle Management: Controller creates, manages, and deletes security groups with proper AWS tagging
  • Backward Compatibility: Existing Service NLBs without security groups remain unchanged (no retrofitting)
  • BYO Security Group Limitation: Explicitly blocks service.beta.kubernetes.io/aws-load-balancer-security-groups and service.beta.kubernetes.io/aws-load-balancer-extra-security-groups annotations for NLB

Rationale for BYO SG Limitation:

BYO Security Group support for NLB is intentionally blocked in this release due to:

  1. AWS API complexity with NLB security group management mixing with the managed SG feature
  2. Security group leak bug (see #1208 / #1209) logic will be reused in NLB BYO SG feature

Which issue(s) this PR fixes:
Refs #1151

Special notes for your reviewer:

Future Work:
Managed SG is important to empower users to bypass managed SG, and enhance security control boundaries with user-managed security groups on NLBs. BYO Security Group support for NLB will be considered in a future release after:

  • The managed SG be merged with NLB SG management logic
  • Resolution of security group leak bug (#1209)
  • Comprehensive field testing of managed SG functionality
  • Community feedback on managed SG approach

The changes introducing the global, opt-in, cloud-config for enabling Security Group (SG) on NLB creation (similar CLB), is for users/administrators who intentionally wants to enforce SG across all new services - following AWS recommendations, and ALBC defaults. This won't change the default CCM behavior if the configuration isn't added.

Done checklist:

  • introduce cloud-config to signalize the controller to always provision Security Group on Service type-LoadBalancer NLB
  • Create Security Group when NLB is created, and global config is added
  • Delete Security Group when NLB is removed
  • Update Service port and ensure SG rules are updated
  • Review SG rules logic
  • Elaborate an e2e scenario that allows to update the cloud-config to exercise NLB+SG feature
  • Resolve pending question related to the approach
  • Create User Documentation for the feature

Related changes isolated from this PR:

https://github.com/kubernetes/cloud-provider-aws/pull/1207
https://github.com/kubernetes/cloud-provider-aws/issues/1206
https://github.com/kubernetes/cloud-provider-aws/pull/1163
https://github.com/kubernetes/cloud-provider-aws/pull/1205
https://github.com/kubernetes/cloud-provider-aws/pull/1159
https://github.com/kubernetes/cloud-provider-aws/pull/1153
https://github.com/kubernetes/cloud-provider-aws/pull/1215

Does this PR introduce a user-facing change?:

Add opt-in NLB Security Group provisioning support. Introduces support for provisioning Service type-LoadBalancer NLB with Security Groups through global cloud-config (`NLBSecurityGroupMode: Managed`). The managed security group approache is opt-in and include automatic lifecycle management of Security Groups.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. release-note Denotes a PR that will be considered when it comes time to generate release notes. kind/feature Categorizes issue or PR as related to a new feature. labels Jun 11, 2025
@k8s-ci-robot k8s-ci-robot requested review from hakman and nckturner June 11, 2025 21:42
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 11, 2025
@k8s-ci-robot
Copy link
Contributor

This issue is currently awaiting triage.

If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Jun 11, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @mtulio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mtulio mtulio changed the title Splat 2253 nlb sg config Introduce Security Group provisioning for NLB Jun 11, 2025
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jun 11, 2025
@mtulio mtulio changed the title Introduce Security Group provisioning for NLB Introduce opt-in Security Group provisioning for NLB Jun 11, 2025
Copy link
Contributor Author

@mtulio mtulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost done. A few pending items in TODO, and raising questions in key points from a broad group.

if _, ok := tags["Name"]; !ok {
tags["Name"] = name
}
// TODO/TBD do we need to introduce a managed tag to ensure SG wont be leaked if config is changed when LB is created in managed mode?
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is important, it could cause leak if we dont introduce a webhook or mechanism to handle configuration changes.

@mtulio mtulio changed the title Introduce opt-in Security Group provisioning for NLB Introduce opt-in NLB provisioning with Security Groups Jun 11, 2025
@mtulio
Copy link
Contributor Author

mtulio commented Jun 11, 2025

Hey @kmala, I still have some items to review in this PR, but tests are passing locally. Would you mind stamping ok-to-test to validate if there is no local addiction from my env? Thanks

@kmala
Copy link
Member

kmala commented Jun 11, 2025

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 11, 2025
@mtulio mtulio force-pushed the SPLAT-2253-nlb-sg-config branch 2 times, most recently from e1e47cf to 1e8e527 Compare June 13, 2025 02:13
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jun 13, 2025
@mtulio
Copy link
Contributor Author

mtulio commented Jun 13, 2025

/test all

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 14, 2025
@mtulio mtulio force-pushed the SPLAT-2253-nlb-sg-config branch from 1e8e527 to 9a0b0cc Compare June 18, 2025 20:10
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jun 18, 2025
@mtulio mtulio changed the title Introduce opt-in NLB provisioning with Security Groups Introduce config opt-in NLB provisioning with Security Groups Jun 18, 2025
@elmiko
Copy link
Contributor

elmiko commented Jul 24, 2025

just starting to take a look at this.

@mtulio
Copy link
Contributor Author

mtulio commented Jul 29, 2025

Observing in the last run:

The maximum number of target groups has been reached

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Jul 29, 2025

I will rebase tough and try again. Quickly triage:

/test pull-cloud-provider-aws-e2e-kubetest2
/test pull-cloud-provider-aws-e2e

Copy link
Contributor Author

@mtulio mtulio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggesting improvements in the documentation strings. I will keep this open for comments and apply it in the next review round.

@mtulio
Copy link
Contributor Author

mtulio commented Jul 31, 2025

e2e test are very flaky (getting stuck on go install) in other PRs, re-running to validate latest version here:

/test pull-cloud-provider-aws-e2e

Copy link
Contributor

@elmiko elmiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes sense to me, i have some questions and suggestions.


The controller can be configured to enable managed Security Groups (SGs) for AWS Network Load Balancers (NLBs) by setting an opt-in configuration in your cloud config. When enabled, each NLB created for a Kubernetes Service of type `LoadBalancer` will have a dedicated Security Group, managed by the cloud provider controller. We are calling this as opt-in Managed NLB Security Group ("Managed SG" mode).

The following annotations (BYO SG / user-provided security groups) are **not supported** for NLB services and will result in service creation/update failure:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we call these out as "not supported", i'm guessing to set user expectations but it seemed a little confusing to me. just curious.

looking at the example below on how to engage this feature, i think if we are going to specify the annotations that do not work here, we should also specify the annotations that should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, Mike. The initial idea of highlighting those BYO SG annotation of "not supported" was enhance user experience when trying to consume an valid annotation to the controller, but not implemented to NLB, only on CLB logic.

This UX pitfall is a very common use case users trying to use annotations valid only on CLB, and no feedback is provided/annotation ignored (afaict a webhook would help on that). Starting to this proposal, we are validating the BYO SG field when NLB to ensure users are not mixing concepts, getting an early feedback.

we should also specify the annotations that should be added.

good idea. To deploy Service type-loadBalancer NLB we should always set the annotation "service.beta.kubernetes.io/aws-load-balancer-type" to CCM. I will update it in a new paragraph above. Moreover, we can also mention the service_controller documentation, which was recently updated with a column Valid for which mention annotation support by ELB type.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense to me. in general, if we add something to the docs that says "do not use these things ....", we should also have something next to it that says "use these other things instead ....". or something similar, that's my desire here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added! I made a few changes from the original version regarding to this statement, focusing in redirecting to the supported service controller documentation which is complete with matrix of supported annotations.

I also added the note reinforcing the users the annotation used to provision NLB. LMK what it looks like now. Thanks

@@ -239,6 +239,7 @@ const ServiceAnnotationLoadBalancerSubnets = "service.beta.kubernetes.io/aws-loa

const headerSourceArn = "x-amz-source-arn"
const headerSourceAccount = "x-amz-source-account"
const securityGroupPrefix = "k8s-elb-"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me

@mtulio
Copy link
Contributor Author

mtulio commented Aug 5, 2025

Thanks for review so far. FWIW I am returning on this PR today, I am applying the suggestions and synchronize the shared pieces (SG ownership check) with changes/feedback from PR #1209

@mtulio mtulio marked this pull request as draft August 5, 2025 21:34
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 5, 2025
mtulio added 2 commits August 5, 2025 19:26
Introduce the NLB Security Group Mode configuration
(NLBSecurityGroupMode) to make the controller creates the Security Group
by default when provisioning Service type-LoadBalancer NLB.

This configuration is opt-in and global to the cluster.
Ensure the Security Group IDs is added on NLB load balancer creation.

Additionally, this is fixing the BYO SG update scenario by detecting the
replaced SG on CLB and delete it when it is owned by controller. The
same behavior will be implemented in the BYO SG scenario for NLB too.
@mtulio mtulio force-pushed the SPLAT-2253-nlb-sg-config branch from 5f62c03 to a1a785e Compare August 5, 2025 22:30
mtulio added 4 commits August 5, 2025 19:53
Ensure the Security Group is managed when creating
a service type-LoadBalancer NLB object, considering the
global configuration to manage SGs in NLBs:
NLBSecurityGroupMode=NLBSecurityGroupModeManaged
Ensure unit tests on EnsureLoadBalancer, including case to test NLB with
security group by changing the cloud-config.
Introduce hasClusterTagOwned() to validate if a resource has the
kubernetes cluster tag (`kubernetes.io/cluster/clusterID`) with
value `owned`, so it can quickly used when ensuring states to
cloud resources managed by controller, such as SG deletions, etc.
Introduce the documentation to use the feature Service type-LoadBalancer
with Security Group by opt-in through the cloud-config.
@mtulio mtulio force-pushed the SPLAT-2253-nlb-sg-config branch from a1a785e to 59722ba Compare August 5, 2025 22:53
@mtulio
Copy link
Contributor Author

mtulio commented Aug 5, 2025

Feedback addressed. The only pending question is related to the SG name, currently I am following the CCMs approach: SG name will have the same name of LB

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Aug 6, 2025

e2e are passing on CI, units are green locally. Converting this PR to ready. PTAL?

@mtulio mtulio marked this pull request as ready for review August 6, 2025 01:41
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 6, 2025
@k8s-ci-robot k8s-ci-robot requested a review from olemarkus August 6, 2025 01:41
@mtulio
Copy link
Contributor Author

mtulio commented Aug 6, 2025

Observing jobs getting stuck frequently on build, I need to rerun those frequently after many hours on Prow Console.

Just did it again since the last run.

@mtulio
Copy link
Contributor Author

mtulio commented Aug 6, 2025

Reviewers: there is a known-bug on kubelet killing the group and Prow is not caughting correctly, leaving the job running forever. Furthermore, I can see e2e job is consuming more CPU and Mem than limit, leading it to be killed by kube. More information to the Slack thread: https://kubernetes.slack.com/archives/C7J9RP96G/p1754511876402269?thread_ts=1754505741.634999&cid=C7J9RP96G

I am constantly monitoring and restarting it manually to get readiness of the latest version.

This is not blocking this PR for review. Thanks!

@mtulio
Copy link
Contributor Author

mtulio commented Aug 11, 2025

CI must be fixed

/test pull-cloud-provider-aws-e2e

@mtulio
Copy link
Contributor Author

mtulio commented Aug 11, 2025

Perfect, now with CI fixes the tests are passing. cc @elmiko

@elmiko
Copy link
Contributor

elmiko commented Aug 11, 2025

thanks @mtulio , i will take a look this week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants