You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We encountered a rate-limiting issue when scaling up many new nodes with Consul agents in our Kubernetes clusters. New agents use the Kubernetes Auth Method for login and exceed the consul server leader's default QPS and Burst settings of the Kubernetes client hitting the token review API. This issue is causing a significant bottleneck, limiting the number of new nodes that can be brought online at one time. The Consul Server is unable to complete the consul agent's logins once the Kubernetes client-go client-side rate limiting hits, and the leader gets stuck in an infinite loop of retries with increasing backoff time. The leader consul server will be trapped in this retry backoff loop until we remove the pressure, or a new leader is elected.
Resolving this issue will significantly improve the scalability and reliability of Consul services in Kubernetes environments. We would greatly appreciate assistance in addressing this problem.
Reproduction Steps
Create a cluster with 3 server nodes
Create >300 consul agents at once using Kubernetes Auth Method
Observe reaching the client-side rate limit log on consul server
Expected Behavior
The Consul agents should scale without hitting Kubernetes client-side rate limits, allowing for a smooth and efficient scaling process.
Actual Behavior
The leader consul server is reaching the default rate limits within the Kubernetes client-go package, preventing new nodes and their consul agents from coming online as expected. The leader gets stuck in an infinite loop of retries with increasing backoff time.
Workaround
The workaround for this problem is to manually delete or reduce the new nodes and consul agents to relieve pressure on the consul server. Once stuck in the infinite loop retrying with a long backoff, we must do a consul leave on the leader to reset the rate limit.
Proposed Resolution
A proposed resolution for this problem is exposing the QPS and Burst parameters for the Kubernetes auth method validator client as configurable variables for the server pods defaulted to the upstream defaults. We have tested increasing these parameters and confirmed that it eliminates the client-side rate-limiting issue and does not overwhelm the control plane. By modifying these limits in the code and rebuilding the Consul binary, we were able to eliminate the client-side rate limiting issue. Our testing with configuring QPS and Burst parameters of the validator client confirms that rate limiting no longer occurs for the consul server with sufficient QPS and Burst.
The proposed resolution does not remove the client-side rate limiting altogether but allows cluster operators to configure the QPS and Burst values according to their cluster size and performance. By default, the Kubernetes Auth Method validator client should still use the same settings as the default Kubernetes client-go library, so there will be no change in behavior for existing clusters. Cluster operators can adjust these settings based on their own testing and monitoring results and ensure that their control plane can handle the increased load without affecting other Kubernetes API clients.
Impact
The current rate limiting issue has the potential to impact service deployment and scaling within large Kubernetes clusters. Addressing this bug is crucial for maintaining operational efficiency and reliability for clusters that support burst workloads.
Consul info for both Client and Server
Client info
consul info
agent:check_monitors=0check_ttls=0checks=0services=0
build:prerelease=revision=0e046bbb
version=1.13.2version_metadata=
consul:acl=enabled
known_servers=3server=false
runtime:arch=amd64
cpu_count=16goroutines=53max_procs=16os=linux
version=go1.18.1
serf_lan:coordinate_resets=0encrypted=trueevent_queue=0event_time=193failed=0health_score=0intent_queue=0left=0member_time=655959members=6query_queue=0query_time=1
// Config holds the common attributes that can be passed to a Kubernetes client on// initialization.typeConfigstruct {
...// QPS indicates the maximum QPS to the master from this client.// If it's zero, the created RESTClient will use DefaultQPS: 5QPSfloat32// Maximum burst for throttle.// If it's zero, the created RESTClient will use DefaultBurst: 10.Burstint...
}
Overview of the Issue
We encountered a rate-limiting issue when scaling up many new nodes with Consul agents in our Kubernetes clusters. New agents use the Kubernetes Auth Method for login and exceed the consul server leader's default QPS and Burst settings of the Kubernetes client hitting the token review API. This issue is causing a significant bottleneck, limiting the number of new nodes that can be brought online at one time. The Consul Server is unable to complete the consul agent's logins once the Kubernetes client-go client-side rate limiting hits, and the leader gets stuck in an infinite loop of retries with increasing backoff time. The leader consul server will be trapped in this retry backoff loop until we remove the pressure, or a new leader is elected.
Resolving this issue will significantly improve the scalability and reliability of Consul services in Kubernetes environments. We would greatly appreciate assistance in addressing this problem.
Reproduction Steps
Expected Behavior
The Consul agents should scale without hitting Kubernetes client-side rate limits, allowing for a smooth and efficient scaling process.
Actual Behavior
The leader consul server is reaching the default rate limits within the Kubernetes client-go package, preventing new nodes and their consul agents from coming online as expected. The leader gets stuck in an infinite loop of retries with increasing backoff time.
Workaround
The workaround for this problem is to manually delete or reduce the new nodes and consul agents to relieve pressure on the consul server. Once stuck in the infinite loop retrying with a long backoff, we must do a consul leave on the leader to reset the rate limit.
Proposed Resolution
A proposed resolution for this problem is exposing the QPS and Burst parameters for the Kubernetes auth method validator client as configurable variables for the server pods defaulted to the upstream defaults. We have tested increasing these parameters and confirmed that it eliminates the client-side rate-limiting issue and does not overwhelm the control plane. By modifying these limits in the code and rebuilding the Consul binary, we were able to eliminate the client-side rate limiting issue. Our testing with configuring QPS and Burst parameters of the validator client confirms that rate limiting no longer occurs for the consul server with sufficient QPS and Burst.
The proposed resolution does not remove the client-side rate limiting altogether but allows cluster operators to configure the QPS and Burst values according to their cluster size and performance. By default, the Kubernetes Auth Method validator client should still use the same settings as the default Kubernetes client-go library, so there will be no change in behavior for existing clusters. Cluster operators can adjust these settings based on their own testing and monitoring results and ensure that their control plane can handle the increased load without affecting other Kubernetes API clients.
Impact
The current rate limiting issue has the potential to impact service deployment and scaling within large Kubernetes clusters. Addressing this bug is crucial for maintaining operational efficiency and reliability for clusters that support burst workloads.
Consul info for both Client and Server
Client info
Server info
Operating system and Environment details
Log Fragments
Full logs Here: https://gist.github.com/brainshell/660ede2a1830c7d01083a8972ce7318d
Additional Context
Source Code
Validator Instantiation
consul/agent/consul/authmethod/kubeauth/k8s.go
Lines 103 to 113 in 0e046bb
Kube Client Config Defaults
https://github.com/kubernetes/client-go/blob/6b7c68377979c821b73d98d1bd4c5a466034f491/rest/config.go#L116-L120
The text was updated successfully, but these errors were encountered: