forked from cilium/cilium
-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
v1.12 Cherry-picks #514
Merged
Merged
v1.12 Cherry-picks #514
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
jaredledvina
changed the title
Add support to fallback from ENI PD if subnet is out of /28 prefixes
v1.12 Cherry-picks
Jul 12, 2023
When a subnet is very fragmented, there might not be enough free contiguous IPs to make up a /28 block. Without this fallback, new node creation would be blocked when PD is enabled and subnet is in this state. Currently operator does not support PD in mixed mode. Once a node has secondary IP addresses attached on the ENI, operator will not attempt to allocate any more prefixes. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Add a MaxRetryInterval option to ControllerParams struct, which allows users to configure the max error retry interval for each controller. This can mitigate the thundering herd problem as reported in cilium#21886. Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>
This fixes a bug in the helpers/rate package (used for rate limiting cilium-operator access to external cloud APIs such as AWS/Azure/AlibabaCloud etc). Before this commit, the `Limit` function would accidentally request two wait tokens instead of one when the rate limit was hit: Once in `Reserve`, and then once again by calling `Wait`. This effectively halved the bucket refill rate (aka "QPS") under load. This commit extends the unit tests with a regression test. The old code will fail in the newly added TestRateLimitWait test. The existing test is extended to valudate the expected burst behavior. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Some calls in the Azure API packages where missing the rate limit wait, needed to ensure we do not get throttled by the Azure API. This commit adds the missing checks. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
When provisioning network interfaces and IP addresses for Cilium ENI, Azure or AlibabaCloud IPAM, the operator has to issue requests to the cloud provider API. Those requests might be subject to throttling if the operator issues too many requests in a short time window (which an for example happen if many nodes are added to the cluster at once). Users have reported that the current limits are not enough to avoid throttling on the cloud provider. If throttling occurs, some operations (like interface deletion) might fail, causing potential resource leaks. Therefore, this commit adjusts the rate limit used by Cilium operator. The new limit consist of a bucket size of 20 (also called the burst size) and a bucket refill rate (called QPS in the config flag) of 4. The old refill rate of 20 (effectively 10 under load, due to a bug fixed in the previous commit) was reportedly high enough to be throttled on the Amazon EC2 API under load. Therefore, the new refill rate of 4 queries per second (under load) is chosen rather conservatively, to ensure we are not throttled by the backend API. To compensate for that, the burst size is raised to 20, meaning the operator can issue up to 20 operations before throttling to 4 queries per second kicks in. It should be noted Cilium's rate limiter is shared for all operations (read and write), where as cloud providers have more sophisticated per operation quotas. But since the interaction in a real systems is likely even more complicated than that (e.g. the quota is shared likely with other API users too), we stick to a rather simple and conservative limit by default. If the new default limit is too low for sophisticated or large-scale users, it remains possible to overwrite the new default values if needed. For reference, these are the documented API rate limits on supported cloud providers: Amazon EC2 API [^1] ------------------ | Type | Burst | RPS | |--------|-------|------| | Reads | 100 | 20 | | Writes | 200 | 10 | Azure API --------- *Network API [^2]* | Type | Burst*| RPS* | |--------|-------|------| | Reads | 1000 | 3.33 | | Writes | 10000 | 33.3 | *Estimated burst refill rate per second based on their 5 minute window. *Compute API [^3]* The compute API does not document what the default limits are, and it seems like the actual setup is more sophisticated. But the rates that apply overall seems to be in the same ballpark as the Compute API given the sample response on the page. Alibaba ECS API [^4] -------------------- According to the public documentation, the limits are only visible to logged in users. They may vary based on region and subscription. We assume they are also in the same ballpark as the other two providers. [^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-limits [^2]: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#network-throttling [^3]: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshooting-throttling-errors [^4]: https://www.alibabacloud.com/help/en/elastic-compute-service/latest/api-throttling Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit also updates the documentation to mention that the default values have changed and how users can use the new Helm values to restore previous behavior. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Currently, trigger related histogram metrics in pgk/ipam use the default prometheus histogram buckets. Resync operation in cloud providers like Azure tend to take a long time and the current buckets are inadequate to track changes in behavior. This commit extends the buckets to allow for measuring longer durations. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Currently we're not capturing response status and duration for some of the calls made to Azure. This commit calls ObserveAPICall() with status and duration for the operations missing them. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
jaredledvina
force-pushed
the
jared.ledvina/cherry-picks-for-1-12
branch
from
July 12, 2023 16:55
62cd08f
to
9bf4d29
Compare
hemanthmalla
approved these changes
Jul 12, 2023
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Cherry-pick the following upstream commits into 1.12:
Documentation/operations/upgrade.rst