v1.12 Cherry-picks #514

jaredledvina · 2023-07-12T15:57:10Z

Cherry-pick the following upstream commits into 1.12:

Add support to fallback from ENI PD if subnet is out of /28 prefixes #286
- No conflicts
- I tried to cherry-pick Add support to fallback from ENI PD if subnet is out of /28 prefixes cilium/cilium#20822 instead but, it relies on other 1.13 changes. Instead I've forward ported(?) the 1.10 change above.
Skip restoring connmark while carrying identity with mark in ENI mode cilium/cilium#23049
- No conflicts
pkg/controller: fix incorrect backoff time for next retry cilium/cilium#21886
- No conflicts
ipam: Change default rate limiting access to external APIs cilium/cilium#21387
- Minor conflict in Documentation/operations/upgrade.rst
pkg/ipam: Update histogram buckets for trigger metrics cilium/cilium#25600
- No conflicts
Operator: Add missing observability for Azure API calls cilium/cilium#26277
- No conflicts

When a subnet is very fragmented, there might not be enough free contiguous IPs to make up a /28 block. Without this fallback, new node creation would be blocked when PD is enabled and subnet is in this state. Currently operator does not support PD in mixed mode. Once a node has secondary IP addresses attached on the ENI, operator will not attempt to allocate any more prefixes. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also uses 7th bit to carry the upper most bit of the security identity in some cases (e.g. Pod to Pod communication on the same node). ENI CNI uses CONNMARK iptables target to record the flow came from the external network and restores it in the reply path. When ENI "restores" the mark for the new connection (connmark == 0), it zeros the 7th bit of the mark and changes the identity. This becomes the problem when Cilium is carrying the identity and the upper most bit is 1. The upper most bit becomes 1 when the ClusterID of the source endpoint is 128-255 since we are encoding ClusterID into the upper most 8bits of the identity. There are two possible iptables rules which causes this error. 1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770) ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` 2. The rule in the PREROUTING chain of nat table (managed by AWS) ``` -A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` There are three possible setup affected by this issue 1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that) 2. Cilium is running with AWS VPC CNI with chaining mode 3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts) In setup 3, we can fix the issue by only modifying rule 1. This is what this commit focuses on. It can be resolved by adding the matching rule that don't restore the connmark when we are carrying the mark. We can check if the MARK_MAGIC_IDENTITY is set. ``` -A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80 ``` Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion. Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com> Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com> Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

Add a MaxRetryInterval option to ControllerParams struct, which allows users to configure the max error retry interval for each controller. This can mitigate the thundering herd problem as reported in cilium#21886. Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>

This fixes a bug in the helpers/rate package (used for rate limiting cilium-operator access to external cloud APIs such as AWS/Azure/AlibabaCloud etc). Before this commit, the `Limit` function would accidentally request two wait tokens instead of one when the rate limit was hit: Once in `Reserve`, and then once again by calling `Wait`. This effectively halved the bucket refill rate (aka "QPS") under load. This commit extends the unit tests with a regression test. The old code will fail in the newly added TestRateLimitWait test. The existing test is extended to valudate the expected burst behavior. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Some calls in the Azure API packages where missing the rate limit wait, needed to ensure we do not get throttled by the Azure API. This commit adds the missing checks. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

When provisioning network interfaces and IP addresses for Cilium ENI, Azure or AlibabaCloud IPAM, the operator has to issue requests to the cloud provider API. Those requests might be subject to throttling if the operator issues too many requests in a short time window (which an for example happen if many nodes are added to the cluster at once). Users have reported that the current limits are not enough to avoid throttling on the cloud provider. If throttling occurs, some operations (like interface deletion) might fail, causing potential resource leaks. Therefore, this commit adjusts the rate limit used by Cilium operator. The new limit consist of a bucket size of 20 (also called the burst size) and a bucket refill rate (called QPS in the config flag) of 4. The old refill rate of 20 (effectively 10 under load, due to a bug fixed in the previous commit) was reportedly high enough to be throttled on the Amazon EC2 API under load. Therefore, the new refill rate of 4 queries per second (under load) is chosen rather conservatively, to ensure we are not throttled by the backend API. To compensate for that, the burst size is raised to 20, meaning the operator can issue up to 20 operations before throttling to 4 queries per second kicks in. It should be noted Cilium's rate limiter is shared for all operations (read and write), where as cloud providers have more sophisticated per operation quotas. But since the interaction in a real systems is likely even more complicated than that (e.g. the quota is shared likely with other API users too), we stick to a rather simple and conservative limit by default. If the new default limit is too low for sophisticated or large-scale users, it remains possible to overwrite the new default values if needed. For reference, these are the documented API rate limits on supported cloud providers: Amazon EC2 API [^1] ------------------ | Type | Burst | RPS | |--------|-------|------| | Reads | 100 | 20 | | Writes | 200 | 10 | Azure API --------- *Network API [^2]* | Type | Burst*| RPS* | |--------|-------|------| | Reads | 1000 | 3.33 | | Writes | 10000 | 33.3 | *Estimated burst refill rate per second based on their 5 minute window. *Compute API [^3]* The compute API does not document what the default limits are, and it seems like the actual setup is more sophisticated. But the rates that apply overall seems to be in the same ballpark as the Compute API given the sample response on the page. Alibaba ECS API [^4] -------------------- According to the public documentation, the limits are only visible to logged in users. They may vary based on region and subscription. We assume they are also in the same ballpark as the other two providers. [^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-limits [^2]: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#network-throttling [^3]: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshooting-throttling-errors [^4]: https://www.alibabacloud.com/help/en/elastic-compute-service/latest/api-throttling Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

This commit also updates the documentation to mention that the default values have changed and how users can use the new Helm values to restore previous behavior. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

Currently, trigger related histogram metrics in pgk/ipam use the default prometheus histogram buckets. Resync operation in cloud providers like Azure tend to take a long time and the current buckets are inadequate to track changes in behavior. This commit extends the buckets to allow for measuring longer durations. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

Currently we're not capturing response status and duration for some of the calls made to Azure. This commit calls ObserveAPICall() with status and duration for the operations missing them. Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

jaredledvina changed the title ~~Add support to fallback from ENI PD if subnet is out of /28 prefixes~~ v1.12 Cherry-picks Jul 12, 2023

jaredledvina marked this pull request as ready for review July 12, 2023 16:30

hemanthmalla and others added 11 commits July 12, 2023 12:53

Include mask value while looking up ip rule

972f4e0

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

Clean up old ENI multi nodeport IP rule

98a8735

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>

azure/api: Add missing rate limit to API calls

4741de4

Some calls in the Azure API packages where missing the rate limit wait, needed to ensure we do not get throttled by the Azure API. This commit adds the missing checks. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

helm: Add new values to configure the IPAM external API rate limit

5186dd9

This commit also updates the documentation to mention that the default values have changed and how users can use the new Helm values to restore previous behavior. Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>

jaredledvina force-pushed the jared.ledvina/cherry-picks-for-1-12 branch from 62cd08f to 9bf4d29 Compare July 12, 2023 16:55

hemanthmalla approved these changes Jul 12, 2023

View reviewed changes

jaredledvina merged this pull request into v1.12-dd Jul 12, 2023
21 of 33 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v1.12 Cherry-picks #514

v1.12 Cherry-picks #514

jaredledvina commented Jul 12, 2023 •

edited

v1.12 Cherry-picks #514

v1.12 Cherry-picks #514

Conversation

jaredledvina commented Jul 12, 2023 • edited

jaredledvina commented Jul 12, 2023 •

edited