Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.12 Cherry-picks #514

Merged
merged 11 commits into from
Jul 12, 2023
Merged

Conversation

@jaredledvina jaredledvina changed the title Add support to fallback from ENI PD if subnet is out of /28 prefixes v1.12 Cherry-picks Jul 12, 2023
@jaredledvina jaredledvina marked this pull request as ready for review July 12, 2023 16:30
hemanthmalla and others added 11 commits July 12, 2023 12:53
When a subnet is very fragmented, there might not be enough free
contiguous IPs to make up a /28 block. Without this fallback, new node
creation would be blocked when PD is enabled and subnet is in this
state. Currently operator does not support PD in mixed mode. Once a node
has secondary IP addresses attached on the ENI, operator will not attempt
to allocate any more prefixes.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
ENI CNI uses the 7th bit (0x80/0x80) of the mark for PBR. Cilium also
uses 7th bit to carry the upper most bit of the security identity in
some cases (e.g. Pod to Pod communication on the same node). ENI CNI
uses CONNMARK iptables target to record the flow came from the external
network and restores it in the reply path.

When ENI "restores" the mark for the new connection (connmark == 0), it
zeros the 7th bit of the mark and changes the identity. This becomes the
problem when Cilium is carrying the identity and the upper most bit is 1.
The upper most bit becomes 1 when the ClusterID of the source endpoint is
128-255 since we are encoding ClusterID into the upper most 8bits of the
identity.

There are two possible iptables rules which causes this error.

1. The rule in the CILIUM_PRE_mangle chain (managed by us, introduced in cilium#12770)

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

2. The rule in the PREROUTING chain of nat table (managed by AWS)

```
-A PREROUTING -m comment --comment "AWS, CONNMARK" -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

There are three possible setup affected by this issue

1. Cilium is running with ENI mode after uninstalling AWS VPC CNI (e.g. create EKS cluster and install Cilium after that)
2. Cilium is running with AWS VPC CNI with chaining mode
3. Cilium is running with ENI mode without AWS VPC CNI from the beginning (e.g. self-hosted k8s cluster on EC2 hosts)

In setup 3, we can fix the issue by only modifying rule 1. This is what this
commit focuses on. It can be resolved by adding the matching rule that don't
restore the connmark when we are carrying the mark. We can check if the
MARK_MAGIC_IDENTITY is set.

```
-A CILIUM_PRE_mangle -i lxc+ -m comment --comment "cilium: primary ENI" -m mark ! --mark 0x0F00/0x0F00 -j CONNMARK --restore-mark --nfmask 0x80 --ctmask 0x80
```

Corresponding IP rule to lookup the main routing table based on mark value of 0x80 has also been updated to account for this exclusion.

Co-developed-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Yutaro Hayakawa <yutaro.hayakawa@isovalent.com>
Suggested-by: Eric Mountain <eric.mountain@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Add a MaxRetryInterval option to ControllerParams struct, which allows
users to configure the max error retry interval for each controller.
This can mitigate the thundering herd problem as reported in cilium#21886.

Signed-off-by: ArthurChiao <arthurchiao@hotmail.com>
This fixes a bug in the helpers/rate package (used for rate limiting
cilium-operator access to external cloud APIs such as
AWS/Azure/AlibabaCloud etc). Before this commit, the `Limit` function
would accidentally request two wait tokens instead of one when the rate
limit was hit: Once in `Reserve`, and then once again by calling `Wait`.
This effectively halved the bucket refill rate (aka "QPS") under load.

This commit extends the unit tests with a regression test. The old code
will fail in the newly added TestRateLimitWait test. The existing test
is extended to valudate the expected burst behavior.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Some calls in the Azure API packages where missing the rate limit wait,
needed to ensure we do not get throttled by the Azure API. This commit
adds the missing checks.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
When provisioning network interfaces and IP addresses for Cilium ENI, Azure or
AlibabaCloud IPAM, the operator has to issue requests to the cloud provider API.
Those requests might be subject to throttling if the operator issues too many
requests in a short time window (which an for example happen if many nodes are
added to the cluster at once). Users have reported that the current limits are
not enough to avoid throttling on the cloud provider. If throttling occurs,
some operations (like interface deletion) might fail, causing potential resource
leaks.

Therefore, this commit adjusts the rate limit used by Cilium operator.
The new limit consist of a bucket size of 20 (also called the burst size) and
a bucket refill rate (called QPS in the config flag) of 4. The old refill rate
of 20 (effectively 10 under load, due to a bug fixed in the previous commit)
was reportedly high enough to be throttled on the Amazon EC2 API under load.

Therefore, the new refill rate of 4 queries per second (under load) is chosen
rather conservatively, to ensure we are not throttled by the backend API. To
compensate for that, the burst size is raised to 20, meaning the operator can
issue up to 20 operations before throttling to 4 queries per second kicks in.

It should be noted Cilium's rate limiter is shared for all operations
(read and write), where as cloud providers have more sophisticated per
operation quotas. But since the interaction in a real systems is likely even
more complicated than that (e.g. the quota is shared likely with other API
users too), we stick to a rather simple and conservative limit by default.

If the new default limit is too low for sophisticated or large-scale users,
it remains possible to overwrite the new default values if needed.

For reference, these are the documented API rate limits on supported cloud
providers:

Amazon EC2 API [^1]
------------------

| Type   | Burst | RPS  |
|--------|-------|------|
| Reads  |   100 |   20 |
| Writes |   200 |   10 |

Azure API
---------

*Network API [^2]*

| Type   | Burst*| RPS* |
|--------|-------|------|
| Reads  |  1000 | 3.33 |
| Writes | 10000 | 33.3 |

*Estimated burst refill rate per second based on their 5 minute window.

*Compute API [^3]*

The compute API does not document what the default limits are, and it seems
like the actual setup is more sophisticated. But the rates that apply overall
seems to be in the same ballpark as the Compute API given the sample response
on the page.

Alibaba ECS API [^4]
--------------------

According to the public documentation, the limits are only visible to logged
in users. They may vary based on region and subscription. We assume they are
also in the same ballpark as the other two providers.

[^1]: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/throttling.html#throttling-limits
[^2]: https://learn.microsoft.com/en-us/azure/azure-resource-manager/management/request-limits-and-throttling#network-throttling
[^3]: https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshooting-throttling-errors
[^4]: https://www.alibabacloud.com/help/en/elastic-compute-service/latest/api-throttling

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
This commit also updates the documentation to mention that the default
values have changed and how users can use the new Helm values to restore
previous behavior.

Signed-off-by: Sebastian Wicki <sebastian@isovalent.com>
Currently, trigger related histogram metrics in pgk/ipam use the default
prometheus histogram buckets. Resync operation in cloud providers like
Azure tend to take a long time and the current buckets are inadequate to
track changes in behavior. This commit extends the buckets to allow for
measuring longer durations.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
Currently we're not capturing response status and duration for some of
the calls made to Azure. This commit calls ObserveAPICall() with status
and duration for the operations missing them.

Signed-off-by: Hemanth Malla <hemanth.malla@datadoghq.com>
@jaredledvina jaredledvina force-pushed the jared.ledvina/cherry-picks-for-1-12 branch from 62cd08f to 9bf4d29 Compare July 12, 2023 16:55
@jaredledvina jaredledvina merged this pull request into v1.12-dd Jul 12, 2023
21 of 33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants