From 7fc43f98b4eecddae86fdf19ec2fa9a8cf54dd7b Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Tue, 18 Oct 2022 01:20:33 +0000 Subject: [PATCH 1/2] feature proposal: subnet scarcity phase 1 Signed-off-by: Evan Baker --- .../subnet-scarcity/phase-1/2-exhaustion.md | 39 +++++++++++++++++- .../subnet-scarcity/phase-1/3-releaseips.md | 23 ++++++++++- .../subnet-scarcity/phase-2/3-subnetscaler.md | 41 +++++++++++++------ .../subnet-scarcity/phase-3/1-watchpods.md | 4 -- docs/feature/subnet-scarcity/proposal.md | 22 ++-------- 5 files changed, 91 insertions(+), 38 deletions(-) delete mode 100644 docs/feature/subnet-scarcity/phase-3/1-watchpods.md diff --git a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md index 1333ed77b7..41889d1116 100644 --- a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md +++ b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md @@ -1 +1,38 @@ -TODO +# DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc) + +DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $Q$. If the Subnet utilization crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states. + +$$ +E = \neg E \text{ when}\begin{cases} +R \gt T \times Q &\text{if } \neg E\\ +R \lt t \times Q &\text{if } E +\end{cases} +$$ + +> Note: $\neg$ is the negation operator. + +If the Subnet is exhausted, DNC-RC will write an additional, per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is un-exhausted, DNC-RC will write the Status as `exhausted=false`. + +```yaml +apiVersion: acn.azure.com/v1alpha1 +kind: ClusterSubnet +metadata: + name: subnet + namespace: kube-system +status: + exhausted: true + timestamp: 123456789 +``` + +```mermaid +sequenceDiagram +participant Kubernetes +participant RC +participant DNC +loop +RC->>+DNC: Query Subnet Utilization +DNC->>-RC: Utilization +RC->>RC: Calculate Exhaustion +RC->>Kubernetes: Write Exhaustion to ClusterSubnet CRD +end +``` diff --git a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md b/docs/feature/subnet-scarcity/phase-1/3-releaseips.md index 1333ed77b7..ee45479da2 100644 --- a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md +++ b/docs/feature/subnet-scarcity/phase-1/3-releaseips.md @@ -1 +1,22 @@ -TODO +# CNS releases IPs back to Exhausted Subnets [[Phase 1 Design]](../proposal.md#1-3-ips-are-released-by-cns) + +CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and internally will use a Batch size of $1$. As the IPAM Pool Monitor reconciles the Pool, the changes to the Batch size will get picked up and applied to the subsequent Pool Scaling and target `RequestedIPCount`. + +```mermaid +sequenceDiagram +participant IPAM Pool Monitor +participant ClusterSubnet Watcher +participant Kubernetes +Kubernetes->>ClusterSubnet Watcher: ClusterSubnet Update +alt Exhausted +ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 1 +else Un-exhausted +ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 16 +end +loop +IPAM Pool Monitor->>IPAM Pool Monitor: Recalculate RequestedIPCount +Note right of IPAM Pool Monitor: Request = Batch * X +IPAM Pool Monitor->>Kubernetes: Update NodeNetworkConfig CRD Spec +Kubernetes->>IPAM Pool Monitor: Update NodeNetworkConfig CRD Status +end +``` diff --git a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md b/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md index bd00f09493..8de81606be 100644 --- a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md +++ b/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md @@ -14,24 +14,39 @@ Since the Scaler values are dependent on the state of the Subnet, the Scaler obj ### ClusterSubnet Scaler The ClusterSubnet `Status.Scaler` definition will be: -```yaml -... -status: - scaler: - batch: X // equal to batchSize - buffer: X // equal to requestThresholdPercent +```diff + apiVersion: acn.azure.com/v1alpha1 + kind: ClusterSubnet + metadata: + name: subnet + namespace: kube-system + status: + exhausted: true + timestamp: 123456789 ++ scaler: ++ batch: 16 ++ buffer: 0.5 ``` Additionally, the `Spec` of the ClusterSubnet will accept `Scaler` values to be used as runtime overrides. DNC-RC will read and validate the `Spec`, then write the values back out to the `Status` if present. -```yaml -... -spec: - scaler: - <...> +```diff + apiVersion: acn.azure.com/v1alpha1 + kind: ClusterSubnet + metadata: + name: subnet + namespace: kube-system + spec: ++ scaler: ++ batch: 8 ++ buffer: 0.25 + status: + exhausted: true + timestamp: 123456789 ++ scaler: ++ batch: 8 ++ buffer: 0.25 ``` - - Note: - The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet. - The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system. diff --git a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md deleted file mode 100644 index 04b3df8cb1..0000000000 --- a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md +++ /dev/null @@ -1,4 +0,0 @@ -CNS current IPAM solution is reactive: it waits for the CNI to request (or release) an IP address for a Pod, and attempts to honor that request out of the current IP Pool. Because of this, requests are necessarily serial, and this creates a scaling bottleneck. -When a Pod is created, the CNI will call with a request to assign an IP. If CNS is out of IPs and cannot honor that request, the CNI will return an error to the CRI, which will follow up by tearing down that Pod sandbox and starting over. Because of this stateless retrying, CNS can only reliable understand that it needs _at least one more_ IP, because it is impossible to tell if subsequent requests are retries for the same Pod, or many different Pods. If _many_ Pods have been scheduled, CNS will still only request a single additional batch of IPs, and assign those IPs one at a time until it runs out, then request a single additional batch of IPs... - -A more predictive method of IP Pool scaling will be added to CNS: CNS will watch Pods for its Node, and will request/release IPs immediately based on the number of Pods scheduled. The Batching behavior will be unchanged, and CNS will continue to request IPs in Batches $B$ based on the local IP usage. diff --git a/docs/feature/subnet-scarcity/proposal.md b/docs/feature/subnet-scarcity/proposal.md index 932f815018..aa4f7f9c35 100644 --- a/docs/feature/subnet-scarcity/proposal.md +++ b/docs/feature/subnet-scarcity/proposal.md @@ -49,19 +49,10 @@ DNC (which maintains the state of the Subnet in its database) will cache the res per Subnet. DNC will also expose an API to query $R$ of the Subnet, the `SubnetState` API. #### [[1-2]](phase-1/2-exhaustion.md) Subnet Exhaustion is calculated by DNC-RC -DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $t$ and $T$ ) as fractions of the Subnet capacity $Q$. If the Subnet utilization crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and avoid continous oscillation between the two states. - -$$ -E = !E \text{(toggle exhaustion) when}\begin{cases} -R \gt T \times Q &\text{if not exhausted}\\ -R \lt t \times Q &\text{if exhausted} -\end{cases} -$$ - -If the Subnet is exhausted, DNC-RC will write an additional per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is not exhausted, DNC-RC will write the Status as `exhausted=false`. +DNC-RC will DNC's SubnetState API on an interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD. #### [[1-3]](phase-1/3-releaseips.md) IPs are released by CNS -CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and instead will scale in Batches of 1 IP. This will have the effect of releasing almost every unassigned IP back to the Subnet - 1 free IP will be kept in the Node's IPAM Pool, and scaling up or down will be done in increments of 1 IP. +CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted. ### Phase 2 The batch size $B$ is dynamically adjusted based on the current subnet utilization. The batch size is increased when the subnet utilization is low, and decreased when the subnet utilization is high. IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets. @@ -69,7 +60,7 @@ The batch size $B$ is dynamically adjusted based on the current subnet utilizati #### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State. -DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnetState` CRD. +DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnet` CRD. #### [[2-2]](phase-2/2-scalingmath.md) CNS scales IPAM pool idempotently Instead of increasing/decreasing the Pool size by 1 Batch at a time to try to satisfy the min/max free IP constraints, CNS will calculate the correct target Requested IP Count using a single O(1) algorithm. @@ -86,10 +77,3 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil #### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration. - -### Phase 3 -#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods - - -#### CNS stops watching the ClusterSubnetState -#### DNC-RC iteratively adjusts the Batch size From 557199f03ef817ab6dd42e9d4205152294f6d139 Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Tue, 18 Oct 2022 19:39:17 +0000 Subject: [PATCH 2/2] address review comments Signed-off-by: Evan Baker --- docs/feature/subnet-scarcity/phase-1/2-exhaustion.md | 6 +++--- docs/feature/subnet-scarcity/proposal.md | 2 +- 2 files changed, 4 insertions(+), 4 deletions(-) diff --git a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md index 41889d1116..2dccdcd1c8 100644 --- a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md +++ b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md @@ -1,11 +1,11 @@ # DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc) -DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $Q$. If the Subnet utilization crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states. +DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $C$. If the Subnet utilization $U$ crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states. $$ E = \neg E \text{ when}\begin{cases} -R \gt T \times Q &\text{if } \neg E\\ -R \lt t \times Q &\text{if } E +U \gt T_u \times C &\text{if } E \text{ is true}\\ +U \lt T_l \times C &\text{if } E \text{ is false} \end{cases} $$ diff --git a/docs/feature/subnet-scarcity/proposal.md b/docs/feature/subnet-scarcity/proposal.md index aa4f7f9c35..a79e9d6061 100644 --- a/docs/feature/subnet-scarcity/proposal.md +++ b/docs/feature/subnet-scarcity/proposal.md @@ -49,7 +49,7 @@ DNC (which maintains the state of the Subnet in its database) will cache the res per Subnet. DNC will also expose an API to query $R$ of the Subnet, the `SubnetState` API. #### [[1-2]](phase-1/2-exhaustion.md) Subnet Exhaustion is calculated by DNC-RC -DNC-RC will DNC's SubnetState API on an interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD. +DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD. #### [[1-3]](phase-1/3-releaseips.md) IPs are released by CNS CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted.