Azure · rbtr · Oct 18, 2022 · Oct 18, 2022 · Oct 18, 2022
@@ -1 +1,38 @@
-TODO
+# DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc)
+
+DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $C$. If the Subnet utilization $U$ crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states.
+
+$$
+E = \neg E \text{ when}\begin{cases}
+U \gt T_u \times C &\text{if } E \text{ is true}\\
+U \lt T_l \times C &\text{if } E \text{ is false}
+\end{cases}
+$$
+
+> Note: $\neg$ is the negation operator.
+
+If the Subnet is exhausted, DNC-RC will write an additional, per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is un-exhausted, DNC-RC will write the Status as `exhausted=false`.
+
+```yaml
+apiVersion: acn.azure.com/v1alpha1
+kind: ClusterSubnet
+metadata:
+    name: subnet
+    namespace: kube-system
+status:
+    exhausted: true
+    timestamp: 123456789
+```
+
+```mermaid
+sequenceDiagram
+participant Kubernetes
+participant RC
+participant DNC
+loop
+RC->>+DNC: Query Subnet Utilization
+DNC->>-RC: Utilization
+RC->>RC: Calculate Exhaustion
+RC->>Kubernetes: Write Exhaustion to ClusterSubnet CRD
+end
+```
@@ -1 +1,22 @@
-TODO
+# CNS releases IPs back to Exhausted Subnets [[Phase 1 Design]](../proposal.md#1-3-ips-are-released-by-cns)
+
+CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and internally will use a Batch size of $1$. As the IPAM Pool Monitor reconciles the Pool, the changes to the Batch size will get picked up and applied to the subsequent Pool Scaling and target `RequestedIPCount`.
+
+```mermaid
+sequenceDiagram
+participant IPAM Pool Monitor
+participant ClusterSubnet Watcher
+participant Kubernetes
+Kubernetes->>ClusterSubnet Watcher: ClusterSubnet Update
+alt Exhausted
+ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 1
+else Un-exhausted
+ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 16
+end
+loop
+IPAM Pool Monitor->>IPAM Pool Monitor: Recalculate RequestedIPCount
+Note right of IPAM Pool Monitor: Request = Batch * X
+IPAM Pool Monitor->>Kubernetes: Update NodeNetworkConfig CRD Spec
+Kubernetes->>IPAM Pool Monitor: Update NodeNetworkConfig CRD Status
+end
+```
@@ -14,24 +14,39 @@ Since the Scaler values are dependent on the state of the Subnet, the Scaler obj
 
 ### ClusterSubnet Scaler
 The ClusterSubnet `Status.Scaler` definition will be: 
-```yaml
-...
-status:
-    scaler:
-        batch: X // equal to batchSize
-        buffer: X // equal to requestThresholdPercent
+```diff
+    apiVersion: acn.azure.com/v1alpha1
+    kind: ClusterSubnet
+    metadata:
+        name: subnet
+        namespace: kube-system
+    status:
+        exhausted: true
+        timestamp: 123456789
++       scaler:
++           batch: 16
++           buffer: 0.5 
 ```
 
 Additionally, the `Spec` of the ClusterSubnet will accept `Scaler` values to be used as runtime overrides. DNC-RC will read and validate the `Spec`, then write the values back out to the `Status` if present.
-```yaml
-...
-spec:
-    scaler:
-        <...>
+```diff
+    apiVersion: acn.azure.com/v1alpha1
+    kind: ClusterSubnet
+    metadata:
+        name: subnet
+        namespace: kube-system
+    spec:
++       scaler:
++           batch: 8
++           buffer: 0.25
+    status:
+        exhausted: true
+        timestamp: 123456789
++       scaler:
++           batch: 8
++           buffer: 0.25
 ```
 
-
-
 Note: 
 - The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet.
 - The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system.

@@ -49,27 +49,18 @@ DNC (which maintains the state of the Subnet in its database) will cache the res
 per Subnet. DNC will also expose an API to query $R$ of the Subnet, the `SubnetState` API.
 
 #### [[1-2]](phase-1/2-exhaustion.md) Subnet Exhaustion is calculated by DNC-RC
-DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $t$ and $T$ ) as fractions of the Subnet capacity $Q$. If the Subnet utilization crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and avoid continous oscillation between the two states.
-
-$$
-E = !E \text{(toggle exhaustion) when}\begin{cases}
-R \gt T \times Q &\text{if not exhausted}\\
-R \lt t \times Q &\text{if exhausted}
-\end{cases}
-$$
-
-If the Subnet is exhausted, DNC-RC will write an additional per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is not exhausted, DNC-RC will write the Status as `exhausted=false`.
+DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD.
 
 #### [[1-3]](phase-1/3-releaseips.md) IPs are released by CNS
-CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and instead will scale in Batches of 1 IP. This will have the effect of releasing almost every unassigned IP back to the Subnet - 1 free IP will be kept in the Node's IPAM Pool, and scaling up or down will be done in increments of 1 IP.
+CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted.
 
 ### Phase 2
 The batch size $B$ is dynamically adjusted based on the current subnet utilization. The batch size is increased when the subnet utilization is low, and decreased when the subnet utilization is high. IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets.
 
 #### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs
 DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State.
 
-DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnetState` CRD.
+DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnet` CRD.
 
 #### [[2-2]](phase-2/2-scalingmath.md) CNS scales IPAM pool idempotently
 Instead of increasing/decreasing the Pool size by 1 Batch at a time to try to satisfy the min/max free IP constraints, CNS will calculate the correct target Requested IP Count using a single O(1) algorithm.
@@ -86,10 +77,3 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil
 
 #### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD
 The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration.
-
-### Phase 3
-#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods
-
-
-#### CNS stops watching the ClusterSubnetState
-#### DNC-RC iteratively adjusts the Batch size