Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 38 additions & 1 deletion docs/feature/subnet-scarcity/phase-1/2-exhaustion.md
Original file line number Diff line number Diff line change
@@ -1 +1,38 @@
TODO
# DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc)

DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $C$. If the Subnet utilization $U$ crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states.

$$
E = \neg E \text{ when}\begin{cases}
U \gt T_u \times C &\text{if } E \text{ is true}\\
U \lt T_l \times C &\text{if } E \text{ is false}
\end{cases}
$$

> Note: $\neg$ is the negation operator.

If the Subnet is exhausted, DNC-RC will write an additional, per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is un-exhausted, DNC-RC will write the Status as `exhausted=false`.

```yaml
apiVersion: acn.azure.com/v1alpha1
kind: ClusterSubnet
metadata:
name: subnet
namespace: kube-system
status:
exhausted: true
timestamp: 123456789
```

```mermaid
sequenceDiagram
participant Kubernetes
participant RC
participant DNC
loop
RC->>+DNC: Query Subnet Utilization
DNC->>-RC: Utilization
RC->>RC: Calculate Exhaustion
RC->>Kubernetes: Write Exhaustion to ClusterSubnet CRD
end
```
23 changes: 22 additions & 1 deletion docs/feature/subnet-scarcity/phase-1/3-releaseips.md
Original file line number Diff line number Diff line change
@@ -1 +1,22 @@
TODO
# CNS releases IPs back to Exhausted Subnets [[Phase 1 Design]](../proposal.md#1-3-ips-are-released-by-cns)

CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and internally will use a Batch size of $1$. As the IPAM Pool Monitor reconciles the Pool, the changes to the Batch size will get picked up and applied to the subsequent Pool Scaling and target `RequestedIPCount`.

```mermaid
sequenceDiagram
participant IPAM Pool Monitor
participant ClusterSubnet Watcher
participant Kubernetes
Kubernetes->>ClusterSubnet Watcher: ClusterSubnet Update
alt Exhausted
ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 1
else Un-exhausted
ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 16
end
loop
IPAM Pool Monitor->>IPAM Pool Monitor: Recalculate RequestedIPCount
Note right of IPAM Pool Monitor: Request = Batch * X
IPAM Pool Monitor->>Kubernetes: Update NodeNetworkConfig CRD Spec
Kubernetes->>IPAM Pool Monitor: Update NodeNetworkConfig CRD Status
end
```
41 changes: 28 additions & 13 deletions docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,24 +14,39 @@ Since the Scaler values are dependent on the state of the Subnet, the Scaler obj

### ClusterSubnet Scaler
The ClusterSubnet `Status.Scaler` definition will be:
```yaml
...
status:
scaler:
batch: X // equal to batchSize
buffer: X // equal to requestThresholdPercent
```diff
apiVersion: acn.azure.com/v1alpha1
kind: ClusterSubnet
metadata:
name: subnet
namespace: kube-system
status:
exhausted: true
timestamp: 123456789
+ scaler:
+ batch: 16
+ buffer: 0.5
```

Additionally, the `Spec` of the ClusterSubnet will accept `Scaler` values to be used as runtime overrides. DNC-RC will read and validate the `Spec`, then write the values back out to the `Status` if present.
```yaml
...
spec:
scaler:
<...>
```diff
apiVersion: acn.azure.com/v1alpha1
kind: ClusterSubnet
metadata:
name: subnet
namespace: kube-system
spec:
+ scaler:
+ batch: 8
+ buffer: 0.25
status:
exhausted: true
timestamp: 123456789
+ scaler:
+ batch: 8
+ buffer: 0.25
```



Note:
- The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet.
- The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system.
Expand Down
4 changes: 0 additions & 4 deletions docs/feature/subnet-scarcity/phase-3/1-watchpods.md

This file was deleted.

22 changes: 3 additions & 19 deletions docs/feature/subnet-scarcity/proposal.md
Original file line number Diff line number Diff line change
Expand Up @@ -49,27 +49,18 @@ DNC (which maintains the state of the Subnet in its database) will cache the res
per Subnet. DNC will also expose an API to query $R$ of the Subnet, the `SubnetState` API.

#### [[1-2]](phase-1/2-exhaustion.md) Subnet Exhaustion is calculated by DNC-RC
DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $t$ and $T$ ) as fractions of the Subnet capacity $Q$. If the Subnet utilization crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and avoid continous oscillation between the two states.

$$
E = !E \text{(toggle exhaustion) when}\begin{cases}
R \gt T \times Q &\text{if not exhausted}\\
R \lt t \times Q &\text{if exhausted}
\end{cases}
$$

If the Subnet is exhausted, DNC-RC will write an additional per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is not exhausted, DNC-RC will write the Status as `exhausted=false`.
DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD.

#### [[1-3]](phase-1/3-releaseips.md) IPs are released by CNS
CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and instead will scale in Batches of 1 IP. This will have the effect of releasing almost every unassigned IP back to the Subnet - 1 free IP will be kept in the Node's IPAM Pool, and scaling up or down will be done in increments of 1 IP.
CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted.

### Phase 2
The batch size $B$ is dynamically adjusted based on the current subnet utilization. The batch size is increased when the subnet utilization is low, and decreased when the subnet utilization is high. IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets.

#### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs
DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State.

DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnetState` CRD.
DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnet` CRD.

#### [[2-2]](phase-2/2-scalingmath.md) CNS scales IPAM pool idempotently
Instead of increasing/decreasing the Pool size by 1 Batch at a time to try to satisfy the min/max free IP constraints, CNS will calculate the correct target Requested IP Count using a single O(1) algorithm.
Expand All @@ -86,10 +77,3 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil

#### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD
The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration.

### Phase 3
#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods


#### CNS stops watching the ClusterSubnetState
#### DNC-RC iteratively adjusts the Batch size