diff --git a/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md b/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md deleted file mode 100644 index edfef05f57..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md +++ /dev/null @@ -1,83 +0,0 @@ -## DNC SubnetState API and Subnet Utilization cache [[Phase 1 Design](../proposal.md#1-1-subnet-utilization-is-cached-by-dnc)] - -### SubnetState API -An API will be added to DNC which will provide the Reserved IP Count (the "Utilization") of a Subnet. The API will synchronously query the Subnet Utilization Cache and return the response from the Cache directly. - -```yaml -paths: - /networks/{networkID}/subnets/{subnet}/utilization: - get: - summary: Returns the Subnet State - operationId: querySubnetState - description: | - Queries the State for the passed Subnet. - parameters: - - in: path - name: networkID - description: The Network ID - required: true - schema: - type: string - - in: path - name: subnet - description: The Subnet Name - required: true - schema: - type: string - responses: - '200': - description: The matching SubnetState - content: - application/json: - schema: - type: array - items: - $ref: '#/components/schemas/SubnetState' - '400': - description: bad input parameter -components: - schemas: - SubnetState: - type: object - required: - - timestamp - - capacity - - reserved - properties: - timestamp: - type: string - format: date-time - example: '2016-08-29T09:12:33.001Z' - capacity: - type: integer - example: 256 - reserved: - type: integer - example: 128 - description: The Subnet Utilization State -``` - -### Subnet Utilization Cache -A cache will be added to DNC which will hold the the Subnet Reserved IP Count per Subnet. The cache will be implemented as a pull-through, self-refreshing cache with a configurable refresh rate. - -The cache will act as a proxy to the Database and, when queried for a Subnet which it does not already know about, will iterate the Subnet Table to build the present Subnet Utilization ("Loading"), cache that result, and return the result. The Subnet will be added to the Cache's Known Subnets, and the Cache will periodically iterate the Known Subnets and re-Load their Utilization from the Database. - - -```mermaid -sequenceDiagram - Client->>+API: Query Subnet State - Note over Client,API: API blocks - API->>+Cache: Lookup State - alt Cache hit - Cache->>Cache: Cache hit - else Cache miss - Cache->>+Database: Iterate Subnet table - Database->>-Cache: Return Subnet State - end - Cache->>-API: Return State - API->>-Client: Return Subnet State - loop Refresh cache - Cache->>+Database: Iterate Subnet table - Database->>-Cache: Retrun Subnet State - end -``` diff --git a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md deleted file mode 100644 index 2dccdcd1c8..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md +++ /dev/null @@ -1,38 +0,0 @@ -# DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc) - -DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $C$. If the Subnet utilization $U$ crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states. - -$$ -E = \neg E \text{ when}\begin{cases} -U \gt T_u \times C &\text{if } E \text{ is true}\\ -U \lt T_l \times C &\text{if } E \text{ is false} -\end{cases} -$$ - -> Note: $\neg$ is the negation operator. - -If the Subnet is exhausted, DNC-RC will write an additional, per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is un-exhausted, DNC-RC will write the Status as `exhausted=false`. - -```yaml -apiVersion: acn.azure.com/v1alpha1 -kind: ClusterSubnet -metadata: - name: subnet - namespace: kube-system -status: - exhausted: true - timestamp: 123456789 -``` - -```mermaid -sequenceDiagram -participant Kubernetes -participant RC -participant DNC -loop -RC->>+DNC: Query Subnet Utilization -DNC->>-RC: Utilization -RC->>RC: Calculate Exhaustion -RC->>Kubernetes: Write Exhaustion to ClusterSubnet CRD -end -``` diff --git a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md b/docs/feature/subnet-scarcity/phase-1/3-releaseips.md deleted file mode 100644 index ee45479da2..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md +++ /dev/null @@ -1,22 +0,0 @@ -# CNS releases IPs back to Exhausted Subnets [[Phase 1 Design]](../proposal.md#1-3-ips-are-released-by-cns) - -CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and internally will use a Batch size of $1$. As the IPAM Pool Monitor reconciles the Pool, the changes to the Batch size will get picked up and applied to the subsequent Pool Scaling and target `RequestedIPCount`. - -```mermaid -sequenceDiagram -participant IPAM Pool Monitor -participant ClusterSubnet Watcher -participant Kubernetes -Kubernetes->>ClusterSubnet Watcher: ClusterSubnet Update -alt Exhausted -ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 1 -else Un-exhausted -ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 16 -end -loop -IPAM Pool Monitor->>IPAM Pool Monitor: Recalculate RequestedIPCount -Note right of IPAM Pool Monitor: Request = Batch * X -IPAM Pool Monitor->>Kubernetes: Update NodeNetworkConfig CRD Spec -Kubernetes->>IPAM Pool Monitor: Update NodeNetworkConfig CRD Status -end -``` diff --git a/docs/feature/subnet-scarcity/phase-2/1-emptync.md b/docs/feature/subnet-scarcity/phase-2/1-emptync.md deleted file mode 100644 index ca244e49a2..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/1-emptync.md +++ /dev/null @@ -1,82 +0,0 @@ -## DNC-RC creates empty NCs for new Nodes [[Phase 2 Design]](../proposal.md#2-1-dnc-rc-creates-ncs-with-no-secondary-ips) - -When a new Node is created in the Cluster, the NodeController in DNC-RC will create a new stub NodeNetworkConfig associated with that Node. When a NodeNetworkConfig is Created/Updated/Deleted, the NodeNetworkConfigController will reconcile that NodeNetworkConfig target state. - -Currently, the NodeController sets the [`Spec.RequestedIPCount`](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L44) equal to the [`Status.Scaler.BatchSize`](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L68) when it initially scaffolds the NodeNetworkConfig. When the NodeNetworkConfigController reconciles that NNC, it sees that the Spec contains a Requested IP Count and attempts to honor that request by making an IP allocation request to DNC. During this, [CNS is blocked, waiting for the NetworkContainer](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/cns/kubecontroller/nodenetworkconfig/reconciler.go#L78) containing those SecondaryIPs to be added to the NNC. When DNC allocates the SecondaryIPs, the NodeNetworkConfigController writes the NetworkContainer to the NNC, and CNS starts its Pool Monitor loop. - -```mermaid -sequenceDiagram -participant CNS -participant Kubernetes -participant NodeController -participant NNCController -participant DNC -loop Node Reconciler Loop -Kubernetes->>+NodeController: Node XYZ created, create NNC XYZ -NodeController->>-Kubernetes: Publish NodeNetworkConfig XYZ -Note over NodeController,Kubernetes: Spec.RequestedIPs = 16 -end -loop DNC-RC NodeNetworkConfig Reconciler Loop -Kubernetes->>+NNCController: NodeNetworkConfig XYZ Spec Updated -Note over NNCController,Kubernetes: Spec.RequestedIPs = 16 -NNCController->>+DNC: Create NC with 16 IPs -alt -DNC->>NNCController: Error allocating IPs -Note left of NNCController: Terminal case, NNC is stuck -else -DNC->>-NNCController: Return NC with 16 IPs -NNCController->>-Kubernetes: Update NodeNetworkConfig XYZ with 16 SecondaryIPs -Note over NNCController,Kubernetes: Status.NC[0].SecondaryIPConfigs = [16]{...} -loop CNS NodeNetworkConfig Reconciler Loop -Kubernetes->>+CNS: NodeNetworkConfig XYZ Status Updated -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [16]{...} -CNS->>CNS: Watch and Scales IPAM Pool -CNS->>-Kubernetes: Updates NodeNetworkConfig XYZ Spec -end -end -end -``` - -Due to the division of responsibilities, it is possible for this flow to deadlock if the Subnet is exhausted. -- the Node Reconciler loop is responsible for initially scaffolding the NNC for a new Node and can only set the Requested IP count safely when it is *creating* the NodeNetworkConfig -- the NodeNetworkConfig Reconciler loop is reacting to updates to the Requested IP count and attempting to honor them -- only CNS can update the Requested IP Count after an NNC has been created, as CNS does Pod IPAM on the Node - -If the Subnet becomes exhausted _after_ the Node Reconciler loop has set the initial Requested IP count and the NodeNetworkConfig Reconciler is unable to honor the request, the NetworkContainer will never be written to the NNC Status. This Status update is what indicates to CNS that the Network is ready (enough for it to start). In this scenario, no running components can safely update the Request IP Count to get it within the constraints of the Subnet, and the NNC Status will never be updated. CNS will get no IPs, and no Pods can run on that Node. - -### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig - -Instead of creating the NodeNetworkConfig with a Requested IP count of $B$, the NodeController will create NodeNetworkConfigs with a Requested IP count of $0$. The NodeNetworkController will create an NC Request with only single Primary IP and zero Secondary IPs for the initial create, and will write the empty NC to the NodeNetworkConfig Status. This skeleton NC in the NNC Status will be enough to signal to CNS to start the IPAM loop, and CNS will be able to iteratively adjust the Requested IP Count based on the current Subnet Exhaustion State at any time, as it does at steady state already. - -```mermaid -sequenceDiagram -participant CNS -participant Kubernetes -participant NodeController -participant NNCController -participant DNC -loop Node Reconciler Loop -Kubernetes->>+NodeController: Node XYZ created, create NNC XYZ -NodeController->>-Kubernetes: Publish NodeNetworkConfig XYZ -Note over NodeController,Kubernetes: Spec.RequestedIPs = 0 -end -loop DNC-RC NodeNetworkConfig Reconciler Loop -Kubernetes->>+NNCController: NodeNetworkConfig XYZ Spec Updated -Note over NNCController,Kubernetes: Spec.RequestedIPs = 0 -NNCController->>+DNC: Create NC with 0 IPs -DNC->>-NNCController: Return NC with 0 IPs -NNCController->>-Kubernetes: Update NodeNetworkConfig XYZ with 0 SecondaryIPs -Note over NNCController,Kubernetes: Status.NC[0].SecondaryIPConfigs = [0]{...} -end -loop CNS NodeNetworkConfig Reconciler Loop -Kubernetes->>+CNS: NodeNetworkConfig XYZ Status Updated -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [0]{...} -CNS->>CNS: Watch and Scales IPAM Pool -CNS->>-Kubernetes: Updates NodeNetworkConfig XYZ Spec -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [B]{...} -end -``` - -Due to the shift of responsibility for asking for the initial Secondary IP allocation from the NodeController to CNS, there will be additional startup latency of one Request-Allocate loop duration while CNS asks for and waits to receive some Secondary IPs the first time. - -However, this improved architecture breaks the hard startup dependency between CNS and the initial $B$ Secondary IP allocation. In this way, the initial *creation* of the NodeNetworkConfig has no special knowledge or cases, can be handled identically to the steady-state Update scenario, and can handle Subnet Exhaustion without the previous race/deadlock condition. diff --git a/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md b/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md deleted file mode 100644 index b1aea2e1bd..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md +++ /dev/null @@ -1,89 +0,0 @@ -## CNS idempotent Pool Scaling math [[Phase 2 Design]](../proposal.md#2-2-cns-scales-ipam-pool-idempotently) - -The current Pod IP allocation works as follows: -- CNS is allocated a Batch of IPs from DNC and records them internally as "Available" -- Pods are scheduled on the Node - - The CRI creates a Pod Sandbox and asks the CNI to assign an IP - - The CNI makes an IP assignment request to CNS - - If there is an Available IP: - - CNS allocates an Available IP out of the Pool. - - If there is not an Available IP: - - CNS returns an error - - CNI returns an error - - CRI tears down the Pod Sandbox -- As described in the [Background](../proposal.md#background), CNS watches the IPAM Pool and continuously verifies that there are at least the Minimum Free IPs left in the Pool. If there are not, it requests an additional Batch from RC via the `NodeNetworkConfig` CRD. -$$m = mf \times B \quad \text{the Minimum Free IPs}$$ -$$\text{if } Available IPs \lt m \quad \text{request an additional Batch }B$$ - - -```mermaid -sequenceDiagram - participant CRI - participant CNI - participant CNS - participant DNC - loop - CNS->CNS: Available IPs > m - CNS->>+DNC: Request B IPs - DNC->>-CNS: Provide IPs - end - CRI->>+CNI: Create Pod - CNI->>+CNS: Request IP - alt IP is Available - CNS->>CNI: Assign IP - CNI->>CRI: Start Pod - else No IP Available - CNS->>-CNI: Error - CNI->>-CRI: Destroy Pod - end -``` - -The existing IP Pool scaling behavior in CNS is reactive and serial: CNS will only request to increase or decrease its Pool size by a single batch at a time. It reacts to the IP usage, attempting to adjust the Pool size to stay between the minimum and maximum free IPs, but it will only step the pool size by a single Batch at a time. - -This introduces latency; CNS must calculate a new Pool size, request an additional Batch, wait for IPs to be allocated from DNC-RC, then loop. The request<->response loop for IP allocations may take several seconds. - -### Idempotent Scaling Math - -The process can be improved by directly calculating the target Pool size based on the current IP usage on the Node. Using this idempotent algorithm, we will always calculate the correct target Pool size in a single step based on the current IP usage. - -The O(1) Pool scaling formula is: - -$$ -Request = B \times \lceil mf + \frac{U}{B} \rceil -$$ - -> Note: $\lceil ... \rceil$ is the ceiling function. - -where $U$ is the number of Assigned (Used) IPs on the Node, $B$ is the Batch size, and $mf$ is the Minimum Free Fraction, as discussed in the [Background](../proposal.md#background). - -The "Required" IP Count is forward looking without effecting the correctness of the Request: it represents the target quantity of IP addresses that CNS *will Assign to Pods* at some instant in time. This may include Pods scheduled which do not *currently* have Assigned IPs because there are insufficient Available IPs in the Pool. - -In this way, at any point in time, CNS may calculate what its exact IP Request should be based on the instantaneous IP demand from currently scheduled Pods on its Node. - -A concrete example: - -$$ -\displaylines{ - \text{Given: }\quad B=16\quad mf=0.5 \quad U=25 \text{ scheduled Pods}\\ - Request = 16 \times \lceil 0.5 + \frac{25}{16} \rceil\\ - Request = 16 \times \lceil 0.5 + 1.5625 \rceil\\ - Request = 16 \times \lceil 2.0625 \rceil\\ - Request = 16 \times 3 \\ - Request = 48 -} -$$ - -As shown, if the demand is for $25$ IPs, and the Batch is $16$, and the Min Free is $8$ (half of the Batch), then the Request must be $48$. $32$ is too few, as $32-25=7 < 8$. - -This algorithm will significantly improve the time-to-pod-ready for large changes in the quantity of scheduled Pods on a Node, due to eliminating all iterations required for CNS to converge on the final Requested IP Count. - - -### Including PrimaryIPs - -The IPAM Pool scaling operates only on NC SecondaryIPs. However, CNS is allocated an additional `PrimaryIP` for every NC as a prerequisite of that NC's existence. Therefore, to align the **real allocated** IP Count to the Batch size, CNS should deduct those PrimaryIPs from its Requested (Secondary) IP Count. - -This makes the RequestedIPCount: - -$$ -RequestedIPCount = B \times \lceil mf + \frac{U}{B} \rceil - PrimaryIPCount -$$ diff --git a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md b/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md deleted file mode 100644 index 8de81606be..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md +++ /dev/null @@ -1,57 +0,0 @@ -## Migrating the Scaler properties to the ClusterSubnet CRD [[Phase 3 Design]](../proposal.md#2-3-scaler-properties-move-to-the-clustersubnet-crd) -Currently, the [`v1alpha/NodeNetworkConfig` contains the Scaler inputs](https://github.com/Azure/azure-container-networking/blob/eae2389f888468e3b863cb28045ba613a5562360/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L66-L72) which CNS will use to scale the local IPAM pool: - -```yaml -... -status: - scaler: - batchSize: X - releaseThresholdPercent: X - requestThresholdPercent: X - maxIPCount: X -``` -Since the Scaler values are dependent on the state of the Subnet, the Scaler object will be moved to the ClusterSubnet CRD and optimized. - -### ClusterSubnet Scaler -The ClusterSubnet `Status.Scaler` definition will be: -```diff - apiVersion: acn.azure.com/v1alpha1 - kind: ClusterSubnet - metadata: - name: subnet - namespace: kube-system - status: - exhausted: true - timestamp: 123456789 -+ scaler: -+ batch: 16 -+ buffer: 0.5 -``` - -Additionally, the `Spec` of the ClusterSubnet will accept `Scaler` values to be used as runtime overrides. DNC-RC will read and validate the `Spec`, then write the values back out to the `Status` if present. -```diff - apiVersion: acn.azure.com/v1alpha1 - kind: ClusterSubnet - metadata: - name: subnet - namespace: kube-system - spec: -+ scaler: -+ batch: 8 -+ buffer: 0.25 - status: - exhausted: true - timestamp: 123456789 -+ scaler: -+ batch: 8 -+ buffer: 0.25 -``` - -Note: -- The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet. -- The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system. - -#### Migration -When the Scaler is added to the ClusterSubnet CRD definiton, DNC-RC will begin replicating the `batch` and `buffer` properties from the NodeNetworkConfig, keeping both up to date. - -CNS, which already watches the ClusterSubnet CRD for known Subnets, will use the Scaler properties from that object as a priority, and will fall back to using the NNC Scaler properties if they are not present in the ClusterSubnet. diff --git a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md deleted file mode 100644 index 18c507f872..0000000000 --- a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md +++ /dev/null @@ -1,157 +0,0 @@ -## CNS watches Pods to drive IPAM scaling [[Phase 3 Design]](../proposal.md#3-1-cns-watches-pods) - -As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod. - -In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs -- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods -- At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs -- At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error - - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$ - - $P_{17-36}$ are similarly stuck, pending available IPs -- At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$ -- At $T_5$ CNS has too few unassigned IPs again and requests another Batch -- At $T_6$ $P_{32}$ is stuck, pending available IPs -- At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ - -By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs -- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods -- At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math) -- At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ - -The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ( $N = Nodes$ ) and the load on the API Server from watching Pods ( $N \propto Nodes$ ). - -#### SharedInformers and local Caches -Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores - -
-
-
- > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md).
-