diff --git a/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md b/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md deleted file mode 100644 index edfef05f57..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/1-subnetstate.md +++ /dev/null @@ -1,83 +0,0 @@ -## DNC SubnetState API and Subnet Utilization cache [[Phase 1 Design](../proposal.md#1-1-subnet-utilization-is-cached-by-dnc)] - -### SubnetState API -An API will be added to DNC which will provide the Reserved IP Count (the "Utilization") of a Subnet. The API will synchronously query the Subnet Utilization Cache and return the response from the Cache directly. - -```yaml -paths: - /networks/{networkID}/subnets/{subnet}/utilization: - get: - summary: Returns the Subnet State - operationId: querySubnetState - description: | - Queries the State for the passed Subnet. - parameters: - - in: path - name: networkID - description: The Network ID - required: true - schema: - type: string - - in: path - name: subnet - description: The Subnet Name - required: true - schema: - type: string - responses: - '200': - description: The matching SubnetState - content: - application/json: - schema: - type: array - items: - $ref: '#/components/schemas/SubnetState' - '400': - description: bad input parameter -components: - schemas: - SubnetState: - type: object - required: - - timestamp - - capacity - - reserved - properties: - timestamp: - type: string - format: date-time - example: '2016-08-29T09:12:33.001Z' - capacity: - type: integer - example: 256 - reserved: - type: integer - example: 128 - description: The Subnet Utilization State -``` - -### Subnet Utilization Cache -A cache will be added to DNC which will hold the the Subnet Reserved IP Count per Subnet. The cache will be implemented as a pull-through, self-refreshing cache with a configurable refresh rate. - -The cache will act as a proxy to the Database and, when queried for a Subnet which it does not already know about, will iterate the Subnet Table to build the present Subnet Utilization ("Loading"), cache that result, and return the result. The Subnet will be added to the Cache's Known Subnets, and the Cache will periodically iterate the Known Subnets and re-Load their Utilization from the Database. - - -```mermaid -sequenceDiagram - Client->>+API: Query Subnet State - Note over Client,API: API blocks - API->>+Cache: Lookup State - alt Cache hit - Cache->>Cache: Cache hit - else Cache miss - Cache->>+Database: Iterate Subnet table - Database->>-Cache: Return Subnet State - end - Cache->>-API: Return State - API->>-Client: Return Subnet State - loop Refresh cache - Cache->>+Database: Iterate Subnet table - Database->>-Cache: Retrun Subnet State - end -``` diff --git a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md b/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md deleted file mode 100644 index 2dccdcd1c8..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/2-exhaustion.md +++ /dev/null @@ -1,38 +0,0 @@ -# DNC-RC watches and reacts to exhaustion [[Phase 1 Design]](../proposal.md#1-2-subnet-exhaustion-is-calculated-by-dnc-rc) - -DNC-RC will poll the `SubnetState` API to periodically check the Subnet utilization. DNC-RC will be configured with a lower and upper threshold ( $T_l$ and $T_u$ ) as percentages of the Subnet capacity $C$. If the Subnet utilization $U$ crosses the upper threshold, DNC-RC will consider the Subnet "exhausted". If the Subnet utilization then falls below the lower threshold, DNC-RC will consider the Subnet "not exhausted". Two values are necessary to induce hysteresis and minimize oscillation between the two states. - -$$ -E = \neg E \text{ when}\begin{cases} -U \gt T_u \times C &\text{if } E \text{ is true}\\ -U \lt T_l \times C &\text{if } E \text{ is false} -\end{cases} -$$ - -> Note: $\neg$ is the negation operator. - -If the Subnet is exhausted, DNC-RC will write an additional, per-subnet CRD, the [`ClusterSubnetState`](https://github.com/Azure/azure-container-networking/blob/master/crd/clustersubnetstate/api/v1alpha1/clustersubnetstate.go), with a Status of `exhausted=true`. When the Subnet is un-exhausted, DNC-RC will write the Status as `exhausted=false`. - -```yaml -apiVersion: acn.azure.com/v1alpha1 -kind: ClusterSubnet -metadata: - name: subnet - namespace: kube-system -status: - exhausted: true - timestamp: 123456789 -``` - -```mermaid -sequenceDiagram -participant Kubernetes -participant RC -participant DNC -loop -RC->>+DNC: Query Subnet Utilization -DNC->>-RC: Utilization -RC->>RC: Calculate Exhaustion -RC->>Kubernetes: Write Exhaustion to ClusterSubnet CRD -end -``` diff --git a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md b/docs/feature/subnet-scarcity/phase-1/3-releaseips.md deleted file mode 100644 index ee45479da2..0000000000 --- a/docs/feature/subnet-scarcity/phase-1/3-releaseips.md +++ /dev/null @@ -1,22 +0,0 @@ -# CNS releases IPs back to Exhausted Subnets [[Phase 1 Design]](../proposal.md#1-3-ips-are-released-by-cns) - -CNS will watch the `ClusterSubnetState` CRD and will update its internal state with the Subnet's exhaustion status. When the Subnet is exhausted, CNS will ignore the configured Batch size from the `NodeNetworkConfig`, and internally will use a Batch size of $1$. As the IPAM Pool Monitor reconciles the Pool, the changes to the Batch size will get picked up and applied to the subsequent Pool Scaling and target `RequestedIPCount`. - -```mermaid -sequenceDiagram -participant IPAM Pool Monitor -participant ClusterSubnet Watcher -participant Kubernetes -Kubernetes->>ClusterSubnet Watcher: ClusterSubnet Update -alt Exhausted -ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 1 -else Un-exhausted -ClusterSubnet Watcher->>IPAM Pool Monitor: Batch size = 16 -end -loop -IPAM Pool Monitor->>IPAM Pool Monitor: Recalculate RequestedIPCount -Note right of IPAM Pool Monitor: Request = Batch * X -IPAM Pool Monitor->>Kubernetes: Update NodeNetworkConfig CRD Spec -Kubernetes->>IPAM Pool Monitor: Update NodeNetworkConfig CRD Status -end -``` diff --git a/docs/feature/subnet-scarcity/phase-2/1-emptync.md b/docs/feature/subnet-scarcity/phase-2/1-emptync.md deleted file mode 100644 index ca244e49a2..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/1-emptync.md +++ /dev/null @@ -1,82 +0,0 @@ -## DNC-RC creates empty NCs for new Nodes [[Phase 2 Design]](../proposal.md#2-1-dnc-rc-creates-ncs-with-no-secondary-ips) - -When a new Node is created in the Cluster, the NodeController in DNC-RC will create a new stub NodeNetworkConfig associated with that Node. When a NodeNetworkConfig is Created/Updated/Deleted, the NodeNetworkConfigController will reconcile that NodeNetworkConfig target state. - -Currently, the NodeController sets the [`Spec.RequestedIPCount`](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L44) equal to the [`Status.Scaler.BatchSize`](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L68) when it initially scaffolds the NodeNetworkConfig. When the NodeNetworkConfigController reconciles that NNC, it sees that the Spec contains a Requested IP Count and attempts to honor that request by making an IP allocation request to DNC. During this, [CNS is blocked, waiting for the NetworkContainer](https://github.com/Azure/azure-container-networking/blob/238d12fb6c3bf4132cecce9a1356b77d13816d1c/cns/kubecontroller/nodenetworkconfig/reconciler.go#L78) containing those SecondaryIPs to be added to the NNC. When DNC allocates the SecondaryIPs, the NodeNetworkConfigController writes the NetworkContainer to the NNC, and CNS starts its Pool Monitor loop. - -```mermaid -sequenceDiagram -participant CNS -participant Kubernetes -participant NodeController -participant NNCController -participant DNC -loop Node Reconciler Loop -Kubernetes->>+NodeController: Node XYZ created, create NNC XYZ -NodeController->>-Kubernetes: Publish NodeNetworkConfig XYZ -Note over NodeController,Kubernetes: Spec.RequestedIPs = 16 -end -loop DNC-RC NodeNetworkConfig Reconciler Loop -Kubernetes->>+NNCController: NodeNetworkConfig XYZ Spec Updated -Note over NNCController,Kubernetes: Spec.RequestedIPs = 16 -NNCController->>+DNC: Create NC with 16 IPs -alt -DNC->>NNCController: Error allocating IPs -Note left of NNCController: Terminal case, NNC is stuck -else -DNC->>-NNCController: Return NC with 16 IPs -NNCController->>-Kubernetes: Update NodeNetworkConfig XYZ with 16 SecondaryIPs -Note over NNCController,Kubernetes: Status.NC[0].SecondaryIPConfigs = [16]{...} -loop CNS NodeNetworkConfig Reconciler Loop -Kubernetes->>+CNS: NodeNetworkConfig XYZ Status Updated -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [16]{...} -CNS->>CNS: Watch and Scales IPAM Pool -CNS->>-Kubernetes: Updates NodeNetworkConfig XYZ Spec -end -end -end -``` - -Due to the division of responsibilities, it is possible for this flow to deadlock if the Subnet is exhausted. -- the Node Reconciler loop is responsible for initially scaffolding the NNC for a new Node and can only set the Requested IP count safely when it is *creating* the NodeNetworkConfig -- the NodeNetworkConfig Reconciler loop is reacting to updates to the Requested IP count and attempting to honor them -- only CNS can update the Requested IP Count after an NNC has been created, as CNS does Pod IPAM on the Node - -If the Subnet becomes exhausted _after_ the Node Reconciler loop has set the initial Requested IP count and the NodeNetworkConfig Reconciler is unable to honor the request, the NetworkContainer will never be written to the NNC Status. This Status update is what indicates to CNS that the Network is ready (enough for it to start). In this scenario, no running components can safely update the Request IP Count to get it within the constraints of the Subnet, and the NNC Status will never be updated. CNS will get no IPs, and no Pods can run on that Node. - -### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig - -Instead of creating the NodeNetworkConfig with a Requested IP count of $B$, the NodeController will create NodeNetworkConfigs with a Requested IP count of $0$. The NodeNetworkController will create an NC Request with only single Primary IP and zero Secondary IPs for the initial create, and will write the empty NC to the NodeNetworkConfig Status. This skeleton NC in the NNC Status will be enough to signal to CNS to start the IPAM loop, and CNS will be able to iteratively adjust the Requested IP Count based on the current Subnet Exhaustion State at any time, as it does at steady state already. - -```mermaid -sequenceDiagram -participant CNS -participant Kubernetes -participant NodeController -participant NNCController -participant DNC -loop Node Reconciler Loop -Kubernetes->>+NodeController: Node XYZ created, create NNC XYZ -NodeController->>-Kubernetes: Publish NodeNetworkConfig XYZ -Note over NodeController,Kubernetes: Spec.RequestedIPs = 0 -end -loop DNC-RC NodeNetworkConfig Reconciler Loop -Kubernetes->>+NNCController: NodeNetworkConfig XYZ Spec Updated -Note over NNCController,Kubernetes: Spec.RequestedIPs = 0 -NNCController->>+DNC: Create NC with 0 IPs -DNC->>-NNCController: Return NC with 0 IPs -NNCController->>-Kubernetes: Update NodeNetworkConfig XYZ with 0 SecondaryIPs -Note over NNCController,Kubernetes: Status.NC[0].SecondaryIPConfigs = [0]{...} -end -loop CNS NodeNetworkConfig Reconciler Loop -Kubernetes->>+CNS: NodeNetworkConfig XYZ Status Updated -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [0]{...} -CNS->>CNS: Watch and Scales IPAM Pool -CNS->>-Kubernetes: Updates NodeNetworkConfig XYZ Spec -Note over Kubernetes,CNS: Status.NC[0].SecondaryIPConfigs = [B]{...} -end -``` - -Due to the shift of responsibility for asking for the initial Secondary IP allocation from the NodeController to CNS, there will be additional startup latency of one Request-Allocate loop duration while CNS asks for and waits to receive some Secondary IPs the first time. - -However, this improved architecture breaks the hard startup dependency between CNS and the initial $B$ Secondary IP allocation. In this way, the initial *creation* of the NodeNetworkConfig has no special knowledge or cases, can be handled identically to the steady-state Update scenario, and can handle Subnet Exhaustion without the previous race/deadlock condition. diff --git a/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md b/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md deleted file mode 100644 index b1aea2e1bd..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/2-scalingmath.md +++ /dev/null @@ -1,89 +0,0 @@ -## CNS idempotent Pool Scaling math [[Phase 2 Design]](../proposal.md#2-2-cns-scales-ipam-pool-idempotently) - -The current Pod IP allocation works as follows: -- CNS is allocated a Batch of IPs from DNC and records them internally as "Available" -- Pods are scheduled on the Node - - The CRI creates a Pod Sandbox and asks the CNI to assign an IP - - The CNI makes an IP assignment request to CNS - - If there is an Available IP: - - CNS allocates an Available IP out of the Pool. - - If there is not an Available IP: - - CNS returns an error - - CNI returns an error - - CRI tears down the Pod Sandbox -- As described in the [Background](../proposal.md#background), CNS watches the IPAM Pool and continuously verifies that there are at least the Minimum Free IPs left in the Pool. If there are not, it requests an additional Batch from RC via the `NodeNetworkConfig` CRD. -$$m = mf \times B \quad \text{the Minimum Free IPs}$$ -$$\text{if } Available IPs \lt m \quad \text{request an additional Batch }B$$ - - -```mermaid -sequenceDiagram - participant CRI - participant CNI - participant CNS - participant DNC - loop - CNS->CNS: Available IPs > m - CNS->>+DNC: Request B IPs - DNC->>-CNS: Provide IPs - end - CRI->>+CNI: Create Pod - CNI->>+CNS: Request IP - alt IP is Available - CNS->>CNI: Assign IP - CNI->>CRI: Start Pod - else No IP Available - CNS->>-CNI: Error - CNI->>-CRI: Destroy Pod - end -``` - -The existing IP Pool scaling behavior in CNS is reactive and serial: CNS will only request to increase or decrease its Pool size by a single batch at a time. It reacts to the IP usage, attempting to adjust the Pool size to stay between the minimum and maximum free IPs, but it will only step the pool size by a single Batch at a time. - -This introduces latency; CNS must calculate a new Pool size, request an additional Batch, wait for IPs to be allocated from DNC-RC, then loop. The request<->response loop for IP allocations may take several seconds. - -### Idempotent Scaling Math - -The process can be improved by directly calculating the target Pool size based on the current IP usage on the Node. Using this idempotent algorithm, we will always calculate the correct target Pool size in a single step based on the current IP usage. - -The O(1) Pool scaling formula is: - -$$ -Request = B \times \lceil mf + \frac{U}{B} \rceil -$$ - -> Note: $\lceil ... \rceil$ is the ceiling function. - -where $U$ is the number of Assigned (Used) IPs on the Node, $B$ is the Batch size, and $mf$ is the Minimum Free Fraction, as discussed in the [Background](../proposal.md#background). - -The "Required" IP Count is forward looking without effecting the correctness of the Request: it represents the target quantity of IP addresses that CNS *will Assign to Pods* at some instant in time. This may include Pods scheduled which do not *currently* have Assigned IPs because there are insufficient Available IPs in the Pool. - -In this way, at any point in time, CNS may calculate what its exact IP Request should be based on the instantaneous IP demand from currently scheduled Pods on its Node. - -A concrete example: - -$$ -\displaylines{ - \text{Given: }\quad B=16\quad mf=0.5 \quad U=25 \text{ scheduled Pods}\\ - Request = 16 \times \lceil 0.5 + \frac{25}{16} \rceil\\ - Request = 16 \times \lceil 0.5 + 1.5625 \rceil\\ - Request = 16 \times \lceil 2.0625 \rceil\\ - Request = 16 \times 3 \\ - Request = 48 -} -$$ - -As shown, if the demand is for $25$ IPs, and the Batch is $16$, and the Min Free is $8$ (half of the Batch), then the Request must be $48$. $32$ is too few, as $32-25=7 < 8$. - -This algorithm will significantly improve the time-to-pod-ready for large changes in the quantity of scheduled Pods on a Node, due to eliminating all iterations required for CNS to converge on the final Requested IP Count. - - -### Including PrimaryIPs - -The IPAM Pool scaling operates only on NC SecondaryIPs. However, CNS is allocated an additional `PrimaryIP` for every NC as a prerequisite of that NC's existence. Therefore, to align the **real allocated** IP Count to the Batch size, CNS should deduct those PrimaryIPs from its Requested (Secondary) IP Count. - -This makes the RequestedIPCount: - -$$ -RequestedIPCount = B \times \lceil mf + \frac{U}{B} \rceil - PrimaryIPCount -$$ diff --git a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md b/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md deleted file mode 100644 index 8de81606be..0000000000 --- a/docs/feature/subnet-scarcity/phase-2/3-subnetscaler.md +++ /dev/null @@ -1,57 +0,0 @@ -## Migrating the Scaler properties to the ClusterSubnet CRD [[Phase 3 Design]](../proposal.md#2-3-scaler-properties-move-to-the-clustersubnet-crd) -Currently, the [`v1alpha/NodeNetworkConfig` contains the Scaler inputs](https://github.com/Azure/azure-container-networking/blob/eae2389f888468e3b863cb28045ba613a5562360/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L66-L72) which CNS will use to scale the local IPAM pool: - -```yaml -... -status: - scaler: - batchSize: X - releaseThresholdPercent: X - requestThresholdPercent: X - maxIPCount: X -``` -Since the Scaler values are dependent on the state of the Subnet, the Scaler object will be moved to the ClusterSubnet CRD and optimized. - -### ClusterSubnet Scaler -The ClusterSubnet `Status.Scaler` definition will be: -```diff - apiVersion: acn.azure.com/v1alpha1 - kind: ClusterSubnet - metadata: - name: subnet - namespace: kube-system - status: - exhausted: true - timestamp: 123456789 -+ scaler: -+ batch: 16 -+ buffer: 0.5 -``` - -Additionally, the `Spec` of the ClusterSubnet will accept `Scaler` values to be used as runtime overrides. DNC-RC will read and validate the `Spec`, then write the values back out to the `Status` if present. -```diff - apiVersion: acn.azure.com/v1alpha1 - kind: ClusterSubnet - metadata: - name: subnet - namespace: kube-system - spec: -+ scaler: -+ batch: 8 -+ buffer: 0.25 - status: - exhausted: true - timestamp: 123456789 -+ scaler: -+ batch: 8 -+ buffer: 0.25 -``` - -Note: -- The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet. -- The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system. - -#### Migration -When the Scaler is added to the ClusterSubnet CRD definiton, DNC-RC will begin replicating the `batch` and `buffer` properties from the NodeNetworkConfig, keeping both up to date. - -CNS, which already watches the ClusterSubnet CRD for known Subnets, will use the Scaler properties from that object as a priority, and will fall back to using the NNC Scaler properties if they are not present in the ClusterSubnet. diff --git a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md deleted file mode 100644 index 18c507f872..0000000000 --- a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md +++ /dev/null @@ -1,157 +0,0 @@ -## CNS watches Pods to drive IPAM scaling [[Phase 3 Design]](../proposal.md#3-1-cns-watches-pods) - -As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod. - -In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs -- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods -- At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs -- At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error - - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$ - - $P_{17-36}$ are similarly stuck, pending available IPs -- At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$ -- At $T_5$ CNS has too few unassigned IPs again and requests another Batch -- At $T_6$ $P_{32}$ is stuck, pending available IPs -- At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ - -By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs -- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods -- At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math) -- At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ - -The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ( $N = Nodes$ ) and the load on the API Server from watching Pods ( $N \propto Nodes$ ). - -#### SharedInformers and local Caches -Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores - -

- - - > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md). -

- -By leveraging this machinery, CNS will set up a `Watch` on Pods which will open a single long-lived socket connection to the API Server and will let the API Server push incremental updates. This significantly decreases the data transferred and API Server load when compared to naively polling `List` to get Pods repeatedly. - -Additionally, any read-only requests (`Get`, `List`, `Watch`) that CNS makes to Kubernetes using a cache-aware client will hit the local Cache instead of querying the remote API Server. This means that the only requests leaving CNS to the API Server for this Pod Watcher will be the Reflector's List and Watch. - -#### Server-side filtering -To further reduce API Server load and traffic, CNS can use an available [Field Selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) for Pods: [`spec.nodeName=`](https://github.com/kubernetes/kubernetes/blob/691d4c3989f18e0be22c4499d22eff95d516d32b/pkg/apis/core/v1/conversion.go#L40). Field selectors are, like Label Selectors, [applied on the server-side](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering) to List and Watch queries to reduce the dataset that is returned from the API Server to the Client. - -By restricting the Watch to Pods on the current Node, the traffic generated by the Watch will be proportional to the number of Pods on that Node, and will *not* scale in relation to either the number of Nodes in the cluster or the total number of Pods in the cluster. - -### Controller-runtime -To make setting up the filters, SharedInformers, and cache-aware client easy, we will use [`controller-runtime`](https://github.com/kubernetes-sigs/controller-runtime) and create a Pod Reconciler. A controller already exists for managing the `NodeNetworkConfig` CRD lifecycle, so the necessary infrastructure (namely, a Manager) already exists in CNS. - -To create a filtered Cache during the Manager instantiation, the existing `nodeScopedCache` will be expanded to include Pods: - -```go -import ( - v1 "k8s.io/api/core/v1" - "k8s.io/apimachinery/pkg/fields" - "sigs.k8s.io/controller-runtime/pkg/cache" - //... -) -//... -nodeName := "the-node-name" -// the nodeScopedCache sets Selector options on the Manager cache which are used -// to perform *server-side* filtering of the cached objects. This is very important -// for high node/pod count clusters, as it keeps us from watching objects at the -// whole cluster scope when we are only interested in our Node's scope. -nodeScopedCache := cache.BuilderWithOptions(cache.Options{ - SelectorsByObject: cache.SelectorsByObject{ - // existing options - //..., - &v1.Pod{}: { - Field: fields.SelectorFromSet(fields.Set{"spec.nodeName": nodeName}), - }, - }, -}) -//... -manager, err := ctrl.NewManager(kubeConfig, ctrl.Options{ - // existing options - //..., - NewCache: nodeScopedCache, -}) -``` - -After the local Cache and ListWatch has been set up correctly, the Reconciler should use the Manager-provided Kubernetes API Client within its event loop so that reads hit the cache instead of the real API. - -```go -import ( - "context" - - v1 "k8s.io/api/core/v1" - ctrl "sigs.k8s.io/controller-runtime" - "sigs.k8s.io/controller-runtime/pkg/client" - "sigs.k8s.io/controller-runtime/pkg/reconcile" -) - -type Reconciler struct { - client client.Client -} - -func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { - pods := v1.PodList{} - r.client.List(ctx, &pods) - // do things with the list of pods - // ... - return reconcile.Result{}, nil -} - -func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { - r.client = mgr.GetClient() - return ctrl.NewControllerManagedBy(mgr). - For(&v1.Pod{}). - Complete(r) -} -``` - -This can be further optimized by ignoring "Status" Updates to any Pods in the controller setup func: -```go -func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { - r.client = mgr.GetClient() - return ctrl.NewControllerManagedBy(mgr). - For(&v1.Pod{}). - WithEventFilter(predicate.Funcs{ - // check that the generation has changed - status changes don't update generation. - UpdateFunc: func(ue event.UpdateEvent) bool { - return ue.ObjectOld.GetGeneration() != ue.ObjectNew.GetGeneration() - }, - }). - Complete(r) -} -``` - -### The updated IPAM Pool Monitor - -When CNS is watching Pods via the above mechanism, the number of Pods scheduled on the Node (after discarding `hostNetwork: true` Pods), is the instantaneous IP demand for the Node. This IP demand can be fed in to the IPAM Pool scaler in place of the "Used" quantity described in the [idempotent Pool Scaling equation](../phase-2/2-scalingmath.md#idempotent-scaling-math): - -$$ -Request = B \times \lceil mf + \frac{Demand}{B} \rceil -$$ - -to immediately calculate the target Requested IP Count for the current actual Pod load. At this point, CNS can scale directly to the neccesary number of IPs in a single operation proactively, as soon as Pods are scheduled on the Node, without waiting for the CNI to request IPs serially. - ---- -Note: - -The CNS memory usage may increase with this change, because it will cache a view of the Pods on its Node. The impact of this will be investigated. - -The CNS RBAC will need to be updated to include permission to access Pods: -```yaml -apiVersion: rbac.authorization.k8s.io/v1 -kind: ClusterRole -metadata: - name: pod-ro - namespace: kube-system -rules: -- apiGroups: - - "" - verbs: - - get - - list - - watch - resources: - - pods -``` diff --git a/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md b/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md deleted file mode 100644 index 38ab262d78..0000000000 --- a/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md +++ /dev/null @@ -1,70 +0,0 @@ -## Revising the NodeNetworkConfig to v1beta1 [[Phase 3 Design]](../proposal.md#3-2-revise-the-nnc-to-v1beta1) - -As some responsibility is shifted out of the NodeNetworkConfig (the Scaler), and the use-cases evolve, the NodeNetworkConfig needs to be updated to remain adaptable to all scenarios. Notably, to support multiple NetworkContainers per Node, the NNC should acknowledge that those may be from separate Subnets, and should map the `requestedIPCount` per known NetworkContainer. With the ClusterSubnet CRD hosting the Subnet Scaler properties, this will allow Subnets to scale independently even when they are used on the same Node. - -Since this is a significant breaking change, the NodeNetworkConfig definition must be incremented. Since the spec is being incremented, some additional improvements are included. - -```diff -- apiVersion: acn.azure.com/v1alpha -+ apiVersion: acn.azure.com/v1beta1 - kind: NodeNetworkConfig - metadata: - name: nodename - namespace: kube-system - spec: -- requestedIPCount: 16 -- ipsNotInUse: -+ releasedIPs: - - abc-ip-123-guid -+ secondaryIPs: -+ abc-nc-123-guid: 16 - status: -- assignedIPCount: 1 - networkContainers: - - assignmentMode: dynamic - defaultGateway: 10.241.0.1 - id: abc-nc-123-guid -- ipAssignments: -- - ip: 10.241.0.2 -- name: abc-ip-123-guid - nodeIP: 10.240.0.5 - primaryIP: 10.241.0.38 - resourceGroupID: rg-id -+ secondaryIPCount: 1 -+ secondaryIPs: -+ - address: 10.241.0.2 -+ id: abc-ip-123-guid - subcriptionID: abc-sub-123-guid - subnetAddressSpace: 10.241.0.0/16 -- subnetID: podnet - subnetName: podnet - type: vnet - version: 49 - vnetID: vnet-id -- scaler: -- batchSize: 16 -- maxIPCount: 250 -- releaseThresholdPercent: 150 -- requestThresholdPercent: 50 - status: Updating -``` - -In order: -- the GV is incremented to `acn.azure.com/v1beta1` -- the `spec.requestedIPCount` key is renamed to `spec.secondaryIPs` - - the value is change from a single scalar to a map of `NC ID` to scalar values -- the `spec.ipsNotInUse` key is renamed to `spec.releasedIPs` -- the `status.assignedIPCount` field is moved and renamed to `status.networkContainers[].secondaryIPCount` -- the `status.networkContainers[].ipAssignments` field is renamed to `status.networkContainers[].secondaryIPs` - - the keys of the secondaryIPs are renamed from `ip` and `name` to `address` and `id` respectively -- the `status.subnetID` fields is removed as a duplicate of `status.subnetName`, where both were actually the "name" and not a unique ID. -- the `status.scaler` is removed entirely - -#### Migration -This update does not _add_ information to the NodeNetworkConfig, but removes and renames some properties. The transition will take place as follows: -1) The `v1beta1` CRD revision is created - - conversion functions are added to the NodeNetworkConfig schema which translate `v1beta1` <-> `v1alpha` (via `v1beta1` as the hub and "Storage Version"). -2) DNC-RC installs the new CRD definition and registers a conversion webhook. -3) CNS switches to `v1beta1`. - -At this time, any mutation of existing NNCs will automatically up-convert them to the `v1beta1` definition. Any client still requesting `v1alpha` will still be served a down-converted representation of the NNC in a backwards-compatible fashion, and updates to that NNC will be stored in the `v1beta1` representation. diff --git a/docs/feature/subnet-scarcity/proposal.md b/docs/feature/subnet-scarcity/proposal.md deleted file mode 100644 index 71221c0974..0000000000 --- a/docs/feature/subnet-scarcity/proposal.md +++ /dev/null @@ -1,90 +0,0 @@ -# Subnet Scarcity -Dynamic SWIFT IP Overhead Reduction (aka IP Reaping) - -## Abstract -AKS clusters using Azure CNI assign VNET IPs to Pods such that those Pods are reachable on the VNET. -In Dynamic mode (SWIFT), IP addresses are reserved out of a customer specified Pod Subnet and allocated to the cluster Nodes, and then assigned to Pods as they are created. IPs are allocated to Nodes in batches, based on the demand for Pod IPs on that Node. - -Since the IPs are allocated in batches, there is always some overhead of IPs allocated to a Node but unused by any Pod. This over-reservations of IPs from the Subnet will eventually lead to IP exhaustion in the Pod Subnet, even though the number of IPs assigned to Pods is lower than the Pod Subnet capacity. - -The intent of this feature is to reduce the IP wastage by reclaiming unassigned IPs from the Nodes as the Subnet utilization increases. - -## Background -In SWIFT, IPs are allocated to Nodes in batches $B$ according to the request for Pod IPs on that Node. CNS runs on the Node and is the IPAM for that Node. As Pods are scheduled, the CNI requests IPs from CNS. CNS assigns IPs from its allocated IPAM Pool, and dynamically scales the pool according to utilization as follows: -- If the unassigned IPs in the Pool falls below a threshold ( $m$ , the minimum free IPs), CNS requests a batch of IPs from DNC-RC. -- If the unassigned IPs in the Pool exceeds a threshold ( $M$ , the maximum free IPs), CNS releases a batch of IPs back to the subnet. - -The minimum and maximum free IPs are calculated using a fraction of the Batch size. The minimum free IP quantity is the minimum free fraction ( $mf$ ) of the batch size, and the maximum free IP quantity is the maximum free fraction ( $Mf$ ) of the batch size. For convergent scaling behavior, the maximum free fraction must be greater than 1 + the minimum free fraction. - -Therefore the scaling thresholds $m$ and $M$ can be described by: - -$$ -m = mf \times B \text{ , } M = Mf \times B \text{ , and } Mf = mf + 1 -$$ - -For $B > 1$, this means that for a cluster of size $N$ Nodes, there is at least $m * N$ wastage of IPs at steady-state, and at most $M * N$. - -$$ -m \times N \lt \text{Wasted IPs} \lt M \times N -$$ - -For total Subnet capacity ( $Q$ ) and reserved Subnet capacity ( $R$ ), CNS may be unable to request additional IPs and thus Kubernetes may be unable to start additional Pods if the Subnet's unreserved capacity is insufficient: - -$$ -Q - R < B -$$ - -In this scenario, no Node’s request for IPs can be fulfilled as there are less than $B$ IPs left unreserved in the Subnet. However, for any $B>1$, the Reserved capacity is not the actual assigned Pod IPs, and unassigned IPs could be reclaimed from Nodes which have reserved them and reallocated to Nodes which need them to provide assignable capacity. - -Thus, to allow real full utilization of all usable IPs in the Pod Subnet, these parameters (primarily $B$) need to be tuned at runtime according to the ongoing subnet utilization. - -## Solutions and Complications -The following solutions are proposed to address the IP wastage and reclaim unassigned IPs from Nodes. - -### Phase 1 -Subnet utilization is cached by DNC, exhaustion is calculated by DNC-RC which writes it to a ClusterSubnetState CRD, which is read by CNS to trigger the release of IPs. - -#### [[1-1]](phase-1/1-subnetstate.md) Subnet utilization is cached by DNC -DNC (which maintains the state of the Subnet in its database) will cache the reserved IP count $R$ -per Subnet. DNC will also expose an API to query $R$ of the Subnet, the `SubnetState` API. - -#### [[1-2]](phase-1/2-exhaustion.md) Subnet Exhaustion is calculated by DNC-RC -DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet Utilization. If the Subnet Utilization crosses some configurable lower and upper thresholds, RC will consider that Subnet un-exhausted or exhausted, respectively, and will write the exhaustion state to the ClusterSubnet CRD. - -#### [[1-3]](phase-1/3-releaseips.md) IPs are released by CNS -CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted. - -### Phase 2 -IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets. CNS scaling math is improved, and CNS Scalar properties come from the ClusterSubnet CRD instead of the NodeNetworkConfig CRD. - -#### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs -DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State. - -DNC-RC will continue to poll the `SubnetState` API periodically to check the Subnet utilization, and write the exhaustion to the `ClusterSubnet` CRD. - -#### [[2-2]](phase-2/2-scalingmath.md) CNS scales IPAM pool idempotently -Instead of increasing/decreasing the Pool size by 1 Batch at a time to try to satisfy the min/max free IP constraints, CNS will calculate the correct target Requested IP Count using a single O(1) algorithm. - -This idempotent Pool scaling formula is: - -$$ -Request = B \times \lceil mf + \frac{U}{B} \rceil -$$ - -where $U$ is the number of Assigned (Used) IPs on the Node. - -CNS will include the NC Primary IP(s) as IPs that it has been allocated, and will subtract them from its real Requested IP Count such that the _total_ number of IPs allocated to CNS is a multiple of the Batch. - -#### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD -The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration. - -### Phase 3 -CNS watches Pods and adjusts the SecondaryIP Count immediately in reaction to Pod IP demand changes. The NNC is revised to cut weight and prepare for the dynamic batch size (or multi-nc) future. - - -#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods -CNS will Watch for Pod events on its Node, and use the number of scheduled Pods to calculate the target Requested IP Count. - - -#### [[3-2]](phase-3/2-nncbeta.md) Revise the NNC to v1beta1 -With the Scaler migration in [[Phase 2-3]](#2-3-scaler-properties-move-to-the-clustersubnet-crd), the NodeNetworkConfig will be revised to remove this object and optimize.