From 09a60f2b0f6073e6162be08c4030d6a4b0e8831e Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Tue, 4 Oct 2022 23:29:18 +0000 Subject: [PATCH 1/4] feature proposal: subnet scarcity phase 3 Signed-off-by: Evan Baker --- .../subnet-scarcity/phase-2/1-emptync.md | 2 +- .../subnet-scarcity/phase-3/1-watchpods.md | 157 ++++++++++++++++++ .../subnet-scarcity/phase-3/2-subnetscaler.md | 32 ++++ docs/feature/subnet-scarcity/proposal.md | 8 + 4 files changed, 198 insertions(+), 1 deletion(-) create mode 100644 docs/feature/subnet-scarcity/phase-3/1-watchpods.md create mode 100644 docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md diff --git a/docs/feature/subnet-scarcity/phase-2/1-emptync.md b/docs/feature/subnet-scarcity/phase-2/1-emptync.md index e90361245a..ca244e49a2 100644 --- a/docs/feature/subnet-scarcity/phase-2/1-emptync.md +++ b/docs/feature/subnet-scarcity/phase-2/1-emptync.md @@ -44,7 +44,7 @@ Due to the division of responsibilities, it is possible for this flow to deadloc If the Subnet becomes exhausted _after_ the Node Reconciler loop has set the initial Requested IP count and the NodeNetworkConfig Reconciler is unable to honor the request, the NetworkContainer will never be written to the NNC Status. This Status update is what indicates to CNS that the Network is ready (enough for it to start). In this scenario, no running components can safely update the Request IP Count to get it within the constraints of the Subnet, and the NNC Status will never be updated. CNS will get no IPs, and no Pods can run on that Node. -#### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig +### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig Instead of creating the NodeNetworkConfig with a Requested IP count of $B$, the NodeController will create NodeNetworkConfigs with a Requested IP count of $0$. The NodeNetworkController will create an NC Request with only single Primary IP and zero Secondary IPs for the initial create, and will write the empty NC to the NodeNetworkConfig Status. This skeleton NC in the NNC Status will be enough to signal to CNS to start the IPAM loop, and CNS will be able to iteratively adjust the Requested IP Count based on the current Subnet Exhaustion State at any time, as it does at steady state already. diff --git a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md new file mode 100644 index 0000000000..55071e9dc4 --- /dev/null +++ b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md @@ -0,0 +1,157 @@ +## CNS watches Pods to drive IPAM scaling [[Phase 3 Design]](../proposal.md#3-1-cns-watches-pods) + +As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod. + +In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example: +- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ($16$) IPs +- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods +- At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs +- At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error + - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$ + - $P_{17-36}$ are similarly stuck, pending available IPs +- At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$ +- At $T_5$ CNS has too few unassigned IPs again and requests another Batch +- At $T_6$ $P_{32}$ is stuck, pending available IPs +- At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ + +By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler: +- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ($16$) IPs +- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods +- At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math) +- At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ + +The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ($N = Nodes$) and the load on the API Server from watching Pods ($N \propto Nodes$). + +#### SharedInformers and local Caches +Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores + +

+ + + > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md). +

+ +By leveraging this machinery, CNS will set up a `Watch` on Pods which will open a single long-lived socket connection to the API Server and will let the API Server push incremental updates. This significantly decreases the data transferred and API Server load when compared to naively polling `List` to get Pods repeatedly. + +Additionally, any read-only requests (`Get`, `List`, `Watch`) that CNS makes to Kubernetes using a cache-aware client will hit the local Cache instead of querying the remote API Server. This means that the only requests leaving CNS to the API Server for this Pod Watcher will be the Reflector's List and Watch. + +#### Server-side filtering +To further reduce API Server load and traffic, CNS can use an available [Field Selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) for Pods: [`spec.nodeName=`](https://github.com/kubernetes/kubernetes/blob/691d4c3989f18e0be22c4499d22eff95d516d32b/pkg/apis/core/v1/conversion.go#L40). Field selectors are, like Label Selectors, [applied on the server-side](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering) to List and Watch queries to reduce the dataset that is returned from the API Server to the Client. + +By restricting the Watch to Pods on the current Node, the traffic generated by the Watch will be proportional to the number of Pods on that Node, and will *not* scale in relation to either the number of Nodes in the cluster or the total number of Pods in the cluster. + +### Controller-runtime +To make setting up the filters, SharedInformers, and cache-aware client easy, we will use [`controller-runtime`](https://github.com/kubernetes-sigs/controller-runtime) and create a Pod Reconciler. A controller already exists for managing the `NodeNetworkConfig` CRD lifecycle, so the necessary infrastructure (namely, a Manager) already exists in CNS. + +To create a filtered Cache during the Manager instantiation, the existing `nodeScopedCache` will be expanded to include Pods: + +```go +import ( + v1 "k8s.io/api/core/v1" + "k8s.io/apimachinery/pkg/fields" + "sigs.k8s.io/controller-runtime/pkg/cache" + //... +) +//... +nodeName := "the-node-name" +// the nodeScopedCache sets Selector options on the Manager cache which are used +// to perform *server-side* filtering of the cached objects. This is very important +// for high node/pod count clusters, as it keeps us from watching objects at the +// whole cluster scope when we are only interested in our Node's scope. +nodeScopedCache := cache.BuilderWithOptions(cache.Options{ + SelectorsByObject: cache.SelectorsByObject{ + // existing options + //..., + &v1.Pod{}: { + Field: fields.SelectorFromSet(fields.Set{"spec.nodeName": nodeName}), + }, + }, +}) +//... +manager, err := ctrl.NewManager(kubeConfig, ctrl.Options{ + // existing options + //..., + NewCache: nodeScopedCache, +}) +``` + +After the local Cache and ListWatch has been set up correctly, the Reconciler should use the Manager-provided Kubernetes API Client within its event loop so that reads hit the cache instead of the real API. + +```go +import ( + "context" + + v1 "k8s.io/api/core/v1" + ctrl "sigs.k8s.io/controller-runtime" + "sigs.k8s.io/controller-runtime/pkg/client" + "sigs.k8s.io/controller-runtime/pkg/reconcile" +) + +type Reconciler struct { + client client.Client +} + +func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { + pods := v1.PodList{} + r.client.List(ctx, &pods) + // do things with the list of pods + // ... + return reconcile.Result{}, nil +} + +func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { + r.client = mgr.GetClient() + return ctrl.NewControllerManagedBy(mgr). + For(&v1.Pod{}). + Complete(r) +} +``` + +This can be further optimized by ignoring "Status" Updates to any Pods in the controller setup func: +```go +func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { + r.client = mgr.GetClient() + return ctrl.NewControllerManagedBy(mgr). + For(&v1.Pod{}). + WithEventFilter(predicate.Funcs{ + // check that the generation has changed - status changes don't update generation. + UpdateFunc: func(ue event.UpdateEvent) bool { + return ue.ObjectOld.GetGeneration() != ue.ObjectNew.GetGeneration() + }, + }). + Complete(r) +} +``` + +### The updated IPAM Pool Monitor + +When CNS is watching Pods via the above mechanism, the number of Pods scheduled on the Node (after discarding `hostNetwork: true` Pods), is the instantaneous IP demand for the Node. This IP demand can be fed in to the IPAM Pool scaler in place of the "Used" quantity described in the [idempotent Pool Scaling equation](../phase-2/2-scalingmath.md#idempotent-scaling-math): + +$$ +Request = B \times \lceil mf + \frac{Demand}{B} \rceil +$$ + +to immediately calculate the target Requested IP Count for the current actual Pod load. At this point, CNS can scale directly to the neccesary number of IPs in a single operation proactively, as soon as Pods are scheduled on the Node, without waiting for the CNI to request IPs serially. + +--- +Note: + +The CNS memory usage may increase with this change, because it will cache a view of the Pods on its Node. The impact of this will be investigated. + +The CNS RBAC will need to be updated to include permission to access Pods: +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: + name: pod-ro + namespace: kube-system +rules: +- apiGroups: + - "" + verbs: + - get + - list + - watch + resources: + - pods +``` diff --git a/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md b/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md new file mode 100644 index 0000000000..193f3e984f --- /dev/null +++ b/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md @@ -0,0 +1,32 @@ +## Migrating the Scaler properties to the ClusterSubnet CRD [[Phase 3 Design]](../proposal.md#3-3-scaler-properties-move-to-the-clustersubnet-crd) +Currently, the [`v1alpha/NodeNetworkConfig` contains the Scaler inputs](https://github.com/Azure/azure-container-networking/blob/eae2389f888468e3b863cb28045ba613a5562360/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L66-L72) which CNS will use to scale the local IPAM pool: + +```yaml +... +status: + scaler: + batchSize: X + releaseThresholdPercent: X + requestThresholdPercent: X + maxIPCount: X +``` +Since the Scaler values are dependent on the state of the Subnet, the Scaler object will be moved to the ClusterSubnet CRD and optimized. + +### ClusterSubnet Scaler +The ClusterSubnet Scaler definition will be: +```yaml +... +status: + scaler: + batch: X // equal to batchSize + buffer: X // equal to requestThresholdPercent +``` + +Note: +- The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet. +- The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system. + +#### Migration +When the Scaler is added to the ClusterSubnet CRD definiton, DNC-RC will begin replicating the `batch` and `buffer` properties from the NodeNetworkConfig, keeping both up to date. + +CNS, which already watches the ClusterSubnet CRD for known Subnets, will use the Scaler properties from that object as a priority, and will fall back to using the NNC Scaler properties if they are not present in the ClusterSubnet. diff --git a/docs/feature/subnet-scarcity/proposal.md b/docs/feature/subnet-scarcity/proposal.md index a79e9d6061..b94538c00a 100644 --- a/docs/feature/subnet-scarcity/proposal.md +++ b/docs/feature/subnet-scarcity/proposal.md @@ -77,3 +77,11 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil #### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration. + +### Phase 3 +#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods +CNS will Watch for Pod events on its Node, and use the number of scheduled Pods to calculate the target Requested IP Count. + + +#### [[3-2]](phase-3/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD +The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. From 5cf4aff3acee98d9cef1a784297a33a86d0bcea1 Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Tue, 18 Oct 2022 00:20:34 +0000 Subject: [PATCH 2/4] revise the nodenetworkconfig Signed-off-by: Evan Baker --- .../subnet-scarcity/phase-3/2-nncbeta.md | 70 +++++++++++++++++++ .../subnet-scarcity/phase-3/2-subnetscaler.md | 32 --------- docs/feature/subnet-scarcity/proposal.md | 9 ++- 3 files changed, 76 insertions(+), 35 deletions(-) create mode 100644 docs/feature/subnet-scarcity/phase-3/2-nncbeta.md delete mode 100644 docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md diff --git a/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md b/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md new file mode 100644 index 0000000000..f4f6d398f6 --- /dev/null +++ b/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md @@ -0,0 +1,70 @@ +## Revising the NodeNetworkConfig to v1beta1 [[Phase 3 Design]](../proposal.md#3-2-revise-the-nnc-to-v1beta1) + +As some responsibility is shifted out of the NodeNetworkConfig (the Scaler), and the use-cases evolve, the NodeNetworkConfig needs to be updated to remain adaptable to all scenarios. Notably, to support multiple NetworkContainers per Node, the NNC should acknowledge that those may be from separate Subnets, and should map the `requestedIPCount` per known NetworkContainer. With the ClusterSubnet CRD hosting the Subnet Scaler properties, this will allow Subnets to scale independently even when they are used on the same Node. + +Since this is a significant breaking change, the NodeNetworkConfig definition must be incremented. Since the spec is being incremented, some additional improvements are included. + +```diff +- apiVersion: acn.azure.com/v1alpha ++ apiVersion: acn.azure.com/v1beta1 + kind: NodeNetworkConfig + metadata: + name: nodename + namespace: kube-system + spec: +- requestedIPCount: 16 +- ipsNotInUse: ++ releasedIPs: + - abc-ip-123-guid ++ secondaryIPs: ++ abc-nc-123-guid: 16 + status: +- assignedIPCount: 1 + networkContainers: + - assignmentMode: dynamic + defaultGateway: 10.241.0.1 + id: abc-nc-123-guid +- ipAssignments: +- - ip: 10.241.0.2 +- name: abc-ip-123-guid + nodeIP: 10.240.0.5 + primaryIP: 10.241.0.38 + resourceGroupID: rg-id ++ secondaryIPCount: 1 ++ secondaryIPs: ++ - address: 10.241.0.2 ++ id: abc-ip-123-guid + subcriptionID: abc-sub-123-guid + subnetAddressSpace: 10.241.0.0/16 + subnetID: podnet +- subnetName: podnet + type: vnet + version: 49 + vnetID: vnet-id +- scaler: +- batchSize: 16 +- maxIPCount: 250 +- releaseThresholdPercent: 150 +- requestThresholdPercent: 50 + status: Updating +``` + +In order: +- the GV is incremented to `acn.azure.com/v1beta1` +- the `spec.requestedIPCount` key is renamed to `spec.secondaryIPs` + - the value is change from a single scalar to a map of `NC ID` to scalar values +- the `spec.ipsNotInUse` key is renamed to `spec.releasedIPs` +- the `status.assignedIPCount` field is moved and renamed to `status.networkContainers[].secondaryIPCount` +- the `status.networkContainers[].ipAssignments` field is renamed to `status.networkContainers[].secondaryIPs` + - the keys of the secondaryIPs are renamed from `ip` and `name` to `address` and `id` respectively +- the `status.subnetName` fields is removed as a duplicate of `status.subnetID` +- the `status.scaler` is removed entirely + +#### Migration +This update does not _add_ information to the NodeNetworkConfig, but removes and renames some properties. The transition will take place as follows: +1) The `v1beta1` CRD revision is created + - conversion functions are added to the NodeNetworkConfig schema which translate `v1beta1` <-> `v1alpha` (via `v1beta1` as the hub and "Storage Version"). +2) DNC-RC installs the new CRD definition and registers a conversion webhook. +3) CNS switches to `v1beta1`. + +At this time, any mutation of existing NNCs will automatically up-convert them to the `v1beta1` definition. Any client still requesting `v1alpha` will still be served a down-converted representation of the NNC in a backwards-compatible fashion, and updates to that NNC will be stored in the `v1beta1` representation. diff --git a/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md b/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md deleted file mode 100644 index 193f3e984f..0000000000 --- a/docs/feature/subnet-scarcity/phase-3/2-subnetscaler.md +++ /dev/null @@ -1,32 +0,0 @@ -## Migrating the Scaler properties to the ClusterSubnet CRD [[Phase 3 Design]](../proposal.md#3-3-scaler-properties-move-to-the-clustersubnet-crd) -Currently, the [`v1alpha/NodeNetworkConfig` contains the Scaler inputs](https://github.com/Azure/azure-container-networking/blob/eae2389f888468e3b863cb28045ba613a5562360/crd/nodenetworkconfig/api/v1alpha/nodenetworkconfig.go#L66-L72) which CNS will use to scale the local IPAM pool: - -```yaml -... -status: - scaler: - batchSize: X - releaseThresholdPercent: X - requestThresholdPercent: X - maxIPCount: X -``` -Since the Scaler values are dependent on the state of the Subnet, the Scaler object will be moved to the ClusterSubnet CRD and optimized. - -### ClusterSubnet Scaler -The ClusterSubnet Scaler definition will be: -```yaml -... -status: - scaler: - batch: X // equal to batchSize - buffer: X // equal to requestThresholdPercent -``` - -Note: -- The `scaler.maxIPCount` will not be migrated, as the maxIPCount is a property of the Node and not the Subnet. -- The `scaler.releaseThresholdPercent` will not be migrated, as it is redundant. The `buffer` (and in fact the `requestThresholdPercent`), imply a `releaseThresholdPercent` and one does not need to be specified explicitly. The [IPAM Scaling Math](../phase-2/2-scalingmath.md) incorporates only a single threshold value and fully describes the behavior of the system. - -#### Migration -When the Scaler is added to the ClusterSubnet CRD definiton, DNC-RC will begin replicating the `batch` and `buffer` properties from the NodeNetworkConfig, keeping both up to date. - -CNS, which already watches the ClusterSubnet CRD for known Subnets, will use the Scaler properties from that object as a priority, and will fall back to using the NNC Scaler properties if they are not present in the ClusterSubnet. diff --git a/docs/feature/subnet-scarcity/proposal.md b/docs/feature/subnet-scarcity/proposal.md index b94538c00a..71221c0974 100644 --- a/docs/feature/subnet-scarcity/proposal.md +++ b/docs/feature/subnet-scarcity/proposal.md @@ -55,7 +55,7 @@ DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet U CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted. ### Phase 2 -The batch size $B$ is dynamically adjusted based on the current subnet utilization. The batch size is increased when the subnet utilization is low, and decreased when the subnet utilization is high. IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets. +IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets. CNS scaling math is improved, and CNS Scalar properties come from the ClusterSubnet CRD instead of the NodeNetworkConfig CRD. #### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State. @@ -79,9 +79,12 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration. ### Phase 3 +CNS watches Pods and adjusts the SecondaryIP Count immediately in reaction to Pod IP demand changes. The NNC is revised to cut weight and prepare for the dynamic batch size (or multi-nc) future. + + #### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods CNS will Watch for Pod events on its Node, and use the number of scheduled Pods to calculate the target Requested IP Count. -#### [[3-2]](phase-3/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD -The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. +#### [[3-2]](phase-3/2-nncbeta.md) Revise the NNC to v1beta1 +With the Scaler migration in [[Phase 2-3]](#2-3-scaler-properties-move-to-the-clustersubnet-crd), the NodeNetworkConfig will be revised to remove this object and optimize. From 7dad2d989760305581ed0c8c3708db8c4a37a539 Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Wed, 19 Oct 2022 19:36:37 +0000 Subject: [PATCH 3/4] fix rendering of math in parenths Signed-off-by: Evan Baker --- docs/feature/subnet-scarcity/phase-3/1-watchpods.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md index 55071e9dc4..18c507f872 100644 --- a/docs/feature/subnet-scarcity/phase-3/1-watchpods.md +++ b/docs/feature/subnet-scarcity/phase-3/1-watchpods.md @@ -3,7 +3,7 @@ As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod. In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ($16$) IPs +- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs - At $T_1$ 35 Pods are scheduled for a Total of 36 Pods - At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs - At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error @@ -15,12 +15,12 @@ In the reactive architecture, CNS is not able to track the number of incoming Po - At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler: -- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ($16$) IPs +- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs - At $T_1$ 35 Pods are scheduled for a Total of 36 Pods - At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math) - At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ -The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ($N = Nodes$) and the load on the API Server from watching Pods ($N \propto Nodes$). +The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ( $N = Nodes$ ) and the load on the API Server from watching Pods ( $N \propto Nodes$ ). #### SharedInformers and local Caches Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores From 06be2a26f924e46518abefea11314ffb89ac586a Mon Sep 17 00:00:00 2001 From: Evan Baker Date: Tue, 25 Oct 2022 00:30:52 +0000 Subject: [PATCH 4/4] drop subnetID instead of subnetName Signed-off-by: Evan Baker --- docs/feature/subnet-scarcity/phase-3/2-nncbeta.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md b/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md index f4f6d398f6..38ab262d78 100644 --- a/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md +++ b/docs/feature/subnet-scarcity/phase-3/2-nncbeta.md @@ -36,8 +36,8 @@ Since this is a significant breaking change, the NodeNetworkConfig definition mu + id: abc-ip-123-guid subcriptionID: abc-sub-123-guid subnetAddressSpace: 10.241.0.0/16 - subnetID: podnet -- subnetName: podnet +- subnetID: podnet + subnetName: podnet type: vnet version: 49 vnetID: vnet-id @@ -57,7 +57,7 @@ In order: - the `status.assignedIPCount` field is moved and renamed to `status.networkContainers[].secondaryIPCount` - the `status.networkContainers[].ipAssignments` field is renamed to `status.networkContainers[].secondaryIPs` - the keys of the secondaryIPs are renamed from `ip` and `name` to `address` and `id` respectively -- the `status.subnetName` fields is removed as a duplicate of `status.subnetID` +- the `status.subnetID` fields is removed as a duplicate of `status.subnetName`, where both were actually the "name" and not a unique ID. - the `status.scaler` is removed entirely #### Migration