-
Notifications
You must be signed in to change notification settings - Fork 260
Proposal: Subnet Scarcity Phase 3 detailed design #1645
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,157 @@ | ||
| ## CNS watches Pods to drive IPAM scaling [[Phase 3 Design]](../proposal.md#3-1-cns-watches-pods) | ||
|
|
||
| As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod. | ||
|
|
||
| In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example: | ||
| - At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs | ||
| - At $T_1$ 35 Pods are scheduled for a Total of 36 Pods | ||
| - At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs | ||
| - At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error | ||
| - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$ | ||
| - $P_{17-36}$ are similarly stuck, pending available IPs | ||
| - At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$ | ||
| - At $T_5$ CNS has too few unassigned IPs again and requests another Batch | ||
| - At $T_6$ $P_{32}$ is stuck, pending available IPs | ||
| - At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$ | ||
|
|
||
| By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler: | ||
| - At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs | ||
| - At $T_1$ 35 Pods are scheduled for a Total of 36 Pods | ||
| - At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math) | ||
| - At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$ | ||
|
|
||
| The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ( $N = Nodes$ ) and the load on the API Server from watching Pods ( $N \propto Nodes$ ). | ||
|
|
||
| #### SharedInformers and local Caches | ||
| Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores | ||
|
|
||
| <p align="center"> | ||
| <img src="https://raw.githubusercontent.com/kubernetes/sample-controller/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/images/client-go-controller-interaction.jpeg" height="600" width="700"/> | ||
|
|
||
| > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md). | ||
| </p> | ||
|
|
||
| By leveraging this machinery, CNS will set up a `Watch` on Pods which will open a single long-lived socket connection to the API Server and will let the API Server push incremental updates. This significantly decreases the data transferred and API Server load when compared to naively polling `List` to get Pods repeatedly. | ||
|
|
||
| Additionally, any read-only requests (`Get`, `List`, `Watch`) that CNS makes to Kubernetes using a cache-aware client will hit the local Cache instead of querying the remote API Server. This means that the only requests leaving CNS to the API Server for this Pod Watcher will be the Reflector's List and Watch. | ||
|
|
||
| #### Server-side filtering | ||
| To further reduce API Server load and traffic, CNS can use an available [Field Selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) for Pods: [`spec.nodeName=<node>`](https://github.com/kubernetes/kubernetes/blob/691d4c3989f18e0be22c4499d22eff95d516d32b/pkg/apis/core/v1/conversion.go#L40). Field selectors are, like Label Selectors, [applied on the server-side](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering) to List and Watch queries to reduce the dataset that is returned from the API Server to the Client. | ||
|
|
||
| By restricting the Watch to Pods on the current Node, the traffic generated by the Watch will be proportional to the number of Pods on that Node, and will *not* scale in relation to either the number of Nodes in the cluster or the total number of Pods in the cluster. | ||
|
|
||
| ### Controller-runtime | ||
| To make setting up the filters, SharedInformers, and cache-aware client easy, we will use [`controller-runtime`](https://github.com/kubernetes-sigs/controller-runtime) and create a Pod Reconciler. A controller already exists for managing the `NodeNetworkConfig` CRD lifecycle, so the necessary infrastructure (namely, a Manager) already exists in CNS. | ||
|
|
||
| To create a filtered Cache during the Manager instantiation, the existing `nodeScopedCache` will be expanded to include Pods: | ||
|
|
||
| ```go | ||
| import ( | ||
| v1 "k8s.io/api/core/v1" | ||
| "k8s.io/apimachinery/pkg/fields" | ||
| "sigs.k8s.io/controller-runtime/pkg/cache" | ||
| //... | ||
| ) | ||
| //... | ||
| nodeName := "the-node-name" | ||
| // the nodeScopedCache sets Selector options on the Manager cache which are used | ||
| // to perform *server-side* filtering of the cached objects. This is very important | ||
| // for high node/pod count clusters, as it keeps us from watching objects at the | ||
| // whole cluster scope when we are only interested in our Node's scope. | ||
| nodeScopedCache := cache.BuilderWithOptions(cache.Options{ | ||
| SelectorsByObject: cache.SelectorsByObject{ | ||
| // existing options | ||
| //..., | ||
| &v1.Pod{}: { | ||
| Field: fields.SelectorFromSet(fields.Set{"spec.nodeName": nodeName}), | ||
| }, | ||
| }, | ||
| }) | ||
| //... | ||
| manager, err := ctrl.NewManager(kubeConfig, ctrl.Options{ | ||
| // existing options | ||
| //..., | ||
| NewCache: nodeScopedCache, | ||
| }) | ||
| ``` | ||
|
|
||
| After the local Cache and ListWatch has been set up correctly, the Reconciler should use the Manager-provided Kubernetes API Client within its event loop so that reads hit the cache instead of the real API. | ||
|
|
||
| ```go | ||
| import ( | ||
| "context" | ||
|
|
||
| v1 "k8s.io/api/core/v1" | ||
| ctrl "sigs.k8s.io/controller-runtime" | ||
| "sigs.k8s.io/controller-runtime/pkg/client" | ||
| "sigs.k8s.io/controller-runtime/pkg/reconcile" | ||
| ) | ||
|
|
||
| type Reconciler struct { | ||
| client client.Client | ||
| } | ||
|
|
||
| func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) { | ||
| pods := v1.PodList{} | ||
| r.client.List(ctx, &pods) | ||
| // do things with the list of pods | ||
| // ... | ||
| return reconcile.Result{}, nil | ||
| } | ||
|
|
||
| func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| r.client = mgr.GetClient() | ||
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&v1.Pod{}). | ||
| Complete(r) | ||
| } | ||
| ``` | ||
|
|
||
| This can be further optimized by ignoring "Status" Updates to any Pods in the controller setup func: | ||
| ```go | ||
| func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| r.client = mgr.GetClient() | ||
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&v1.Pod{}). | ||
| WithEventFilter(predicate.Funcs{ | ||
| // check that the generation has changed - status changes don't update generation. | ||
| UpdateFunc: func(ue event.UpdateEvent) bool { | ||
| return ue.ObjectOld.GetGeneration() != ue.ObjectNew.GetGeneration() | ||
| }, | ||
| }). | ||
| Complete(r) | ||
| } | ||
| ``` | ||
|
|
||
| ### The updated IPAM Pool Monitor | ||
|
|
||
| When CNS is watching Pods via the above mechanism, the number of Pods scheduled on the Node (after discarding `hostNetwork: true` Pods), is the instantaneous IP demand for the Node. This IP demand can be fed in to the IPAM Pool scaler in place of the "Used" quantity described in the [idempotent Pool Scaling equation](../phase-2/2-scalingmath.md#idempotent-scaling-math): | ||
|
|
||
| $$ | ||
| Request = B \times \lceil mf + \frac{Demand}{B} \rceil | ||
| $$ | ||
|
|
||
| to immediately calculate the target Requested IP Count for the current actual Pod load. At this point, CNS can scale directly to the neccesary number of IPs in a single operation proactively, as soon as Pods are scheduled on the Node, without waiting for the CNI to request IPs serially. | ||
|
|
||
| --- | ||
| Note: | ||
|
|
||
| The CNS memory usage may increase with this change, because it will cache a view of the Pods on its Node. The impact of this will be investigated. | ||
nairashu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| The CNS RBAC will need to be updated to include permission to access Pods: | ||
| ```yaml | ||
| apiVersion: rbac.authorization.k8s.io/v1 | ||
| kind: ClusterRole | ||
| metadata: | ||
| name: pod-ro | ||
| namespace: kube-system | ||
| rules: | ||
| - apiGroups: | ||
| - "" | ||
| verbs: | ||
| - get | ||
| - list | ||
| - watch | ||
| resources: | ||
| - pods | ||
| ``` | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,70 @@ | ||
| ## Revising the NodeNetworkConfig to v1beta1 [[Phase 3 Design]](../proposal.md#3-2-revise-the-nnc-to-v1beta1) | ||
|
|
||
| As some responsibility is shifted out of the NodeNetworkConfig (the Scaler), and the use-cases evolve, the NodeNetworkConfig needs to be updated to remain adaptable to all scenarios. Notably, to support multiple NetworkContainers per Node, the NNC should acknowledge that those may be from separate Subnets, and should map the `requestedIPCount` per known NetworkContainer. With the ClusterSubnet CRD hosting the Subnet Scaler properties, this will allow Subnets to scale independently even when they are used on the same Node. | ||
|
|
||
| Since this is a significant breaking change, the NodeNetworkConfig definition must be incremented. Since the spec is being incremented, some additional improvements are included. | ||
|
|
||
| ```diff | ||
| - apiVersion: acn.azure.com/v1alpha | ||
| + apiVersion: acn.azure.com/v1beta1 | ||
| kind: NodeNetworkConfig | ||
| metadata: | ||
| name: nodename | ||
| namespace: kube-system | ||
| spec: | ||
| - requestedIPCount: 16 | ||
| - ipsNotInUse: | ||
| + releasedIPs: | ||
| - abc-ip-123-guid | ||
nairashu marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| + secondaryIPs: | ||
| + abc-nc-123-guid: 16 | ||
| status: | ||
| - assignedIPCount: 1 | ||
| networkContainers: | ||
| - assignmentMode: dynamic | ||
| defaultGateway: 10.241.0.1 | ||
| id: abc-nc-123-guid | ||
| - ipAssignments: | ||
| - - ip: 10.241.0.2 | ||
| - name: abc-ip-123-guid | ||
| nodeIP: 10.240.0.5 | ||
| primaryIP: 10.241.0.38 | ||
| resourceGroupID: rg-id | ||
| + secondaryIPCount: 1 | ||
| + secondaryIPs: | ||
| + - address: 10.241.0.2 | ||
| + id: abc-ip-123-guid | ||
| subcriptionID: abc-sub-123-guid | ||
| subnetAddressSpace: 10.241.0.0/16 | ||
| - subnetID: podnet | ||
| subnetName: podnet | ||
| type: vnet | ||
| version: 49 | ||
| vnetID: vnet-id | ||
| - scaler: | ||
| - batchSize: 16 | ||
| - maxIPCount: 250 | ||
| - releaseThresholdPercent: 150 | ||
| - requestThresholdPercent: 50 | ||
| status: Updating | ||
| ``` | ||
|
|
||
| In order: | ||
| - the GV is incremented to `acn.azure.com/v1beta1` | ||
| - the `spec.requestedIPCount` key is renamed to `spec.secondaryIPs` | ||
| - the value is change from a single scalar to a map of `NC ID` to scalar values | ||
| - the `spec.ipsNotInUse` key is renamed to `spec.releasedIPs` | ||
| - the `status.assignedIPCount` field is moved and renamed to `status.networkContainers[].secondaryIPCount` | ||
| - the `status.networkContainers[].ipAssignments` field is renamed to `status.networkContainers[].secondaryIPs` | ||
| - the keys of the secondaryIPs are renamed from `ip` and `name` to `address` and `id` respectively | ||
| - the `status.subnetID` fields is removed as a duplicate of `status.subnetName`, where both were actually the "name" and not a unique ID. | ||
| - the `status.scaler` is removed entirely | ||
|
|
||
| #### Migration | ||
| This update does not _add_ information to the NodeNetworkConfig, but removes and renames some properties. The transition will take place as follows: | ||
| 1) The `v1beta1` CRD revision is created | ||
| - conversion functions are added to the NodeNetworkConfig schema which translate `v1beta1` <-> `v1alpha` (via `v1beta1` as the hub and "Storage Version"). | ||
| 2) DNC-RC installs the new CRD definition and registers a conversion webhook. | ||
| 3) CNS switches to `v1beta1`. | ||
|
|
||
| At this time, any mutation of existing NNCs will automatically up-convert them to the `v1beta1` definition. Any client still requesting `v1alpha` will still be served a down-converted representation of the NNC in a backwards-compatible fashion, and updates to that NNC will be stored in the `v1beta1` representation. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.