Azure · rbtr · Oct 25, 2022 · Oct 4, 2022 · Oct 18, 2022 · Oct 19, 2022
@@ -44,7 +44,7 @@ Due to the division of responsibilities, it is possible for this flow to deadloc
 
 If the Subnet becomes exhausted _after_ the Node Reconciler loop has set the initial Requested IP count and the NodeNetworkConfig Reconciler is unable to honor the request, the NetworkContainer will never be written to the NNC Status. This Status update is what indicates to CNS that the Network is ready (enough for it to start). In this scenario, no running components can safely update the Request IP Count to get it within the constraints of the Subnet, and the NNC Status will never be updated. CNS will get no IPs, and no Pods can run on that Node.
 
-#### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig
+### Solution: Create NetworkContainer with no SecondaryIPs when creating NodeNetworkConfig
 
 Instead of creating the NodeNetworkConfig with a Requested IP count of $B$, the NodeController will create NodeNetworkConfigs with a Requested IP count of $0$. The NodeNetworkController will create an NC Request with only single Primary IP and zero Secondary IPs for the initial create, and will write the empty NC to the NodeNetworkConfig Status. This skeleton NC in the NNC Status will be enough to signal to CNS to start the IPAM loop, and CNS will be able to iteratively adjust the Requested IP Count based on the current Subnet Exhaustion State at any time, as it does at steady state already.
 

@@ -0,0 +1,157 @@
+## CNS watches Pods to drive IPAM scaling [[Phase 3 Design]](../proposal.md#3-1-cns-watches-pods)
+
+As described in [Phase 2: Scaling Math](../phase-2/2-scalingmath.md), the IPAM Pool Scaling is reactive: CNS assigns IPs out of the IPAM Pool as it is asked for them by the CNI, while trying to maintain a buffer of IPs that is within the Scaler parameters. The CNI makes IP assignment requests serially, and as CNI requests that IPs are assigned or freed, CNS makes requests to scale up or down the IPAM Pool by adjusting the Requested IP Count in the NodeNetworkConfig. If CNS is unable to honor an IP assignment requests due to no free IPs, CNI returns an error to the CRI which causes the Pod sandbox to be cleaned up, and CNS will receive an IP Release request for that Pod.
+
+In the reactive architecture, CNS is not able to track the number of incoming Pod IP assignment requests, CNS can only reliably scale by a single Batch at a time. For example:
+- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs
+- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods
+- At $T_2$ CNI is sequentially requesting IP assignments for Pods, and for Pod $P_8$, CNS has less than $B\times mf$ unassigned IPs and requests an additional Batch of IPs
+- At $T_3$ CNI requests an IP for Pod $P_{16}$ but CNS is out of free IPs and returns an error
+  - CRI tears down $P_{16}$, and CNI requests that CNS frees the IP for $P_{16}$
+  - $P_{17-36}$ are similarly stuck, pending available IPs
+- At $T_4$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{16-31}$
+- At $T_5$ CNS has too few unassigned IPs again and requests another Batch
+- At $T_6$ $P_{32}$ is stuck, pending available IPs
+- At $T_7$ CNS receives an additional Batch of IPs, and as the CNI retries they are assigned to $P_{32-35}$
+
+By proactively watching Pods instead of waiting for the CNI requests, this process could be faster and simpler:
+- At $T_0$ 1 Pod $P_0$ is scheduled: CNS has 1 Batch ( $16$ ) IPs
+- At $T_1$ 35 Pods are scheduled for a Total of 36 Pods
+- At $T_2$ CNS sees 36 Pods have been scheduled and updates the Requested IP Count to $48$ according to the [Scaling Equation](../phase-2/2-scalingmath.md#idempotent-scaling-math)
+- At $T_3$ CNS receives 48 total IPs, and as the CNI requests IP assignments they are assigned to $P_{1-35}$
+
+The following details explain how this design will be accomplished while accounting for the horizontal scalability of CNS ( $N = Nodes$ ) and the load on the API Server from watching Pods ( $N \propto Nodes$ ).
+
+#### SharedInformers and local Caches
+Kubernetes `client-go` [provides machinery for local caching](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md): Reflectors, (Shared)Informers, Indexer, and Stores
+
+<p align="center">
+  <img src="https://raw.githubusercontent.com/kubernetes/sample-controller/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/images/client-go-controller-interaction.jpeg" height="600" width="700"/>
+
+  > [Image from kubernetes/sample-controller documentation](https://github.com/kubernetes/sample-controller/blob/6d1d76794eb5f951e63a46f1ad6e097c1879d81b/docs/controller-client-go.md).
+</p>
+
+By leveraging this machinery, CNS will set up a `Watch` on Pods which will open a single long-lived socket connection to the API Server and will let the API Server push incremental updates. This significantly decreases the data transferred and API Server load when compared to naively polling `List` to get Pods repeatedly.
+
+Additionally, any read-only requests (`Get`, `List`, `Watch`) that CNS makes to Kubernetes using a cache-aware client will hit the local Cache instead of querying the remote API Server. This means that the only requests leaving CNS to the API Server for this Pod Watcher will be the Reflector's List and Watch.
+
+#### Server-side filtering
+To further reduce API Server load and traffic, CNS can use an available [Field Selector](https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/) for Pods: [`spec.nodeName=<node>`](https://github.com/kubernetes/kubernetes/blob/691d4c3989f18e0be22c4499d22eff95d516d32b/pkg/apis/core/v1/conversion.go#L40). Field selectors are, like Label Selectors, [applied on the server-side](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/#list-and-watch-filtering) to List and Watch queries to reduce the dataset that is returned from the API Server to the Client. 
+
+By restricting the Watch to Pods on the current Node, the traffic generated by the Watch will be proportional to the number of Pods on that Node, and will *not* scale in relation to either the number of Nodes in the cluster or the total number of Pods in the cluster.
+
+### Controller-runtime
+To make setting up the filters, SharedInformers, and cache-aware client easy, we will use [`controller-runtime`](https://github.com/kubernetes-sigs/controller-runtime) and create a Pod Reconciler. A controller already exists for managing the `NodeNetworkConfig` CRD lifecycle, so the necessary infrastructure (namely, a Manager) already exists in CNS.
+
+To create a filtered Cache during the Manager instantiation, the existing `nodeScopedCache` will be expanded to include Pods:
+
+```go
+import (
+  v1 "k8s.io/api/core/v1"
+  "k8s.io/apimachinery/pkg/fields"
+  "sigs.k8s.io/controller-runtime/pkg/cache"
+  //...
+)
+//...
+nodeName := "the-node-name"
+// the nodeScopedCache sets Selector options on the Manager cache which are used
+// to perform *server-side* filtering of the cached objects. This is very important
+// for high node/pod count clusters, as it keeps us from watching objects at the
+// whole cluster scope when we are only interested in our Node's scope.
+nodeScopedCache := cache.BuilderWithOptions(cache.Options{
+  SelectorsByObject: cache.SelectorsByObject{
+    // existing options
+    //...,
+    &v1.Pod{}: {
+      Field: fields.SelectorFromSet(fields.Set{"spec.nodeName": nodeName}),
+    },
+  },
+})
+//...
+manager, err := ctrl.NewManager(kubeConfig, ctrl.Options{
+    // existing options
+    //...,
+    NewCache:           nodeScopedCache,
+})
+```
+
+After the local Cache and ListWatch has been set up correctly, the Reconciler should use the Manager-provided Kubernetes API Client within its event loop so that reads hit the cache instead of the real API.
+
+```go
+import (
+  "context"
+
+  v1 "k8s.io/api/core/v1"
+  ctrl "sigs.k8s.io/controller-runtime"
+  "sigs.k8s.io/controller-runtime/pkg/client"
+  "sigs.k8s.io/controller-runtime/pkg/reconcile"
+)
+
+type Reconciler struct {
+  client client.Client
+}
+
+func (r *Reconciler) Reconcile(ctx context.Context, req reconcile.Request) (reconcile.Result, error) {
+  pods := v1.PodList{}
+  r.client.List(ctx, &pods)
+  // do things with the list of pods
+  // ...
+  return reconcile.Result{}, nil
+}
+
+func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
+  r.client = mgr.GetClient()
+  return ctrl.NewControllerManagedBy(mgr).
+    For(&v1.Pod{}).
+    Complete(r)
+}
+```
+
+This can be further optimized by ignoring "Status" Updates to any Pods in the controller setup func:
+```go
+func (r *Reconciler) SetupWithManager(mgr ctrl.Manager) error {
+  r.client = mgr.GetClient()
+  return ctrl.NewControllerManagedBy(mgr).
+    For(&v1.Pod{}).
+    WithEventFilter(predicate.Funcs{
+      // check that the generation has changed - status changes don't update generation.
+      UpdateFunc: func(ue event.UpdateEvent) bool {
+        return ue.ObjectOld.GetGeneration() != ue.ObjectNew.GetGeneration()
+      },
+    }).
+    Complete(r)
+}
+```
+
+### The updated IPAM Pool Monitor
+
+When CNS is watching Pods via the above mechanism, the number of Pods scheduled on the Node (after discarding `hostNetwork: true` Pods), is the instantaneous IP demand for the Node. This IP demand can be fed in to the IPAM Pool scaler in place of the "Used" quantity described in the [idempotent Pool Scaling equation](../phase-2/2-scalingmath.md#idempotent-scaling-math):
+
+$$
+Request = B \times \lceil mf + \frac{Demand}{B} \rceil
+$$
+
+to immediately calculate the target Requested IP Count for the current actual Pod load. At this point, CNS can scale directly to the neccesary number of IPs in a single operation proactively, as soon as Pods are scheduled on the Node, without waiting for the CNI to request IPs serially.
+
+---
+Note:
+
+The CNS memory usage may increase with this change, because it will cache a view of the Pods on its Node. The impact of this will be investigated.
+
+The CNS RBAC will need to be updated to include permission to access Pods:
+```yaml
+apiVersion: rbac.authorization.k8s.io/v1
+kind: ClusterRole
+metadata:
+  name: pod-ro
+  namespace: kube-system
+rules:
+- apiGroups: 
+  - ""
+  verbs: 
+  - get
+  - list
+  - watch
+  resources: 
+  - pods
+```
@@ -0,0 +1,70 @@
+## Revising the NodeNetworkConfig to v1beta1 [[Phase 3 Design]](../proposal.md#3-2-revise-the-nnc-to-v1beta1)
+
+As some responsibility is shifted out of the NodeNetworkConfig (the Scaler), and the use-cases evolve, the NodeNetworkConfig needs to be updated to remain adaptable to all scenarios. Notably, to support multiple NetworkContainers per Node, the NNC should acknowledge that those may be from separate Subnets, and should map the `requestedIPCount` per known NetworkContainer. With the ClusterSubnet CRD hosting the Subnet Scaler properties, this will allow Subnets to scale independently even when they are used on the same Node.
+
+Since this is a significant breaking change, the NodeNetworkConfig definition must be incremented. Since the spec is being incremented, some additional improvements are included.
+
+```diff
+-   apiVersion: acn.azure.com/v1alpha
++   apiVersion: acn.azure.com/v1beta1
+    kind: NodeNetworkConfig
+    metadata:
+        name: nodename
+        namespace: kube-system
+    spec:
+-       requestedIPCount: 16
+-       ipsNotInUse:
++       releasedIPs:
+        -   abc-ip-123-guid
++       secondaryIPs:
++           abc-nc-123-guid: 16
+    status:
+-       assignedIPCount: 1
+        networkContainers:
+        -   assignmentMode: dynamic
+            defaultGateway: 10.241.0.1
+            id: abc-nc-123-guid
+-           ipAssignments:
+-           -   ip: 10.241.0.2 
+-               name: abc-ip-123-guid
+            nodeIP: 10.240.0.5
+            primaryIP: 10.241.0.38
+            resourceGroupID: rg-id
++           secondaryIPCount: 1
++           secondaryIPs:
++           -   address: 10.241.0.2 
++               id: abc-ip-123-guid
+            subcriptionID: abc-sub-123-guid
+            subnetAddressSpace: 10.241.0.0/16
+-           subnetID: podnet
+            subnetName: podnet
+            type: vnet
+            version: 49
+            vnetID: vnet-id
+-       scaler:
+-           batchSize: 16
+-           maxIPCount: 250
+-           releaseThresholdPercent: 150
+-           requestThresholdPercent: 50
+        status: Updating
+```
+
+In order:
+- the GV is incremented to `acn.azure.com/v1beta1`
+- the `spec.requestedIPCount` key is renamed to `spec.secondaryIPs`
+    - the value is change from a single scalar to a map of `NC ID` to scalar values
+- the `spec.ipsNotInUse` key is renamed to `spec.releasedIPs`
+- the `status.assignedIPCount` field is moved and renamed to `status.networkContainers[].secondaryIPCount`
+- the `status.networkContainers[].ipAssignments` field is renamed to `status.networkContainers[].secondaryIPs`
+    - the keys of the secondaryIPs are renamed from `ip` and `name` to `address` and `id` respectively
+- the `status.subnetID` fields is removed as a duplicate of `status.subnetName`, where both were actually the "name" and not a unique ID.
+- the `status.scaler` is removed entirely
+
+#### Migration
+This update does not _add_ information to the NodeNetworkConfig, but removes and renames some properties. The transition will take place as follows:
+1) The `v1beta1` CRD revision is created
+    - conversion functions are added to the NodeNetworkConfig schema which translate `v1beta1` <-> `v1alpha` (via `v1beta1` as the hub and "Storage Version").
+2) DNC-RC installs the new CRD definition and registers a conversion webhook.
+3) CNS switches to `v1beta1`.
+
+At this time, any mutation of existing NNCs will automatically up-convert them to the `v1beta1` definition. Any client still requesting `v1alpha` will still be served a down-converted representation of the NNC in a backwards-compatible fashion, and updates to that NNC will be stored in the `v1beta1` representation.
@@ -55,7 +55,7 @@ DNC-RC will poll DNC's SubnetState API on a fixed interval to check the Subnet U
 CNS will watch the `ClusterSubnet` CRD, scaling down and releasing IPs when the Subnet is marked as Exhausted.
 
 ### Phase 2
-The batch size $B$ is dynamically adjusted based on the current subnet utilization. The batch size is increased when the subnet utilization is low, and decreased when the subnet utilization is high. IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets.
+IPs are not assigned to a new Node until CNS requests them, allowing Nodes to start safely even in very constrained subnets. CNS scaling math is improved, and CNS Scalar properties come from the ClusterSubnet CRD instead of the NodeNetworkConfig CRD.
 
 #### [[2-1]](phase-2/1-emptync.md) DNC-RC creates NCs with no Secondary IPs
 DNC-RC will create the NNC for a new Node with an initial IP Request of 0. An empty NC (containing a Primary, but no Secondary IPs) will be created via normal DNC API calls. The empty NC will be written to the NNC, allowing CNS to start. CNS will make the initial IP request according to the Subnet Exhaustion State.
@@ -77,3 +77,14 @@ CNS will include the NC Primary IP(s) as IPs that it has been allocated, and wil
 
 #### [[2-3]](phase-2/3-subnetscaler.md) Scaler properties move to the ClusterSubnet CRD
 The Scaler properties from the v1alpha/NodeNetworkConfig `Status.Scaler` definition are moved to the ClusterSubnet CRD, and CNS will use the Scaler from this CRD as priority when it is available, and fall back to the NNC Scaler otherwise. The `.Spec` field of the CRD may serve as an "overrides" location for runtime reconfiguration.
+
+### Phase 3
+CNS watches Pods and adjusts the SecondaryIP Count immediately in reaction to Pod IP demand changes. The NNC is revised to cut weight and prepare for the dynamic batch size (or multi-nc) future.
+
+
+#### [[3-1]](phase-3/1-watchpods.md) CNS watches Pods
+CNS will Watch for Pod events on its Node, and use the number of scheduled Pods to calculate the target Requested IP Count.
+
+
+#### [[3-2]](phase-3/2-nncbeta.md) Revise the NNC to v1beta1
+With the Scaler migration in [[Phase 2-3]](#2-3-scaler-properties-move-to-the-clustersubnet-crd), the NodeNetworkConfig will be revised to remove this object and optimize.