Skip to content

Latest commit

 

History

History
273 lines (221 loc) · 11.4 KB

File metadata and controls

273 lines (221 loc) · 11.4 KB

Overview

This folder holds the Topology-aware scheduler plugin implementations based on Topology aware scheduler plugin based on NodeResourceTopology CRD. This plugin enables scheduling decisions based on worker node hardware topology overcoming the issue described here.

Document capturing the NodeResourceTopology API Custom Resource Definition Standard can be found here.

Maturity Level

  • 💡 Sample (for demonstrating and inspiring purpose)
  • 👶 Alpha (used in companies for pilot projects)
  • 👦 Beta (used in companies and developed actively)
  • 👨 Stable (used in companies for production workloads)

Tutorial

Expectation

In case the cumulative count of node resource allocatable appear to be the same for both the nodes in the cluster, topology aware scheduler plugin uses the CRD instance corresponding to the nodes to obtain the resource topology information to make a topology-aware scheduling decision.

NOTE:

  • NodeResourceTopology version v0.0.12 onwards, CRD has been changed from namespace to cluster scoped. Scheduler plugin version > v0.21.6 depends on NodeResourceTopology CRD v0.0.12 and the namespace field has been deprecated from the NodeResourceTopology scheduler config args.

Dependency:

Config

Scheduler

Enable the "NodeResourceTopologyMatch" Filter and Score plugins via SchedulerConfigConfiguration.

apiVersion: kubescheduler.config.k8s.io/v1beta2
kind: KubeSchedulerConfiguration
leaderElection:
  leaderElect: false
clientConnection:
  kubeconfig: "/etc/kubernetes/scheduler.conf"
profiles:
- schedulerName: topo-aware-scheduler
  plugins:
    filter:
      enabled:
      - name: NodeResourceTopologyMatch
    score:
      enabled:
      - name: NodeResourceTopologyMatch
# optional plugin configs
  pluginConfig:
  - name: NodeResourceTopologyMatch
    args:
      # other strategies are MostAllocated and BalancedAllocation
      scoringStrategy:
        type: "LeastAllocated"

Cluster

The Topology-aware scheduler performs its decision over a number of node-specific hardware details or configuration settings which have node granularity (not at cluster granularity). Consistent settings across a set of nodes or all over the cluster is a fundamental prerequisite for the scheduler to work correctly. In other words, it is a prerequisite that a set of nodes share the same NUMA topology and kubelet configuration, at least for settings like topology and resource (cpu, memory, device) managers.

However, the scheduler has no means to enforce or even validate this prerequisite; for example the NodeResourceTopology CRD does not expose all the relevant fields, nor it should (it would be out of scope).

Hence, proper cluster configuration is expected from the cluster admins, or to other software components, like controllers or operators, outside of the scope here.

Should the cluster need to have different settings (e.g. topology manager) or NUMA topologies, we recommend to use the standard kubernetes tools to identify each set of nodes using affinity or also taints.

Demo

Let us assume we have two nodes in a cluster deployed with sample-device-plugin with the hardware topology described by the diagram below:

Setup

The hardware topology corresponding to both the nodes is represented by the below CRD instances. These CRD instances are supposed to be created by Node Agents like Resource Topology Exporter (RTE) or Node feature Discovery (NFD). Please refer to issue Exposing Hardware Topology through CRDs in NFD and Design document which captures details of enhancing NFD to expose node resource topology through CRDs.

For configuring your cluster with NFD-topology updater, a software component in Node Feature Discovery which creates NodeResourceTopology CRs corresponding to nodes in the cluster follow the Quick Start guide here.

# Worker Node A CRD spec
apiVersion: topology.node.k8s.io/v1alpha1
kind: NodeResourceTopology
metadata:
  name: worker-node-A
topologyPolicies: ["SingleNUMANodeContainerLevel"]
zones:
  - name: numa-node-0
    type: Node
    resources:
      - name: cpu
        capacity: 4
        allocatable: 3
      - name: example.com/deviceA
        capacity: 1
        allocatable: 1
      - name: example.com/deviceB
        capacity: 2
        allocatable: 2
  - name: numa-node-1
    type: Node
    resources:
      - name: cpu
        capacity: 4
        allocatable: 3
      - name: example.com/deviceA
        capacity: 2
        allocatable: 2
      - name: example.com/deviceB
        capacity: 1
        allocatable: 1
# Worker Node B CRD spec
apiVersion: topology.node.k8s.io/v1alpha1
kind: NodeResourceTopology
metadata:
  name: worker-node-B
topologyPolicies: ["SingleNUMANodeContainerLevel"]
zones:
  - name: numa-node-0
    type: Node
    resources:
      - name: cpu
        capacity: 4
        allocatable: 3
      - name: example.com/deviceA
        capacity: 3
        allocatable: 3
  - name: numa-node-1
    type: Node
    resources:
      - name: cpu
        capacity: 4
        allocatable: 3
      - name: example.com/deviceB
        capacity: 3
        allocatable: 3
  • Verify if the CRD has been created by running

    1. In case NFD/RTE is deployed in the cluster ensure that the CRD and CRD instances are created by running

        $ kubectl get noderesourcetopologies.topology.node.k8s.io
    2. Alternatively, in case you are just interested in simply testing the scheduler plugin, use the manifest in the manifest directory to deploy the CRD and CRs as follows:

      1. Deploy the Custom Resource Definition manifest

        $ kubectl create -f crd.yaml
      2. Check if the noderesourcetopologies.topology.node.k8s.io CRD is created

         $ kubectl get crd
         $ kubectl get noderesourcetopologies.topology.node.k8s.io
      3. Deploy the CRs representative of the hardware topology of the worker-node-A and worker-node-B if CRs haven't been created using RTE or NFD as mentioned above:

         $ kubectl create -f worker-node-A.yaml
         $ kubectl create -f worker-node-B.yaml

        NOTE: In case you are testing this demo by creating CRs manually, ensure that the names of the nodes in the cluster match the CR names.

  • Copy cluster kubeconfig file to /etc/kubernetes/scheduler.conf

  • Build the image locally

    $  make local-image
  • Push the built image to the image registry:

    $ docker push <IMAGE_REGISTRY>/scheduler-plugins/kube-scheduler:latest
  • Deploy the topology-aware scheduler plugin config

    $ kubectl  create -f scheduler-configmap.yaml
  • Deploy the Scheduler plugin

    $ kubectl  create  -f cluster-role.yaml
    serviceaccount/topo-aware-scheduler created
    clusterrole.rbac.authorization.k8s.io/noderesourcetoplogy-handler created
    clusterrolebinding.rbac.authorization.k8s.io/topo-aware-scheduler-as-kube-scheduler created
    clusterrolebinding.rbac.authorization.k8s.io/my-scheduler-as-volume-scheduler created
    rolebinding.rbac.authorization.k8s.io/topo-aware-scheduler-as-kube-scheduler created
    clusterrolebinding.rbac.authorization.k8s.io/noderesourcetoplogy created
    
    $ kubectl create -f deploy.yaml
    deployment.apps/topo-aware-scheduler created
  • Check if the scheduler plugin is deployed correctly by running the following

    $ kubectl get pods -n kube-system -o wide
    NAME                                         READY   STATUS    RESTARTS   AGE   IP            NODE                 NOMINATED NODE   READINES
    topo-aware-scheduler-764c475854-vpmcw        1/1     Running   0          2s    10.244.0.14   kind-control-plane   <none>           <none>
    
  • Deploy the pod to be scheduled with topology-aware scheduler plugin by populating the schedulerName: topo-aware-scheduler

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
spec:
  selector:
      matchLabels:
        name: test
  template:
    metadata:
      labels:
        name: test
    spec:
      schedulerName: topo-aware-scheduler
      containers:
      - name: test-deployment-1-container-1
        image: quay.io/fromani/numalign
        imagePullPolicy: IfNotPresent
        command: ["/bin/sh", "-c"]
        args: [ "while true; do numalign; sleep 100000; done;" ]
        resources:
          limits:
            cpu: 1
            memory: 100Mi
            example.com/deviceA: 1
            example.com/deviceB: 1
          requests:
            cpu: 1
            memory: 100Mi
            example.com/deviceA: 1
            example.com/deviceB: 1
$ kubectl create -f test-deployment.yaml
deployment.apps/test-deployment created
  • The test-deployment pod should be scheduled on the worker-node-A node

    $ kubectl get pods -o wide
    NAME                               READY   STATUS    RESTARTS   AGE     IP           NODE                 NOMINATED NODE   READINESS GATES
    device-plugin-a-ds-9bpsj           1/1     Running   0          3h13m   172.17.0.3   worker-node-B          <none>           <none>
    device-plugin-a-ds-dv55t           1/1     Running   0          3h13m   172.17.0.2   worker-node-A          <none>           <none>
    device-plugin-b-ds-8t7lh           1/1     Running   0          3h13m   172.17.0.2   worker-node-A          <none>           <none>
    device-plugin-b-ds-lt4pr           1/1     Running   0          3h13m   172.17.0.3   worker-node-B          <none>           <none>
    test-deployment-6dccf65ddb-pkg9j   1/1     Running   0          18s     172.17.0.2   worker-node-A          <none>           <none>