Skip to content

Alnair Installation Guide

Zhaobo edited this page Jan 16, 2023 · 11 revisions

Install a vanilla Kubernetes cluster

1. Prepare two linux nodes (at least one gpu node as worker), install docker engine

  • in the /etc/docker/daemon.json add {"exec-opts": ["native.cgroupdriver=systemd"]}, Then systemctl daemon-reload, systemctl restart docker this is because kubelet's default driver is systemd, different from docker's default cgroupfs, kubelet cannot start with this difference

2. On the master node, install 1.23 version with kubeadm, and then create cluster with kubeadm

 apt-mark unhold kubeadm && apt-get install --allow-downgrades kubeadm=1.23.0-00 &&  apt-mark hold kubeadm
 apt-mark unhold kubelet && apt-get install --allow-downgrades kubelet=1.23.0-00 &&  apt-mark hold kubelet
 apt-mark unhold kubectl && apt-get install --allow-downgrades kubectl=1.23.0-00 &&  apt-mark hold kubectl
 kubeadm init --pod-network-cidr=  
  • in kubeadm init add network cidr, otherwise later flannel fails
  • after k8s 1.24, you need to install cri-dockerd. Docker Engine does not implement the CRI which is a requirement for a container runtime to work with Kubernetes. For that reason, an additional service cri-dockerd has to be installed. cri-dockerd is a project based on the legacy built-in Docker Engine support that was removed from the kubelet in version 1.24.

3. On the master node, Install network plugin flannel

4. On the worker node, disable memory swap on the worker nodes swapoff -a

5. On the worker node, install apt-get install nvidia-docker2 and set nvidia docker runtime as default runtime

    "exec-opts": ["native.cgroupdriver=systemd"],
    "runtimes": {
        "nvidia": {
            "path": "nvidia-container-runtime",
            "runtimeArgs": []

6. On the worker node, join the worker nodes to the master

join command can be printed on master node with kubeadm token create --print-join-command

7. On the master, Install nvidia device plugin

8. New users add authorization

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

Install Alnair components

1. Install Alnair profiler

2. Install CRD and Controller for Elastic Horovod

3. Install CRD and Controller for Torch Elastic

4. Install vGPU device plugin

5. Install vGPU scheduler

Clone this wiki locally