# What is Distributed Training
Deep learning has shown that being able to train large models on vasts amount of data can drastically improve model performance. 

However, consider the problem of training a deep network with millions, or even billions of parameters. How do we achieve this without waiting for days, or even multiple weeks? Dean et al propose a different training paradigm which allows us to train and serve a model on multiple physical machines. The auth|ors propose two novel methodologies to accomplish this, namely, `model parallelism` and `data parallelism`.


# Model Parallelism
When a big model can not fit into a single node's memory, model parallel training can be employed to handle the big model. Model parallelism training has two key features:
1. Each worker task is responsible for estimating different part of the model parameters. So the computation logic in each worker is different from other one else.
2. There is application-level data communication between workers. 

![Model Parallelism](./images/model_parallelism.jpg)


# Data Parallelism

The algorithm distributes the data between various tasks.
1. Each worker task is responsible for estimating different part of the dataset
2. Tasks then exchange their estimate(s) with each other to come up with the right estimate for the step.

![Data Parallelism](./images/data_parallelism.png)



# Distributed Training in Tensorflow 
"Data Parallelism" is the most common training configuration, it involves multiple tasks in a `worker` job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a `ps` (parameter server) job. All tasks typically run on different machines or containers. There are many ways to specify this structure in TensorFlow, and Tensorflow team are building libraries that will simplify the work of specifying a replicated model. Other platforms like `MXnet`, `Petuum` also have the same abstraction. 

- __In-graph replication__. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker.

- __Between-graph replication__. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.

- __Asynchronous training__. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above.

- __Synchronous training__. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using the tf.train.SyncReplicasOptimizer).


# Examples

We will introduce two frameworks in the distributed training. Tensorflow and PyTorch

# View Distributed TensorFlow Job (TFJob) Definition

In [1]:
!cat ./distributed-training-jobs/distributed-tensorflow-job.yaml

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "distributed-tensorflow-job"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0

# Launch Distributed TensorFlow Training job (`TFJob`)

In [2]:
!kubectl create -f distributed-training-jobs/distributed-tensorflow-job.yaml

tfjob.kubeflow.org/distributed-tensorflow-job created


# View All TFJobs

In [3]:
!kubectl get tfjob

NAME                         STATE   AGE
distributed-tensorflow-job           1s


# Check TFJob Status

In [4]:
!kubectl describe tfjob distributed-tensorflow-job

Name:         distributed-tensorflow-job
Namespace:    anonymous
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2020-08-23T01:05:31Z
  Generation:          1
  Resource Version:    63010
  Self Link:           /apis/kubeflow.org/v1/namespaces/anonymous/tfjobs/distributed-tensorflow-job
  UID:                 c8ec2343-8022-4f17-9582-8bb53d52de44
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Image:  gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0
            Name:   tensorflow
    Worker:
      Replicas:        4
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Image:  

# Check Distributed TensorFlow Training Logs
_Note:  If you see an error in this cell, just wait a bit and re-run to see the logs._

In [5]:
!kubectl get pod | grep distributed-tensorflow-job

distributed-tensorflow-job-ps-0       0/1     ContainerCreating   0          2s
distributed-tensorflow-job-ps-1       0/1     ContainerCreating   0          2s
distributed-tensorflow-job-worker-0   0/1     ContainerCreating   0          2s
distributed-tensorflow-job-worker-1   0/1     ContainerCreating   0          1s
distributed-tensorflow-job-worker-2   0/1     ContainerCreating   0          1s
distributed-tensorflow-job-worker-3   0/1     ContainerCreating   0          1s


In [15]:
!kubectl logs distributed-tensorflow-job-worker-0

  from ._conv import register_converters as _register_converters
2020-08-23 01:05:53.696926: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-08-23 01:05:53.697635: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> distributed-tensorflow-job-ps-0.anonymous.svc:2222, 1 -> distributed-tensorflow-job-ps-1.anonymous.svc:2222}
2020-08-23 01:05:53.697657: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> distributed-tensorflow-job-worker-1.anonymous.svc:2222, 2 -> distributed-tensorflow-job-worker-2.anonymous.svc:2222, 3 -> distributed-tensorflow-job-worker-3.anonymous.svc:2222}
2020-08-23 01:05:53.697947: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://localhost:22

1598144797.146798: Worker 0: training step 1951 done (global step: 8084)
1598144797.167984: Worker 0: training step 1952 done (global step: 8087)
1598144797.191108: Worker 0: training step 1953 done (global step: 8092)
1598144797.213397: Worker 0: training step 1954 done (global step: 8097)
1598144797.242188: Worker 0: training step 1955 done (global step: 8102)
1598144797.259722: Worker 0: training step 1956 done (global step: 8105)
1598144797.277138: Worker 0: training step 1957 done (global step: 8109)
1598144797.298728: Worker 0: training step 1958 done (global step: 8114)
1598144797.321893: Worker 0: training step 1959 done (global step: 8115)
1598144797.345451: Worker 0: training step 1960 done (global step: 8120)
1598144797.367704: Worker 0: training step 1961 done (global step: 8124)
1598144797.386993: Worker 0: training step 1962 done (global step: 8126)
1598144797.411022: Worker 0: training step 1963 done (global step: 8131)
1598144797.432483: Worker 0: training 

# Launch Distributed PyTorch Job

In [7]:
!cat ./distributed-training-jobs/distributed-pytorch-job.yaml

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "distributed-pytorch-job"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              #resources:
                #limits:
                  #nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              args: ["--backend", "gloo"]
              # Co

# Launch Distributed PyTorch Training Job

In [8]:
!kubectl apply -f ./distributed-training-jobs/distributed-pytorch-job.yaml

pytorchjob.kubeflow.org/distributed-pytorch-job created


In [9]:
!kubectl describe pytorchjob distributed-pytorch-job

Name:         distributed-pytorch-job
Namespace:    anonymous
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1","kind":"PyTorchJob","metadata":{"annotations":{},"name":"distributed-pytorch-job","namespace":"anonymous"}...
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2020-08-23T01:05:35Z
  Generation:          1
  Resource Version:    63151
  Self Link:           /apis/kubeflow.org/v1/namespaces/anonymous/pytorchjobs/distributed-pytorch-job
  UID:                 eecc655f-2da3-4e5e-b319-5859b7a6df80
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Args:
              --backend
              gloo
            Image:  gcr.io/kubeflow-ci/pytor

# Check Distributed PyTorch Training Logs
## _Note:  If you see an error below, just wait a bit and re-run.  You will eventually see the pod status change to `Running` or `Completed`._

In [10]:
!kubectl get pod | grep distributed-pytorch-job

distributed-pytorch-job-master-0      0/1     ContainerCreating   0          1s
distributed-pytorch-job-worker-0      0/1     Init:0/1            0          2s
distributed-pytorch-job-worker-1      0/1     Init:0/1            0          2s


# If You See an Error Below, Wait a Few Seconds and Re-Run It 

In [16]:
!kubectl logs distributed-pytorch-job-master-0

Using distributed PyTorch with gloo backend
Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Processing...
Done!
