# Background
Deep learning has shown that being able to train large models on vasts amount of data can drastically improve model performance. 


However, consider the problem of training a deep network with millions, or even billions of parameters. How do we achieve this without waiting for days, or even multiple weeks? Dean et al propose a different training paradigm which allows us to train and serve a model on multiple physical machines. The auth|ors propose two novel methodologies to accomplish this, namely, `model parallelism` and `data parallelism`.


## Model Parallelism
When a big model can not fit into a single node's memory, model parallel training can be employed to handle the big model. Model parallelism training has two key features:
1. Each worker task is responsible for estimating different part of the model parameters. So the computation logic in each worker is different from other one else.
2. There is application-level data communication between workers. 

![Model Parallelism](./images/model_parallelism.jpg)


## Data Parallelism

The algorithm distributes the data between various tasks.
1. Each worker task is responsible for estimating different part of the dataset
2. Tasks then exchange their estimate(s) with each other to come up with the right estimate for the step.

![Data Parallelism](./images/data_parallelism.png)



# Distributed Training in Tensorflow 
"Data Parallelism" is the most common training configuration, it involves multiple tasks in a `worker` job training the same model on different mini-batches of data, updating shared parameters hosted in one or more tasks in a `ps` (parameter server) job. All tasks typically run on different machines or containers. There are many ways to specify this structure in TensorFlow, and Tensorflow team are building libraries that will simplify the work of specifying a replicated model. Other platforms like `MXnet`, `Petuum` also have the same abstraction. 

- __In-graph replication__. In this approach, the client builds a single tf.Graph that contains one set of parameters (in tf.Variable nodes pinned to /job:ps); and multiple copies of the compute-intensive part of the model, each pinned to a different task in /job:worker.

- __Between-graph replication__. In this approach, there is a separate client for each /job:worker task, typically in the same process as the worker task. Each client builds a similar graph containing the parameters (pinned to /job:ps as before using tf.train.replica_device_setter to map them deterministically to the same tasks); and a single copy of the compute-intensive part of the model, pinned to the local task in /job:worker.

- __Asynchronous training__. In this approach, each replica of the graph has an independent training loop that executes without coordination. It is compatible with both forms of replication above.

- __Synchronous training__. In this approach, all of the replicas read the same values for the current parameters, compute gradients in parallel, and then apply them together. It is compatible with in-graph replication (e.g. using gradient averaging as in the CIFAR-10 multi-GPU trainer), and between-graph replication (e.g. using the tf.train.SyncReplicasOptimizer).


## Examples

We will introduce two frameworks in the distributed training. Tensorflow and PyTorch

### Tensorflow

#### Check Tensorflow PS Job

In [1]:
!cat ./distributed-training-jobs/distributed-tensorflow-job.yaml

apiVersion: "kubeflow.org/v1"
kind: "TFJob"
metadata:
  name: "distributed-tensorflow-job"
spec:
  tfReplicaSpecs:
    PS:
      replicas: 2
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0
    Worker:
      replicas: 4
      restartPolicy: Never
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: tensorflow
              image: gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0

#### Submit TFJob distributed training job

In [2]:
!kubectl create -f distributed-training-jobs/distributed-tensorflow-job.yaml

tfjob.kubeflow.org/distributed-tensorflow-job created


#### Get all TFJobs

In [3]:
!kubectl get tfjob

NAME                         STATE     AGE
distributed-tensorflow-job   Created   3s


#### Check TFJob Status

In [4]:
!kubectl describe tfjob distributed-tensorflow-job

Name:         distributed-tensorflow-job
Namespace:    eksworkshop
Labels:       <none>
Annotations:  <none>
API Version:  kubeflow.org/v1
Kind:         TFJob
Metadata:
  Creation Timestamp:  2020-01-25T21:44:16Z
  Generation:          1
  Resource Version:    49678
  Self Link:           /apis/kubeflow.org/v1/namespaces/eksworkshop/tfjobs/distributed-tensorflow-job
  UID:                 d584767b-3fbb-11ea-9a5c-0a9556c1ecda
Spec:
  Tf Replica Specs:
    PS:
      Replicas:        2
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Image:  gcr.io/kubeflow-ci/tf-dist-mnist-test:1.0
            Name:   tensorflow
    Worker:
      Replicas:        4
      Restart Policy:  Never
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Imag

#### Check all the pods created by this TFJob

In [5]:
!kubectl get pod | grep distributed-tensorflow-job

distributed-tensorflow-job-ps-0       0/1     ContainerCreating   0          7s
distributed-tensorflow-job-ps-1       0/1     ContainerCreating   0          7s
distributed-tensorflow-job-worker-0   0/1     ContainerCreating   0          8s
distributed-tensorflow-job-worker-1   0/1     ContainerCreating   0          8s
distributed-tensorflow-job-worker-2   0/1     ContainerCreating   0          8s
distributed-tensorflow-job-worker-3   0/1     ContainerCreating   0          8s


#### Check logs of one worker pod
`-f` means follow and it will block the process until the job finish. If you want to check current logs and return immediately, please run without `-f` or click `Kernel` -> `Interrupt` to stop the process.

In [7]:
!kubectl logs -f distributed-tensorflow-job-worker-0

  from ._conv import register_converters as _register_converters
2020-01-25 21:44:38.312118: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2020-01-25 21:44:38.312854: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job ps -> {0 -> distributed-tensorflow-job-ps-0.eksworkshop.svc:2222, 1 -> distributed-tensorflow-job-ps-1.eksworkshop.svc:2222}
2020-01-25 21:44:38.312872: I tensorflow/core/distributed_runtime/rpc/grpc_channel.cc:215] Initialize GrpcChannelCache for job worker -> {0 -> localhost:2222, 1 -> distributed-tensorflow-job-worker-1.eksworkshop.svc:2222, 2 -> distributed-tensorflow-job-worker-2.eksworkshop.svc:2222, 3 -> distributed-tensorflow-job-worker-3.eksworkshop.svc:2222}
2020-01-25 21:44:38.313251: I tensorflow/core/distributed_runtime/rpc/grpc_server_lib.cc:324] Started server with target: grpc://lo

1579988688.804729: Worker 0: training step 1910 done (global step: 5051)
1579988688.810537: Worker 0: training step 1911 done (global step: 5055)
1579988688.816378: Worker 0: training step 1912 done (global step: 5058)
1579988688.822737: Worker 0: training step 1913 done (global step: 5060)
1579988688.828150: Worker 0: training step 1914 done (global step: 5066)
1579988688.834909: Worker 0: training step 1915 done (global step: 5069)
1579988688.841194: Worker 0: training step 1916 done (global step: 5073)
1579988688.845813: Worker 0: training step 1917 done (global step: 5078)
1579988688.852504: Worker 0: training step 1918 done (global step: 5080)
1579988688.857516: Worker 0: training step 1919 done (global step: 5083)
1579988688.862439: Worker 0: training step 1920 done (global step: 5087)
1579988688.867044: Worker 0: training step 1921 done (global step: 5090)
1579988688.872571: Worker 0: training step 1922 done (global step: 5093)
1579988688.876796: Worker 0: training 

1579988698.682033: Worker 0: training step 3491 done (global step: 10820)
1579988698.692224: Worker 0: training step 3492 done (global step: 10823)
1579988698.697398: Worker 0: training step 3493 done (global step: 10829)
1579988698.702740: Worker 0: training step 3494 done (global step: 10832)
1579988698.707543: Worker 0: training step 3495 done (global step: 10836)
1579988698.712958: Worker 0: training step 3496 done (global step: 10838)
1579988698.718237: Worker 0: training step 3497 done (global step: 10842)
1579988698.723427: Worker 0: training step 3498 done (global step: 10845)
1579988698.728884: Worker 0: training step 3499 done (global step: 10850)
1579988698.734527: Worker 0: training step 3500 done (global step: 10852)
1579988698.740630: Worker 0: training step 3501 done (global step: 10857)
1579988698.745534: Worker 0: training step 3502 done (global step: 10859)
1579988698.750866: Worker 0: training step 3503 done (global step: 10863)
1579988698.755294: Worker 0: training 

1579988699.321543: Worker 0: training step 3602 done (global step: 11221)
1579988699.329483: Worker 0: training step 3603 done (global step: 11225)
1579988699.335305: Worker 0: training step 3604 done (global step: 11229)
1579988699.340703: Worker 0: training step 3605 done (global step: 11233)
1579988699.346023: Worker 0: training step 3606 done (global step: 11236)
1579988699.351820: Worker 0: training step 3607 done (global step: 11239)
1579988699.357370: Worker 0: training step 3608 done (global step: 11244)
1579988699.364529: Worker 0: training step 3609 done (global step: 11248)
1579988699.371055: Worker 0: training step 3610 done (global step: 11251)
1579988699.377208: Worker 0: training step 3611 done (global step: 11255)
1579988699.384905: Worker 0: training step 3612 done (global step: 11259)
1579988699.396030: Worker 0: training step 3613 done (global step: 11263)
1579988699.402086: Worker 0: training step 3614 done (global step: 11269)
1579988699.407728: Worker 0: training 

1579988700.028805: Worker 0: training step 3713 done (global step: 11641)
1579988700.034445: Worker 0: training step 3714 done (global step: 11645)
1579988700.041357: Worker 0: training step 3715 done (global step: 11648)
1579988700.046646: Worker 0: training step 3716 done (global step: 11652)
1579988700.050511: Worker 0: training step 3717 done (global step: 11655)
1579988700.055617: Worker 0: training step 3718 done (global step: 11658)
1579988700.060978: Worker 0: training step 3719 done (global step: 11661)
1579988700.078748: Worker 0: training step 3720 done (global step: 11665)
1579988700.085475: Worker 0: training step 3721 done (global step: 11671)
1579988700.090277: Worker 0: training step 3722 done (global step: 11673)
1579988700.095005: Worker 0: training step 3723 done (global step: 11676)
1579988700.101278: Worker 0: training step 3724 done (global step: 11680)
1579988700.106380: Worker 0: training step 3725 done (global step: 11683)
1579988700.110831: Worker 0: training 

1579988701.124863: Worker 0: training step 3879 done (global step: 12241)
1579988701.131487: Worker 0: training step 3880 done (global step: 12245)
1579988701.136527: Worker 0: training step 3881 done (global step: 12249)
1579988701.143044: Worker 0: training step 3882 done (global step: 12253)
1579988701.149097: Worker 0: training step 3883 done (global step: 12257)
1579988701.153968: Worker 0: training step 3884 done (global step: 12259)
1579988701.158878: Worker 0: training step 3885 done (global step: 12263)
1579988701.164428: Worker 0: training step 3886 done (global step: 12267)
1579988701.169081: Worker 0: training step 3887 done (global step: 12270)
1579988701.174100: Worker 0: training step 3888 done (global step: 12272)
1579988701.179039: Worker 0: training step 3889 done (global step: 12275)
1579988701.183715: Worker 0: training step 3890 done (global step: 12277)
1579988701.188017: Worker 0: training step 3891 done (global step: 12280)
1579988701.196995: Worker 0: training 

1579988701.842945: Worker 0: training step 3990 done (global step: 12623)
1579988701.855350: Worker 0: training step 3991 done (global step: 12628)
1579988701.865587: Worker 0: training step 3992 done (global step: 12631)
1579988701.873389: Worker 0: training step 3993 done (global step: 12635)
1579988701.893491: Worker 0: training step 3994 done (global step: 12638)
1579988701.899690: Worker 0: training step 3995 done (global step: 12646)
1579988701.912190: Worker 0: training step 3996 done (global step: 12649)
1579988701.921129: Worker 0: training step 3997 done (global step: 12652)
1579988701.929386: Worker 0: training step 3998 done (global step: 12655)
1579988701.937414: Worker 0: training step 3999 done (global step: 12658)
1579988701.945485: Worker 0: training step 4000 done (global step: 12661)
1579988701.952357: Worker 0: training step 4001 done (global step: 12666)
1579988701.957817: Worker 0: training step 4002 done (global step: 12667)
1579988701.963320: Worker 0: training 

1579988702.878710: Worker 0: training step 4156 done (global step: 13224)
1579988702.884375: Worker 0: training step 4157 done (global step: 13227)
1579988702.891182: Worker 0: training step 4158 done (global step: 13231)
1579988702.897203: Worker 0: training step 4159 done (global step: 13235)
1579988702.905261: Worker 0: training step 4160 done (global step: 13240)
1579988702.911306: Worker 0: training step 4161 done (global step: 13243)
1579988702.916259: Worker 0: training step 4162 done (global step: 13248)
1579988702.920257: Worker 0: training step 4163 done (global step: 13250)
1579988702.925141: Worker 0: training step 4164 done (global step: 13253)
1579988702.931289: Worker 0: training step 4165 done (global step: 13256)
1579988702.938110: Worker 0: training step 4166 done (global step: 13260)
1579988702.944124: Worker 0: training step 4167 done (global step: 13264)
1579988702.949083: Worker 0: training step 4168 done (global step: 13270)
1579988702.953539: Worker 0: training 

1579988703.934283: Worker 0: training step 4322 done (global step: 13840)
1579988703.943751: Worker 0: training step 4323 done (global step: 13843)
1579988703.948749: Worker 0: training step 4324 done (global step: 13846)
1579988703.956737: Worker 0: training step 4325 done (global step: 13850)
1579988703.962129: Worker 0: training step 4326 done (global step: 13853)
1579988703.970216: Worker 0: training step 4327 done (global step: 13856)
1579988703.976064: Worker 0: training step 4328 done (global step: 13858)
1579988703.981750: Worker 0: training step 4329 done (global step: 13862)
1579988703.987267: Worker 0: training step 4330 done (global step: 13866)
1579988703.993953: Worker 0: training step 4331 done (global step: 13869)
1579988704.000791: Worker 0: training step 4332 done (global step: 13873)
1579988704.005452: Worker 0: training step 4333 done (global step: 13877)
1579988704.010828: Worker 0: training step 4334 done (global step: 13880)
1579988704.016705: Worker 0: training 

1579988705.073598: Worker 0: training step 4488 done (global step: 14505)
1579988705.082357: Worker 0: training step 4489 done (global step: 14508)
1579988705.087882: Worker 0: training step 4490 done (global step: 14513)
1579988705.093992: Worker 0: training step 4491 done (global step: 14516)
1579988705.101576: Worker 0: training step 4492 done (global step: 14521)
1579988705.106141: Worker 0: training step 4493 done (global step: 14524)
1579988705.112347: Worker 0: training step 4494 done (global step: 14528)
1579988705.117111: Worker 0: training step 4495 done (global step: 14532)
1579988705.122871: Worker 0: training step 4496 done (global step: 14535)
1579988705.128327: Worker 0: training step 4497 done (global step: 14539)
1579988705.133781: Worker 0: training step 4498 done (global step: 14541)
1579988705.139157: Worker 0: training step 4499 done (global step: 14547)
1579988705.143658: Worker 0: training step 4500 done (global step: 14549)
1579988705.148086: Worker 0: training 

1579988706.148430: Worker 0: training step 4654 done (global step: 15094)
1579988706.155142: Worker 0: training step 4655 done (global step: 15098)
1579988706.160535: Worker 0: training step 4656 done (global step: 15102)
1579988706.168519: Worker 0: training step 4657 done (global step: 15104)
1579988706.175365: Worker 0: training step 4658 done (global step: 15110)
1579988706.180898: Worker 0: training step 4659 done (global step: 15114)
1579988706.186740: Worker 0: training step 4660 done (global step: 15117)
1579988706.192925: Worker 0: training step 4661 done (global step: 15121)
1579988706.197532: Worker 0: training step 4662 done (global step: 15124)
1579988706.202430: Worker 0: training step 4663 done (global step: 15127)
1579988706.207447: Worker 0: training step 4664 done (global step: 15131)
1579988706.213382: Worker 0: training step 4665 done (global step: 15134)
1579988706.219551: Worker 0: training step 4666 done (global step: 15138)
1579988706.224947: Worker 0: training 

1579988707.026580: Worker 0: training step 4796 done (global step: 15630)
1579988707.035689: Worker 0: training step 4797 done (global step: 15633)
1579988707.045929: Worker 0: training step 4798 done (global step: 15637)
1579988707.053529: Worker 0: training step 4799 done (global step: 15641)
1579988707.063448: Worker 0: training step 4800 done (global step: 15645)
1579988707.067981: Worker 0: training step 4801 done (global step: 15650)
1579988707.074270: Worker 0: training step 4802 done (global step: 15653)
1579988707.080745: Worker 0: training step 4803 done (global step: 15657)
1579988707.085781: Worker 0: training step 4804 done (global step: 15661)
1579988707.091802: Worker 0: training step 4805 done (global step: 15664)
1579988707.097161: Worker 0: training step 4806 done (global step: 15668)
1579988707.102979: Worker 0: training step 4807 done (global step: 15671)
1579988707.107833: Worker 0: training step 4808 done (global step: 15675)
1579988707.114256: Worker 0: training 

1579988707.876123: Worker 0: training step 4930 done (global step: 16095)
1579988707.893849: Worker 0: training step 4931 done (global step: 16101)
1579988707.901971: Worker 0: training step 4932 done (global step: 16108)
1579988707.908082: Worker 0: training step 4933 done (global step: 16112)
1579988707.914395: Worker 0: training step 4934 done (global step: 16116)
1579988707.923634: Worker 0: training step 4935 done (global step: 16120)
1579988707.930688: Worker 0: training step 4936 done (global step: 16125)
1579988707.943530: Worker 0: training step 4937 done (global step: 16129)
1579988707.948594: Worker 0: training step 4938 done (global step: 16135)
1579988707.954425: Worker 0: training step 4939 done (global step: 16138)
1579988707.961164: Worker 0: training step 4940 done (global step: 16141)
1579988707.967077: Worker 0: training step 4941 done (global step: 16146)
1579988707.972403: Worker 0: training step 4942 done (global step: 16149)
1579988707.978503: Worker 0: training 

1579988708.680617: Worker 0: training step 5041 done (global step: 16562)
1579988708.690076: Worker 0: training step 5042 done (global step: 16566)
1579988708.695355: Worker 0: training step 5043 done (global step: 16570)
1579988708.703042: Worker 0: training step 5044 done (global step: 16573)
1579988708.707887: Worker 0: training step 5045 done (global step: 16577)
1579988708.712571: Worker 0: training step 5046 done (global step: 16581)
1579988708.717488: Worker 0: training step 5047 done (global step: 16584)
1579988708.723272: Worker 0: training step 5048 done (global step: 16587)
1579988708.731049: Worker 0: training step 5049 done (global step: 16591)
1579988708.735944: Worker 0: training step 5050 done (global step: 16594)
1579988708.740375: Worker 0: training step 5051 done (global step: 16598)
1579988708.746358: Worker 0: training step 5052 done (global step: 16602)
1579988708.751587: Worker 0: training step 5053 done (global step: 16604)
1579988708.756331: Worker 0: training 

1579988709.320654: Worker 0: training step 5152 done (global step: 16960)
1579988709.327119: Worker 0: training step 5153 done (global step: 16964)
1579988709.333160: Worker 0: training step 5154 done (global step: 16965)
1579988709.337928: Worker 0: training step 5155 done (global step: 16968)
1579988709.344606: Worker 0: training step 5156 done (global step: 16971)
1579988709.349998: Worker 0: training step 5157 done (global step: 16974)
1579988709.355703: Worker 0: training step 5158 done (global step: 16977)
1579988709.361006: Worker 0: training step 5159 done (global step: 16979)
1579988709.366920: Worker 0: training step 5160 done (global step: 16983)
1579988709.371260: Worker 0: training step 5161 done (global step: 16986)
1579988709.375247: Worker 0: training step 5162 done (global step: 16988)
1579988709.381777: Worker 0: training step 5163 done (global step: 16990)
1579988709.388529: Worker 0: training step 5164 done (global step: 16993)
1579988709.396413: Worker 0: training 

1579988710.010639: Worker 0: training step 5263 done (global step: 17339)
1579988710.019426: Worker 0: training step 5264 done (global step: 17344)
1579988710.024736: Worker 0: training step 5265 done (global step: 17347)
1579988710.029286: Worker 0: training step 5266 done (global step: 17350)
1579988710.034753: Worker 0: training step 5267 done (global step: 17354)
1579988710.040377: Worker 0: training step 5268 done (global step: 17357)
1579988710.046473: Worker 0: training step 5269 done (global step: 17361)
1579988710.051655: Worker 0: training step 5270 done (global step: 17365)
1579988710.057699: Worker 0: training step 5271 done (global step: 17368)
1579988710.065456: Worker 0: training step 5272 done (global step: 17373)
1579988710.070856: Worker 0: training step 5273 done (global step: 17376)
1579988710.081860: Worker 0: training step 5274 done (global step: 17377)
1579988710.087293: Worker 0: training step 5275 done (global step: 17381)
1579988710.093946: Worker 0: training 

1579988710.989363: Worker 0: training step 5429 done (global step: 17913)
1579988710.996682: Worker 0: training step 5430 done (global step: 17916)
1579988711.001893: Worker 0: training step 5431 done (global step: 17920)
1579988711.006760: Worker 0: training step 5432 done (global step: 17924)
1579988711.012938: Worker 0: training step 5433 done (global step: 17925)
1579988711.018715: Worker 0: training step 5434 done (global step: 17930)
1579988711.023960: Worker 0: training step 5435 done (global step: 17934)
1579988711.030165: Worker 0: training step 5436 done (global step: 17938)
1579988711.035871: Worker 0: training step 5437 done (global step: 17940)
1579988711.041167: Worker 0: training step 5438 done (global step: 17944)
1579988711.046878: Worker 0: training step 5439 done (global step: 17947)
1579988711.053331: Worker 0: training step 5440 done (global step: 17952)
1579988711.057999: Worker 0: training step 5441 done (global step: 17956)
1579988711.063658: Worker 0: training 

1579988712.190663: Worker 0: training step 5595 done (global step: 18586)
1579988712.199336: Worker 0: training step 5596 done (global step: 18590)
1579988712.206019: Worker 0: training step 5597 done (global step: 18594)
1579988712.211354: Worker 0: training step 5598 done (global step: 18598)
1579988712.219476: Worker 0: training step 5599 done (global step: 18602)
1579988712.224730: Worker 0: training step 5600 done (global step: 18606)
1579988712.229160: Worker 0: training step 5601 done (global step: 18609)
1579988712.234765: Worker 0: training step 5602 done (global step: 18612)
1579988712.239757: Worker 0: training step 5603 done (global step: 18615)
1579988712.244341: Worker 0: training step 5604 done (global step: 18619)
1579988712.249694: Worker 0: training step 5605 done (global step: 18622)
1579988712.254319: Worker 0: training step 5606 done (global step: 18624)
1579988712.261113: Worker 0: training step 5607 done (global step: 18628)
1579988712.266835: Worker 0: training 

1579988713.171107: Worker 0: training step 5761 done (global step: 19191)
1579988713.179731: Worker 0: training step 5762 done (global step: 19196)
1579988713.187804: Worker 0: training step 5763 done (global step: 19200)
1579988713.194193: Worker 0: training step 5764 done (global step: 19205)
1579988713.199247: Worker 0: training step 5765 done (global step: 19208)
1579988713.204657: Worker 0: training step 5766 done (global step: 19211)
1579988713.210062: Worker 0: training step 5767 done (global step: 19216)
1579988713.215852: Worker 0: training step 5768 done (global step: 19220)
1579988713.221072: Worker 0: training step 5769 done (global step: 19223)
1579988713.228819: Worker 0: training step 5770 done (global step: 19226)
1579988713.233838: Worker 0: training step 5771 done (global step: 19231)
1579988713.238803: Worker 0: training step 5772 done (global step: 19234)
1579988713.244440: Worker 0: training step 5773 done (global step: 19237)
1579988713.249376: Worker 0: training 

1579988714.158071: Worker 0: training step 5927 done (global step: 19747)
1579988714.166716: Worker 0: training step 5928 done (global step: 19749)
1579988714.174573: Worker 0: training step 5929 done (global step: 19755)
1579988714.179594: Worker 0: training step 5930 done (global step: 19759)
1579988714.186728: Worker 0: training step 5931 done (global step: 19762)
1579988714.191896: Worker 0: training step 5932 done (global step: 19766)
1579988714.196309: Worker 0: training step 5933 done (global step: 19770)
1579988714.201377: Worker 0: training step 5934 done (global step: 19772)
1579988714.207552: Worker 0: training step 5935 done (global step: 19776)
1579988714.214022: Worker 0: training step 5936 done (global step: 19780)
1579988714.220651: Worker 0: training step 5937 done (global step: 19783)
1579988714.226703: Worker 0: training step 5938 done (global step: 19788)
1579988714.232905: Worker 0: training step 5939 done (global step: 19792)
1579988714.238697: Worker 0: training 

### PyTorch

In [8]:
!cat ./distributed-training-jobs/distributed-pytorch-job.yaml

apiVersion: "kubeflow.org/v1"
kind: "PyTorchJob"
metadata:
  name: "distributed-pytorch-job"
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              args: ["--backend", "gloo"]
              # Comment out the below resources to use the CPU.
              #resources:
                #limits:
                  #nvidia.com/gpu: 1
    Worker:
      replicas: 2
      restartPolicy: OnFailure
      template:
        metadata:
          annotations:
            sidecar.istio.io/inject: "false"
        spec:
          containers:
            - name: pytorch
              image: gcr.io/kubeflow-ci/pytorch-dist-mnist_test:1.0
              args: ["--backend", "gloo"]
              # Co

In [9]:
!kubectl apply -f ./distributed-training-jobs/distributed-pytorch-job.yaml

pytorchjob.kubeflow.org/distributed-pytorch-job created


In [10]:
!kubectl describe pytorchjob distributed-pytorch-job

Name:         distributed-pytorch-job
Namespace:    eksworkshop
Labels:       <none>
Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                {"apiVersion":"kubeflow.org/v1","kind":"PyTorchJob","metadata":{"annotations":{},"name":"distributed-pytorch-job","namespace":"eksworkshop...
API Version:  kubeflow.org/v1
Kind:         PyTorchJob
Metadata:
  Creation Timestamp:  2020-01-25T21:45:54Z
  Generation:          1
  Resource Version:    50380
  Self Link:           /apis/kubeflow.org/v1/namespaces/eksworkshop/pytorchjobs/distributed-pytorch-job
  UID:                 1069d727-3fbc-11ea-9a5c-0a9556c1ecda
Spec:
  Pytorch Replica Specs:
    Master:
      Replicas:        1
      Restart Policy:  OnFailure
      Template:
        Metadata:
          Annotations:
            sidecar.istio.io/inject:  false
        Spec:
          Containers:
            Args:
              --backend
              gloo
            Image:  gcr.io/kubeflow-ci/p

In [11]:
!kubectl get pod | grep distributed-pytorch-job

distributed-pytorch-job-master-0      0/1     ContainerCreating   0          6s
distributed-pytorch-job-worker-0      0/1     Init:0/1            0          6s
distributed-pytorch-job-worker-1      0/1     Init:0/1            0          6s


In [13]:
!kubectl logs -f distributed-pytorch-job-master-0

Error from server (BadRequest): container "pytorch" in pod "distributed-pytorch-job-master-0" is waiting to start: ContainerCreating
