# 入门云原生AI - 3. 提交MPI任务

在这个示例中，我们将演示：

* 利用Arena提交分布式MPI的训练任务,并且查看训练任务状态和日志
* 通过TensorBoard查看训练任务

> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。

1.下载TensorFlow样例源代码到${HOME}/models目录

In [1]:
! if [ ! -d "${HOME}/models/tensorflow-benchmarks" ] ; then \
  git clone -b cnn_tf_v1.9_compatible "https://code.aliyun.com/xiaozhou/benchmark.git" "${HOME}/models/tensorflow-benchmarks"; \
fi

Cloning into '/root/models/tensorflow-benchmarks'...
remote: Enumerating objects: 3748, done.[K
remote: Counting objects: 100% (3748/3748), done.[K
remote: Compressing objects: 100% (1170/1170), done.[K
remote: Total 3748 (delta 2557), reused 3748 (delta 2557)[K
Receiving objects: 100% (3748/3748), 1.98 MiB | 0 bytes/s, done.
Resolving deltas: 100% (2557/2557), done.
Checking connectivity... done.


3.创建训练结果${HOME}/output

In [2]:
! mkdir -p ${HOME}/output

4.查看目录结构, 其中`dataset`是数据目录，`models`是模型代码目录，`output`是训练结果目录。

In [3]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|-- models
|   `-- tensorflow-benchmarks
|       |-- LICENSE
|       |-- README.md
|       |-- bower_components
|       |-- dashboard_app
|       |-- index.html
|       |-- js
|       |-- scripts
|       |-- soumith_benchmarks.html
|       `-- tools
`-- output

9 directories, 4 files


5.检查可用GPU资源

In [4]:
! arena top node

NAME                                   IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
cn-zhangjiakou.i-8vb2knpxzlk449e7lugx  192.168.0.209  <none>  1           0
cn-zhangjiakou.i-8vb2knpxzlk449e7lugy  192.168.0.210  <none>  1           0
cn-zhangjiakou.i-8vb2knpxzlk449e7lugz  192.168.0.208  <none>  1           0
cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw  192.168.0.205  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7662i0dbq  192.168.0.204  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7681j4fav  192.168.0.206  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)  


6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建.   
`--data=training-data:/training`将其映射到训练任务的`/training`目录。  
`/training`目录下的子目录`/training/models/tensorflow-benchmarks`就是步骤1拷贝源代码的位置。  
`/training`目录下的子目录`/training/output/mpi-benchmarks`就是步骤3创建的训练结果输出的位置。

In [5]:
# Set the Job Name
%env JOB_NAME=tf-mpi-benchmarks
%env USER_DATA_NAME=training-data
# Submit a training job 
# using code and data from Data Volume
!arena submit mpi \
             --name=$JOB_NAME \
             --workers=2 \
             --gpus=1 \
             --data=$USER_DATA_NAME:/training \
             --tensorboard \
             --image=uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5 \
             --logdir=/training/output/mpi-benchmarks \
             "mpirun python /training/models/tensorflow-benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py --model resnet101 --batch_size 64     --variable_update horovod --train_dir=/training/output/mpi-benchmarks --summary_verbosity=3 --save_summaries_steps=10"

env: JOB_NAME=tf-mpi-benchmarks
configmap/tf-mpi-benchmarks-mpijob created
configmap/tf-mpi-benchmarks-mpijob labeled
service/tf-mpi-benchmarks-tensorboard created
deployment.extensions/tf-mpi-benchmarks-tensorboard created
mpijob.kubeflow.org/tf-mpi-benchmarks created
[36mINFO[0m[0004] The Job tf-mpi-benchmarks has been submitted successfully 
[36mINFO[0m[0004] You can run `arena get tf-mpi-benchmarks --type mpijob` to check the job status 


> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event
> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)

7.检查模型训练状态，当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image "uber/horovod:0.13.11-tf1.10.0-torch0.4.0-py3.5"`代表容器镜像过大，导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。

In [6]:
! arena get $JOB_NAME -e

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 57s

NAME               STATUS   TRAINER  AGE  INSTANCE                          NODE
tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-launcher-ph6fr  192.168.0.208
tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-worker-0        192.168.0.210
tf-mpi-benchmarks  RUNNING  MPIJOB   57s  tf-mpi-benchmarks-worker-1        192.168.0.209

Your tensorboard will be available on:
192.168.0.206:31129   

Events: 
No events for pending pod


![](3-1-tensorboard.jpg)

8.实时检查日志，此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。
如果想要实时查看日志，可以增加`-f`参数。

In [7]:
! arena logs --tail=50 $JOB_NAME

2019-03-02T10:25:06.633100624Z 2019-03-02 10:25:06.632415: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] 0:   N 
2019-03-02T10:25:06.633242941Z 2019-03-02 10:25:06.632923: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1097] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15111 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-16GB, pci bus id: 0000:00:09.0, compute capability: 6.0)
2019-03-02T10:25:06.946553283Z I0302 10:25:06.947077 140068638594816 tf_logging.py:115] Starting standard services.
2019-03-02T10:25:06.94658461Z W0302 10:25:06.947337 140068638594816 tf_logging.py:125] Standard services need a 'logdir' passed to the SessionManager
2019-03-02T10:25:06.946588591Z I0302 10:25:06.947422 140068638594816 tf_logging.py:115] Starting queue runners.
2019-03-02T10:25:08.02147656Z I0302 10:25:08.020608 140189223290624 tf_logging.py:115] Running local_init_op.
2019-03-02T10:25:08.456852287Z I0302 10:25:08.455843 1401892232906

9.查看实时训练的GPU使用情况

In [8]:
! arena top job $JOB_NAME

INSTANCE NAME                     GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)           STATUS   NODE
tf-mpi-benchmarks-launcher-s6x5c  N/A                N/A              N/A                       Running  192.168.0.208
tf-mpi-benchmarks-worker-0        0                  97%              15651.0MiB / 16276.2MiB   Running  192.168.0.210
tf-mpi-benchmarks-worker-1        0                  0%               15651.0MiB / 16276.2MiB   Running  192.168.0.208


10.通过TensorBoard查看训练趋势。
您可以通过Tensorboard查看训练趋势， 通过执行 `arena get ${JOB_NAME}`， 您可以获取到tensorboard的集群内网IP，本例中是 `192.168.0.206:31129`。如果您通过笔记本电脑无法直接访问 Tensorboard，可以考虑在您的笔记本电脑使用 `sshuttle`。例如：`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP，且该外网IP可以通过ssh访问。

In [9]:
! arena get $JOB_NAME

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 1m

NAME               STATUS   TRAINER  AGE  INSTANCE                          NODE
tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-launcher-ph6fr  192.168.0.208
tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-worker-0        192.168.0.210
tf-mpi-benchmarks  RUNNING  MPIJOB   1m   tf-mpi-benchmarks-worker-1        192.168.0.209

Your tensorboard will be available on:
192.168.0.206:31129   


11.查看模型训练产生的结果, 在`output`下生成了训练结果

In [10]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|-- models
|   `-- tensorflow-benchmarks
|       |-- LICENSE
|       |-- README.md
|       |-- bower_components
|       |-- dashboard_app
|       |-- index.html
|       |-- js
|       |-- scripts
|       |-- soumith_benchmarks.html
|       `-- tools
`-- output
    `-- mpi-benchmarks
        |-- checkpoint
        |-- events.out.tfevents.1551522695.tf-mpi-benchmarks-worker-0
        |-- graph.pbtxt
        |-- model.ckpt-110.data-00000-of-00001
        |-- model.ckpt-110.index
        `-- model.ckpt-110.meta

10 directories, 10 files


12.删除已经完成的任务

In [11]:
! arena delete $JOB_NAME

service "tf-distributed-mnist-tensorboard" deleted
deployment.extensions "tf-distributed-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-distributed-mnist" deleted
configmap "tf-distributed-mnist-tfjob" deleted
[36mINFO[0m[0004] The Job tf-distributed-mnist has been deleted successfully 


恭喜！您已经使用 `arena` 成功运行了训练作业，而且还能轻松检查 Tensorboard。

总结，希望您通过本次演示了解：
1. 如何准备代码，并将其放入数据卷中
2. 如何在分布式MPI训练任务中引用数据卷，并且使用其中的代码和数据
3. 如何利用arena管理您的分布式训练任务。