# 入门云原生AI - 2. 运行分布式mnist
接下来我们介绍如何通过arena提交，运维，管理一个分布式训练任务。通过arena，我们管理分布式训练任务，可以拥有像管理单机应用一样方便，快捷的体验。
在这个示例中，我们将演示：

* 下载并准备数据
* 利用Arena提交分布式的训练任务,并且查看训练任务状态和日志
* 通过TensorBoard查看训练任务

> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。

1.下载TensorFlow样例源代码到${HOME}/models目录

In [1]:
! if [ ! -d "${HOME}/models/tensorflow-sample-code" ] ; then \
  git clone "https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git" "${HOME}/models/tensorflow-sample-code"; \
fi

Cloning into '/root/models/tensorflow-sample-code'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 242 (delta 93), reused 242 (delta 93)[K
Receiving objects: 100% (242/242), 11.25 MiB | 0 bytes/s, done.
Resolving deltas: 100% (93/93), done.
Checking connectivity... done.


2.下载mnist数据到${HOME}/dataset/mnist

In [2]:
! mkdir -p ${HOME}/dataset/mnist && \
  cd ${HOME}/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1610k    0 1610k    0     0  2453k      0 --:--:-- --:--:-- --:--:-- 2450k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4542    0  4542    0     0  12110      0 --:--:-- --:--:-- --:--:-- 12144
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9680k    0 9680k    0     0  12.4M      0 --:--:-- --:--:-- --:--:-- 12.4M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 28881    0 28881    0     0  72028      0 --:--:-- --:--:-- --:--:-- 72022


3.创建训练结果${HOME}/output

In [3]:
! mkdir -p ${HOME}/output

4.查看目录结构, 其中`dataset`是数据目录，`models`是模型代码目录，`output`是训练结果目录。

In [4]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|   `-- mnist
|       |-- t10k-images-idx3-ubyte.gz
|       |-- t10k-labels-idx1-ubyte.gz
|       |-- train-images-idx3-ubyte.gz
|       `-- train-labels-idx1-ubyte.gz
|-- models
|   `-- tensorflow-sample-code
|       |-- README.md
|       |-- data
|       |-- mnist-tf
|       |-- models
|       |-- mpijob
|       `-- tfjob
`-- output

10 directories, 5 files


5.检查可用GPU资源

In [5]:
! arena top node

NAME                                   IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
cn-zhangjiakou.i-8vb2knpxzlk449e7lugx  192.168.0.209  <none>  1           0
cn-zhangjiakou.i-8vb2knpxzlk449e7lugy  192.168.0.210  <none>  1           0
cn-zhangjiakou.i-8vb2knpxzlk449e7lugz  192.168.0.208  <none>  1           0
cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw  192.168.0.205  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7662i0dbq  192.168.0.204  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7681j4fav  192.168.0.206  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)  


6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建.   
`--data=training-data:/training`将其映射到训练任务的`/training`目录。而`/training`目录下的子目录`/training/models/tensorflow-sample-code`就是步骤1拷贝源代码的位置，`/training`目录下的子目录`/training/dataset/mnist`就是步骤2下载数据的位置, `/training`目录下的子目录`/training/output`就是步骤3创建的训练结果输出的位置。

In [6]:
# Set the Job Name
%env JOB_NAME=tf-distributed-mnist
# Submit a training job 
# using code and data from Data Volume
!arena submit tf \
             --name=$JOB_NAME \
             --ps=1 \
             --workers=2 \
             --gpus=1 \
             --data=training-data:/training \
             --tensorboard \
             --psImage=tensorflow/tensorflow:1.5.0 \
             --image=tensorflow/tensorflow:1.5.0-gpu-py3 \
             --logdir=/training/output/distributed-mnist \
             "python /training/models/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py --max_steps 10000 --data_dir /training/dataset/mnist --log_dir /training/output/distributed-mnist"

env: JOB_NAME=tf-distributed-mnist
configmap/tf-distributed-mnist-tfjob created
configmap/tf-distributed-mnist-tfjob labeled
service/tf-distributed-mnist-tensorboard created
deployment.extensions/tf-distributed-mnist-tensorboard created
tfjob.kubeflow.org/tf-distributed-mnist created
[36mINFO[0m[0004] The Job tf-distributed-mnist has been submitted successfully 
[36mINFO[0m[0004] You can run `arena get tf-distributed-mnist --type tfjob` to check the job status 


> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event
> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)

7.检查模型训练状态，当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image "tensorflow/tensorflow:1.5.0-gpu-py3"`代表容器镜像过大，导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。

In [7]:
! arena get $JOB_NAME -e

STATUS: PENDING
NAMESPACE: default
TRAINING DURATION: 0s

NAME                  STATUS   TRAINER  AGE  INSTANCE                       NODE
tf-distributed-mnist  PENDING  TFJOB    0s   tf-distributed-mnist-ps-0      N/A
tf-distributed-mnist  PENDING  TFJOB    0s   tf-distributed-mnist-worker-0  N/A
tf-distributed-mnist  PENDING  TFJOB    0s   tf-distributed-mnist-worker-1  N/A

Your tensorboard will be available on:
192.168.0.206:31963   

Events: 
INSTANCE  TYPE  AGE  MESSAGE
--------  ----  ---  -------
                         
                         
                         


8.实时检查日志，此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。
如果想要实时查看日志，可以增加`-f`参数。

In [8]:
! arena logs --tail=50 $JOB_NAME

2019-02-26T07:28:59.062611786Z Instructions for updating:
2019-02-26T07:28:59.062616602Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-02-26T07:28:59.102120053Z Instructions for updating:
2019-02-26T07:28:59.102123749Z Please write your own downloading logic.
2019-02-26T07:28:59.106955986Z Instructions for updating:
2019-02-26T07:28:59.106959188Z Please use tf.data to implement this functionality.
2019-02-26T07:28:59.705261581Z Instructions for updating:
2019-02-26T07:28:59.705266844Z Please use tf.data to implement this functionality.
2019-02-26T07:28:59.710131028Z Instructions for updating:
2019-02-26T07:28:59.710134306Z Please use tf.one_hot on tensors.
2019-02-26T07:28:59.775952959Z Instructions for updating:
2019-02-26T07:28:59.775956287Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-02-26T07:29:00.062085896Z Instructions for updating:
2019-02-26T07:29:00.062089719Z 
2019-02-26T07:29:00

9.查看实时训练的GPU使用情况

In [9]:
! arena top job $JOB_NAME

INSTANCE NAME                  GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)          STATUS   NODE
tf-distributed-mnist-ps-0      N/A                N/A              N/A                      Running  192.168.0.208
tf-distributed-mnist-worker-0  0                  9%               551.0MiB / 16276.2MiB    Running  192.168.0.210
tf-distributed-mnist-worker-1  0                  6%               1092.0MiB / 16276.2MiB   Running  192.168.0.208


10.通过TensorBoard查看训练趋势。您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard，可以考虑在您的笔记本电脑使用 `sshuttle`。例如：`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP，且该外网IP可以通过ssh访问。

In [10]:
! arena get $JOB_NAME

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 45s

NAME                  STATUS   TRAINER  AGE  INSTANCE                       NODE
tf-distributed-mnist  RUNNING  TFJOB    45s  tf-distributed-mnist-ps-0      192.168.0.208
tf-distributed-mnist  RUNNING  TFJOB    45s  tf-distributed-mnist-worker-0  192.168.0.210
tf-distributed-mnist  RUNNING  TFJOB    45s  tf-distributed-mnist-worker-1  192.168.0.208

Your tensorboard will be available on:
192.168.0.206:30308   


![](2-1-tensorboard.jpg)

11.查看模型训练产生的结果, 在`output`下生成了训练结果

In [11]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|   `-- mnist
|       |-- t10k-images-idx3-ubyte.gz
|       |-- t10k-labels-idx1-ubyte.gz
|       |-- train-images-idx3-ubyte.gz
|       `-- train-labels-idx1-ubyte.gz
|-- models
|   `-- tensorflow-sample-code
|       |-- README.md
|       |-- data
|       |-- mnist-tf
|       |-- models
|       |-- mpijob
|       `-- tfjob
`-- output
    `-- distributed-mnist
        |-- test
        `-- train

13 directories, 5 files


12.删除已经完成的任务

In [13]:
! arena delete $JOB_NAME

service "tf-distributed-mnist-tensorboard" deleted
deployment.extensions "tf-distributed-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-distributed-mnist" deleted
configmap "tf-distributed-mnist-tfjob" deleted
[36mINFO[0m[0004] The Job tf-distributed-mnist has been deleted successfully 


恭喜！您已经使用 `arena` 成功运行了训练作业，而且还能轻松检查 Tensorboard。

总结，希望您通过本次演示了解：
1. 如何准备代码和数据，并将其放入数据卷中
2. 如何在分布式训练任务中引用数据卷，并且使用其中的代码和数据
3. 如何利用arena管理您的分布式训练任务。

以上是使用`Arena`在云上进行机器学习的例子，您可以修改代码`${HOME}/models/tensorflow-sample-code/tfjob/docker/v1alpha2/distributed-mnist/main.py`重新提交，从而实现模型开发的效果。