# 入门云原生AI - 1. 从mnist开始体验

在这个示例中，我们将演示：

* 下载并准备数据
* 利用Arena提交单机训练任务,并且查看训练任务状态和日志
* 通过TensorBoard查看训练任务

> 前提：请先完成文档中的[共享存储配置]()

1.下载TensorFlow样例源代码到${HOME}/models目录

In [1]:
! git clone https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git ${HOME}/models/tensorflow-sample-code

Cloning into '/home/jovyan/models/tensorflow-sample-code'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 242 (delta 93), reused 242 (delta 93)[K
Receiving objects: 100% (242/242), 11.25 MiB | 0 bytes/s, done.
Resolving deltas: 100% (93/93), done.
Checking connectivity... done.


2.下载mnist数据到${HOME}/dataset/mnist

In [2]:
! mkdir -p .${HOME}/dataset/mnist && \
  cd .${HOME}/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1610k    0 1610k    0     0  6076k      0 --:--:-- --:--:-- --:--:-- 6099k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4542    0  4542    0     0  26221      0 --:--:-- --:--:-- --:--:-- 26254
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9680k    0 9680k    0     0  12.9M      0 --:--:-- --:--:-- --:--:-- 12.9M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 28881    0 28881    0     0   167k      0 --:--:-- --:--:-- --:--:--  168k


3.检查可用GPU资源

In [4]:
! arena top node

NAME                                IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
cn-hangzhou.i-bp10z0xdiqgv653v9h12  192.168.0.197  master  0           0
cn-hangzhou.i-bp13aeb3a5zfo852z812  192.168.0.198  master  0           0
cn-hangzhou.i-bp16qzvrrju4y4ynd012  192.168.0.196  master  0           0
cn-hangzhou.i-bp1a0lysmwctstugx212  192.168.0.200  <none>  1           0
cn-hangzhou.i-bp1a0lysmwctstugx212  192.168.0.201  <none>  1           0
cn-hangzhou.i-bp1a0lysmwctstugx212  192.168.0.199  <none>  1           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/3 (0%)  


4.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时]()创建.   
`--data=training-data:/training`将其映射到训练任务的`/training`目录。而`/training`目录下的子目录`/training/models/tensorflow-sample-code`就是步骤1拷贝源代码的位置，`/training`目录下的子目录`/training/dataset/mnist`就是步骤2下载数据的位置。

In [3]:
# Submit a training job 
# using code and data from NAS
!arena submit tf \
             --name=tf-mnist \
             --gpus=1 \
             --data=training-data:/training \
             --tensorboard \
             --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
             "python /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir /training/dataset/mnist"

configmap/tf-mnist-tfjob created
configmap/tf-mnist-tfjob labeled
service/tf-mnist-tensorboard created
deployment.extensions/tf-mnist-tensorboard created
tfjob.kubeflow.org/tf-mnist created
[36mINFO[0m[0001] The Job tf-mnist has been submitted successfully 
[36mINFO[0m[0001] You can run `arena get tf-mnist --type tfjob` to check the job status 


> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)

5. 检查模型训练状态，当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。

In [5]:
! arena get tf-mnist -e

NAME      STATUS   TRAINER  AGE  INSTANCE          NODE
tf-mnist  PENDING  TFJOB    0s   tf-mnist-chief-0  N/A

Your tensorboard will be available on:
192.168.0.199:30342   

Events: 
INSTANCE          TYPE     AGE  MESSAGE
--------          ----     ---  -------
                                                                                                      


6.实时检查日志

In [38]:
# get the job logs
! arena logs -f --tail=50 tf-mnist 

2019-02-19T08:58:50.62548952Z Instructions for updating:
2019-02-19T08:58:50.625494225Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-02-19T08:58:50.663583511Z Instructions for updating:
2019-02-19T08:58:50.663587321Z Please write your own downloading logic.
2019-02-19T08:58:50.667460936Z Instructions for updating:
2019-02-19T08:58:50.667464619Z Please use tf.data to implement this functionality.
2019-02-19T08:58:51.295033762Z Instructions for updating:
2019-02-19T08:58:51.295037969Z Please use tf.data to implement this functionality.
2019-02-19T08:58:51.300682416Z Instructions for updating:
2019-02-19T08:58:51.300685495Z Please use tf.one_hot on tensors.
2019-02-19T08:58:51.369275274Z Instructions for updating:
2019-02-19T08:58:51.3692787Z Please use alternatives such as official/mnist/dataset.py from tensorflow/models.
2019-02-19T08:58:51.649170871Z Instructions for updating:
2019-02-19T08:58:51.649178597Z 
2019-02-19T08:58:51.64

2019-02-19T09:00:49.842843734Z Accuracy at step 1510: 0.974
2019-02-19T09:00:49.842851131Z Accuracy at step 1520: 0.972
2019-02-19T09:00:49.842855009Z Accuracy at step 1530: 0.9738
2019-02-19T09:00:49.842858665Z Accuracy at step 1540: 0.9741
2019-02-19T09:00:49.842862454Z Accuracy at step 1550: 0.9751
2019-02-19T09:00:49.84286602Z Accuracy at step 1560: 0.9738
2019-02-19T09:00:49.842869709Z Accuracy at step 1570: 0.974
2019-02-19T09:00:49.842873354Z Accuracy at step 1580: 0.9737
2019-02-19T09:00:49.842877344Z Accuracy at step 1590: 0.9708
2019-02-19T09:00:49.842880702Z Adding run metadata for 1599
2019-02-19T09:00:49.842884419Z Accuracy at step 1600: 0.9706
2019-02-19T09:00:49.842893858Z Accuracy at step 1610: 0.9724
2019-02-19T09:00:49.84289822Z Accuracy at step 1620: 0.9696
2019-02-19T09:00:49.842902226Z Accuracy at step 1630: 0.9729
2019-02-19T09:00:49.842906168Z Accuracy at step 1640: 0.9754
2019-02-19T09:00:49.842909735Z Accuracy at step 1650: 0.9754
2019-02-19T09:

7.查看实时训练的GPU使用情况

In [None]:
! arena top job tf-mnist 

8.通过TensorBoard查看训练趋势。您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard，可以考虑使用 `sshuttle`。例如：`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP，且该外网IP可以通过ssh访问。

In [39]:
# show job detail
! arena get tf-mnist

STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 7m

NAME      STATUS     TRAINER  AGE  INSTANCE          NODE
tf-mnist  SUCCEEDED  TFJOB    8m   tf-mnist-chief-0  N/A


![](2-tensorboard.jpg)

9.删除已经完成的任务

In [6]:
# delete job
! arena delete tf-mnist

service "tf-mnist-tensorboard" deleted
deployment.extensions "tf-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-mnist" deleted
configmap "tf-mnist-tfjob" deleted
[36mINFO[0m[0001] The Job tf-mnist has been deleted successfully 


恭喜！您已经使用 `arena` 成功运行了训练作业，而且还能轻松检查 Tensorboard。

总结，希望您通过本次演示了解：
1. 如何准备代码和数据，并将其放入数据卷中
2. 如何在训练任务中引用数据卷，并且使用其中的代码和数据
3. 如何利用arena管理您的训练任务。