# 入门云原生AI - 1. 从mnist开始体验

在这个示例中，我们将演示：

* 下载并准备数据
* 利用Arena提交单机训练任务,并且查看训练任务状态和日志
* 通过TensorBoard查看训练任务

> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。

1.下载TensorFlow样例源代码到${HOME}/models目录

In [1]:
! git clone https://code.aliyun.com/xiaozhou/tensorflow-sample-code.git ${HOME}/models/tensorflow-sample-code

Cloning into '/root/models/tensorflow-sample-code'...
remote: Enumerating objects: 242, done.[K
remote: Counting objects: 100% (242/242), done.[K
remote: Compressing objects: 100% (112/112), done.[K
remote: Total 242 (delta 93), reused 242 (delta 93)[KiB/s   
Receiving objects: 100% (242/242), 11.25 MiB | 22.15 MiB/s, done.
Resolving deltas: 100% (93/93), done.
Checking connectivity... done.


2.下载mnist数据到${HOME}/dataset/mnist

In [2]:
! mkdir -p ${HOME}/dataset/mnist && \
  cd ${HOME}/dataset/mnist && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/t10k-labels-idx1-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-images-idx3-ubyte.gz && \
  curl -O https://code.aliyun.com/xiaozhou/tensorflow-sample-code/raw/master/data/train-labels-idx1-ubyte.gz

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1610k    0 1610k    0     0  2432k      0 --:--:-- --:--:-- --:--:-- 2432k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  4542    0  4542    0     0  13465      0 --:--:-- --:--:-- --:--:-- 13477
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 9680k    0 9680k    0     0  12.7M      0 --:--:-- --:--:-- --:--:-- 12.7M
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 28881    0 28881    0     0  83500      0 --:--:-- --:--:-- --:--:-- 83713


3.创建训练结果${HOME}/output

In [1]:
! mkdir -p ${HOME}/output

4.查看目录结构, 其中`dataset`是数据目录，`models`是模型代码目录，`output`是训练结果目录。

In [3]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|   `-- mnist
|       |-- t10k-images-idx3-ubyte.gz
|       |-- t10k-labels-idx1-ubyte.gz
|       |-- train-images-idx3-ubyte.gz
|       `-- train-labels-idx1-ubyte.gz
|-- models
|   `-- tensorflow-sample-code
|       |-- README.md
|       |-- data
|       |-- mnist-tf
|       |-- models
|       |-- mpijob
|       `-- tfjob
`-- output

10 directories, 5 files


5.检查可用GPU资源

In [4]:
! arena top node

NAME                                 IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
cn-huhehaote.i-hp309790vg0alb65q123  192.168.0.116  master  0           0
cn-huhehaote.i-hp32to8ln1xdug4rk123  192.168.0.194  <none>  8           0
cn-huhehaote.i-hp32to8ln1xdug4rk123  192.168.0.195  <none>  8           0
cn-huhehaote.i-hp3b4qysu7phej5q2123  192.168.0.115  master  0           0
cn-huhehaote.i-hp3b4qysu7phen3sa123  192.168.0.118  <none>  8           0
cn-huhehaote.i-hp3b4qysu7phen3sa123  192.168.0.117  <none>  8           0
cn-huhehaote.i-hp3dc30s7ew8nbmtq123  192.168.0.114  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
0/32 (0%)  


6.通过Arena提交训练任务, 这里`training-data`在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建.   
`--data=training-data:/training`将其映射到训练任务的`/training`目录。而`/training`目录下的子目录`/training/models/tensorflow-sample-code`就是步骤1拷贝源代码的位置，`/training`目录下的子目录`/training/dataset/mnist`就是步骤2下载数据的位置, `/training`目录下的子目录`/training/output`就是步骤3创建的训练结果输出的位置。

In [5]:
# Submit a training job 
# using code and data from NAS
!arena submit tf \
             --name=tf-mnist \
             --gpus=1 \
             --data=training-data:/training \
             --tensorboard \
             --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
             --logdir=/training/output/mnist \
             "python /training/models/tensorflow-sample-code/tfjob/docker/mnist/main.py --max_steps 10000 --data_dir /training/dataset/mnist --log_dir /training/output/mnist"

configmap/tf-mnist-tfjob created
configmap/tf-mnist-tfjob labeled
service/tf-mnist-tensorboard created
deployment.extensions/tf-mnist-tensorboard created
tfjob.kubeflow.org/tf-mnist created
[36mINFO[0m[0002] The Job tf-mnist has been submitted successfully 
[36mINFO[0m[0002] You can run `arena get tf-mnist --type tfjob` to check the job status 


> `Arena`命令的`--logdir`指定`tensorboard`从训练任务的指定目录读取event
> 完整参数可以参考[命令行文档](https://github.com/kubeflow/arena/blob/master/docs/cli/arena_submit_tfjob.md)

7.检查模型训练状态，当任务状态从`Pending`转为`Running`后就可以查看日志和GPU使用率了。这里`-e`为了方便检查任务`Pending`的原因。通常看到`[Pulling] pulling image "tensorflow/tensorflow:1.11.0-gpu-py3"`代表容器镜像过大，导致任务处于`Pending`。这时可以重复执行下列命令直到任务状态变为`Running`。

In [6]:
! arena get tf-mnist -e 2>&1

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 2m

NAME      STATUS   TRAINER  AGE  INSTANCE          NODE
tf-mnist  RUNNING  TFJOB    2m   tf-mnist-chief-0  192.168.0.195

Your tensorboard will be available on:
192.168.0.114:32033   

Events: 
No events for pending pod


8.实时检查日志，此时可以通过调整`--tail=`的数值展示输出的行数。默认为显示全部日志。

In [7]:
! arena logs --tail=50 tf-mnist 

2019-02-24T08:25:26.053563882Z Accuracy at step 7080: 0.9813
2019-02-24T08:25:26.053566438Z Accuracy at step 7090: 0.9825
2019-02-24T08:25:26.053568961Z Adding run metadata for 7099
2019-02-24T08:25:26.053571498Z Accuracy at step 7100: 0.9802
2019-02-24T08:25:26.053574047Z Accuracy at step 7110: 0.9798
2019-02-24T08:25:26.053576564Z Accuracy at step 7120: 0.9817
2019-02-24T08:25:26.053579078Z Accuracy at step 7130: 0.9813
2019-02-24T08:25:26.05358162Z Accuracy at step 7140: 0.9806
2019-02-24T08:25:26.053584146Z Accuracy at step 7150: 0.9816
2019-02-24T08:25:26.053586671Z Accuracy at step 7160: 0.981
2019-02-24T08:25:26.05358922Z Accuracy at step 7170: 0.9812
2019-02-24T08:25:26.053591794Z Accuracy at step 7180: 0.9808
2019-02-24T08:25:26.053594316Z Accuracy at step 7190: 0.9792
2019-02-24T08:25:26.053596833Z Adding run metadata for 7199
2019-02-24T08:25:26.053599394Z Accuracy at step 7200: 0.9799
2019-02-24T08:25:26.053610985Z Accuracy at step 7210: 0.9803
2019-02-24T08

9.查看实时训练的GPU使用情况

In [8]:
! arena top job tf-mnist 

INSTANCE NAME     GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)         STATUS   NODE
tf-mnist-chief-0  5                  0%               549.0MiB / 16276.2MiB   Running  192.168.0.194


10.通过TensorBoard查看训练趋势。您可以使用 `192.168.1.117:30670` 访问 Tensorboard。如果您通过笔记本电脑无法直接访问 Tensorboard，可以考虑在您的笔记本电脑使用 `sshuttle`。例如：`sshuttle -r root@41.82.59.51 192.168.0.0/16`。其中`41.82.59.51`为集群内某个节点的外网IP，且该外网IP可以通过ssh访问。

In [9]:
# show job detail
! arena get tf-mnist

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 1m

NAME      STATUS   TRAINER  AGE  INSTANCE          NODE
tf-mnist  RUNNING  TFJOB    1m   tf-mnist-chief-0  192.168.0.195

Your tensorboard will be available on:
192.168.0.114:32033   


![](1-1-tensorboard.jpg)

11.查看模型训练产生的结果, 在`output`下生成了训练结果

In [10]:
! tree -I ai-starter -L 3 ${HOME}

/root
|-- dataset
|   `-- mnist
|       |-- t10k-images-idx3-ubyte.gz
|       |-- t10k-labels-idx1-ubyte.gz
|       |-- train-images-idx3-ubyte.gz
|       `-- train-labels-idx1-ubyte.gz
|-- models
|   `-- tensorflow-sample-code
|       |-- README.md
|       |-- data
|       |-- mnist-tf
|       |-- models
|       |-- mpijob
|       `-- tfjob
`-- output
    `-- mnist
        |-- test
        `-- train

13 directories, 5 files


12.删除已经完成的任务

In [11]:
# delete job
! arena delete tf-mnist

service "tf-mnist-tensorboard" deleted
deployment.extensions "tf-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-mnist" deleted
configmap "tf-mnist-tfjob" deleted
[36mINFO[0m[0002] The Job tf-mnist has been deleted successfully 


恭喜！您已经使用 `arena` 成功运行了训练作业，而且还能轻松检查 Tensorboard。

总结，希望您通过本次演示了解：
1. 如何准备代码和数据，并将其放入数据卷中
2. 如何在训练任务中引用数据卷，并且使用其中的代码和数据
3. 如何利用arena管理您的训练任务。

以上是使用`Arena`在云上进行机器学习的例子，您可以修改代码`${HOME}/models/tensorflow-sample-code/tfjob/docker/mnist/main.py`重新提交，从而实现模型开发的效果。