# 入门云原生AI - 提交Bert训练任务
BERT是Google开发的一种nlp领域的预训练语言表示模型，BERT在11项NLP任务中夺得非常好的结果，Google在11月份开源了bert的代码，同时发布了多种语言版本的模型。我们可以通过arena 提交bert模型的训练代码，非常方便地利用这项学术红利。

在这个示例中，我们将演示：
* 利用Arena提交Bert的pretraining训练任务，并且查看训练任务状态和日志。

> 前提：请先完成文档中的[共享存储配置](../docs/setup/SETUP_NAS.md)，当前${HOME}就是其中`training-data`的数据卷对应目录。

1.下载Bert样例源代码到${HOME}/models目录

In [2]:
! git clone "https://github.com/google-research/bert.git" "${HOME}/models/bert"

Cloning into '/root/models/bert'...
remote: Enumerating objects: 317, done.[K
remote: Total 317 (delta 0), reused 0 (delta 0), pack-reused 317[K
Receiving objects: 100% (317/317), 254.03 KiB | 149.00 KiB/s, done.
Resolving deltas: 100% (178/178), done.
Checking connectivity... done.


2.下载pretraining 任务所需要的语料数据

In [5]:
! mkdir -p ${HOME}/dataset/bert
! cd ${HOME}/dataset/bert && \
    curl -O http://kubeflow.oss-cn-beijing.aliyuncs.com/uncased_L-12_H-768_A-12.zip && \
    unzip uncased_L-12_H-768_A-12.zip

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  388M  100  388M    0     0  12.6M      0  0:00:30  0:00:30 --:--:-- 12.2M
Archive:  uncased_L-12_H-768_A-12.zip
   creating: uncased_L-12_H-768_A-12/
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.meta  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.data-00000-of-00001  
  inflating: uncased_L-12_H-768_A-12/vocab.txt  
  inflating: uncased_L-12_H-768_A-12/bert_model.ckpt.index  
  inflating: uncased_L-12_H-768_A-12/bert_config.json  


3.创建训练结果的输出目录 ${HOME}/output/bert

In [19]:
! mkdir -p ${HOME}/output/bert

4.查看目录结构。
* `dataset/bert` 是数据目录，用于存储训练所需的数据。
* `models/bert` 是模型代码目录，用于存储模型训练的代码
* `output/bert` 是训练结果目录，存放训练结果模型和checkpoint。

In [21]:
! tree -I ai-starter -L 2 ${HOME}

/root
|-- dataset
|   `-- bert
|-- models
|   |-- bert
|   `-- tensorflow-benchmarks
`-- output
    `-- bert

7 directories, 0 files


5.检查可用GPU资源，训练开始前，我们要保证有足够的空闲GPU资源

In [22]:
! arena top node

NAME                                   IPADDRESS      ROLE    GPU(Total)  GPU(Allocated)
cn-zhangjiakou.i-8vb2knpxzlk449e7lugx  192.168.0.209  <none>  1           0
cn-zhangjiakou.i-8vb2knpxzlk449e7lugy  192.168.0.210  <none>  1           1
cn-zhangjiakou.i-8vb2knpxzlk449e7lugz  192.168.0.208  <none>  1           0
cn-zhangjiakou.i-8vb7yuo831zjzijo9sdw  192.168.0.205  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7662i0dbq  192.168.0.204  master  0           0
cn-zhangjiakou.i-8vbezxqzueo7681j4fav  192.168.0.206  master  0           0
-----------------------------------------------------------------------------------------
Allocated/Total GPUs In Cluster:
1/3 (33%)  


6.通过Arena提交一个bert 创建pretrainingData 的训练任务, 用于创建Bert pretraining所需要的tfrecord文件。
这里`training-data` 是在配置[共享存储时](../docs/setup/SETUP_NAS.md)创建的NAS存储声明.   
`--data=training-data:/training` 将其映射到训练任务的`/training`目录。
* `/training`目录下的子目录`/training/models/bert` 是步骤1拷贝源代码的位置
* `/training`目录下的子目录`/training/dataset/bert` 是步骤2下载数据的位置
* `/training`目录下的子目录`/training/output` 就是步骤3创建的训练结果输出的位置。

In [9]:
%env PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data
!arena submit tf \
             --name=$PRETRAIN_DATA_JOB_NAME \
             --workers=1 \
             --gpus=1 \
             --data=training-data:/training \
             --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
            "python3 /training/models/bert/create_pretraining_data.py \
            --input_file=/training/models/bert/sample_text.txt \
            --output_file=/training/output/bert/tf_examples.tfrecord \
            --vocab_file=/training/dataset/bert/uncased_L-12_H-768_A-12/vocab.txt \
            --do_lower_case=True \
            --max_seq_length=256 \
            --max_predictions_per_seq=39 \
            --masked_lm_prob=0.15 \
            --random_seed=12345 \
            --dupe_factor=5; python3 -c \"import os;fd=os.open('/training/output/bert/tf_examples.tfrecord',os.O_NONBLOCK);os.fsync(fd)\""


env: PRETRAIN_DATA_JOB_NAME=bert-create-pretrain-data
configmap/bert-create-pretrain-data-tfjob created
configmap/bert-create-pretrain-data-tfjob labeled
service/bert-create-pretrain-data-tensorboard created
deployment.extensions/bert-create-pretrain-data-tensorboard created
tfjob.kubeflow.org/bert-create-pretrain-data created
[36mINFO[0m[0004] The Job bert-create-pretrain-data has been submitted successfully 
[36mINFO[0m[0004] You can run `arena get bert-create-pretrain-data --type tfjob` to check the job status 


7.检查Pretraining data任务的状态，这个步骤不涉及大量计算，任务很快就可以完成。

In [10]:
! arena get $PRETRAIN_DATA_JOB_NAME -e

STATUS: SUCCEEDED
NAMESPACE: default
TRAINING DURATION: 3s

NAME                       STATUS     TRAINER  AGE  INSTANCE                           NODE
bert-create-pretrain-data  SUCCEEDED  TFJOB    48s  bert-create-pretrain-data-chief-0  N/A

Your tensorboard will be available on:
192.168.0.206:31785   

Events: 
No events for pending pod


8.查看创建Pretraining data的任务日志


In [12]:
! arena logs --tail=30 $PRETRAIN_DATA_JOB_NAME


2019-03-01T03:19:57.848619386Z INFO:tensorflow:masked_lm_weights: 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2019-03-01T03:19:57.848753719Z INFO:tensorflow:next_sentence_labels: 0
2019-03-01T03:19:57.849205981Z INFO:tensorflow:*** Example ***
2019-03-01T03:19:57.849406004Z INFO:tensorflow:tokens: [CLS] like most of [MASK] fellow gold - seekers [MASK] cass was super ##sti [MASK] . [SEP] basket on phil ##am ##mon ' s head , and tr ##otted [MASK] up a neighbouring street . phil ##am ##mon followed , half contempt ##uous , half wondering at [MASK] this [MASK] might be [MASK] which could feed the self - con ##ce ##it of anything so ab ##ject as his ragged little api ##sh guide ; but the novel roar and w ##hir ##l of the street [MASK] the perpetual [MASK] of busy faces [MASK] the line of cu ##rri ##cles , pal ##an ##quin ##s , [MASK] ass ##es [MASK] camel ##s , elephants , [MAS

9.创建bert pretrain 的训练任务， 

In [15]:
%env PRETRAIN_JOB_NAME=bert-pretrain-data
! arena submit tf --name=$PRETRAIN_JOB_NAME \
                --gpus=1 \
                --workers=1 \
                --data=training-data:/training \
                --image=tensorflow/tensorflow:1.11.0-gpu-py3 \
                "python /training/models/bert/run_pretraining.py \
                --input_file=/training/output/bert/tf_examples.tfrecord \
                --output_dir=/training/output/bert/pretraining_output \
                --do_train=True \
                --do_eval=True \
                --bert_config_file=/training/dataset/bert/uncased_L-12_H-768_A-12/bert_config.json \
                --train_batch_size=16 \
                --max_seq_length=256 \
                --max_predictions_per_seq=39 \
                --num_train_steps=8000 \
                --num_warmup_steps=10 \
                --learning_rate=2e-5 \
                --save_checkpoints_steps=4000"

env: PRETRAIN_JOB_NAME=bert-pretrain-data
configmap/bert-pretrain-data-tfjob created
configmap/bert-pretrain-data-tfjob labeled
tfjob.kubeflow.org/bert-pretrain-data created
[36mINFO[0m[0003] The Job bert-pretrain-data has been submitted successfully 
[36mINFO[0m[0003] You can run `arena get bert-pretrain-data --type tfjob` to check the job status 


10.查看实时训练的GPU使用情况

In [16]:
! arena top job $PRETRAIN_JOB_NAME

INSTANCE NAME               GPU(Device Index)  GPU(Duty Cycle)  GPU(Memory MiB)           STATUS   NODE
bert-pretrain-data-chief-0  0                  100%             15519.0MiB / 16276.2MiB   Running  192.168.0.210


11.查看pretraining 的任务状态和实例情况，本示例中，我们启动了一个训练实例

In [17]:
! arena get $PRETRAIN_JOB_NAME

STATUS: RUNNING
NAMESPACE: default
TRAINING DURATION: 3m

NAME                STATUS   TRAINER  AGE  INSTANCE                    NODE
bert-pretrain-data  RUNNING  TFJOB    3m   bert-pretrain-data-chief-0  192.168.0.210


12.查看pretraining 的训练日志，一段时间后，可以出现`tensorflow:examples/sec`相关的日志，代表已经开始训练，以及训练的速度。
出现由于bert pretraining 时间非常长，如果我们想要实时查看日志，可以增加`-f`参数。

In [18]:
! arena logs --tail=50 $PRETRAIN_JOB_NAME

2019-03-01T03:24:31.335467491Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/kernel:0, shape = (768, 768)
2019-03-01T03:24:31.335472861Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/query/bias:0, shape = (768,)
2019-03-01T03:24:31.335629243Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/kernel:0, shape = (768, 768)
2019-03-01T03:24:31.335635168Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/key/bias:0, shape = (768,)
2019-03-01T03:24:31.335790291Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/kernel:0, shape = (768, 768)
2019-03-01T03:24:31.335795756Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/self/value/bias:0, shape = (768,)
2019-03-01T03:24:31.335952831Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/kernel:0, shape = (768, 768)
2019-03-01T03:24:31.335958202Z INFO:tensorflow:  name = bert/encoder/layer_11/attention/output/dense/bias:0, shape = (768,)


13.训练完成后，我们可以删除已经完成的任务，清理环境。

In [13]:
! arena delete $PRETRAIN_JOB_NAME
! arena delete $PRETRAIN_DATA_JOB_NAME

service "tf-distributed-mnist-tensorboard" deleted
deployment.extensions "tf-distributed-mnist-tensorboard" deleted
tfjob.kubeflow.org "tf-distributed-mnist" deleted
configmap "tf-distributed-mnist-tfjob" deleted
[36mINFO[0m[0004] The Job tf-distributed-mnist has been deleted successfully 


恭喜！您已经使用 `arena` 成功运行了训练作业

总结，希望您通过本次演示如何提交Bert的pretraining任务，将包含以下几个步骤：
1. 准备训练代码和数据，并将其放入数据卷中
2. 提交pretraining 所需的数据处理工作
3. 提交pretraining 的训练任务