Autoscaling Experiment. #399

helinwang · 2017-10-18T22:28:32Z

No description provided.

Yancey1989 · 2017-10-24T06:27:15Z

gongweibao · 2017-10-24T09:06:24Z

https://github.com/PaddlePaddle/cloud/projects/1

Yancey1989 · 2017-10-26T13:12:08Z

Cluster Resouces

CPU: 2348(Core)
GPU: 0(GPU card)

NOTE: The performance of HDFS VFS mount in the internal CPU cluster is not stabilized(somtimes ls will need 10+ seconds), this will cause the difference of each experiment result is very huge. So I make a report using sleep 180 instead of python train.py train as the baseline.

TestCase1

Submit The Fault-tolerant Jobs

> paddlecloud submit -jobname $JOBNAME \
        -cpu 10\
        -gpu 0 \
        -memory 8Gi \
        -parallelism 30 \
        -pscpu 6 \
        -pservers 10 \
        -psmemory 5Gi \
        -entry $ENTRY \
        -faulttolerant \
        ./mnist

AUTO_SCALING ON

> AUTO_SCALING=ON JOB_COUNT=8 ./run.sh start case1

PASS	AVG RUNNING TIME	AVG PENDING TIME	JOB RUNNING TIME	AVG CLUSTER CPU UTILS
0	306	43	193,378,371,366,383,378,191,193	62.03
AVG	306	43	N/A	62.03

AUTO_SCALING OFF

> JOB_COUNT=8 ./run.sh start case1

PASS	AVG RUNNING TIME	AVG PENDING TIME	JOB RUNNING TIME	AVG CLUSTER CPU UTILS
0	250	64	420,191,194,193,377,190,220,222	61.49
AVG	250	64	N/A	61.49

TestCase2

AUTO_SCALING ON

AUTO_SCALING=ON JOB_COUNT=3 ./run.sh start case2

TIME	NGINX PODS	RUNNING TRAINERS	CLUSTER CPU UTILS
1	144	0	18.40
4	200	15	31.94
8	262	15	39.86
12	335	45	61.97
17	400	52	76.66
53	391	52	76.83
56	302	52	65.46
60	238	52	57.28
63	210	52	53.71
67	202	52	55.75
70	200	52	55.49
83	191	52	54.34
86	121	52	45.40
89	110	52	43.99
92	102	52	42.97
95	100	52	42.72
116	180	52	52.94
120	199	52	55.37
127	200	52	55.49
141	200	61	59.33
148	268	61	68.02
152	339	61	77.09
156	385	61	82.96
205	385	49	77.85
209	385	41	74.45
213	385	46	76.58
217	390	46	77.21
221	400	46	78.49
260	400	24	69.12
264	400	38	75.09
327	400	29	71.25
395	400	24	69.12
399	400	14	64.86
450	400	0	58.90

PASS	AVG RUNNING TIME	AVG PENDING TIME	JOB RUNNING TIME	AVG CLUSTER CPU UTILS
0	341	2	205,438,382	67.55
AVG	341	2	N/A	67.55

AUTO_SCALING OFF

AUTO_SCALING=OFF JOB_COUNT=3 ./run.sh start case2

TIME	NGINX PODS	RUNNING TRAINERS	CLUSTER CPU UTILS
1	140	0	17.89
4	203	15	32.33
7	267	15	40.50
11	328	30	54.68
14	393	32	63.84
18	400	45	78.07
54	334	45	69.63
57	264	45	60.69
61	207	45	53.41
64	202	45	52.77
67	201	45	52.64
70	200	45	52.51
83	191	45	51.36
86	112	45	41.27
88	103	45	40.12
94	100	45	39.74
114	137	45	44.46
117	199	45	52.39
120	200	45	52.51
147	251	45	59.03
152	339	45	70.27
156	400	45	78.07
204	400	38	75.09
208	400	1	59.33
212	400	0	58.90

PASS	AVG RUNNING TIME	AVG PENDING TIME	JOB RUNNING TIME	AVG CLUSTER CPU UTILS
0	199	2	204,197,198	59.65
AVG	199	2	N/A	59.65

helinwang · 2017-10-26T20:15:32Z

Great!!!

Maybe we can change the "WAITING" in 0 mnist0 WAITING 5.60 0 0 to "NOT EXISTS" or "N/A"? To me waiting means more like pending, which we already have.
Can you put the link to the trainer Python file in the experiment doc, and keep the Python file up-to-date. So everyone could be on the same page?
The experiment only ran once, we need to run it multiple times (e.g., 10 - 30 times, we probably need a script to run the experiment multiple times, otherwise do it manually takes too much effort) to make the measurement statically sound. Also, the number of jobs running on a cluster with 2348 core probably would be more than 2. Maybe each run should have more than 2 jobs (e.g., 5 - 20 jobs).
Maybe after clearing all the bugs related to the experiment, the final experiment could be longer (e.g., 5 - 30 mins, by adding more passes), a typical machine learning task takes more than 220s :) .

Yancey1989 · 2017-10-27T11:26:09Z

@helinwang I push a PR. #447 to reproduce the experiment, I think I will fixed 1. and 2. of your comment abvoe.

helinwang · 2017-10-27T23:04:07Z

Thanks! @Yancey1989 for fault tolerant mode, I get the below error log, it seems that the ETCD_IP environment variable is not set in the fault tolerant mode (with the newest built paddlecloud from the code in PR #447 )

$ ./control_case1.sh start 1 ON
$ kc logs -f mnist0-trainer-w98q9
label selector: paddle-job-master=mnist0, desired: 1
Starting training job:  ..., num_gradient_servers: 2, trainer_id:  1, version: 
I1027 22:58:52.600109    28 Util.cpp:166] commandline:  --num_gradient_servers=2 --ports_num_for_sparse=1 --use_gpu=0 --trainer_id=1 --trainer_count=10 --num_passes=1 --ports_num=1 --port=7164 
[INFO 2017-10-27 22:58:52,696 layers.py:2556] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-10-27 22:58:52,697 layers.py:2684] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-10-27 22:58:52,698 layers.py:2556] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-10-27 22:58:52,699 layers.py:2684] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I1027 22:58:52.706857    28 GradientMachine.cpp:94] Initing parameters..
I1027 22:58:52.708870    28 GradientMachine.cpp:101] Init parameters done.
t=2017-10-27T22:58:57+0000 lvl=eror msg="Init etcd connection failed" error="dial tcp: lookup None on 11.1.0.10:53: no such host" stack="[github.com/PaddlePaddle/Paddle/go/pserver/client/etcd_client.go:145 c/cclient.go:129 _obj/_cgo_gotypes.go:109]"
t=2017-10-27T22:59:07+0000 lvl=eror msg="Init etcd connection failed" error="dial tcp: lookup None on 11.1.0.10:53: no such host" stack="[github.com/PaddlePaddle/Paddle/go/pserver/client/etcd_client.go:145 c/cclient.go:129 _obj/_cgo_gotypes.go:109]"

EDIT:

Now I don't get the above error, sometimes I get (also in fault tolerant mode):

$ paddlecloud logs mnist0
==========================mnist0-trainer-7nz62==========================
[INFO 2017-10-27 23:53:57,573 layers.py:2556] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-10-27 23:53:57,574 layers.py:2684] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-10-27 23:53:57,575 layers.py:2556] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-10-27 23:53:57,575 layers.py:2684] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I1027 23:53:57.583065    29 GradientMachine.cpp:94] Initing parameters..
I1027 23:53:57.585075    29 GradientMachine.cpp:101] Init parameters done.
I1027 23:53:57.776031    29 NewRemoteParameterUpdater.cpp:130] NewRemoteParameterUpdater initialized
job returned 0...setting pod return message...
===============================
termination log wroted...
==========================mnist0-trainer-l21dr==========================
adam_beta1: 0.9
adam_beta2: 0.999
adam_epsilon: 1e-08
learning_rate_args: ""
async_lagged_grad_discard_ratio: 1.5
I1027 23:53:57.313676    28 NewRemoteParameterUpdater.cpp:125] paddle_begin_init_params done
I1027 23:53:57.313689    28 NewRemoteParameterUpdater.cpp:130] NewRemoteParameterUpdater initialized
job returned 0...setting pod return message...
===============================
termination log wroted...

I have to do the following fix to make it train (but no longer uses recordio):

--- a/doc/autoscale/experiment/mnist/train_ft.py
+++ b/doc/autoscale/experiment/mnist/train_ft.py
@@ -2,6 +2,7 @@ from PIL import Image
 import numpy as np
 import paddle.v2 as paddle
 import paddle.v2.dataset.common as common
+import paddle.v2.dataset.mnist as mnist
 import os
 import sys
 import glob
@@ -138,20 +139,14 @@ def main():
             if event.batch_id % 100 == 0:
                 print "Pass %d, Batch %d, Cost %f, %s" % (
                     event.pass_id, event.batch_id, event.cost, event.metrics)
-        if isinstance(event, paddle.event.EndPass):
-            result = trainer.test(
-                    reader=paddle.batch(
-                    cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "test"),
-                    batch_size=2))
-            print "Test with Pass %d, Cost %f, %s\n" % (
-                event.pass_id, result.cost, result.metrics)
 
     trainer.train(
         reader=paddle.batch(
-            cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "train"),
+            #cluster_reader_recordio(TRAINER_ID, TRAINER_COUNT, "train"),
+            mnist.train(),

Btw, the current ft code does not use the master server. Maybe we can upload the recordio files to a public folder, and use the master server in the ft training.

helinwang · 2017-10-27T23:48:20Z

For documentation purpose in case anyone else run into the problem:

For none fault tolerant job (./control_case1.sh start 1 OFF), I have to comment out the below line to make it start training reliably:

     elif sys.argv[1] == "train":
-        prepare_dataset()
+        #prepare_dataset()
         main()

And to make the trainer run longer, we need to do:

-        num_passes=1)
+        num_passes=100)

Yancey1989 · 2017-10-28T19:22:48Z

Hi @helinwang @typhoonzero , I have update the PR #447 , please follow code :) The update as following:

Fix train_ft.py, fetch record from master
Support PASSES and JOB_COUNT args to support run the experiment multipl passes(the pass for the experiement, distinction with the pass with training), such as:
```
> PASSES=5 JOB_COUNT=10 ./control_case1.sh start
```
The above command will run the experiment 5 times, and submit 10 jobs for each time.
Generating a experiment report after all passes are finihsed.

Yancey1989 · 2017-10-30T15:16:43Z

Remaining problems:

Scale down does not work, Auto-scaling controller does not scale-down the Job #450,
The result of the experiment is unstable because of the unstable HDFS mount.

jacquesqiao · 2017-10-30T16:14:35Z

maybe can add definition description about AVG_RUNNINT_TIME, AVG_PENDING_TIME and JOB_RUNNING_TIME.

helinwang · 2017-10-30T21:36:24Z

Scale down does not work

@Yancey1989 I sent a PR to improve scale down: #456
(pushed to dockerhub: helinwang/training-job-controller). But could not test it because I got this error after starting autoscaler, seems to be a RBAC related issue, do you have any idea?

E1030 21:10:27.490697 1 reflector.go:201] github.com/PaddlePaddle/cloud/go/controller/controller.go:100 : Failed to list *api.TrainingJob: User "system:serviceaccount:helinwang-baidu-com:default" cannot list trainingjobs.paddlepaddle.org at the cluster scope. (get TrainingJobs.paddlepaddle.org)

controller.yaml:

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: training-job-controller
spec:
  replicas: 1
  template:
    metadata:
      labels:
        name: training-job-controller
    spec:
      containers:
      - name: training-job-controller
        image: helinwang/training-job-controller
        imagePullPolicy: Always
        command: ["/controller", "-logtostderr", "-log_level", "debug"]

The result of the experiment is unstable because of the unstable HDFS mount.

Maybe we can overcome this problem by getting running more experiment passes? Currently we run 5 experiment passes, in each experiment pass, 20 passes of mnist data is trained. Maybe we can reduce the 20 to some smaller numbers, such as 3? so we can increase 5 experiment passes to maybe 20 experiment passes.

Another way is to run pods that copy the data to every node and the trainer and master uses the host mount...

Another issue: current train.py does not uses cloud_reader (with the master server), we may want to use it. Otherwise the variable for test case 1 is more than "autoscaling ON/OFF".
Edit: oh my mistake, maybe we are no longer using train.py?

Yancey1989 · 2017-10-31T00:50:34Z

I sent a PR to improve scale down: #456 . But could not test it because I got this error after starting > autoscaler, seems to be a RBAC related issue, do you have any idea?

The controller needs a ClusterRoleBind to bind with the ClusterRole(admin)，in the internal CPU cluster, please submit to namespace with paddlecloud

Another way is to run pods that copy the data to every node and the trainer and master uses the host mount...

As the discuss on Hi, to produce the dataset in Docker image is an easy way...

Edit: oh my mistake, maybe we are no longer using train.py?

Sure, the experiment does not use train.py , I will remove it.

helinwang · 2017-10-31T01:21:23Z

Here is one run for test case 1 today:

PASSES=5 JOB_COUNT=1 FAULT_TOLERANT=ON ./control_case1.sh start

PASS_NUM	AVG_RUNNINT_TIME	AVG_PENDING_TIME	JOB_RUNNING_TIME	CPU_UTILS
0	2360	5	2360	13.69
1	320	5	320	14.54
2	1630	0	1630	13.91
3	4570	0	4570	14.80
4	5870	0	5870	14.94
TOTALLY	2950	2	N/A	14.38

typhoonzero · 2017-10-31T14:34:38Z

有一些问题请教：

case1：开启autoscale的情况，平均job执行时间变长了，整体集群CPU利用率上升不明显，而且貌似也没有达到很高的利用率？是否是因为每个pod request的CPU比较多呢？

case2：没有开启autoscale的情况，最后两行，trainer也变得很少，是job执行完成了么？这时是否nginx有大量pending的pod？

helinwang · 2017-10-31T15:19:09Z

同请教：

case1：为何autoscaling=ON会有pending？

case2：为何autoscaling=ON的时候有450s，而OFF的时候只有212s？

还有case2因为总时间是固定的（比如都是450s），考虑加一个total trainer running time（trainer数在时间上的积分，越高越好）？

putcn · 2017-10-31T21:32:24Z

是不是AVG RUNNING TIME和AVG CPU UTIL并不能反映整体的性能提升, 因为均值的算法非常的一维, 就是几个值加起来除以个数并不能表示一个面的特征.
我们对于case 1是不是可以也收集下time series的每个pod和总体cpu的利用率, 这样我们就可以生成如下的图:

上下俩chart分别对应打开auto scale和没开的情况. 横轴为时间, 纵轴为CPU利用率.
除了红色的每条线都是一个pod的CPU利用曲线, 通过这个可以看出来每个pod的调度情况, 这样可以显示出打开auto scale的情况下pending的time都很少.
红色的曲线代表总体CPU的利用率. 对时间求下积分, 算出x轴和红色曲线围成的面积, 作为CPU utilization score. 分数越高说明CPU做功越多. 如果同样功情况下时间越短, 说明效率越高.
不知道我说明白了没...

typhoonzero · 2017-11-01T00:14:23Z

@putcn 这个方法甚好，而且足够直观。但可能需要 @Yancey1989 也记录执行过程的详细数值。

helinwang · 2017-11-01T00:32:08Z

@typhoonzero @Yancey1989 我改了下程序，能够输出每秒的情况，push到了@Yancey1989的PR里面：https://github.com/PaddlePaddle/cloud/pull/453/commits

Yancey1989 · 2017-11-01T02:31:13Z

Cool!! Thanks @helinwang @putcn

Yancey1989 · 2017-11-01T13:38:13Z

Update PR #452

Do not need to run python main.py print_info, the time serise data will be generated at ./out/mnist-case[1|2]-pass[0-9].log folder.
Average data will be generated at ./out/mnist-case[1|2]-result.csv
Package the dataset and recordio files in the Docker image registry.baidu.com/paddlepaddle/paddlecloud-job:mnist
Add two colume in the time serise data, running trainers for each job and cpu utils for each job, @putcn could you help to add these two colume on the figure?

BUG

Also find a bug that the auto-scaling job would be hanged after the job will be scaled up and scaled down, Can not fetch new task after some trainers have been scaled down Paddle#5279

putcn · 2017-11-01T18:36:17Z

sure, will do

helinwang · 2017-11-01T23:09:42Z

Updated on #453

Change case 2 so that it will run around 10 min for each experiment, no matter how many jobs has been submitted. The experiment will stop after the Nginx is scaled back to 400.

Not longer rm ./out every time, test will have different output folder depending on the configuration.
So now we can run multiple experiments in a loop:

$ for i in `seq 1 2`; do echo pass $i; TAG=round_$i JOB_COUNT=5 ./run.sh start case2; done
pass 1
outputing output to folder: ./out/mnist-OFF-5-1-ON-400-case_case2-round_1

Change master default timeout dur from 20min to 16s, chunk per task from 10 to 1. Pushed image to registry.baidu.com/paddlepaddle/paddlecloud-job:mnist .

helinwang · 2017-11-02T00:07:12Z

Remaining problem:

The max cluster CPU util is capped around 85%, it would be great if we can explain why this happens, to the readers.
case 1 there are many pending jobs. Maybe we need to be more aggressive for scale down, now we only scale down when util reaches 100%, maybe we should scale down when reaching 90% or 95%.

helinwang · 2017-11-02T00:13:19Z

Test case 2 graphs:

The most important graphs are: # of Nginx pods, # of trainers, cluster CPU utils.

TODO: 现在trainer太早结束了，要延长trainer训练pass。

Autoscaling ON:

Autoscaling OFF:

Yancey1989 · 2017-11-02T16:00:44Z

TODO: 现在trainer太早结束了，要延长trainer训练pass。

现在是改到跑30个pass

Test case 1 graphs:

Autoscaling OFF

Autoscaling ON

Test case 2 graphs:

Autoscaling OFF

Autoscaling ON

Yancey1989 · 2017-11-02T16:04:43Z

Hi @helinwang

The max cluster CPU util is capped around 85%, it would be great if we can explain why this happens, to the readers.

I think the reason is the same as #465 , There is a calico Pod which request 250m CPU running on each Node, and the Trainer Pod request 5 CPU in the epxeriment, so always some CPU is idle.

case 1 there are many pending jobs. Maybe we need to be more aggressive for scale down, now we only scale down when util reaches 100%, maybe we should scale down when reaching 90% or 95%.

I submit a PR to fix this problem, #467

Yancey1989 · 2017-11-02T16:11:55Z

Update #453

Run case1 for 10 mins for each experiment, the same as case2.
Upload the logs to the ./experiment/result foler.
Update the contorller Docker image to registry.baidu.com/paddlepaddle/controller:yx2, build from Fix scale up with no assigned node #467

helinwang · 2017-11-02T21:26:13Z

Update:

一个比较方便的debug controller的方法，在本地启动：../../../go/cmd/controller/controller -kubeconfig ~/.kube/config | tee clog.txt
我是在本地启动的controller，cluster里的controller被我杀掉了，如果需要的话要重新启动一下（我不是很确定启动的yaml，所以就没有启动了）。
commit pushed to Fix scale up with no assigned node #467:
- Print less log, fix unit test
- Increase max_load_desired from 0.9 to 0.97
- Scale all by target rather than by diff, fix TrainerJob maybe nil
- Fix crash, add logs
Merged Fix scale up with no assigned node #467 into Add testcase2 scripts #453 , pushed to Add testcase2 scripts #453
Build the latest controller from Add testcase2 scripts #453 , pushed to registry.baidu.com/paddlepaddle/controller:yx2

helinwang · 2017-11-02T23:35:24Z

Problems:

Not very related to the experiment, but kubectl get events sometimes shows:

2m 2m 1 mnist4-trainer-xlqk1 Pod Warning FailedMount kubelet, yq01-jpaas-paddle01-wrk18.yq01.baidu.com Unable to mount volumes for pod "mnist4-trainer-xlqk1_helinwang-baidu-com(e8f619c6-c024-11e7-aa74-6c92bf4727a8)": timeout expired waiting for volumes to attach/mount for pod "helinwang-baidu-com"/"mnist4-trainer-xlqk1". list of unattached/unmounted volumes=[public mulan default-token-hkc45]

wuyi added:

This may due to the job is still mounting HDFS to trainers? @Yancey1989

The job always mount the hostPath by default, it's a global configuration...

From yanxu

helinwang · 2017-11-03T00:20:14Z

TODO:

Let's always use the latest Add testcase2 scripts #453 for the experiment, as it contains changes to experiment script, train_ft.py.
Let's make our experiment command line exactly the same:
I have found that when PASSES=2, the experiment log for PASS1 and PASS2 differs. So let's always use PASSES=1.
- case 1:
  TAG=round_0 AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1
  TAG=round_1 AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1
  ...
  TAG=round_0 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=20 ./run.sh start case1
  TAG=round_1 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=20 ./run.sh start case1
- case 2:
  TAG=round_0 AUTO_SCALING=ON PASSES=1 JOB_COUNT=5 ./run.sh start case2
  TAG=round_1 AUTO_SCALING=ON PASSES=1 JOB_COUNT=5 ./run.sh start case2
  ...
  TAG=round_0 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=5 ./run.sh start case2
  TAG=round_1 AUTO_SCALING=OFF PASSES=1 JOB_COUNT=5 ./run.sh start case2

helinwang · 2017-11-03T00:52:09Z

Latest Graphs:
case 2:
Autoscale OFF

Autoscale ON

case 1:
Autoscale ON

Autoscale OFF

Yancey1989 · 2017-11-03T16:38:37Z

Update #453

Package train_ft.py in the Docker image, and modify the run.sh to use the new Image to run the job.
Fix mount HDFS timeout.
Push the time series data under the log folder.

helinwang · 2017-11-03T22:08:19Z

Update #453

Plotter: support averaging inputs on different timestamps
usage: DATA_MAX=550 DATA_PATHS='case2-mnist-ON*/*.log' python ../python/ploter.py

helinwang · 2017-11-04T00:08:35Z

Known problem:

Sometimes case 2 stuck at:

waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv
waiting for collector exit, generated file ./out/case2-mnist-ON-5-1-ON-400-round_3/mnist-case1-pass0.csv

@helinwang
Fixed this problem, we need to kill the trainingjob firstly, and then kill the job.

FROM yanxu

helinwang · 2017-11-04T00:30:16Z

我们最后会在timestamp 550截断试验（之后会kill job，实验数据没有意义）。
avg pending time不被影响。
avg cpu util受影响，统计方式是如下命令行：
case1:

$ cat case1-mnist-OFF-20-1-ON-400-round_*/mnist-case1-pass0.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}'

case2:

$ cat case2-mnist-OFF-6-1-ON-400-round_*/mnist-case2.log|awk -F, '{if ($1<=550) {a=a+$2; b=b+1}} END {print a/b}'

Yancey1989 · 2017-11-07T15:52:52Z

Push a PR: #470

Fix a bug, the controller does rewrite the node resource when executing the DryRun.
Fix the long pending time in a track way, shrink the time interval of submitting jobs.
Fix remainding Pods after deleting the job in a track way: kubectl delete pod `kubectl get pods | | grep -v Terminating | awk '{print $1}'` , already add in the stop function in case1.sh and case2.sh.
Rerun the experiment, update the log files: out/case1-mnist-ON-20-1-ON-400-round_{0-9}, out/case1-mnist-OFF-20-1-ON-400-round_{0-3}

helinwang · 2017-11-07T22:29:33Z

Seems something wrong with the state of autoscaler, I have to kill the pod between every run of case 1 to see the expected experiment result.

for i in `seq 2 9`; do TAG=round_$i AUTO_SCALING=ON PASSES=1 JOB_COUNT=20 ./run.sh start case1; kc delete po `kc get pod --namespace paddlecloud|grep training |awk  '{print $1}'` --namespace paddlecloud; sleep 10; done

typhoonzero · 2017-11-08T06:38:25Z

Seems something wrong with the state of autoscaler, I have to kill the pod between every run of case 1 to see the expected experiment result.

@Yancey1989 is that a bug of case 1 job submit scripts, it's not deleting jobs after they finishes?

Yancey1989 · 2017-11-08T06:41:43Z

@typhoonzero , Yes, I fixed it by a tracking way, but I think it's a bug of controller or cloud server..

helinwang created this issue from a note in Autoscaling Fault-tolerant Paddle Cloud Release. Deadline: Oct 31, 2017. (TODO) Oct 18, 2017

Yancey1989 assigned Yancey1989, gongweibao and typhoonzero Oct 24, 2017

Yancey1989 assigned helinwang and putcn Oct 24, 2017

typhoonzero mentioned this issue Oct 26, 2017

fix word2vec fault tolerant train #435

Merged

helinwang moved this from TODO to Doing in Autoscaling Fault-tolerant Paddle Cloud Release. Deadline: Oct 31, 2017. Oct 26, 2017

This was referenced Nov 2, 2017

Running fault tolernet jobs encounter panic: context deadline exceeded #464

Open

autoscaler does not consider whether pod can schedule to node causing pending status #465

Closed

Yancey1989 mentioned this issue Nov 8, 2017

It's not delete the pod after delete the job and trainingjob resource #478

Open

helinwang moved this from Doing to Done in Autoscaling Fault-tolerant Paddle Cloud Release. Deadline: Oct 31, 2017. Nov 9, 2017

Yancey1989 closed this as completed Nov 22, 2017

Autoscaling Experiment. #399

Autoscaling Experiment. #399

Comments

helinwang commented Oct 18, 2017

Yancey1989 commented Oct 24, 2017 • edited by gongweibao

gongweibao commented Oct 24, 2017

Yancey1989 commented Oct 26, 2017 • edited

Cluster Resouces

TestCase1

Submit The Fault-tolerant Jobs

TestCase2

helinwang commented Oct 26, 2017 • edited

Yancey1989 commented Oct 27, 2017

helinwang commented Oct 27, 2017 • edited

helinwang commented Oct 27, 2017

Yancey1989 commented Oct 28, 2017 • edited

Yancey1989 commented Oct 30, 2017

jacquesqiao commented Oct 30, 2017

helinwang commented Oct 30, 2017 • edited

Yancey1989 commented Oct 31, 2017 • edited

helinwang commented Oct 31, 2017 • edited

typhoonzero commented Oct 31, 2017

helinwang commented Oct 31, 2017 • edited

putcn commented Oct 31, 2017 • edited

typhoonzero commented Nov 1, 2017

helinwang commented Nov 1, 2017 • edited

Yancey1989 commented Nov 1, 2017

Yancey1989 commented Nov 1, 2017 • edited

putcn commented Nov 1, 2017

helinwang commented Nov 1, 2017 • edited

helinwang commented Nov 2, 2017 • edited

helinwang commented Nov 2, 2017 • edited

Yancey1989 commented Nov 2, 2017

Yancey1989 commented Nov 2, 2017

Yancey1989 commented Nov 2, 2017

helinwang commented Nov 2, 2017 • edited

helinwang commented Nov 2, 2017 • edited by Yancey1989

helinwang commented Nov 3, 2017

helinwang commented Nov 3, 2017 • edited

Yancey1989 commented Nov 3, 2017 • edited

helinwang commented Nov 3, 2017 • edited

helinwang commented Nov 4, 2017 • edited by Yancey1989

helinwang commented Nov 4, 2017 • edited

Yancey1989 commented Nov 7, 2017 • edited

helinwang commented Nov 7, 2017

typhoonzero commented Nov 8, 2017

Yancey1989 commented Nov 8, 2017

Yancey1989 commented Oct 24, 2017 •

edited by gongweibao

Yancey1989 commented Oct 26, 2017 •

edited

helinwang commented Oct 26, 2017 •

edited

helinwang commented Oct 27, 2017 •

edited

Yancey1989 commented Oct 28, 2017 •

edited

helinwang commented Oct 30, 2017 •

edited

Yancey1989 commented Oct 31, 2017 •

edited

helinwang commented Oct 31, 2017 •

edited

helinwang commented Oct 31, 2017 •

edited

putcn commented Oct 31, 2017 •

edited

helinwang commented Nov 1, 2017 •

edited

Yancey1989 commented Nov 1, 2017 •

edited

helinwang commented Nov 1, 2017 •

edited

helinwang commented Nov 2, 2017 •

edited

helinwang commented Nov 2, 2017 •

edited

helinwang commented Nov 2, 2017 •

edited

helinwang commented Nov 2, 2017 •

edited by Yancey1989

helinwang commented Nov 3, 2017 •

edited

Yancey1989 commented Nov 3, 2017 •

edited

helinwang commented Nov 3, 2017 •

edited

helinwang commented Nov 4, 2017 •

edited by Yancey1989

helinwang commented Nov 4, 2017 •

edited

Yancey1989 commented Nov 7, 2017 •

edited