Fault tolerant job init params error #340

wanghaoshuang · 2017-08-23T09:24:09Z

提交paddlecloud fault tolerant job，出现如下错误：

==========================train-trainer-zm4qs==========================
label selector: paddle-job-master=train, desired: 1
running pod list:  [('Running', '***')]
label selector: paddle-job=train, desired: 1
running pod list:  [('Running', '***')]
Starting training job:  /pfs/***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 09:08:14.278625    21 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 09:08:17,608 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 09:08:17,613 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 09:08:17,617 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 09:08:17,620 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 09:08:17.634304    21 GradientMachine.cpp:85] Initing parameters..
I0823 09:08:17.644143    21 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T09:08:17Z" level=info msg="Connected to etcd: http://****
"
time="2017-08-23T09:08:17Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Trainer selected."
I0823 09:08:17.654239    21 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
I0823 09:08:17.654561    21 NewRemoteParameterUpdater.cpp:71] old param config: name: "___conv_pool_0___conv.w0"
size: 500
initial_mean: 0
initial_std: 0.282842712474619
initial_strategy: 0
initial_smart: false
para_id: 0
*** Aborted at 1503479297 (unix time) try "date -d @1503479297" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 21 (TID 0x7f12cb950700) from PID 245368656; stack trace: ***
    @     0x7f124834e86d runtime.sigfwd

submit.sh :

paddlecloud submit \
-jobname train \
-cpu 1 \
-gpu 1 \
-memory 3Gi \
-parallelism 1 \
-pscpu 1 \
-pservers 1 \
-psmemory 1Gi \
-passes 1 \
-faulttolerant \
-entry "python train_ft.py train" ./recognize_digits/

更新到最新paddle:

==========================train-trainer-107vp==========================
label selector: paddle-job-master=train, desired: 1
current cnt: 0 sleep for 5 seconds...
label selector: paddle-job=train, desired: 1
Starting training job:  /***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 10:59:35.705291    34 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 10:59:39,575 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 10:59:39,576 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 10:59:39,577 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 10:59:39,577 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 10:59:39.585695    34 GradientMachine.cpp:85] Initing parameters..
I0823 10:59:39.591514    34 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T10:59:39Z" level=info msg="Connected to etcd: http://***
"
time="2017-08-23T10:59:39Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Trainer selected."
I0823 10:59:39.620086    34 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
E0823 10:59:39.620120    34 NewRemoteParameterUpdater.cpp:109] got unsupported v1 learning_rate_schedule config: poly, set to const
*** Aborted at 1503485979 (unix time) try "date -d @1503485979" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 34 (TID 0x7f59c6443700) from PID 245368656; stack trace: ***
    @     0x7f5971fa886d runtime.sigfwd

The text was updated successfully, but these errors were encountered:

typhoonzero · 2017-08-23T11:24:45Z

This is a bug when parsing optimization configs. Is there core files generated? Can you find the full call stack using gdb and the core file?

wanghaoshuang · 2017-08-23T11:33:44Z

This is a PaddleCloud job. How should i get the core file from paddle cloud?

typhoonzero · 2017-08-23T12:00:05Z

One way is to download the core file and then use the core file locally in a docker container.

Yancey1989 · 2017-08-24T01:44:51Z

The core file located under /pfs/dlnel/home/<your email>/jobs/<job-name>

typhoonzero · 2017-08-30T12:08:54Z

Stack trace looks like in the core file:

#0  0x00007f6f556427fb in runtime.sched_getaffinity () at /usr/local/go/src/runtime/sys_linux_amd64.s:519
#1  0x00007f6f55a0bc82 in encoding/gob.(*Encoder).encodeArray (enc=0x7f6f55a0c518 <encoding/gob.(*Encoder).encodeInterface+472>, b=0xc4202207e0, value=..., op=
    {void (struct encoding/gob.encInstr *, struct encoding/gob.encoderState *, reflect.Value)} 0xc4200cb808, elemIndir=1, length=140116168536416, helper=
    {void (struct encoding/gob.encoderState *, reflect.Value, bool *)} 0xc4200cb820) at /usr/local/go/src/encoding/gob/encode.go:348
#2  0x000000c4200cb878 in ?? ()
#3  0x00007f6f55a0c518 in encoding/gob.(*Encoder).encodeInterface (enc=0x7f6f567a68e0, b=0xc42021c820, iv=...) at /usr/local/go/src/encoding/gob/encode.go:406
#4  0x000000c42021a740 in ?? ()
#5  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#6  0x000000c42021c820 in ?? ()
#7  0x0000000000000099 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#10 0x000000c42021c820 in ?? ()
#11 0x0000000000000099 in ?? ()
#12 0x0000000000000000 in ?? ()

而且这个问题只在使用gpu的时候才可以稳定复现，使用cpu执行正常。core在了cgo的encoding/gob.encoderState

感觉是一个比较麻烦的问题了，可能是cgo的runtime和cuda有些冲突？

wanghaoshuang assigned typhoonzero Aug 23, 2017

typhoonzero added the bug label Aug 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fault tolerant job init params error #340

Fault tolerant job init params error #340

wanghaoshuang commented Aug 23, 2017 •

edited

Loading

typhoonzero commented Aug 23, 2017

wanghaoshuang commented Aug 23, 2017

typhoonzero commented Aug 23, 2017

Yancey1989 commented Aug 24, 2017

typhoonzero commented Aug 30, 2017 •

edited

Loading

Fault tolerant job init params error #340

Fault tolerant job init params error #340

Comments

wanghaoshuang commented Aug 23, 2017 • edited Loading

typhoonzero commented Aug 23, 2017

wanghaoshuang commented Aug 23, 2017

typhoonzero commented Aug 23, 2017

Yancey1989 commented Aug 24, 2017

typhoonzero commented Aug 30, 2017 • edited Loading

wanghaoshuang commented Aug 23, 2017 •

edited

Loading

typhoonzero commented Aug 30, 2017 •

edited

Loading