Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fault tolerant job init params error #340

Open
wanghaoshuang opened this issue Aug 23, 2017 · 5 comments
Open

Fault tolerant job init params error #340

wanghaoshuang opened this issue Aug 23, 2017 · 5 comments
Assignees
Labels

Comments

@wanghaoshuang
Copy link

wanghaoshuang commented Aug 23, 2017

提交paddlecloud fault tolerant job, 出现如下错误:

==========================train-trainer-zm4qs==========================
label selector: paddle-job-master=train, desired: 1
running pod list:  [('Running', '***')]
label selector: paddle-job=train, desired: 1
running pod list:  [('Running', '***')]
Starting training job:  /pfs/***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 09:08:14.278625    21 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 09:08:17,608 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 09:08:17,613 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 09:08:17,617 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 09:08:17,620 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 09:08:17.634304    21 GradientMachine.cpp:85] Initing parameters..
I0823 09:08:17.644143    21 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T09:08:17Z" level=info msg="Connected to etcd: http://****
"
time="2017-08-23T09:08:17Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T09:08:17Z" level=info msg="Trainer selected."
I0823 09:08:17.654239    21 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
I0823 09:08:17.654561    21 NewRemoteParameterUpdater.cpp:71] old param config: name: "___conv_pool_0___conv.w0"
size: 500
initial_mean: 0
initial_std: 0.282842712474619
initial_strategy: 0
initial_smart: false
para_id: 0
*** Aborted at 1503479297 (unix time) try "date -d @1503479297" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 21 (TID 0x7f12cb950700) from PID 245368656; stack trace: ***
    @     0x7f124834e86d runtime.sigfwd

submit.sh :

paddlecloud submit \
-jobname train \
-cpu 1 \
-gpu 1 \
-memory 3Gi \
-parallelism 1 \
-pscpu 1 \
-pservers 1 \
-psmemory 1Gi \
-passes 1 \
-faulttolerant \
-entry "python train_ft.py train" ./recognize_digits/

更新到最新paddle:

==========================train-trainer-107vp==========================
label selector: paddle-job-master=train, desired: 1
current cnt: 0 sleep for 5 seconds...
label selector: paddle-job=train, desired: 1
Starting training job:  /***/home/***/jobs/train, num_gradient_servers: 1, trainer_id:  0, version:
I0823 10:59:35.705291    34 Util.cpp:166] commandline:  --num_gradient_servers=1 --ports_num_for_sparse=1 --use_gpu=1 --trainer_id=0 --trainer_count=1 --num_passes=1 --ports_num=1 --port=7164
[INFO 2017-08-23 10:59:39,575 layers.py:2479] output for __conv_pool_0___conv: c = 20, h = 24, w = 24, size = 11520
[INFO 2017-08-23 10:59:39,576 layers.py:2604] output for __conv_pool_0___pool: c = 20, h = 12, w = 12, size = 2880
[INFO 2017-08-23 10:59:39,577 layers.py:2479] output for __conv_pool_1___conv: c = 50, h = 8, w = 8, size = 3200
[INFO 2017-08-23 10:59:39,577 layers.py:2604] output for __conv_pool_1___pool: c = 50, h = 4, w = 4, size = 800
I0823 10:59:39.585695    34 GradientMachine.cpp:85] Initing parameters..
I0823 10:59:39.591514    34 GradientMachine.cpp:92] Init parameters done.
time="2017-08-23T10:59:39Z" level=info msg="Connected to etcd: http://***
"
time="2017-08-23T10:59:39Z" level=info msg="Trying to acquire lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Successfully acquired lock at /init_ps/lock."
time="2017-08-23T10:59:39Z" level=info msg="Trainer selected."
I0823 10:59:39.620086    34 NewRemoteParameterUpdater.cpp:68] paddle_begin_init_params start
E0823 10:59:39.620120    34 NewRemoteParameterUpdater.cpp:109] got unsupported v1 learning_rate_schedule config: poly, set to const
*** Aborted at 1503485979 (unix time) try "date -d @1503485979" if you are using GNU date ***
PC: @                0x0 (unknown)
*** SIGSEGV (@0x1020ea00750) received by PID 34 (TID 0x7f59c6443700) from PID 245368656; stack trace: ***
    @     0x7f5971fa886d runtime.sigfwd
@typhoonzero
Copy link
Collaborator

This is a bug when parsing optimization configs. Is there core files generated? Can you find the full call stack using gdb and the core file?

@wanghaoshuang
Copy link
Author

This is a PaddleCloud job. How should i get the core file from paddle cloud?

@typhoonzero
Copy link
Collaborator

One way is to download the core file and then use the core file locally in a docker container.

@Yancey1989
Copy link
Collaborator

The core file located under /pfs/dlnel/home/<your email>/jobs/<job-name>

@typhoonzero
Copy link
Collaborator

typhoonzero commented Aug 30, 2017

Stack trace looks like in the core file:

#0  0x00007f6f556427fb in runtime.sched_getaffinity () at /usr/local/go/src/runtime/sys_linux_amd64.s:519
#1  0x00007f6f55a0bc82 in encoding/gob.(*Encoder).encodeArray (enc=0x7f6f55a0c518 <encoding/gob.(*Encoder).encodeInterface+472>, b=0xc4202207e0, value=..., op=
    {void (struct encoding/gob.encInstr *, struct encoding/gob.encoderState *, reflect.Value)} 0xc4200cb808, elemIndir=1, length=140116168536416, helper=
    {void (struct encoding/gob.encoderState *, reflect.Value, bool *)} 0xc4200cb820) at /usr/local/go/src/encoding/gob/encode.go:348
#2  0x000000c4200cb878 in ?? ()
#3  0x00007f6f55a0c518 in encoding/gob.(*Encoder).encodeInterface (enc=0x7f6f567a68e0, b=0xc42021c820, iv=...) at /usr/local/go/src/encoding/gob/encode.go:406
#4  0x000000c42021a740 in ?? ()
#5  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#6  0x000000c42021c820 in ?? ()
#7  0x0000000000000099 in ?? ()
#8  0x0000000000000001 in ?? ()
#9  0x00007f6f567a68e0 in typerel.* () from /usr/local/lib/python2.7/dist-packages/py_paddle/_swig_paddle.so
#10 0x000000c42021c820 in ?? ()
#11 0x0000000000000099 in ?? ()
#12 0x0000000000000000 in ?? ()

而且这个问题只在使用gpu的时候才可以稳定复现,使用cpu执行正常。core在了cgo的encoding/gob.encoderState

感觉是一个比较麻烦的问题了,可能是cgo的runtime和cuda有些冲突?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants