Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seq2seq model sparse update on cpu cluster error #1096

Closed
autoAlien opened this issue Jan 9, 2017 · 3 comments
Closed

seq2seq model sparse update on cpu cluster error #1096

autoAlien opened this issue Jan 9, 2017 · 3 comments

Comments

@autoAlien
Copy link

seq2seq model, 用CPU集群的sparse update 报错,错误信息截图为:
428c975e3d1f2f8f88236a76c
模型和提交配置:
提交shell带有此参数:--ports_num_for_sparse=1
cluster_config带有此参数:use_remote_sparse=True,
setting 的优化算法为:learning_method=AdaGradOptimizer(),
seq2seq_net.py 中,gru_encoder_decoder 函数的embedding_layer 设置有 param_attr=ParamAttr(name='_source_language_embedding', sparse_update=True))

@qingqing01
Copy link
Contributor

请不要使用图片,直接粘贴文本吧。这个信息太少,或许可以给个作业链接。

@autoAlien
Copy link
Author

这个是job链接:http://10.73.221.41:8920/fileview.html?path=/home/normandy/maybach/225298/
错误日志为:
Tue Jan 10 13:13:16 2017[1,0]:PC: @ 0x702d7a ZNSt17_Function_handlerIFvPN6paddle9ParameterEEZNS0_15TrainerInternal13trainOneBatchElRKNS0_9DataBatchEPSt6vectorINS0_8ArgumentESaIS9_EEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2
Tue Jan 10 13:13:16 2017[1,0]:*** SIGSEGV (@0x0) received by PID 26948 (TID 0x7ffd27cad780) from PID 0; stack trace: ***
Tue Jan 10 13:13:16 2017[1,0]: @ 0x7ffd27887160 (unknown)
Tue Jan 10 13:13:16 2017[1,0]: @ 0x702d7a ZNSt17_Function_handlerIFvPN6paddle9ParameterEEZNS0_15TrainerInternal13trainOneBatchElRKNS0_9DataBatchEPSt6vectorINS0_8ArgumentESaIS9_EEEUlS2_E_E9_M_invokeERKSt9_Any_dataS2
Tue Jan 10 13:13:16 2017[1,0]: @ 0x7040b1 paddle::TrainerInternal::trainOneBatch()
Tue Jan 10 13:13:16 2017[1,0]: @ 0x6fe850 paddle::Trainer::trainOneDataBatch()
Tue Jan 10 13:13:16 2017[1,0]: @ 0x7015ce paddle::Trainer::trainOnePass()
Tue Jan 10 13:13:16 2017[1,0]: @ 0x7028b5 paddle::Trainer::train()
Tue Jan 10 13:13:16 2017[1,0]: @ 0x5920aa main
Tue Jan 10 13:13:16 2017[1,0]: @ 0x7ffd266acbd5 __libc_start_main
Tue Jan 10 13:13:16 2017[1,0]: @ 0x59dcf5 (unknown)
Tue Jan 10 13:13:18 2017[1,0]:./train.sh: line 207: 26948 Segmentation fault PYTHONPATH=./paddle:$PYTHONPATH GLOG_logtostderr=0 GLOG_log_dir="./log" ./paddle_trainer --num_gradient_servers=${OMPI_COMM_WORLD_SIZE} --trainer_id=${OMPI_COMM_WORLD_RANK} --pservers=$ipstring --rdma_tcp=${rdma_tcp} --nics=${nics} ${train_arg} --config=conf/trainer_config.conf --save_dir=./${save_dir} ${extern_arg}

@kuke
Copy link
Contributor

kuke commented Jul 28, 2017

Closed because of inactivity

@kuke kuke closed this as completed Jul 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants