-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
流水并行出错记录: header_size == rhs->blob_desc().ByteSizeOfBlobHeader()
报错
#6226
Comments
这个报错很奇怪。 但是因为你的脚本里是 eval 模式(没有 loss backward 和 Optimizer),所以其实配置的 set_gradient_accumulation_steps 和 stage id 都是不起作用的 😂
报错的具体原因可以再查一下 |
chengcheng@oneflow-21:~/debug $ ONEFLOW_TEST_DEVICE_NUM=2 python3 -m oneflow.distributed.launch --nproc_per_node 2 test_simple.py
graph input x.shape= oneflow.Size([16, 1, 28, 28])
graph input x.shape= oneflow.Size([16, 1, 28, 28])
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 32 header_shape = (16,1,28,28) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
cclog: in PushCb, input tensor blob: header_size = 8 header_shape = (0,) aligned header size = 64
input regst blob: header_size = 32 header_shape = (16,1,28,28): aligned header size = 64
F0910 15:31:40.086340 3750611 blob.cpp:62] Check failed: header_size == rhs->blob_desc().ByteSizeOfBlobHeader() (32 vs. 8)
*** Check failure stack trace: ***
@ 0x7fe94e5ba353 google::LogMessage::Fail()
@ 0x7fe94e5bf0fb google::LogMessage::SendToLog()
@ 0x7fe94e5ba04f google::LogMessage::Flush()
@ 0x7fe94e5ba87f google::LogMessageFatal::~LogMessageFatal()
@ 0x7fe949d44669 oneflow::Blob::CopyHeaderFrom()
@ 0x7fe949491200 _ZNSt17_Function_handlerIFvlEZNK7oneflow2vm25RunLazyJobInstructionType15MakeJobInstanceEPNS2_11InstructionEEUllE_E9_M_invokeERKSt9_Any_dataOl
@ 0x7fe949490c50 oneflow::(anonymous namespace)::LazyJobInstance::PushBlobByOpName()
@ 0x7fe949b5253d oneflow::(anonymous namespace)::InputKernel<>::ForwardDataContent()
@ 0x7fe949b52b3c oneflow::Kernel::Forward()
@ 0x7fe949b52bfe oneflow::Kernel::Launch()
@ 0x7fe94912877e oneflow::Actor::AsyncLaunchKernel()
@ 0x7fe949193d0e oneflow::NaiveActor::Act()
@ 0x7fe94912807e oneflow::Actor::ActUntilFail()
@ 0x7fe949128ddb oneflow::Actor::HandlerNormal()
@ 0x7fe949d60ec7 oneflow::Thread::PollMsgChannel()
@ 0x7fe949d61317 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow6ThreadC4ERKNS3_8StreamIdEEUlvE_EEEEE6_M_runEv
@ 0x7fe94415fde4 (unknown)
@ 0x7fe98cc0c609 start_thread
@ 0x7fe98cd48293 clone
@ (nil) (unknown)
Killing subprocess 3750257
Killing subprocess 3750258
Traceback (most recent call last):
File "/usr/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/chengcheng/oneflow/python/oneflow/distributed/launch.py", line 211, in <module>
main()
File "/home/chengcheng/oneflow/python/oneflow/distributed/launch.py", line 199, in main
sigkill_handler(signal.SIGTERM, None)
File "/home/chengcheng/oneflow/python/oneflow/distributed/launch.py", line 167, in sigkill_handler
raise subprocess.CalledProcessError(
subprocess.CalledProcessError: Command '['/usr/bin/python3', '-u', 'test_simple.py']' died with <Signals.SIGABRT: 6>.
通过给输入 push callback 加 日志发现,数据会偶尔出现 全 0 的情况。。。。。input 拷贝数据就报错了 |
只要在 python 端 ,print 每次的 x tensor,就不会有这个问题。 但是会死锁,好像是卡在 push callback 的 buffer 里了 |
这里会给所有的 input 和 output 都 send job instance。无论 这个 input 有没有本 rank 的分量。 但是 本 rank 没有 input 分量的 op,也不会创建 input kernel 从 buffer 里读这个 job instance 取出来,那么一旦这个 buffer 被塞满了,就阻塞卡住了。 |
复现代码,迭代 200 次,报错;改成更小的次数(如100),就不报错。怀疑和内存的分配及回收有关。
报错信息:
The text was updated successfully, but these errors were encountered: