New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control Graph / Session / Env's python c++ object destruction #5845
Conversation
This comment has been minimized.
This comment has been minimized.
把 测试脚本也加上吧,提供一个 master 上一定会报错的版本,但是在你这个 PR 可以 work 的单测 |
这个脚本就可以触发: import os
import unittest
import numpy as np
import oneflow as flow
device = flow.device("cuda")
linear = flow.nn.Linear(3, 8, False)
linear = linear.to(device)
input_arr = np.array(
[
[-0.94630778, -0.83378579, -0.87060891],
[2.0289922, -0.28708987, -2.18369248],
[0.35217619, -0.67095644, -1.58943879],
[0.08086036, -1.81075924, 1.20752494],
[0.8901075, -0.49976737, -1.07153746],
[-0.44872912, -1.07275683, 0.06256855],
[-0.22556897, 0.74798368, 0.90416439],
[0.48339456, -2.32742195, -0.59321527],
],
dtype=np.float32,
)
np_weight = np.ones((3, 8)).astype(np.float32)
np_weight.fill(2.3)
x = flow.tensor(input_arr, device=device)
flow.nn.init.constant_(linear.weight, 2.3)
of_eager_out = linear(x)
np_out = np.matmul(input_arr, np_weight)
assert(np.allclose(of_eager_out.numpy(), np_out, 1e-05, 1e-05))
class LinearGraph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.my_linear = linear
def build(self, x):
return self.my_linear(x)
linear_g = LinearGraph()
of_lazy_out = linear_g(x)
assert(np.array_equal(of_lazy_out.numpy(), of_eager_out.numpy())) master 运行输出: chengcheng@oneflow-21:~/debug/graph $ python3 test_graph_destruct.py
F0812 14:21:07.513065 3473074 thread.cpp:57] Check failed: id2actor_ptr_.empty()
*** Check failure stack trace: ***
@ 0x7f18d2038643 google::LogMessage::Fail()
@ 0x7f18d203d3eb google::LogMessage::SendToLog()
@ 0x7f18d203833f google::LogMessage::Flush()
@ 0x7f18d2038b6f google::LogMessageFatal::~LogMessageFatal()
@ 0x7f18cd8dc8c6 oneflow::Thread::PollMsgChannel()
@ 0x7f18cd8d9e08 _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow9CpuThreadC4ElEUlvE_EEEEE6_M_runEv
@ 0x7f19b86eede4 (unknown)
@ 0x7f19c2cf5609 start_thread
@ 0x7f19c2e31293 clone
@ (nil) (unknown)
Aborted (core dumped |
实测: 只要在 顶层空间创建 graph,就会触发 BUG,但是如果包成 test case 还有其他的 BUG。 我测试了几次,有一半概率会死锁卡住,进程 100 % CPU 空转; 有一半概率会挂掉 import os
import unittest
import numpy as np
import oneflow as flow
import oneflow.unittest
device = flow.device("cuda")
linear = flow.nn.Linear(3, 8, False)
linear = linear.to(device)
input_arr = np.random.randn(8, 3).astype(np.float32)
np_weight = np.ones((3, 8)).astype(np.float32)
np_weight.fill(2.3)
x = flow.tensor(input_arr, device=device)
flow.nn.init.constant_(linear.weight, 2.3)
of_eager_out = linear(x)
np_out = np.matmul(input_arr, np_weight)
assert(np.allclose(of_eager_out.numpy(), np_out, 1e-05, 1e-05))
class LinearGraph(flow.nn.Graph):
def __init__(self):
super().__init__()
self.my_linear = linear
def build(self, x):
return self.my_linear(x)
linear_g = LinearGraph()
@unittest.skipIf(os.getenv("ONEFLOW_TEST_CPU_ONLY"), "only test cpu cases")
@flow.unittest.skip_unless_1n1d()
class TestLinearGraph(oneflow.unittest.TestCase):
def test_linear_graph_gpu(test_case):
of_lazy_out = linear_g(x)
assert(np.array_equal(of_lazy_out.numpy(), of_eager_out.numpy()))
# _test_linear_graph(test_case, flow.device("cuda"))
if __name__ == "__main__":
unittest.main() |
…neflow-Inc/oneflow into fea/destruct_session_and_graph
…fea/destruct_session_and_graph
这个测试已经可以通过。 |
去掉graph对Session引用 + 去掉 python graph del中的sync
|
…neflow-Inc/oneflow into fea/destruct_session_and_graph
…fea/destruct_session_and_graph
m.def("TryDestroyMultiClientSessionContext", &TryDestroyMultiClientSessionContext); | ||
m.def("MultiClientSessionContextAddCGraph", &MultiClientSessionContextAddCGraph); | ||
m.def("TryDestroyMultiClientSessionContext", &TryDestroyMultiClientSessionContext, | ||
py::call_guard<py::gil_scoped_release>()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 gil 锁是必须要加的么
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
试了下,现在不必要了,已经去掉
@@ -62,6 +67,8 @@ def _graph_proto(self): | |||
return self._job_proto | |||
|
|||
def debug(self, mode: bool = True) -> None: | |||
if get_rank() != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是否需要输出 debug 信息?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
就是虽然 return 了,但是提示用户因为这个不是 rank 0? 看是否有必要。也可以不加。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,我让rank 0 提示下只有rank 0 打印。
launcher启动时,发现每个进程都打了日志导致日志混在一起了。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
总体上没问题了~ 有几个小问题你处理一下就行~
Speed stats:
|
总体思路用引用计数控制Python对象的析构顺序; 但是Python gc析构时机不严谨、对c++对象析构不是同步的;所以c++的析构会做一次Check+Close:
python对象的析构控制Env初始化,另外在env_util保存一个全局的Env引用; c++对象的析构控制python对象析构时,并不能同步的下线C++对象。所以在C++对象析构时: 下面是python对象和c++对象的析构日志:
|
暂时不讨论现在的实现,我觉得比较直接的思路是:
关于现在的实现
|
这里是 runtime 还是 env ? |
|
这里的讨论是否可以形成一个专门的issue? |
总体思路