Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control Graph / Session / Env's python c++ object destruction #5845

Merged
merged 41 commits into from Aug 16, 2021

Conversation

strint
Copy link
Contributor

@strint strint commented Aug 11, 2021

总体思路

  • Env Close时会检查是否还有Session,如果还有,就调用其Close方法;
  • Session Close时,会用weak_ptr检查下graph是否还存在,如果存在,也调用其Close方法;
  • 如此控制C++对象的析构顺序。
python graph destruct                         python session destruct                         python env destruct
         |                                                 |                                         |
c graph (delay) destruct                       c session context (delay) destruct               c  env (delay) destruct 
         |                                                 |                                         |
c graph close                 c session context close (do c graph close if needed )          c env close(do c session close if needed)

@strint strint changed the title refer count del of session and env Delete session and env by reference counting Aug 11, 2021
@strint strint changed the title Delete session and env by reference counting Delete Session and Env by reference counting Aug 11, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 11, 2021 18:06
@strint strint marked this pull request as ready for review August 11, 2021 18:08
@chengtbf

This comment has been minimized.

@chengtbf
Copy link
Contributor

把 测试脚本也加上吧,提供一个 master 上一定会报错的版本,但是在你这个 PR 可以 work 的单测

@chengtbf
Copy link
Contributor

这个脚本就可以触发:

import os
import unittest
import numpy as np

import oneflow as flow

device = flow.device("cuda")
linear = flow.nn.Linear(3, 8, False)
linear = linear.to(device)
input_arr = np.array(
    [
        [-0.94630778, -0.83378579, -0.87060891],
        [2.0289922, -0.28708987, -2.18369248],
        [0.35217619, -0.67095644, -1.58943879],
        [0.08086036, -1.81075924, 1.20752494],
        [0.8901075, -0.49976737, -1.07153746],
        [-0.44872912, -1.07275683, 0.06256855],
        [-0.22556897, 0.74798368, 0.90416439],
        [0.48339456, -2.32742195, -0.59321527],
    ],
    dtype=np.float32,
)
np_weight = np.ones((3, 8)).astype(np.float32)
np_weight.fill(2.3)
x = flow.tensor(input_arr, device=device)
flow.nn.init.constant_(linear.weight, 2.3)
of_eager_out = linear(x)
np_out = np.matmul(input_arr, np_weight)
assert(np.allclose(of_eager_out.numpy(), np_out, 1e-05, 1e-05))

class LinearGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.my_linear = linear

    def build(self, x):
        return self.my_linear(x)

linear_g = LinearGraph()
of_lazy_out = linear_g(x)
assert(np.array_equal(of_lazy_out.numpy(), of_eager_out.numpy()))

master 运行输出:

chengcheng@oneflow-21:~/debug/graph $ python3 test_graph_destruct.py 
F0812 14:21:07.513065 3473074 thread.cpp:57] Check failed: id2actor_ptr_.empty() 
*** Check failure stack trace: ***
    @     0x7f18d2038643  google::LogMessage::Fail()
    @     0x7f18d203d3eb  google::LogMessage::SendToLog()
    @     0x7f18d203833f  google::LogMessage::Flush()
    @     0x7f18d2038b6f  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f18cd8dc8c6  oneflow::Thread::PollMsgChannel()
    @     0x7f18cd8d9e08  _ZNSt6thread11_State_implINS_8_InvokerISt5tupleIJZN7oneflow9CpuThreadC4ElEUlvE_EEEEE6_M_runEv
    @     0x7f19b86eede4  (unknown)
    @     0x7f19c2cf5609  start_thread
    @     0x7f19c2e31293  clone
    @              (nil)  (unknown)
Aborted (core dumped

@chengtbf
Copy link
Contributor

实测: 只要在 顶层空间创建 graph,就会触发 BUG,但是如果包成 test case 还有其他的 BUG。 我测试了几次,有一半概率会死锁卡住,进程 100 % CPU 空转; 有一半概率会挂掉

import os
import unittest
import numpy as np

import oneflow as flow
import oneflow.unittest

device = flow.device("cuda")
linear = flow.nn.Linear(3, 8, False)
linear = linear.to(device)
input_arr = np.random.randn(8, 3).astype(np.float32)
np_weight = np.ones((3, 8)).astype(np.float32)
np_weight.fill(2.3)
x = flow.tensor(input_arr, device=device)
flow.nn.init.constant_(linear.weight, 2.3)
of_eager_out = linear(x)
np_out = np.matmul(input_arr, np_weight)
assert(np.allclose(of_eager_out.numpy(), np_out, 1e-05, 1e-05))

class LinearGraph(flow.nn.Graph):
    def __init__(self):
        super().__init__()
        self.my_linear = linear

    def build(self, x):
        return self.my_linear(x)

linear_g = LinearGraph()


@unittest.skipIf(os.getenv("ONEFLOW_TEST_CPU_ONLY"), "only test cpu cases")
@flow.unittest.skip_unless_1n1d()
class TestLinearGraph(oneflow.unittest.TestCase):
    def test_linear_graph_gpu(test_case):
        of_lazy_out = linear_g(x)
        assert(np.array_equal(of_lazy_out.numpy(), of_eager_out.numpy()))
        # _test_linear_graph(test_case, flow.device("cuda"))


if __name__ == "__main__":
    unittest.main()

@strint strint removed the request for review from oneflow-ci-bot August 13, 2021 13:54
@strint
Copy link
Contributor Author

strint commented Aug 13, 2021

python -m unittest test_graph_session_env_destruct

这个测试已经可以通过。

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 14, 2021 04:07
@strint
Copy link
Contributor Author

strint commented Aug 16, 2021

F0816 14:07:56.225633 36856 global_process_ctx.cpp:29] Check failed: 'Global<ProcessCtx>::Get()' Must be non NULL 
*** Check failure stack trace: ***
    @     0x7f25b48d2a1c  google::LogMessage::Fail()
    @     0x7f25b48d2979  google::LogMessage::SendToLog()
    @     0x7f25b48d22e2  google::LogMessage::Flush()
    @     0x7f25b48d52cc  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f25b03fb63a  google::CheckNotNull<>()
    @     0x7f25b08d704e  oneflow::GlobalProcessCtx::Rank()
    @     0x7f25afc0cca2  oneflow::InstructionsBuilder::ComputeRankFrontSeqCallback()
    @     0x7f25b09703c7  _ZZN7oneflow2vm15MultiClientSyncEvENKUlPNS_19InstructionsBuilderEE_clES2_
    @     0x7f25b097132b  _ZNSt17_Function_handlerIFN7oneflow5MaybeIvvEEPNS0_19InstructionsBuilderEEZNS0_2vm15MultiClientSyncEvEUlS4_E_E9_M_invokeERKSt9_Any_dataOS4_
    @     0x7f25afc32cbf  std::function<>::operator()()
    @     0x7f25afc23238  oneflow::PhysicalRun()
    @     0x7f25b09707dd  oneflow::vm::MultiClientSync()
    @     0x7f25ae5b468f  _ZZL30OneflowApiPythonModule__LINE__RN8pybind117module_EENKUlvE_clEv
    @     0x7f25ae5b549c  _ZNO8pybind116detail15argument_loaderIJEE9call_implIvRZL30OneflowApiPythonModule__LINE__RNS_7module_EEUlvE_JENS_18gil_scoped_releaseEEET_OT0_NS0_14index_sequenceIJXspT1_EEEEOT2_
    @     0x7f25ae5b4f00  _ZNO8pybind116detail15argument_loaderIJEE4callIvNS_18gil_scoped_releaseERZL30OneflowApiPythonModule__LINE__RNS_7module_EEUlvE_EENSt9enable_ifIXsrSt7is_voidIT_E5valueENS0_9void_typeEE4typeEOT1_
    @     0x7f25ae5b4bf5  _ZZN8pybind1112cpp_function10initializeIZL30OneflowApiPythonModule__LINE__RNS_7module_EEUlvE_vJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENKUlRNS_6detail13function_callEE1_clESO_
    @     0x7f25ae5b4c52  _ZZN8pybind1112cpp_function10initializeIZL30OneflowApiPythonModule__LINE__RNS_7module_EEUlvE_vJEJNS_4nameENS_5scopeENS_7siblingENS_10call_guardIJNS_18gil_scoped_releaseEEEEEEEvOT_PFT0_DpT1_EDpRKT2_ENUlRNS_6detail13function_callEE1_4_FUNESO_
    @     0x7f25ad9971d1  pybind11::cpp_function::dispatcher()
    @     0x559d6b0e8114  _PyMethodDef_RawFastCallKeywords
    @     0x559d6b0e8231  _PyCFunction_FastCallKeywords
    @     0x559d6b14ca5d  _PyEval_EvalFrameDefault
    @     0x559d6b0a273b  _PyFunction_FastCallDict
    @     0x559d6b0ffa15  slot_tp_finalize
    @     0x559d6b0aca1d  collect
    @     0x559d6b19620a  _PyGC_CollectNoFail
    @     0x559d6b126b1c  PyImport_Cleanup
    @     0x559d6b19c367  Py_FinalizeEx
    @     0x559d6b19c4c9  Py_Exit
    @     0x559d6b19c587  handle_system_exit
    @     0x559d6b19c622  PyErr_PrintEx
    @     0x559d6b19c845  pymain_run_module
    @     0x559d6b1aefcb  pymain_main
Aborted

去掉graph对Session引用 + 去掉 python graph del中的sync

Traceback (most recent call last):
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/__main__.py", line 18, in <module>
    main(module=None)
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/main.py", line 100, in __init__
    self.parseArgs(argv)
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/main.py", line 147, in parseArgs
    self.createTests()
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/main.py", line 159, in createTests
    self.module)
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/loader.py", line 220, in loadTestsFromNames
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/loader.py", line 220, in <listcomp>
    suites = [self.loadTestsFromName(name, module) for name in names]
  File "/home/xuxiaoyu/anaconda3/lib/python3.7/unittest/loader.py", line 154, in loadTestsFromName
    module = __import__(module_name)
  File "/home/xuxiaoyu/oneflow/python/oneflow/test/graph/test_graph_session_env_destruct1.py", line 20, in <module>
    import oneflow as flow
  File "/home/xuxiaoyu/oneflow/python/oneflow/__init__.py", line 118, in <module>
    import oneflow.nn.image
  File "/home/xuxiaoyu/oneflow/python/oneflow/nn/__init__.py", line 16, in <module>
    from oneflow.nn.graph import Graph
  File "/home/xuxiaoyu/oneflow/python/oneflow/nn/graph.py", line 469
    class GraphConfig(FunctionConfig):
    ^
IndentationError: expected an indented block
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/home/xuxiaoyu/oneflow/python/oneflow/__init__.py", line 103, in _ExitOneFlow
    _SyncOnMasterFn()
  File "/home/xuxiaoyu/oneflow/python/oneflow/__init__.py", line 95, in _SyncOnMasterFn
    if not oneflow._oneflow_internal.IsEnvInited():
NameError: name 'oneflow' is not defined

m.def("TryDestroyMultiClientSessionContext", &TryDestroyMultiClientSessionContext);
m.def("MultiClientSessionContextAddCGraph", &MultiClientSessionContextAddCGraph);
m.def("TryDestroyMultiClientSessionContext", &TryDestroyMultiClientSessionContext,
py::call_guard<py::gil_scoped_release>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 gil 锁是必须要加的么

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

试了下,现在不必要了,已经去掉

@@ -62,6 +67,8 @@ def _graph_proto(self):
return self._job_proto

def debug(self, mode: bool = True) -> None:
if get_rank() != 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是否需要输出 debug 信息?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

就是虽然 return 了,但是提示用户因为这个不是 rank 0? 看是否有必要。也可以不加。

Copy link
Contributor Author

@strint strint Aug 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,我让rank 0 提示下只有rank 0 打印。

launcher启动时,发现每个进程都打了日志导致日志混在一起了。

Copy link
Contributor

@chengtbf chengtbf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

总体上没问题了~ 有几个小问题你处理一下就行~

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 16, 2021 15:49
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 16, 2021 16:19
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 138.1ms (= 6906.7ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 127.7ms (= 6382.8ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.08 (= 138.1ms / 127.7ms)

PyTorch resnet50 time: 84.4ms (= 4218.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.2ms (= 3711.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.14 (= 84.4ms / 74.2ms)

PyTorch resnet50 time: 57.1ms (= 2853.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.6ms (= 2378.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.20 (= 57.1ms / 47.6ms)

PyTorch resnet50 time: 49.6ms (= 2478.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 39.0ms (= 1948.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.27 (= 49.6ms / 39.0ms)

PyTorch resnet50 time: 42.9ms (= 2142.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 37.7ms (= 1884.5ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.14 (= 42.9ms / 37.7ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 16, 2021 17:26
@oneflow-ci-bot oneflow-ci-bot merged commit c89b3ff into master Aug 16, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fea/destruct_session_and_graph branch August 16, 2021 17:27
@strint
Copy link
Contributor Author

strint commented Mar 4, 2022

总体思路

用引用计数控制Python对象的析构顺序;

但是Python gc析构时机不严谨、对c++对象析构不是同步的;所以c++的析构会做一次Check+Close:

  • Env Close时会检查是否还有Session,如果还有,就调用其Close方法;
  • Session Close时,会用weak_ptr检查下graph是否还存在,如果存在,也调用其Close方法;
    如此控制C++对象的析构顺序。

python对象的析构控制

Env初始化,另外在env_util保存一个全局的Env引用;
Session初始化时,引用Env,另外在session_context保存一个全局的Session引用;
Graph构图时,引用Session;
Graph析构,减少Session引用,可能触发Session析构;
Oneflow exit,减少session_context保存一个全局的Session引用,可能触发Session析构;
Session析构,减少Env引用,可能触发Env析构;
Oneflow exit,减少env_util保存一个全局的Env引用,可能触发Env析构;

c++对象的析构控制

python对象析构时,并不能同步的下线C++对象。所以在C++对象析构时:
Env析构时,会去判断Session是否还在,如果在就触发其Close方法,但是不析构其对象;
Session Close时,会看Graph的shared_ptr是否还在,如果在,就触发其Close方法,也不析构其对象;
C++对象析构还是依赖正常的引用计数析构,但是析构时会判断是否Close了,如果没有Close,就Close下,如果已经Close,就只是析构。

下面是python对象和c++对象的析构日志:

e_s_g evn init
e_s_g py session init
test setup
Nothing happened because environment has been initialized
e_s_g try to create graph  LinearGraph_0
.
----------------------------------------------------------------------
Ran 1 test in 0.397s

OK
e_s_g try to del graph  LinearGraph_0  # py graph 析构
e_s_g py session close  # py session 析构
e_s_g py session try close  # c session 析构
E0814 16:15:30.509083 19034 multi_client_session_context.cpp:130] Try to delete multi client session context. after sync
E0814 16:15:30.510738 19034 multi_client_session_context.cpp:132]  graph count 1
E0814 16:15:30.510833 19034 multi_client_session_context.cpp:134] grap name LinearGraph_0 not cleand. # c session触发 c graph close
E0814 16:15:30.510886 19034 nn_graph.cpp:42] Try to delete c nn graph name LinearGraph_0.  # c graph close
E0814 16:15:30.517534 19034 nn_graph.cpp:47] Finish delete c nn graph name LinearGraph_0.
E0814 16:15:30.522147 19034 env_global_objects_scope.cpp:202] Try to delete env   # c env close
E0814 16:15:30.522207 19034 env_global_objects_scope.cpp:208] session expired 
e_s_g env close  # py env close

@jackalcooper
Copy link
Collaborator

暂时不讨论现在的实现,我觉得比较直接的思路是:

  • session 是被runtime持有的uniq ptr,graph 是被 session 持有的 uniq ptr。
  • runtime 和 graph 的成员可以是 shared ptr 和系统的其他方面来共享状态,这样有 bug 也知道是哪个资源的生命周期有问题。
  • python 取得 graph 和 session 的 id,用 id 来取得和改变 session 和 graph 的状态

关于现在的实现

  • session 和 graph 是 shared ptr 从概念上就是很奇怪的,我们没有 graph/session 的引用的功能(类似调用函数),他的职责应当是提供为所有资源提供一个明确的生命周期,而不是自己是个 shared ptr 导致自己的生命周期就不明确。这也是各种 try close 的代码看起来那么反模式、以及和系统其他地方打交道容易出bug、死锁的原因。

@chengtbf
Copy link
Contributor

chengtbf commented Mar 4, 2022

  • session 是被runtime持有的uniq ptr,graph 是被 session 持有的 uniq ptr。

这里是 runtime 还是 env ?

@chengtbf
Copy link
Contributor

chengtbf commented Mar 4, 2022

python对象的析构控制

Env初始化,另外在env_util保存一个全局的Env引用; Session初始化时,引用Env,另外在session_context保存一个全局的Session引用; Graph构图时,引用Session; Graph析构,减少Session引用,可能触发Session析构; Oneflow exit,减少session_context保存一个全局的Session引用,可能触发Session析构; Session析构,减少Env引用,可能触发Env析构; Oneflow exit,减少env_util保存一个全局的Env引用,可能触发Env析构;

  1. 未来重构的方向是合并 Env 和 Session
  2. Graph析构,减少Session引用,可能触发Session析构 ,这里我认为是不正确的, session 是否析构,跟当前是否有 Graph 无关吧?即使 所有的 Graph 都被析构了,但是 Session 仍然不能被析构,因为有可能下一刻用户又会创建一个新的 Graph?

@yuanms2
Copy link
Contributor

yuanms2 commented Mar 4, 2022

这里的讨论是否可以形成一个专门的issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants