-
Notifications
You must be signed in to change notification settings - Fork 656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NNGraph RunLazyJob check static input tensor meta #6243
NNGraph RunLazyJob check static input tensor meta #6243
Conversation
oneflow/core/framework/nn_graph.cpp
Outdated
DType dtype(tensor_meta.dtype()); | ||
std::string ret = "shape=" + tensor_meta.shape().ToString() + ", dtype=" + dtype.name(); | ||
if (mirrored_meta) { | ||
ret += ", device=" + mirrored_meta->device()->ToString(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里会遇到 SegmentFault,栈信息:
#0 0x00007fffb45de6ff in std::operator<< <char, std::char_traits<char>, std::allocator<char> > (
__str=<error reading variable: Cannot access memory at address 0x38>, __os=...)
at /usr/include/c++/9/bits/basic_string.h:6416
#1 oneflow::Device::ToString[abi:cxx11]() const (this=0x30)
at /home/chengcheng/oneflow/oneflow/core/framework/device.cpp:170
#2 0x00007fffb4636f94 in oneflow::(anonymous namespace)::TensorMetaToString (tensor_meta=...)
at /home/chengcheng/oneflow/oneflow/core/framework/tensor_meta.h:72
#3 0x00007fffb4637827 in oneflow::(anonymous namespace)::CheckStaticTensorMeta (tensor_meta=..., tensor=
std::shared_ptr<oneflow::one::Tensor> (use count 2, weak count 1) = {...})
at /home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp:360
#4 0x00007fffb463826d in oneflow::RunLazyNNGraph (inputs=..., outputs=..., parameters=...,
nn_graph=std::shared_ptr<oneflow::NNGraph> (use count 2, weak count 1) = {...})
at /home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp:385
#5 0x00007fffb371a72f in oneflow::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>::operator() (
__closure=<optimized out>, nn_graph=std::shared_ptr<oneflow::NNGraph> (use count 2, weak count 1) = {...},
parameters=..., outputs=..., inputs=...)
at /home/chengcheng/oneflow/oneflow/api/python/framework/nn_graph.cpp:58
#6 pybind11::detail::argument_loader<oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, std::shared_ptr<oneflow::NNGraph> const&>::call_impl<void, oneflow::OneflowApiPythonModule__LINE__(pybind11::module&)::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>&, 0, 1, 2, 3, pybind11::detail::void_type> (f=..., this=0x7fffffffd260)
at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/cast.h:1189
#7 pybind11::detail::argument_loader<oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, std::shared_ptr<oneflow::NNGraph> const&>::call<void, pybind11::detail::void_type, oneflow::OneflowApiPythonModule__LINE__(pybind11::module&)::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>&> (f=...,
this=0x7fffffffd260) at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/cast.h:1166
#8 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (this=0x0, call=...)
at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:213
#9 pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &)
() at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:191
#10 0x00007fffb30e8e8a in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff0009bbd0,
kwargs_in=0x0) at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:791
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不知道为什么无法访问 MirroredTensorMeta 的 Symbol<device>->ToString()
内存。 猜测是有可能 main 线程还不能直接读 tensor 的这个值?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
不知道为什么无法访问 MirroredTensorMeta 的
Symbol<device>->ToString()
内存。 猜测是有可能 main 线程还不能直接读 tensor 的这个值?
这里可能也是tensor_meta,是个纯基类的。没有device成员。
报错示例1:
报错示例2
|
Speed stats:
|
oneflow/core/framework/nn_graph.cpp
Outdated
reinterpret_cast<const one::MirroredTensorMeta*>(&tensor_meta); | ||
const one::ConsistentTensorMeta* consistent_meta = | ||
reinterpret_cast<const one::ConsistentTensorMeta*>(&tensor_meta); | ||
CHECK_OR_RETURN(mirrored_meta || consistent_meta); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liufengwei0103 复现方法就是这里,但是需要增加一行:
CHECK_OR_RETURN(!(mirrored_meta && consistent_meta));
用任意带 input、Variable 的 NNGraph 即可复现。发现同一个 tensor_meta 对象既可以 cast 成 Mirrored,又可以 cast 成 consistent meta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
用这个 commit 吧,刚才那个是删掉以后的 str 低效判等版本
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liufengwei0103 复现方法就是这里,但是需要增加一行:
CHECK_OR_RETURN(!(mirrored_meta && consistent_meta));
用任意带 input、Variable 的 NNGraph 即可复现。发现同一个 tensor_meta 对象既可以 cast 成 Mirrored,又可以 cast 成 consistent meta
这儿咋不用dynamic_cast?
oneflow/core/framework/nn_graph.cpp
Outdated
for (const auto& input_tensor : input_tensors) { | ||
input_tensors_valid_.push_back(JUST(GetTensorValidInCurRank(input_tensor))); | ||
inputs_tensor_meta_.push_back(input_tensor->tensor_meta()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我感觉出问题的可能是这句: 我需要保存 input tensor 的 meta 信息,但是又不能持有这个 input tensor(我不能在 NNGraph 的成员里包含 Tensor 的 shared ptr,因为会导致这个 tensor 一直不会被释放,生命周期无限延长),所以我只能 复制 一份 meta 出来。是不是这个 复制 操作,导致了 tensor meta 丢失了 local/consistent 信息? @liufengwei0103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我希望可以有 tensor meta 的 symbol,这样方便直接判等? @liufengwei0103
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我感觉出问题的可能是这句: 我需要保存 input tensor 的 meta 信息,但是又不能持有这个 input tensor(我不能在 NNGraph 的成员里包含 Tensor 的 shared ptr,因为会导致这个 tensor 一直不会被释放,生命周期无限延长),所以我只能 复制 一份 meta 出来。是不是这个 复制 操作,导致了 tensor meta 丢失了 local/consistent 信息? @liufengwei0103
这里好像会发生子类到基类的隐式转换,丢失了子类的信息。是不是应该在基类里禁掉这样的构造函数,有这样的场景吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我希望可以有 tensor meta 的 symbol,这样方便直接判等? @liufengwei0103
这里是说,想让tensor->tensor_meta() 返回一个Symbol<TensorMeta> 吗? @chengtbf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里是说,想让tensor->tensor_meta() 返回一个Symbol<TensorMeta> 吗? @chengtbf
是的,直接返回 Symbol<TensorMeta>
,且这个 Symbol,是包含了 Device 或者 Placement + SBP 信息的,这样我好方便的知道这个 TensorMeta 是否是我想要的那个
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
另外,我觉得 TensorMeta 的基类就要提供 Device 和 Placement 的接口,子类分别实现,就像 Tensor 一样。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这样基类就不会丢失子类的信息了
…b.com/Oneflow-Inc/oneflow into dev_cc_check_static_input_tensor_meta
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我这里正确性没问题了。可以额外做一次模型测试,实测下性能差异。
带 input 的么,目前好像还没有这样的模型库。我们的模型都是 graph 的 reader 作为输入。只有单测是带 input 的 |
Maybe<std::string> GetTensorMetaString(const std::shared_ptr<one::Tensor>& tensor) { | ||
std::string ret = "shape=" + tensor->shape()->ToString() + ", dtype=" + tensor->dtype()->name(); | ||
if (tensor->is_consistent()) { | ||
ret += ", placement=" + *JUST(PlacementToString(JUST(tensor->parallel_desc()))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tensor->nd_sbp是不是也需要check下
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
需要,这里漏掉了,我补充一下
的确是的,那我们先做正确性。后面发现有性能瓶颈再优化也可以 |
嗯嗯。 带 input 的单测,master 也应该比以前慢了不少,因为要显示的同步。 |
…b.com/Oneflow-Inc/oneflow into dev_cc_check_static_input_tensor_meta
Speed stats:
|
NNGraph 每次执行都检查 input tensor 的 meta 信息是否与编译时的 input tensor meta 一致。