Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NNGraph RunLazyJob check static input tensor meta #6243

Merged
merged 22 commits into from
Sep 26, 2021

Conversation

chengtbf
Copy link
Contributor

NNGraph 每次执行都检查 input tensor 的 meta 信息是否与编译时的 input tensor meta 一致。

DType dtype(tensor_meta.dtype());
std::string ret = "shape=" + tensor_meta.shape().ToString() + ", dtype=" + dtype.name();
if (mirrored_meta) {
ret += ", device=" + mirrored_meta->device()->ToString();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里会遇到 SegmentFault,栈信息:

#0  0x00007fffb45de6ff in std::operator<< <char, std::char_traits<char>, std::allocator<char> > (
    __str=<error reading variable: Cannot access memory at address 0x38>, __os=...)
    at /usr/include/c++/9/bits/basic_string.h:6416
#1  oneflow::Device::ToString[abi:cxx11]() const (this=0x30)
    at /home/chengcheng/oneflow/oneflow/core/framework/device.cpp:170
#2  0x00007fffb4636f94 in oneflow::(anonymous namespace)::TensorMetaToString (tensor_meta=...)
    at /home/chengcheng/oneflow/oneflow/core/framework/tensor_meta.h:72
#3  0x00007fffb4637827 in oneflow::(anonymous namespace)::CheckStaticTensorMeta (tensor_meta=..., tensor=
    std::shared_ptr<oneflow::one::Tensor> (use count 2, weak count 1) = {...})
    at /home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp:360
#4  0x00007fffb463826d in oneflow::RunLazyNNGraph (inputs=..., outputs=..., parameters=..., 
    nn_graph=std::shared_ptr<oneflow::NNGraph> (use count 2, weak count 1) = {...})
    at /home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp:385
#5  0x00007fffb371a72f in oneflow::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>::operator() (
    __closure=<optimized out>, nn_graph=std::shared_ptr<oneflow::NNGraph> (use count 2, weak count 1) = {...}, 
    parameters=..., outputs=..., inputs=...)
    at /home/chengcheng/oneflow/oneflow/api/python/framework/nn_graph.cpp:58
#6  pybind11::detail::argument_loader<oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, std::shared_ptr<oneflow::NNGraph> const&>::call_impl<void, oneflow::OneflowApiPythonModule__LINE__(pybind11::module&)::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>&, 0, 1, 2, 3, pybind11::detail::void_type> (f=..., this=0x7fffffffd260)
    at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/cast.h:1189
#7  pybind11::detail::argument_loader<oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, oneflow::one::TensorTuple const&, std::shared_ptr<oneflow::NNGraph> const&>::call<void, pybind11::detail::void_type, oneflow::OneflowApiPythonModule__LINE__(pybind11::module&)::<lambda(const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const oneflow::one::TensorTuple&, const std::shared_ptr<oneflow::NNGraph>&)>&> (f=..., 
    this=0x7fffffffd260) at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/cast.h:1166
#8  pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::operator() (this=0x0, call=...)
    at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:213
#9  pybind11::cpp_function::<lambda(pybind11::detail::function_call&)>::_FUN(pybind11::detail::function_call &)
    () at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:191
#10 0x00007fffb30e8e8a in pybind11::cpp_function::dispatcher (self=<optimized out>, args_in=0x7fff0009bbd0, 
    kwargs_in=0x0) at /home/chengcheng/oneflow/build/_deps/pybind11-src/include/pybind11/pybind11.h:791

Copy link
Contributor Author

@chengtbf chengtbf Sep 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不知道为什么无法访问 MirroredTensorMeta 的 Symbol<device>->ToString() 内存。 猜测是有可能 main 线程还不能直接读 tensor 的这个值?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不知道为什么无法访问 MirroredTensorMeta 的 Symbol<device>->ToString() 内存。 猜测是有可能 main 线程还不能直接读 tensor 的这个值?

这里可能也是tensor_meta,是个纯基类的。没有device成员。

@chengtbf
Copy link
Contributor Author

报错示例1:

Traceback (most recent call last):
  File "test_1d_simple.py", line 82, in <module>
    loss = graph_pipeline(x)
  File "/home/chengcheng/oneflow/python/oneflow/nn/graph/graph.py", line 251, in __call__
    return self._run(*args)
  File "/home/chengcheng/oneflow/python/oneflow/nn/graph/graph.py", line 593, in _run
    oneflow._oneflow_internal.nn.graph.RunLazyNNGraph(
oneflow._oneflow_internal.exception.CheckFailedException: 
  File "/home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp", line 363, in RunLazyNNGraph
    Check failed: static_meta_str == tensor_meta_str 
  nn.Graph ONLY accepts static inputs tensor meta, please check whether your input tensor meta each step is the same as the input of first call graph. 
  The excepted tensor meta is : ( 
  shape=(16,1,28,28), dtype=oneflow.float32, placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0]}, hierarchy=(1,)) 
) , but the actual tensor meta is : ( 
  shape=(16,1,28,28), dtype=oneflow.float32, device=cpu:0 
)

报错示例2

Traceback (most recent call last):
  File "test_graph_buffer_limit.py", line 89, in test_graph_buffer_limit
    _test_graph_buffer_limit(test_case)
  File "test_graph_buffer_limit.py", line 81, in _test_graph_buffer_limit
    out = pp_g(x)
  File "/home/chengcheng/oneflow/python/oneflow/nn/graph/graph.py", line 251, in __call__
    return self._run(*args)
  File "/home/chengcheng/oneflow/python/oneflow/nn/graph/graph.py", line 593, in _run
    oneflow._oneflow_internal.nn.graph.RunLazyNNGraph(
oneflow._oneflow_internal.exception.CheckFailedException: 
  File "/home/chengcheng/oneflow/oneflow/core/framework/nn_graph.cpp", line 363, in RunLazyNNGraph
    Check failed: static_meta_str == tensor_meta_str 
  nn.Graph ONLY accepts static inputs tensor meta, please check whether your input tensor meta each step is the same as the input of first call graph. 
  The excepted tensor meta is : ( 
  shape=(16,10), dtype=oneflow.float32, placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0]}, hierarchy=(1,)) 
) , but the actual tensor meta is : ( 
  shape=(4,4), dtype=oneflow.float32, placement=oneflow.placement(device_type="cuda", machine_device_ids={0 : [0]}, hierarchy=(1,)) 
)

Base automatically changed from dev_cc_nngraph_io_valid to master September 12, 2021 17:05
@chengtbf chengtbf marked this pull request as ready for review September 12, 2021 17:26
@chengtbf chengtbf added automerge and removed WIP work in progress labels Sep 12, 2021
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 128.0ms (= 6398.4ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.5ms (= 7072.9ms / 50, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 141.5ms / 128.0ms)

OneFlow resnet50 time: 74.2ms (= 3708.0ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.9ms (= 4244.6ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.14 (= 84.9ms / 74.2ms)

OneFlow resnet50 time: 49.9ms (= 2493.1ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 61.1ms (= 3055.2ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.23 (= 61.1ms / 49.9ms)

OneFlow resnet50 time: 46.2ms (= 2307.6ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 56.0ms (= 2798.5ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.21 (= 56.0ms / 46.2ms)

OneFlow resnet50 time: 51.0ms (= 2548.4ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 45.1ms (= 2253.2ms / 50, input_shape=[1, 3, 224, 224])
❌ Relative speed: 0.88 (= 45.1ms / 51.0ms)

OneFlow resnet50 time: 143.6ms (= 7180.3ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 159.8ms (= 7988.1ms / 50, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.11 (= 159.8ms / 143.6ms)

OneFlow resnet50 time: 92.2ms (= 4609.6ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 107.6ms (= 5379.2ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 107.6ms / 92.2ms)

OneFlow resnet50 time: 72.6ms (= 3631.3ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.1ms (= 3953.7ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.09 (= 79.1ms / 72.6ms)

OneFlow resnet50 time: 70.1ms (= 3505.3ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.8ms (= 3638.5ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.04 (= 72.8ms / 70.1ms)

OneFlow resnet50 time: 72.7ms (= 3636.7ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.0ms (= 3449.9ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.95 (= 69.0ms / 72.7ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 21, 2021 11:46
reinterpret_cast<const one::MirroredTensorMeta*>(&tensor_meta);
const one::ConsistentTensorMeta* consistent_meta =
reinterpret_cast<const one::ConsistentTensorMeta*>(&tensor_meta);
CHECK_OR_RETURN(mirrored_meta || consistent_meta);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liufengwei0103 复现方法就是这里,但是需要增加一行:

CHECK_OR_RETURN(!(mirrored_meta && consistent_meta));

用任意带 input、Variable 的 NNGraph 即可复现。发现同一个 tensor_meta 对象既可以 cast 成 Mirrored,又可以 cast 成 consistent meta

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

用这个 commit 吧,刚才那个是删掉以后的 str 低效判等版本

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liufengwei0103 复现方法就是这里,但是需要增加一行:

CHECK_OR_RETURN(!(mirrored_meta && consistent_meta));

用任意带 input、Variable 的 NNGraph 即可复现。发现同一个 tensor_meta 对象既可以 cast 成 Mirrored,又可以 cast 成 consistent meta

这儿咋不用dynamic_cast?

for (const auto& input_tensor : input_tensors) {
input_tensors_valid_.push_back(JUST(GetTensorValidInCurRank(input_tensor)));
inputs_tensor_meta_.push_back(input_tensor->tensor_meta());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我感觉出问题的可能是这句: 我需要保存 input tensor 的 meta 信息,但是又不能持有这个 input tensor(我不能在 NNGraph 的成员里包含 Tensor 的 shared ptr,因为会导致这个 tensor 一直不会被释放,生命周期无限延长),所以我只能 复制 一份 meta 出来。是不是这个 复制 操作,导致了 tensor meta 丢失了 local/consistent 信息? @liufengwei0103

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我希望可以有 tensor meta 的 symbol,这样方便直接判等? @liufengwei0103

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我感觉出问题的可能是这句: 我需要保存 input tensor 的 meta 信息,但是又不能持有这个 input tensor(我不能在 NNGraph 的成员里包含 Tensor 的 shared ptr,因为会导致这个 tensor 一直不会被释放,生命周期无限延长),所以我只能 复制 一份 meta 出来。是不是这个 复制 操作,导致了 tensor meta 丢失了 local/consistent 信息? @liufengwei0103

这里好像会发生子类到基类的隐式转换,丢失了子类的信息。是不是应该在基类里禁掉这样的构造函数,有这样的场景吗?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我希望可以有 tensor meta 的 symbol,这样方便直接判等? @liufengwei0103

这里是说,想让tensor->tensor_meta() 返回一个Symbol<TensorMeta> 吗? @chengtbf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里是说,想让tensor->tensor_meta() 返回一个Symbol<TensorMeta> 吗? @chengtbf

是的,直接返回 Symbol<TensorMeta> ,且这个 Symbol,是包含了 Device 或者 Placement + SBP 信息的,这样我好方便的知道这个 TensorMeta 是否是我想要的那个

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

另外,我觉得 TensorMeta 的基类就要提供 Device 和 Placement 的接口,子类分别实现,就像 Tensor 一样。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这样基类就不会丢失子类的信息了

Copy link
Contributor

@strint strint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我这里正确性没问题了。可以额外做一次模型测试,实测下性能差异。

@chengtbf
Copy link
Contributor Author

我这里正确性没问题了。可以额外做一次模型测试,实测下性能差异。

带 input 的么,目前好像还没有这样的模型库。我们的模型都是 graph 的 reader 作为输入。只有单测是带 input 的

Maybe<std::string> GetTensorMetaString(const std::shared_ptr<one::Tensor>& tensor) {
std::string ret = "shape=" + tensor->shape()->ToString() + ", dtype=" + tensor->dtype()->name();
if (tensor->is_consistent()) {
ret += ", placement=" + *JUST(PlacementToString(JUST(tensor->parallel_desc())));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tensor->nd_sbp是不是也需要check下

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

需要,这里漏掉了,我补充一下

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 26, 2021 04:03
@strint
Copy link
Contributor

strint commented Sep 26, 2021

我这里正确性没问题了。可以额外做一次模型测试,实测下性能差异。

带 input 的么,目前好像还没有这样的模型库。我们的模型都是 graph 的 reader 作为输入。只有单测是带 input 的

的确是的,那我们先做正确性。后面发现有性能瓶颈再优化也可以

@chengtbf
Copy link
Contributor Author

我这里正确性没问题了。可以额外做一次模型测试,实测下性能差异。

带 input 的么,目前好像还没有这样的模型库。我们的模型都是 graph 的 reader 作为输入。只有单测是带 input 的

的确是的,那我们先做正确性。后面发现有性能瓶颈再优化也可以

嗯嗯。 带 input 的单测,master 也应该比以前慢了不少,因为要显示的同步。

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 26, 2021 04:25
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

OneFlow resnet50 time: 136.2ms (= 6809.9ms / 50, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 140.7ms (= 7034.9ms / 50, input_shape=[16, 3, 224, 224])
❌ Relative speed: 1.03 (= 140.7ms / 136.2ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

OneFlow resnet50 time: 78.1ms (= 3904.3ms / 50, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.4ms (= 4172.2ms / 50, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.07 (= 83.4ms / 78.1ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

OneFlow resnet50 time: 51.8ms (= 2588.1ms / 50, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.3ms (= 3017.5ms / 50, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.17 (= 60.3ms / 51.8ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

OneFlow resnet50 time: 51.6ms (= 2580.9ms / 50, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 51.1ms (= 2553.7ms / 50, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 0.99 (= 51.1ms / 51.6ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

OneFlow resnet50 time: 44.1ms (= 2204.6ms / 50, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 41.3ms (= 2062.6ms / 50, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 0.94 (= 41.3ms / 44.1ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
Killing subprocess 6257
Killing subprocess 6258

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
OneFlow resnet50 time: 100.7ms (= 5037.1ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 109.6ms (= 5481.9ms / 50, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.09 (= 109.6ms / 100.7ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
OneFlow resnet50 time: 80.8ms (= 4039.2ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.7ms (= 3832.5ms / 50, input_shape=[4, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 0.95 (= 76.7ms / 80.8ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
OneFlow resnet50 time: 76.5ms (= 3823.1ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.3ms (= 3115.3ms / 50, input_shape=[2, 3, 224, 224], ddp, world size=2)
❌ Relative speed: 0.81 (= 62.3ms / 76.5ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1
OneFlow resnet50 time: 71.4ms (= 3572.3ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 62.6ms (= 3132.2ms / 50, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 0.88 (= 62.6ms / 71.4ms)
loaded library: /usr/lib/x86_64-linux-gnu/libibverbs.so.1

@oneflow-ci-bot oneflow-ci-bot removed their request for review September 26, 2021 05:37
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot September 26, 2021 06:30
@oneflow-ci-bot oneflow-ci-bot merged commit bc72be1 into master Sep 26, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the dev_cc_check_static_input_tensor_meta branch September 26, 2021 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nn.Graph 动态执行时 input tensor 需要做静态的 shape/meta 检查
5 participants