Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove almost all usage of Blob in EagerBlobObject #7895

Merged
merged 42 commits into from
May 3, 2022

Conversation

daquexian
Copy link
Contributor

@daquexian daquexian commented Mar 25, 2022

改动:

  1. EagerBlobObject 不再是 BlobObject 的子类。删掉了 BlobObject 和另一个子类 LazyRefBlobObject,因为已经没有地方用到
  2. user_op::Tensor 和 user_op::TensorDesc 的数据来源改成了 EagerBlobObject 自己的 shape/data_type/is_dynamic 成员而不再是 Blob
  3. 删掉了 TensorView 指令,因为已经不需要操作 Blob
  4. 受到现实难度的限制,blob_ 目前仍然是 EagerBlobObject 的一个成员(但不再需要手动维护了),完全去掉要等下个 PR
  5. ShapeView/MutShapeView 不再以引用而是以值传递了,EagerBlobObject 的 shape_view() 方法会现场从 shape_ 构造出一个 ShapeView 对象。原因是
    1. 引用传递要求 EagerBlobObject 同时维护 shape_、shape_view_ 和 mut_shape_view_ 三个对象,而且只有上层代码才知道什么时候需要同步这三个对象的内容。
    2. 一般来讲 View 类型都是推荐用值传递的:https://stackoverflow.com/questions/27256377/c-view-types-pass-by-const-or-by-value
  6. 删除了 Shape 对象能在 O(1) 时间内得到 element count 的特性,这个特性阻碍了从 Shape 得到 MutShapeView,而且遍历一次 dim_vec 计算 element count 应该不会很耗时(实际情况还要跑 ci 看看)

Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
oneflow/core/framework/nn_graph.h Outdated Show resolved Hide resolved
oneflow/core/framework/tensor_methods.cpp Outdated Show resolved Hide resolved
oneflow/core/job/runtime.cpp Outdated Show resolved Hide resolved
oneflow/core/register/register_manager.cpp Outdated Show resolved Hide resolved
oneflow/core/register/register_manager.cpp Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

@Flowingsun007
Copy link
Contributor

Flowingsun007 commented Mar 29, 2022

swin性能测试

环境

数据

master分支@e62d826

200iter 显存(mb) 速度(s)
1n1g 5512*1 27
1n4g 5670*4 28
1n8g 5850*8 35
  • cpu 1n8g 1iter 145s

本分支@18215dc

200iter 显存(mb) 速度(s)
1n1g 5502*1 27
1n4g 5680*4 28
1n8g 5850*8 35
  • cpu 1n8g 1iter 142s

结论

基于swin测试发现:

  • 1.cuda下,此分支和master上速度、显存占用在单卡-多卡下都相近,几乎没有差异;
  • 2.猜测该分支改动带来的收益在cpu环境(view op)下较为明显,在cuda环境下几乎没有影响;后续需要根据德澎einsum的case:https://github.com/Oneflow-Inc/OneTeam/issues/1133#issue-1157969421 用nsight profile一下,才能准确看出优化程度。

@daquexian daquexian enabled auto-merge (squash) March 30, 2022 12:31
@@ -29,8 +29,8 @@ class BlobTensorView final : public Tensor {
explicit BlobTensorView(Blob* blob);
~BlobTensorView() = default;

const ShapeView& shape() const override;
MutShapeView* mut_shape() override;
ShapeView shape() const override;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这块为啥去掉const

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个是返回一个值,所以不用加 const,以及 ShapeView 本身就没有写数据的操作(和 MutShapeView 对应)

void InitNonPODTypeEagerBlobObjectIfNeed(MemoryAllocator* allocator,
vm::EagerBlobObject* eager_blob_object_ptr) {
if (eager_blob_object_ptr->data_type() == kOFRecord) {
int64_t elem_cnt = eager_blob_object_ptr->shape().elem_cnt();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int64_t elem_cnt = eager_blob_object_ptr->shape().elem_cnt();可以提取到
L124之前

@github-actions
Copy link
Contributor

CI failed when running job: cuda-benchmark. PR label automerge has been removed

@github-actions
Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

1 similar comment
@github-actions
Copy link
Contributor

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

Signed-off-by: daquexian <daquexian566@gmail.com>
@daquexian daquexian force-pushed the remove_blob_from_eager_blob_object branch from 11cb68c to d365d43 Compare April 28, 2022 08:53
Signed-off-by: daquexian <daquexian566@gmail.com>
@github-actions
Copy link
Contributor

Speed stats:

@github-actions
Copy link
Contributor

CI failed when running job: Build cpu. PR label automerge has been removed

@github-actions
Copy link
Contributor

CI failed when running job: cpu-misc. PR label automerge has been removed

@github-actions
Copy link
Contributor

Speed stats:

Signed-off-by: daquexian <daquexian566@gmail.com>
Signed-off-by: daquexian <daquexian566@gmail.com>
@github-actions
Copy link
Contributor

github-actions bot commented May 3, 2022

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/7895/

@daquexian daquexian merged commit 3a75d14 into master May 3, 2022
@daquexian daquexian deleted the remove_blob_from_eager_blob_object branch May 3, 2022 10:00
@lixinqi
Copy link
Contributor

lixinqi commented May 3, 2022

后来有没有查到一开始性能受损的原因?

@daquexian
Copy link
Contributor Author

后来有没有查到一开始性能受损的原因?

没有,可能是 speed test 退役、新版 benchmark 上线之后,就能通过了

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants