Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Graph rename v2 #9351

Merged
merged 84 commits into from
Nov 17, 2022
Merged

Graph rename v2 #9351

merged 84 commits into from
Nov 17, 2022

Conversation

strint
Copy link
Contributor

@strint strint commented Nov 1, 2022

本 pr 去掉 Block 上的 attribute 和 config

  • 1、彻底避免重名问题;
  • 2、去掉 block config;

实现的方案:

Eager original Proxy ,基类叫Proxy GraphBlock ,基类 GraphBlock
功能 支持拿到原始的 eager类型 代理执行能力,使用执行接口和 Module 和 Tensor 一样,但是行为已经变化,比如是 lazy 的,可能执行的 op 也被改写了 GraphBlock, 对应的 一个 Graph代码块,保存graph执行需要的信息,比如name/scope/lazy op or tensor,一些 graph 上的分模块的优化开关
Module Module ProxyModule,内含了一个Module成员和一个GraphModule成员 GraphModule
Tensor Tensor ProxyTensor,内含了一个Tensor成员和一个GraphTensor成员 GraphTensor

用例

from  oneflow.nn.graph import GraphModule
import  oneflow.nn as nn

class AGraph(nn.Graph):
    def __init__(self, module: nn.Module):
        super().__init__()

        self.m = module
        # self.m is a ProxyModule
        # ProxyModule中有两大部分,一部分是原 module,一部分是 GraphModule
        self.m.name  // 默认取 eager module 的 name
        self.m.to(GraphModule).name // 取 GraphModule 的 name
        self.m.to(nn.Module) // 取得原 nn.Module
        
        # 取到 GraphModule 上的 config 的方法
        self.m.to(GraphModule).set_stage(id, placement)

Fix issue: #9193

另外支持 nn.Module 多重继承时的property获取

Fix issue:#9345 and #9186

@strint strint marked this pull request as ready for review November 1, 2022 20:58
@strint strint requested a review from doombeaker as a code owner November 2, 2022 15:06
@@ -15,3 +15,6 @@
"""
from .graph import Graph
from .block import Block
from .block_graph import BlockGraph
from .block_graph import ModuleGraph
from .block_graph import TensorGraph
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

导出了如上类型,以供转换

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个命名结果,取得所有人的认可了吗? @leaves-zwx @CPFLAME @jackalcooper

主要是 TensorGraph 这个,还是 GraphTensor 呢? 或者 gtensor ? TensorGraph 乍一听感觉没理解这个到底是 Tensor 还是 Graph,还是由 Tensor 组成的 Graph。

同理: ModuleGraph 和 GraphModule / gmodule

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Origin Type:Module, Tensor
Graph Type: ModuleGraph, TensorGraph
Block Type: ModuleBlock, TensorBlock

一个 Block 内含了 Origin 和 Graph 两个部分;

有这样一个对应关系。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里很奇怪,一般的惯例是定语在前

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

也行,那我统一改成这样:

Origin Type:Module, Tensor
Graph Type: GraphModule, GraphTensor
Block Type: BlockModule, BlockTensor

r"""ModuleGraph is the graph representation of a nn.Module in a nn.Graph.

When an nn.Module is added into an nn.Graph, it is wrapped into a ModuleBlock. The ModuleBlock has a ModuleGraph inside it.
You can get and set the ModuleGraph to enable graph optimization on the nn.Module.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleGraph 的定义,stage 和 Activation checkpint 的配置改到了直接在 ModuleGraph 上配置

belonged_graph: weakref.ProxyTypes = None,
tensor_graph_type: BlockGraphType = BlockGraphType.NONE
):
super().__init__(prefix, name, belonged_graph, tensor_graph_type)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TensorGraph,只用于转换,提供基类的方法,其它方法没有

if lines is not None:
main_str += "\n " + "\n ".join(lines) + "\n"
main_str += ")"
return main_str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleGraph的打印

    (ModuleGraph:linears(activation_checkpointing=True, )): (
      (OPERATOR: linears-identity-0(_SeqGraph_0_input.0.0_2/out:(sbp=(B), size=(4, 10), dtype=(oneflow.float32))) -> (linears-identity-0/out_0:(sbp=(B), size=(4, 10), dtype=(oneflow.float32))), placement=(oneflow.placement(type="cpu", ranks=[0])))
    )

# The original data
self._oneflow_internal_origin__ = None
# The graph representation of the original data
self._oneflow_internal_blockgraph__ = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Block 的数据拆成原始的 eager 和 graph 两个部分,以避免冲突

self._type = BlockType.MODULE
self._is_executing_forward = False
super().__init__()
self._oneflow_internal_blockgraph__ = ModuleGraph(prefix, name, belonged_graph)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleBlock 初始化,内含一个 ModuleGraph

def set_origin(self, origin):
self._origin = origin
def _oneflow_internal_blockgraph__set_origin(self, origin):
self._oneflow_internal_origin__ = origin
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ModuleBlock 初始化,内含一个原始的 Module

@github-actions
Copy link
Contributor

Speed stats:

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot November 17, 2022 08:50
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.0ms (= 13998.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.6ms (= 16056.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.6ms / 140.0ms)

OneFlow resnet50 time: 85.2ms (= 8520.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.4ms (= 10241.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.4ms / 85.2ms)

OneFlow resnet50 time: 57.8ms (= 11562.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.2ms (= 16235.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.40 (= 81.2ms / 57.8ms)

OneFlow resnet50 time: 44.5ms (= 8902.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.3ms (= 16064.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.80 (= 80.3ms / 44.5ms)

OneFlow resnet50 time: 40.2ms (= 8031.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.7ms (= 13546.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.69 (= 67.7ms / 40.2ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9351/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.0ms (= 13996.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.6ms (= 16364.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 163.6ms / 140.0ms)

OneFlow resnet50 time: 85.3ms (= 8526.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.6ms (= 10162.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 101.6ms / 85.3ms)

OneFlow resnet50 time: 57.9ms (= 11587.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 86.7ms (= 17331.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.50 (= 86.7ms / 57.9ms)

OneFlow resnet50 time: 44.6ms (= 8915.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.2ms (= 14432.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.62 (= 72.2ms / 44.6ms)

OneFlow resnet50 time: 40.9ms (= 8181.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.3ms (= 15264.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.87 (= 76.3ms / 40.9ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9351/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.4ms (= 14036.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.7ms (= 16271.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.7ms / 140.4ms)

OneFlow resnet50 time: 85.6ms (= 8560.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 110.8ms (= 11076.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.29 (= 110.8ms / 85.6ms)

OneFlow resnet50 time: 57.6ms (= 11528.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.3ms (= 17458.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.51 (= 87.3ms / 57.6ms)

OneFlow resnet50 time: 44.5ms (= 8899.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.5ms (= 14108.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 70.5ms / 44.5ms)

OneFlow resnet50 time: 39.4ms (= 7876.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.3ms (= 13656.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 68.3ms / 39.4ms)

@github-actions
Copy link
Contributor

CI failed when running job: cuda-module. PR label automerge has been removed

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot November 17, 2022 13:46
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.6ms (= 14061.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.2ms (= 16222.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 162.2ms / 140.6ms)

OneFlow resnet50 time: 86.3ms (= 8634.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.2ms (= 10219.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 102.2ms / 86.3ms)

OneFlow resnet50 time: 57.6ms (= 11522.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.7ms (= 15931.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.38 (= 79.7ms / 57.6ms)

OneFlow resnet50 time: 44.4ms (= 8884.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.7ms (= 14130.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.59 (= 70.7ms / 44.4ms)

OneFlow resnet50 time: 40.7ms (= 8139.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.4ms (= 15473.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.90 (= 77.4ms / 40.7ms)

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot November 17, 2022 14:05
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.4ms (= 14044.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.7ms (= 16170.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.7ms / 140.4ms)

OneFlow resnet50 time: 85.4ms (= 8542.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.6ms (= 11155.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 111.6ms / 85.4ms)

OneFlow resnet50 time: 57.9ms (= 11577.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.6ms (= 17515.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.51 (= 87.6ms / 57.9ms)

OneFlow resnet50 time: 44.5ms (= 8895.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 81.1ms (= 16216.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.82 (= 81.1ms / 44.5ms)

OneFlow resnet50 time: 41.5ms (= 8304.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.0ms (= 13605.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 68.0ms / 41.5ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9351/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants