Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

better repr of nn.Graph for debug #5762

Merged
merged 33 commits into from
Aug 7, 2021

Conversation

strint
Copy link
Contributor

@strint strint commented Aug 5, 2021

支持graph构图时Module + Tensor层的debug

输入、输出tensor,尤其是consistent tensor的sbp和placement信息,对于执行方式影响大,但是又隐含在tensor中,这里便于debug graph。

print(tensor._meta_repr())

nn.Graph打印时不关注tensor的具体数据,只显示meta信息,所以提供内置的 meta_repr 供nn.Graph和内部使用

tensor(flow.Size([10]),
       placement=oneflow.placement(device_type="cpu", machine_device_ids={0 : [0]}, hierarchy=(1,)),
       sbp=(oneflow.sbp.broadcast,), dtype=oneflow.float32)

tensor(flow.Size([3, 9]), device='cuda:0', dtype=oneflow.float32)

print(repr(graph) )或者 print(graph)

在实例化了graph,但没有执行graph前打印

这时没有input、output信息,只有module的结构信息

repr(alexnet_graph) before run: 
 (GRAPH:AlexNetGraph_0:AlexNetGraph): (
  (MODULE:alexnet:AlexNet()): (
    (MODULE:alexnet.features:Sequential()): (
      (MODULE:alexnet.features.0:Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))): (
        (PARAMETER:alexnet.features.0.weight:tensor(flow.Size([64, 3, 11, 11]), device='cuda:0', dtype=oneflow.float32,
               requires_grad=True)): ()
        (PARAMETER:alexnet.features.0.bias:tensor(flow.Size([64]), device='cuda:0', dtype=oneflow.float32,
               requires_grad=True)): ()
      )

打开graph的debug,边构图,边打印

因为很多时候图没有构完就挂了,这样可以看挂掉之前的repr情况。

  • graph.debug()打开这个开关
  • graph.debug(False)关闭这个开关
  • 默认是关闭的
  • 只在graph第一次执行的编译阶段有效,开关开启会打印相应的tensor信息、module信息
(GRAPH:AlexNetGraph_0:AlexNetGraph) start graph construting.
(INPUT:_AlexNetGraph_0-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', dtype=oneflow.float32))
(INPUT:_AlexNetGraph_0-input_1:tensor(flow.Size([4]), device='cuda:0', dtype=oneflow.int32))
(MODULE:alexnet:AlexNet())
(INPUT:_alexnet-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
       dtype=oneflow.float32))
(MODULE:alexnet.features:Sequential())
(INPUT:_alexnet.features-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
       dtype=oneflow.float32))
(MODULE:alexnet.features.0:Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2)))
(INPUT:_alexnet.features.0-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
       dtype=oneflow.float32))
(PARAMETER:alexnet.features.0.weight:tensor(flow.Size([64, 3, 11, 11]), device='cuda:0', dtype=oneflow.float32,
       requires_grad=True))
(PARAMETER:alexnet.features.0.bias:tensor(flow.Size([64]), device='cuda:0', dtype=oneflow.float32,
       requires_grad=True))
(OUTPUT:_alexnet.features.0-output_0:tensor(flow.Size([4, 64, 55, 55]), device='cuda:0', is_lazy ='True',
       dtype=oneflow.float32))

在执行一次graph后打印

repr(alexnet_graph) after run: 
 (GRAPH:AlexNetGraph_0:AlexNetGraph): (
  (INPUT:_AlexNetGraph_0-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', dtype=oneflow.float32))
  (INPUT:_AlexNetGraph_0-input_1:tensor(flow.Size([4]), device='cuda:0', dtype=oneflow.int32))
  (MODULE:alexnet:AlexNet()): (
    (INPUT:_alexnet-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
           dtype=oneflow.float32))
    (MODULE:alexnet.features:Sequential()): (
      (INPUT:_alexnet.features-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
             dtype=oneflow.float32))
      (MODULE:alexnet.features.0:Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))): (
        (INPUT:_alexnet.features.0-input_0:tensor(flow.Size([4, 3, 224, 224]), device='cuda:0', is_lazy ='True',
               dtype=oneflow.float32))
        (PARAMETER:alexnet.features.0.weight:tensor(flow.Size([64, 3, 11, 11]), device='cuda:0', dtype=oneflow.float32,
               requires_grad=True)): ()
        (PARAMETER:alexnet.features.0.bias:tensor(flow.Size([64]), device='cuda:0', dtype=oneflow.float32,
               requires_grad=True)): ()
        (OUTPUT:_alexnet.features.0-output_0:tensor(flow.Size([4, 64, 55, 55]), device='cuda:0', is_lazy ='True',
               dtype=oneflow.float32))
      )

graph、block、module的_shallow_repr()

无递归子模块的repr,为了边执行边debug,因为很多时候图没有构完就挂了。

@strint strint requested review from leaves-zwx and chengtbf August 5, 2021 18:22
@strint strint marked this pull request as ready for review August 5, 2021 18:27
@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 5, 2021 19:24
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 5, 2021 21:41
@strint strint requested a review from oneflow-ci-bot August 6, 2021 03:11
@strint strint added this to the v0.5.0 milestone Aug 6, 2021
@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 6, 2021 07:23
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 7, 2021 05:35
@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2021

CI failed, removing label automerge

@github-actions github-actions bot removed the automerge label Aug 7, 2021
@oneflow-ci-bot oneflow-ci-bot removed their request for review August 7, 2021 07:19
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 7, 2021 08:11
@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 144.2ms (= 7208.3ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.0ms (= 6298.4ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.14 (= 144.2ms / 126.0ms)

PyTorch resnet50 time: 82.1ms (= 4105.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 72.7ms (= 3634.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 82.1ms / 72.7ms)

PyTorch resnet50 time: 57.1ms (= 2856.6ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 46.7ms (= 2332.6ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 57.1ms / 46.7ms)

PyTorch resnet50 time: 45.3ms (= 2265.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 43.3ms (= 2163.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.05 (= 45.3ms / 43.3ms)

PyTorch resnet50 time: 40.4ms (= 2019.0ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 42.6ms (= 2131.4ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.95 (= 40.4ms / 42.6ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 7, 2021 10:49
@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 7, 2021 13:20
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 7, 2021 14:25
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 7, 2021 16:23
@github-actions
Copy link
Contributor

github-actions bot commented Aug 7, 2021

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 138.4ms (= 6918.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.0ms (= 6298.4ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 138.4ms / 126.0ms)

PyTorch resnet50 time: 82.1ms (= 4104.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 72.7ms (= 3636.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 82.1ms / 72.7ms)

PyTorch resnet50 time: 56.8ms (= 2841.8ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.3ms (= 2365.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.20 (= 56.8ms / 47.3ms)

PyTorch resnet50 time: 47.7ms (= 2386.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 41.4ms (= 2069.9ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.15 (= 47.7ms / 41.4ms)

PyTorch resnet50 time: 41.5ms (= 2075.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.8ms (= 2437.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.85 (= 41.5ms / 48.8ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review August 7, 2021 19:07
@oneflow-ci-bot oneflow-ci-bot merged commit e964367 into master Aug 7, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fea/nn_graph/in_out_and_sbp_in_repr branch August 7, 2021 19:07
+ "-input_"
+ str(idx)
+ ":"
+ arg._meta_repr()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里没有处理 arg 为 None 的情形,或者不是 Tensor 的情形

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants