Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace src tick with with wait and send ids #5603

Merged
merged 37 commits into from
Jul 27, 2021
Merged

Conversation

leaves-zwx
Copy link
Contributor

@leaves-zwx leaves-zwx commented Jul 26, 2021

  • 修改 MultiClientAddWaitAndSendIds,让 wait_and_send_ids 不再连在 source tick 前面,而是直接替代 source tick 连在 tick 前面
  • 还原之前为 source tick 添加 optional input wait_in 的修改,source tick 从语义和用法上看最好别添加输入
  • 去掉 InputKernel 的 Forward 重载,它的空实现会阻止 ForwardDataContent 被调用
  • 修改 test_graph_relu.py 测试用例,这是一个只包含唯一一个 relu op 的 nn.graph

chengtbf and others added 29 commits July 23, 2021 12:20
@leaves-zwx leaves-zwx requested a review from lixinqi July 27, 2021 06:35
@leaves-zwx leaves-zwx marked this pull request as ready for review July 27, 2021 06:36
@chengtbf chengtbf requested a review from strint July 27, 2021 07:06
const auto& input_lbi = op.BnInOp2Lbi(ibn);
if (input_lbi.op_name() == src_op_name) {
tick_op_conf.CopyFrom(op.op_conf());
break;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 break 不对吧,并不会阻止上面的 ForeEachOperator 的执行。 如果有多个 op 消费了 src_op_name 的 out 呢? 如何保证 tick_op_conf.CopyFrom 的唯一性呢? 我们会有多个 tick op 吗?

Copy link
Contributor Author

@leaves-zwx leaves-zwx Jul 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 src_tick_op -> tick_op 是唯一连接,所以后面检查了 tick_conf.tick_size() == 1,且 tick_conf.tick(0) == src_op_name/out

这里我改下 break,应该直接 return Maybe::Ok(); 了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

直接 return 也不能解决问题。。。。因为 ForEach 仍然会执行后续的 Op,如果有另外一个 op 的 input lbi 里也有这个 src_op_name 怎么办呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看了下 ForEachOperator 的实现,你说的对,它会一直遍历完所有的 operator,并没有提前终止的条件。

src_tick 目前的实现上不会直接连其他 op,它后面一定会跟一个 tick,见 CreateSourceTicksAndSrcSubsetTick

这里我换个方式实现,把检查挪到里面,然后在外面 copy 吧。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们会有多个 tick 吗

@leaves-zwx leaves-zwx removed the request for review from oneflow-ci-bot July 27, 2021 07:31
const auto& input_lbi = op.BnInOp2Lbi(ibn);
if (input_lbi.op_name() == src_op_name) {
tick_op_conf.CopyFrom(op.op_conf());
return Maybe<void>::Ok();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 return 和 break 没有区别。但是改变不了上层的 ForEachOperator 的逻辑

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 140.9ms (= 7046.9ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.3ms (= 6316.7ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 140.9ms / 126.3ms)

PyTorch resnet50 time: 83.9ms (= 4197.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 75.0ms (= 3751.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 83.9ms / 75.0ms)

PyTorch resnet50 time: 58.8ms (= 2938.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 52.1ms (= 2603.9ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 58.8ms / 52.1ms)

PyTorch resnet50 time: 50.3ms (= 2515.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 49.5ms (= 2476.4ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.02 (= 50.3ms / 49.5ms)

PyTorch resnet50 time: 43.6ms (= 2178.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.1ms (= 2353.2ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.93 (= 43.6ms / 47.1ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 27, 2021 10:36
@oneflow-ci-bot oneflow-ci-bot self-requested a review July 27, 2021 11:12
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 140.2ms (= 7012.1ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.1ms (= 6307.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.11 (= 140.2ms / 126.1ms)

PyTorch resnet50 time: 85.7ms (= 4283.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 73.6ms (= 3679.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.16 (= 85.7ms / 73.6ms)

PyTorch resnet50 time: 56.5ms (= 2824.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.4ms (= 2420.6ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.17 (= 56.5ms / 48.4ms)

PyTorch resnet50 time: 47.5ms (= 2377.3ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 50.5ms (= 2523.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 0.94 (= 47.5ms / 50.5ms)

PyTorch resnet50 time: 42.9ms (= 2145.4ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 48.3ms (= 2416.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.89 (= 42.9ms / 48.3ms)

@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 27, 2021 12:22
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 139.4ms (= 6972.2ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 126.1ms (= 6304.6ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.11 (= 139.4ms / 126.1ms)

PyTorch resnet50 time: 83.5ms (= 4176.8ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.1ms (= 3706.5ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 83.5ms / 74.1ms)

PyTorch resnet50 time: 57.9ms (= 2893.2ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 58.7ms (= 2934.1ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 0.99 (= 57.9ms / 58.7ms)

PyTorch resnet50 time: 47.7ms (= 2383.1ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 58.8ms (= 2940.0ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 0.81 (= 47.7ms / 58.8ms)

PyTorch resnet50 time: 44.4ms (= 2217.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 53.9ms (= 2696.4ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.82 (= 44.4ms / 53.9ms)

@oneflow-ci-bot oneflow-ci-bot removed their request for review July 27, 2021 13:36
@oneflow-ci-bot oneflow-ci-bot merged commit 308f365 into master Jul 27, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the replace_src_tick branch July 27, 2021 13:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants