Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flow.tmp_compute_stream #8866

Merged
merged 121 commits into from Aug 30, 2022
Merged

flow.tmp_compute_stream #8866

merged 121 commits into from Aug 30, 2022

Conversation

lixinqi
Copy link
Contributor

@lixinqi lixinqi commented Aug 8, 2022

使用api oneflow.async.thread暴露vm的worker thread给用户编程。
用法示例:

loss = model()
with flow.asyncs.thread(thread_global_id=2):
    write_metrics(loss)

经过和建浩,后江,啸宇一起讨论,决定暂时不对python层暴露stream或者StreamSet等概念,因为解释成本太高,而且用户极易理解错误,简单的暴露thread概念就能完成绝大部分业务需求了。

lixinqi added 30 commits May 12, 2022 21:11
@lixinqi
Copy link
Contributor Author

lixinqi commented Aug 29, 2022

experimental

试验性的工具类 torch 一般是放到 utils.xxx 下面,稳定后再升级到二级命名空间下、或者一级命名空间下。

之前的实验性功能命名空间 experimental 也都已经删掉了

我没有找到过torch.experimental.xxx api。最多只是找到了functorch.experimental.xxx和torchtext.experimental.xxx。后两者看起来是基于pytorch框架做得库,不是torch本身的东西。
或者,在其子名字空间下还可以找到experimental
https://pytorch.org/docs/stable/search.html?q=torch.experimental&check_keywords=yes&area=default#

Comment on lines 56 to 58
for tensor in tensors:
test_case.assertEqual(tensor[0], 1)
test_case.assertEqual(tensor[int(tensor.shape[0] / 2)], 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里要测试什么呢,如果是通信的正确性好像并没有体现?因为即使不同 rank 的通信错位了 tensor 的值也会是全 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这的本意是为了测试是否死锁。
也许我应该进一步,对值也做测试。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

已处理: 1dfa972

@github-actions
Copy link
Contributor

CI failed when running job: Build cpu. PR label automerge has been removed

@lixinqi
Copy link
Contributor Author

lixinqi commented Aug 30, 2022

备选方案以及辩护理由

新奇、建浩提 flow.StreamSet / flow.stream_set

啸宇:不应该直接放在oneflow顶层名字空间
后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

新奇、建浩提 flow.experimental.StreamSet / flow.experimental.stream_set

啸宇:torch的不太成熟的api一般都放在torch.utils名字空间下。
后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

新奇提 flow.vm.StreamSet / flow.vm.stream_set

新奇:不太想把vm这个概念暴露出来,本质上不应该让用户关心。
后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

新奇提 flow.worker.StreamSet / flow.worker.stream_set

新奇:用户不太能理解worker语义。
啸宇:子命名空间应该是特定功能,worker有点不合适。
后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

新奇提 flow.worker_thread.StreamSet / flow.worker_thread.stream_set

新奇:worker_thread里用户的业务逻辑太远。
啸宇:子命名空间应该是特定功能,worker_thread有点不合适。
后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

啸宇提 flow.stream.StreamSet / flow.stream.stream_set

后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

新奇,啸宇提flow.utils.StreamSet / flow.utils.stream_set

后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。

啸宇,后江,建浩提新的名字空间 flow.async

均无异议。新奇:看起来像是flow.vm的替代,但是更易于用户理解。
畅想了什么api可以放到该名字空间下:flow.async.local_sync ,flow.async.global_sync,

后江提flow.async.run

with flow.async.run(thread_global_id):
    pass

啸宇:run这个字眼不太合适,应该是个名词

新奇提flow.async.run(flow.async.Fiber(thread_global_id))

后江:Fiber比stream好,但仍然会让人联想到常见的协程概念,但此处又不太一致。

新奇、后江提flow.async.pipeline

新奇,啸宇:pipeline在pytorch里有类似概念: Pipe module。那个更多的是一个composer,把分属两个设备的module组合起来,类似Sequential module。此处的flow.async.pipeline更多地想表达类似micro thread的概念。

后江提flow.asyncs.thread(thread_global_id)

后江:暂时不要提供fiber概念,用户不易理解。c++层面的StreamSet可以继续存在,python层只导出flow.asyncs.thread。用户用不同fiber所要解决的问题,总是可以用不同thread_global_id来解决,而且更加可靠,因为不同的thread_global_id肯定不会出现不同fiber争用线程资源。

default_thread_id = 0
decoder_async_thread_id = 1
loss_async_thread_id = 2

for i in epoch_iters:

    with flow.asyncs.thread(decoder_async_thread_id):
        data = get_data()

    loss = train(model)

    with flow.asyncs.thread(loss_async_thread_id):
        write_metric(loss)

@lixinqi
Copy link
Contributor Author

lixinqi commented Aug 30, 2022

决策原则

  1. api是否明确暴露eager 运行时的核心特性。
  2. 让用户最快地理解api如何使用。

@github-actions
Copy link
Contributor

Speed stats:

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8866/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.3ms (= 12926.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.9ms (= 14289.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.9ms / 129.3ms)

OneFlow resnet50 time: 74.5ms (= 7451.9ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.4ms (= 8540.7ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.15 (= 85.4ms / 74.5ms)

OneFlow resnet50 time: 47.1ms (= 9410.3ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 57.9ms (= 11573.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.23 (= 57.9ms / 47.1ms)

OneFlow resnet50 time: 34.3ms (= 6859.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 42.9ms (= 8582.9ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.25 (= 42.9ms / 34.3ms)

OneFlow resnet50 time: 28.2ms (= 5635.2ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 38.1ms (= 7626.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.35 (= 38.1ms / 28.2ms)

OneFlow swin dataloader time: 0.268s (= 53.662s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.005s / 200, num_workers=1)
Relative speed: 0.559 (= 0.150s / 0.268s)

OneFlow swin dataloader time: 0.077s (= 15.423s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 8.082s / 200, num_workers=4)
Relative speed: 0.524 (= 0.040s / 0.077s)

OneFlow swin dataloader time: 0.040s (= 7.950s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.350s / 200, num_workers=8)
Relative speed: 0.547 (= 0.022s / 0.040s)

❌ OneFlow resnet50 time: 138.3ms (= 13829.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.0ms (= 16496.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 165.0ms / 138.3ms)

OneFlow resnet50 time: 84.2ms (= 8417.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.7ms (= 10172.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 101.7ms / 84.2ms)

OneFlow resnet50 time: 57.2ms (= 11435.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.7ms (= 15533.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 77.7ms / 57.2ms)

OneFlow resnet50 time: 44.3ms (= 8866.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.8ms (= 13754.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 68.8ms / 44.3ms)

OneFlow resnet50 time: 38.6ms (= 7725.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.0ms (= 15407.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.99 (= 77.0ms / 38.6ms)

@github-actions
Copy link
Contributor

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8866/

@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

❌ OneFlow resnet50 time: 129.2ms (= 12916.2ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 150.0ms (= 15004.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.16 (= 150.0ms / 129.2ms)

OneFlow resnet50 time: 74.5ms (= 7451.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 83.8ms (= 8382.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 83.8ms / 74.5ms)

OneFlow resnet50 time: 46.7ms (= 9344.2ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.6ms (= 11715.7ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.25 (= 58.6ms / 46.7ms)

OneFlow resnet50 time: 34.2ms (= 6838.0ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.2ms (= 8835.2ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.29 (= 44.2ms / 34.2ms)

OneFlow resnet50 time: 28.2ms (= 5638.8ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.7ms (= 7933.4ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.41 (= 39.7ms / 28.2ms)

OneFlow swin dataloader time: 0.262s (= 52.388s / 200, num_workers=1)
PyTorch swin dataloader time: 0.154s (= 30.826s / 200, num_workers=1)
Relative speed: 0.588 (= 0.154s / 0.262s)

OneFlow swin dataloader time: 0.074s (= 14.738s / 200, num_workers=4)
PyTorch swin dataloader time: 0.042s (= 8.399s / 200, num_workers=4)
Relative speed: 0.570 (= 0.042s / 0.074s)

OneFlow swin dataloader time: 0.039s (= 7.822s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.331s / 200, num_workers=8)
Relative speed: 0.554 (= 0.022s / 0.039s)

❌ OneFlow resnet50 time: 138.1ms (= 13805.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 168.1ms (= 16807.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.22 (= 168.1ms / 138.1ms)

OneFlow resnet50 time: 84.2ms (= 8421.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 107.3ms (= 10730.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.27 (= 107.3ms / 84.2ms)

OneFlow resnet50 time: 57.3ms (= 11463.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.0ms (= 15598.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.0ms / 57.3ms)

OneFlow resnet50 time: 44.5ms (= 8899.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.5ms (= 13909.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 69.5ms / 44.5ms)

OneFlow resnet50 time: 38.7ms (= 7745.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.5ms (= 14894.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.92 (= 74.5ms / 38.7ms)

@lixinqi lixinqi merged commit 4fefb3e into master Aug 30, 2022
@lixinqi lixinqi deleted the tmp_compute_stream_type_guard branch August 30, 2022 09:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants