New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
flow.tmp_compute_stream #8866
flow.tmp_compute_stream #8866
Conversation
…pi oneflow.stream to oneflow.experimental.stream_set
我没有找到过torch.experimental.xxx api。最多只是找到了functorch.experimental.xxx和torchtext.experimental.xxx。后两者看起来是基于pytorch框架做得库,不是torch本身的东西。 |
…c/oneflow into tmp_compute_stream_type_guard
…c/oneflow into tmp_compute_stream_type_guard
for tensor in tensors: | ||
test_case.assertEqual(tensor[0], 1) | ||
test_case.assertEqual(tensor[int(tensor.shape[0] / 2)], 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里要测试什么呢,如果是通信的正确性好像并没有体现?因为即使不同 rank 的通信错位了 tensor 的值也会是全 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这的本意是为了测试是否死锁。
也许我应该进一步,对值也做测试。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
已处理: 1dfa972
CI failed when running job: Build cpu. PR label automerge has been removed |
备选方案以及辩护理由新奇、建浩提 flow.StreamSet / flow.stream_set啸宇:不应该直接放在oneflow顶层名字空间 新奇、建浩提 flow.experimental.StreamSet / flow.experimental.stream_set啸宇:torch的不太成熟的api一般都放在torch.utils名字空间下。 新奇提 flow.vm.StreamSet / flow.vm.stream_set新奇:不太想把vm这个概念暴露出来,本质上不应该让用户关心。 新奇提 flow.worker.StreamSet / flow.worker.stream_set新奇:用户不太能理解worker语义。 新奇提 flow.worker_thread.StreamSet / flow.worker_thread.stream_set新奇:worker_thread里用户的业务逻辑太远。 啸宇提 flow.stream.StreamSet / flow.stream.stream_set后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。 新奇,啸宇提flow.utils.StreamSet / flow.utils.stream_set后江:Stream这个字眼不合适,普通c++的开发人员容易理解成io流,而cuda开发人员容易理解为stream,但其实都不太像。 啸宇,后江,建浩提新的名字空间 flow.async均无异议。新奇:看起来像是flow.vm的替代,但是更易于用户理解。 后江提flow.async.runwith flow.async.run(thread_global_id):
pass 啸宇:run这个字眼不太合适,应该是个名词 新奇提flow.async.run(flow.async.Fiber(thread_global_id))后江:Fiber比stream好,但仍然会让人联想到常见的协程概念,但此处又不太一致。 新奇、后江提flow.async.pipeline新奇,啸宇:pipeline在pytorch里有类似概念: Pipe module。那个更多的是一个composer,把分属两个设备的module组合起来,类似Sequential module。此处的flow.async.pipeline更多地想表达类似micro thread的概念。 后江提flow.asyncs.thread(thread_global_id)后江:暂时不要提供fiber概念,用户不易理解。c++层面的StreamSet可以继续存在,python层只导出flow.asyncs.thread。用户用不同fiber所要解决的问题,总是可以用不同thread_global_id来解决,而且更加可靠,因为不同的thread_global_id肯定不会出现不同fiber争用线程资源。 default_thread_id = 0
decoder_async_thread_id = 1
loss_async_thread_id = 2
for i in epoch_iters:
with flow.asyncs.thread(decoder_async_thread_id):
data = get_data()
loss = train(model)
with flow.asyncs.thread(loss_async_thread_id):
write_metric(loss) |
决策原则
|
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8866/ |
Speed stats:
|
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8866/ |
Speed stats:
|
使用api oneflow.async.thread暴露vm的worker thread给用户编程。
用法示例:
经过和建浩,后江,啸宇一起讨论,决定暂时不对python层暴露stream或者StreamSet等概念,因为解释成本太高,而且用户极易理解错误,简单的暴露thread概念就能完成绝大部分业务需求了。