Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fea/nn graph/warmup amp config #5969

Merged
merged 29 commits into from
Aug 21, 2021
Merged

Conversation

strint
Copy link
Contributor

@strint strint commented Aug 19, 2021

  • WarmUpLR
    • eager
    • graph
  • amp & nn.Graph.config
  • flow.config

class WarmUpLR(WarmUpLrScheduler):
def __init__(
self,
lrsch_or_optimizer,
Copy link
Contributor Author

@strint strint Aug 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://pytorch.org/docs/master/generated/torch.optim.lr_scheduler.WarmUpLR.html#torch.optim.lr_scheduler.WarmUpLR

和torch开发分支中一样的接口。

但是是个torch的增强版本:可以组合一个普通的LrScheduler。支持eager和graph。

@strint strint added this to the v0.5.0 milestone Aug 19, 2021

flow.backends.cudnn.set_reserved_mem_mbytes(1000)

flow.utils.load_library("")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config接口调用的例子

# amp
self.config.enable_amp(True)
grad_scaler = flow.nn.graph.amp.GradScaler(3000, 2.0, 0.5, 1000)
self.set_grad_scaler(grad_scaler)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

amp调用的例子

return self.config.proto

@property
def _optimization_conf_proto(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个只是为了 debug 打印吧

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,为了内部debug,加下滑线前缀的方法都是内部使用的私有方法,不保证稳定

Copy link
Contributor

@chengtbf chengtbf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LR 部分的内容 @leaves-zwx 文骁 Review 吧,通过了就能合并~


# amp
self.config.enable_amp(True)
grad_scaler = flow.amp.GradScaler(3000, 2.0, 0.5, 1000)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里需要提供 key args 把,不然别人不知道 3000、2、1000 都是啥

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里和torch接口对齐了,用户传参时也可以写 arg=val,也可以不写

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

对,但是我觉得我们的示例代码要写明白,参考:

# Assuming optimizer uses lr = 0.05 for all groups
# lr = 0.025    if epoch == 0
# lr = 0.03125  if epoch == 1
# lr = 0.0375   if epoch == 2
# lr = 0.04375  if epoch == 3
# lr = 0.005    if epoch >= 4
scheduler = WarmUpLR(self.opt, warmup_factor=0.5, warmup_iters=4, warmup_method="linear")
for epoch in range(100):
    train(...)
    validate(...)
    scheduler.step()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们的测试脚本,用户也会尝试模仿。所以最好还是带上参数更明白

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有道理,已经加上

@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 21, 2021 09:28
@strint strint requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 21, 2021 09:55
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot August 21, 2021 09:58
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 140.1ms (= 7004.4ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 128.4ms (= 6419.8ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.09 (= 140.1ms / 128.4ms)

PyTorch resnet50 time: 84.5ms (= 4225.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 74.6ms (= 3728.0ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.13 (= 84.5ms / 74.6ms)

PyTorch resnet50 time: 57.6ms (= 2878.5ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 47.3ms (= 2365.6ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.22 (= 57.6ms / 47.3ms)

PyTorch resnet50 time: 49.2ms (= 2460.5ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 41.4ms (= 2068.8ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 1.19 (= 49.2ms / 41.4ms)

PyTorch resnet50 time: 44.7ms (= 2234.6ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 35.3ms (= 1766.1ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 1.27 (= 44.7ms / 35.3ms)

@oneflow-ci-bot oneflow-ci-bot merged commit d4aae1b into master Aug 21, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the fea/nn_graph/warmup_amp_config branch August 21, 2021 10:49
)

from oneflow.framework.config_util import (
api_nccl_use_compute_stream as enable_use_compute_stream,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里不是有吗? 😂 @strint @leaves-zwx

flow.boxing.nccl.enable_use_compute_stream()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants