Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support weight_dacay(l2 actually) #5587

Merged
merged 3 commits into from
Jul 23, 2021
Merged

Conversation

wyg1997
Copy link
Contributor

@wyg1997 wyg1997 commented Jul 23, 2021

SGD支持weight_decay参数,更新了文档:

image

但SGD底层和torch计算方式还没有对齐,后续需要在kernel对齐。

@wyg1997 wyg1997 requested a review from BBuf July 23, 2021 12:09
@wyg1997 wyg1997 requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 23, 2021 12:18
@oneflow-ci-bot oneflow-ci-bot requested review from oneflow-ci-bot and removed request for oneflow-ci-bot July 23, 2021 12:51
@github-actions
Copy link
Contributor

Speed stats:
GPU Name: GeForce GTX 1080 

PyTorch resnet50 time: 136.3ms (= 6814.0ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 123.9ms (= 6192.5ms / 50, input_shape=[16, 3, 224, 224], backward is enabled)
Relative speed: 1.10 (= 136.3ms / 123.9ms)

PyTorch resnet50 time: 83.7ms (= 4183.6ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 72.3ms (= 3613.2ms / 50, input_shape=[8, 3, 224, 224], backward is enabled)
Relative speed: 1.16 (= 83.7ms / 72.3ms)

PyTorch resnet50 time: 58.3ms (= 2913.6ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 52.0ms (= 2600.7ms / 50, input_shape=[4, 3, 224, 224], backward is enabled)
Relative speed: 1.12 (= 58.3ms / 52.0ms)

PyTorch resnet50 time: 47.4ms (= 2371.2ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 50.5ms (= 2523.7ms / 50, input_shape=[2, 3, 224, 224], backward is enabled)
Relative speed: 0.94 (= 47.4ms / 50.5ms)

PyTorch resnet50 time: 43.0ms (= 2148.8ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
OneFlow resnet50 time: 50.6ms (= 2527.7ms / 50, input_shape=[1, 3, 224, 224], backward is enabled)
Relative speed: 0.85 (= 43.0ms / 50.6ms)

@oneflow-ci-bot oneflow-ci-bot merged commit 5c7bab4 into master Jul 23, 2021
@oneflow-ci-bot oneflow-ci-bot deleted the feat-SGD_weight_decay branch July 23, 2021 13:52
@@ -60,16 +62,19 @@ def __init__(
parameters: Union[Iterator[Parameter], List[Dict]],
lr: float = 1e-3,
momentum: float = 0.0,
weight_decay: float = 0.0, # SGD's weight_decay actually does L2 Normalize
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 L2,是跟 adam 的 L2 语义一致 还是 adamW 的 L2 语义一致? @leaves-zwx @strint @wyg1997 (即:跟 Optimizer 里的 WeightDecay 参数语义一致,还是 VariableConf 里的 L2 语义 一致?) 如果是后者的话, Lazy nn.Graph Optimizer 的 SGD 就要处理 Variable 的 L2 设置问题了

注: 两者的区别是 WeightDecay 计算的时机不同。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的 weight_decay(SGD)就是 L2 regularization,它的计算发生在 momentum 更新之前,如果有 SGDW(pytorch 没实现),这里的 weight_decay 就发生在 momentum 更新之后(原教旨 weigth_decay)。如果 momentum beta 为 0(SGD 不带 momentum), 则前述2者没有区别。

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

那按照我的理解,PyTorch 对齐的 SGD Optimizer 里的 WeightDecay ,其实不是我们 Lazy OptimizerConf 里的 WeightDecay,而是 Lazy VariableOpConf 里的 Regularizer 的 L2 参数? @leaves-zwx @strint ,如果是这样的话,那么 nn.Graph 的 SGD 就要处理如何把这个WeightDecay 参数写到对应的 VariableOpConf 上去了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

感觉 nn.Graph 的 add_optimizer 需要判断一下 optimizer 的类型了

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,SGD这里加上weight_decay后,我们就可以基于这个optimizer接着做下Variable的L2那个pr。add_optimizer会间接的感知类型,就是类型不同产生的数据不同。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants