-
Notifications
You must be signed in to change notification settings - Fork 667
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support weight_dacay(l2 actually) #5587
Conversation
Speed stats:
|
@@ -60,16 +62,19 @@ def __init__( | |||
parameters: Union[Iterator[Parameter], List[Dict]], | |||
lr: float = 1e-3, | |||
momentum: float = 0.0, | |||
weight_decay: float = 0.0, # SGD's weight_decay actually does L2 Normalize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个 L2,是跟 adam 的 L2 语义一致 还是 adamW 的 L2 语义一致? @leaves-zwx @strint @wyg1997 (即:跟 Optimizer 里的 WeightDecay 参数语义一致,还是 VariableConf 里的 L2 语义 一致?) 如果是后者的话, Lazy nn.Graph Optimizer 的 SGD 就要处理 Variable 的 L2 设置问题了
注: 两者的区别是 WeightDecay 计算的时机不同。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的 weight_decay(SGD)就是 L2 regularization,它的计算发生在 momentum 更新之前,如果有 SGDW(pytorch 没实现),这里的 weight_decay 就发生在 momentum 更新之后(原教旨 weigth_decay)。如果 momentum beta 为 0(SGD 不带 momentum), 则前述2者没有区别。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
那按照我的理解,PyTorch 对齐的 SGD Optimizer 里的 WeightDecay ,其实不是我们 Lazy OptimizerConf 里的 WeightDecay,而是 Lazy VariableOpConf 里的 Regularizer 的 L2 参数? @leaves-zwx @strint ,如果是这样的话,那么 nn.Graph 的 SGD 就要处理如何把这个WeightDecay 参数写到对应的 VariableOpConf 上去了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
感觉 nn.Graph 的 add_optimizer 需要判断一下 optimizer 的类型了
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,SGD这里加上weight_decay后,我们就可以基于这个optimizer接着做下Variable的L2那个pr。add_optimizer会间接的感知类型,就是类型不同产生的数据不同。
SGD支持weight_decay参数,更新了文档:
但SGD底层和torch计算方式还没有对齐,后续需要在kernel对齐。