-
Notifications
You must be signed in to change notification settings - Fork 790
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix backward index for gamma beta #6149
Conversation
这个发现很有意思,有没有其他同事能一起确认一下 |
查看了一下git blame, 是蔡晟航从基于别的同事开发的版本迁移过来的,当时也发现了计算结果和tf 有一定的差别: “目前测试发现 epsilon 较大的时候,和 tf 的误差也会较大,甚至达到 1e-1,旧版本也是如此” |
更新了下目前能够确认的内容,目前有guoran,yinggang,depeng在帮忙一起看 |
无意间在python端,改变了gamma 和 beta的传入顺序 虽然输入不对齐,但是梯度更新正确了 后续定位到在传入layernorm_param_grad的时候,把gamma传成了beta,修复后能与torch对齐 |
还有个问题先记录一下:LayerNorm module 的 forward 中只是用到了 self.normalized_shape 的长度来计算出 norm_axis,并没有真正去 check 成员的 shape 是否对得上。在 TensorDescInfer 时才 CHECK 出错,报错信息也不够清晰,bug 修复后需要注意一下。 |
ctx->has_normalized_diff = ctx->scale && inputs.at(0)->requires_grad(); | ||
if (ctx->has_gamma_diff || ctx->has_normalized_diff) { | ||
ctx->gamma_index = ctx->SaveTensorForBackward(inputs.at(gamma_index)); | ||
ctx->gamma_index = ctx->SaveTensorForBackward(inputs.at(1)); // save gamma. | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里 inputs.at(1) 算不算硬编码
if (ctx->has_gamma_diff) { | ||
in_grads->at(ctx->has_beta_diff + 1) = results->at(ctx->has_beta_diff); | ||
in_grads->at(1) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at(1) 算不算硬编码
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在输入和输出都是有序的,这个顺序是在op expr build的时候就确定下来的,这里硬编码没什么问题,效率更高
if (ctx->has_gamma_diff) { | ||
in_grads->at(ctx->has_beta_diff + 1) = results->at(ctx->has_beta_diff); | ||
in_grads->at(1) = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
现在输入和输出都是有序的,这个顺序是在op expr build的时候就确定下来的,这里硬编码没什么问题,效率更高
记录一个梯度的问题有些情况下,还是会出现layernorm输出和梯度与torch不对齐的情况(gpu/cpu都会出现) 以一个例子: import oneflow as flow
import torch
import numpy as np
x_np = np.array([[[[-1.83965693, -1.82964566]]]])
print(x_np.shape)
affine = False
device = "cuda"
normalized_shape = (1, 1, 2)
of_x_tensor = flow.Tensor(x_np).to(device)
of_x_tensor.requires_grad = True
of_layernorm = flow.nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=affine).to(device)
of_out = of_layernorm(of_x_tensor)
print("Oneflow out is: ", of_out)
of_out = of_out.sum()
of_out.backward()
print("Of x tensor grad is: ", of_x_tensor.grad)
torch_x_tensor = torch.Tensor(x_np).to(device)
torch_x_tensor.requires_grad = True
torch_layernorm = torch.nn.LayerNorm(normalized_shape, eps=1e-5, elementwise_affine=affine).to(device)
torch_out = torch_layernorm(torch_x_tensor)
print("Pytorch out is: ", torch_out)
torch_out = torch_out.sum()
torch_out.backward()
print("Torch x tensor grad is: ", torch_x_tensor.grad) 以下是一些实验和猜测: 实验1使用最naive的方式,用mean和var拼凑: import oneflow as flow
# import torch as flow
import numpy as np
device = "cpu"
eps = 1e-5
x_np = np.array([[[[-1.83965693, -1.82964566]]]])
of_x_tensor = flow.Tensor(x_np).to(device)
of_x_tensor.requires_grad = True
mean = flow.mean(of_x_tensor, dim=(2, 3), keepdim=True)
var = (flow.var(of_x_tensor, dim=(2, 3), keepdim=True, unbiased=False) + eps).rsqrt()
of_out = (of_x_tensor - mean) * var
print("X out is: ", of_out)
of_out = of_out.sum()
of_out.backward()
print("Grad is: ", of_x_tensor.grad) 按照拼凑的方式,有一些结论:
猜测是:
实验2与Paddle的实现进行对比 import paddle
import numpy as np
x = np.array([[[[-1.83965693, -1.82964566]]]]).astype(np.float32)
x_tensor = paddle.to_tensor(x).cpu()
x_tensor.stop_gradient = False
layernorm = paddle.nn.LayerNorm(normalized_shape=(1, 1, 2), epsilon=1e-5)
out = layernorm(x_tensor)
print("Out is: ", out)
out = out.sum()
out.backward()
print("X grad is: ", x_tensor.grad.numpy())
"""
Out is: Tensor(shape=[1, 1, 1, 2], dtype=float32, place=CPUPlace, stop_gradient=False,
[[[[-0.84543723, 0.84541708]]]])
X grad is: [[[[-0.00143835 0.00143831]]]]
"""
|
This is because of the different algorithms of LayerNorm between PyTorch and oneflow. The Welford algorithm in PyTorch needs higher float precision, after converting torch tensor to float64 could align with oneflow. |
感谢世杰帮助 oneflow, lightseq, deepx_core都采用的是比较naive的计算方式,统计得到sum,和sum_square,最后同步,根据公式 D(X) = E(X^2) - E(X)^2 得到方差 而pytorch采取了welford在线算法,在线更新均值和方差。https://changyaochen.github.io/welford/ 在dtype为torch.float64下,得到的结果能够对齐 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
在 python LayerNorm forward 最上面加一个 check 吧:
for i in range(0, len(self.normalized_shape)):
if x.shape[i + self.begin_params_axis] != self.normalized_shape[i]:
raise RuntimeError(f"Given normalized_shape={self.normalized_shape}, expected input with shape [*, {self.normalized_shape[-1]}], but got input of size {x.shape}")
CI failed, removing label automerge |
CI failed, removing label automerge |
Speed stats:
|
问题描述:
Eager下 layernorm前向结果一致,后向不一致,而且weight,bias的grad和torch的是相反过来
初步解决:
猜测是functional赋梯度赋反了,也确实是weight, bias的梯度赋错了,已经修复
关于layernorm和batchnorm的联系
layernorm可以看作是batchnorm的一种变种
以输入(b, c, t),其layernorm(c, t) 等价于,将输入reshape成 (1, b, c*t) 做batchnorm(num_features=3)
但我认为这两者并不严格等价,layernorm的weight和bias的形状应该是(2, 2),而bn的weight是(3, )
阅读资料
cudnn文档
batchnorm反向传播推导
目前进展:
所以还是能够认为以前的这套实现没有问题,可能还是eager,functional这块儿有些问题