Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement normalization methods(BatchNorm/LayerNorm/BatchRenorm) as functions in a common header file #5685

Closed
zhouxiao-coder opened this issue Nov 16, 2017 · 4 comments
Assignees

Comments

@zhouxiao-coder
Copy link
Contributor

zhouxiao-coder commented Nov 16, 2017

Current Status

In the new Paddle core, we currently have a batch_norm_op which essentially is an implementation of fused spatial batch normalization method. Like most other operators, most computations are directly included in BatchNorm kernels.

Suggestion

I suggest we abstract away some normalization calculations and implement them as functions/functors in a common header file like normalization.h to better reuse existing code.

Reasons

  1. Normalization methods are evolving rapidly, new variants of BatchNorm come out from time to time, e.g. layer normalization and batch renormalization. They often share similar structures.
  2. Normalization layers are often used in deep architectures repeatedly, so it should be efficient. We may need to write a lot of fused kernels to support this.
  3. RNN units also benefit from using some normalization methods, we should support it in c++ code.
@zhouxiao-coder zhouxiao-coder self-assigned this Nov 16, 2017
@lcy-seso
Copy link
Contributor

lcy-seso commented Nov 16, 2017

Layer normalization is just transposing the input of batch norm (moving average of mean and std is no longer needed), is it possible that just wrap the batch norm to implement the layer normalization?

@lcy-seso
Copy link
Contributor

Another potentially usefully normalization method I am interested in is weight normalization. As I understand, most of the user-defined normalization can be implemented by the combination of some primary arithmetical operators.

Implement a certain normalization method in an independent operator is helpful to improve the computation and memory efficiency (most time many intermediate computation results can be simplified by manually check the formulas). But I did not check all the potential normalization methods.

@zhouxiao-coder
Copy link
Contributor Author

zhouxiao-coder commented Nov 16, 2017

@lcy-seso

is it possible that just wrap the batch norm to implement the layer normalization?

Yes, we could, and that's what TensorFlow does in their repo. They wrap it around a non-fused BatchNorm implementation. There are some subtleties about LayerNorm: the sizes of estimated mean and variance are batch size rather than channel size, but they bypass it using broadcasting property.

However, I think it's not optimal. As you have already pointed out, non-fused BatchNorm is significantly slower, it makes a big difference if we want to use it in deep models.

Since one big advantage of LayerNorm is it is directly usable in RNN units, it also makes sense to reuse code between standalone LayerNorm layer and "LSTMUnitsWithLayerNorm".

@zhouxiao-coder
Copy link
Contributor Author

Another potentially usefully normalization method I am interested in is weight normalization.

I also checked weight normalization. It seems simple enough to be implemented in basic operators efficiently, so I didn't mention it in the title. If it turns out to be necessary, we can also write it in the common header.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants