-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design doc for Model average(renaming it to Parameter Average) #5137
Conversation
doc/design/parameter_average.md
Outdated
@@ -0,0 +1,70 @@ | |||
# Averaging Parameter Updates in PaddlePaddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "Updates" should be dropped.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
|
||
## Why Averaging | ||
In large scale machine learning, it could take us a large number of iterations over the training data to achieve optimal values of parameters of our model. The size of training data could be very large for large scale training hence it is desirable if we can obtain the optimal parameters by going through the data in as few passes as we can. | ||
Polyak and Juditsky (1992) showed that asymptotically the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameters which minimize the empirical cost. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this paragraph comes from the abstract of this paper by Wei Xu. I think we need to add a citation here or rephrase this paragraph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
4. Perform testing and/or save the parameters. | ||
5. Restore the values of the parameters once done. | ||
|
||
### How to do implement Averaging of Parameter Updates in PaddlePaddle |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop "do"
Drop "Updates"? -> see above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
We can add the ParameterAverageOptimizer op to the graph through Python API. Using this approach, we manually add this op to the graph and direct the output of the optimizer op to this op during training. | ||
|
||
**Advantages**: | ||
- Allows for greater flexibility to the users of Paddle. Using this approach, the users can plug different optimizers into ParameterAverageOptimizer by passing in the optimizer to the op. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Paddle/PaddlePaddle/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
|
||
The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h). | ||
|
||
The idea of building an op for averaging is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way the op will be added to the computation graph will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Paddle/PaddlePaddle/g
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
- the optimizer | ||
- the window_size to keep the updates | ||
|
||
The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should also support the case that trainer runs on GPU while ParameterAverageOptimizer runs on CPU in order to save GPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
doc/design/parameter_average.md
Outdated
- Any optimizer (RMSProp , AdaGrad etc.) | ||
- A window size. The op keeps accumulating updated parameter values over a window of N batches and takes an average. Move the averaged value to a buffer when window is full to avoid loss of precision. | ||
|
||
Using the ParameterAverageOptimizer op, any user can add the operation to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support averaging. As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since parameter averaging will be an operator, it makes sense to create it in the layer functions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we still do the actual computation in C++? I think the computational cost will be very high if we do it in Python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
And thanks to @pengli09 for the reviewing!
No description provided.