Design doc for Model average(renaming it to Parameter Average) #5137

kavyasrinet · 2017-10-26T21:33:28Z

No description provided.

pengli09 · 2017-10-27T02:56:05Z

doc/design/parameter_average.md

@@ -0,0 +1,70 @@
+# Averaging Parameter Updates in PaddlePaddle


I think "Updates" should be dropped.

pengli09 · 2017-10-27T03:01:00Z

doc/design/parameter_average.md

+
+## Why Averaging
+In large scale machine learning, it could take us a large number of iterations over the training data to achieve optimal values of parameters of our model. The size of training data could be very large for large scale training hence it is desirable if we can obtain the optimal parameters by going through the data in as few passes as we can.
+Polyak and Juditsky (1992) showed that asymptotically the test performance of simple average of parameters obtained by Stochastic Gradient Descent (SGD) is as good as that of parameters which minimize the empirical cost.


It seems that this paragraph comes from the abstract of this paper by Wei Xu. I think we need to add a citation here or rephrase this paragraph.

pengli09 · 2017-10-27T03:02:58Z

doc/design/parameter_average.md

+4. Perform testing and/or save the parameters.
+5. Restore the values of the parameters once done.
+
+### How to do implement Averaging of Parameter Updates in PaddlePaddle


Drop "do"
Drop "Updates"? -> see above

pengli09 · 2017-10-27T03:04:00Z

doc/design/parameter_average.md

+We can add the ParameterAverageOptimizer op to the graph through Python API. Using this approach, we manually add this op to the graph and direct the output of the optimizer op to this op during training.
+
+	**Advantages**:
+    - Allows for greater flexibility to the users of Paddle. Using this approach, the users can plug different optimizers into ParameterAverageOptimizer by passing in the optimizer to the op.


s/Paddle/PaddlePaddle/g

pengli09 · 2017-10-27T03:05:17Z

doc/design/parameter_average.md

+
+The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h).
+
+The idea of building an op for averaging is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way the op will be added to the computation graph will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API.


s/Paddle/PaddlePaddle/g

pengli09 · 2017-10-27T03:06:27Z

doc/design/parameter_average.md

+- the optimizer
+- the window_size to keep the updates
+
+The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h).


We should also support the case that trainer runs on GPU while ParameterAverageOptimizer runs on CPU in order to save GPU memory.

pengli09 · 2017-10-27T03:10:17Z

doc/design/parameter_average.md

+- Any optimizer (RMSProp , AdaGrad etc.)
+- A window size. The op keeps accumulating updated parameter values over a window of N batches and takes an average. Move the averaged value to a buffer when window is full to avoid loss of precision.
+
+Using the ParameterAverageOptimizer op, any user can add the operation to their computation graphs. However, this will require a lot of lines of code and we should design Python APIs that support averaging. As per the PaddlePaddle [Python API design](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md), the layer functions are responsible for creating operators, operator parameters and variables. Since parameter averaging will be an operator, it makes sense to create it in the layer functions.


Shall we still do the actual computation in C++? I think the computational cost will be very high if we do it in Python.

… model_average

pengli09

LGTM

wangkuiyi

LGTM!

And thanks to @pengli09 for the reviewing!

Kavya Srinet added 2 commits October 26, 2017 14:30

Adding design doc for model average (now called parameter_average)

581f3af

Updating title

9aeda9f

kavyasrinet requested review from reyoung and dzhwinter October 26, 2017 21:33

Updating image tag

4fed3d0

kavyasrinet requested a review from wangkuiyi October 26, 2017 21:36

kavyasrinet assigned pengli09 and unassigned pengli09 Oct 27, 2017

kavyasrinet requested a review from pengli09 October 27, 2017 02:11

pengli09 requested changes Oct 27, 2017

View reviewed changes

Kavya Srinet added 2 commits October 30, 2017 11:26

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

d2fa599

… model_average

Updating review comments

78482e2

pengli09 reviewed Oct 31, 2017

View reviewed changes

wangkuiyi approved these changes Oct 31, 2017

View reviewed changes

pengli09 approved these changes Oct 31, 2017

View reviewed changes

kavyasrinet merged commit db34138 into PaddlePaddle:develop Nov 2, 2017

kavyasrinet deleted the model_average branch November 7, 2017 00:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Design doc for Model average(renaming it to Parameter Average) #5137

Design doc for Model average(renaming it to Parameter Average) #5137

kavyasrinet commented Oct 26, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 Oct 27, 2017

kavyasrinet Oct 30, 2017

pengli09 left a comment

wangkuiyi left a comment

		@@ -0,0 +1,70 @@
		# Averaging Parameter Updates in PaddlePaddle


		The ParameterAverageOptimizer op can be like any other operator with its own CPU/GPU implementation either using Eigen or separate CPU and GPU kernels. As the initial implementation, we can implement the kernel using Eigen following the abstraction pattern implemented for [Operators](https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/operators/rmsprop_op.h).

		The idea of building an op for averaging is in sync with the refactored Paddle philosophy of using operators to represent any computation unit. The way the op will be added to the computation graph will be decided by the [layer functions](https://github.com/PaddlePaddle/Paddle/blob/develop/doc/design/python_api.md#layer-function) in Python API.

Design doc for Model average(renaming it to Parameter Average) #5137

Design doc for Model average(renaming it to Parameter Average) #5137

Conversation

kavyasrinet commented Oct 26, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pengli09 left a comment

Choose a reason for hiding this comment

wangkuiyi left a comment

Choose a reason for hiding this comment