-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto pruning #2603
Auto pruning #2603
Conversation
It exits bug, i am trying to settle it. |
paddle/api/PaddleAPI.h
Outdated
@@ -880,6 +880,7 @@ class ParameterUpdater { | |||
* @param param | |||
*/ | |||
void update(Parameter* param); | |||
void preprocess(Parameter* param, size_t currentPass, size_t currentBatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.h文件中新加的函数都请添加注释,下同
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
for (size_t i = 0; i < para->getSize(); i++){ | ||
std::cout << data[i] << " " ; | ||
} | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
138-143行注释的代码请删掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
sum_non += 1; | ||
} | ||
std::cout<<"sum_non: " <<sum_non << " " << para->getSize()<< std::endl; | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
169-178注释的代码请删掉。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -139,6 +220,8 @@ static IParameterUpdaterHook *createImpl( | |||
auto &type = config.type(); | |||
if (type == "pruning") { | |||
return new StaticPruningHook(config); | |||
} else if (type == "dpruning") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dpruning需要写全称dynamic_pruning,以便用户更好理解呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
…efore featched in paddle.v2.parameters.get(...)
|
||
void generateMask(Parameter *para) { | ||
virtual void generateMask(Parameter *para, size_t nonZeroNum) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
建议第二个参数传real sparsityRatio
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯, 好的
: ParameterPruningHook() { | ||
this->upperBound_ = hookConfig.upper_bound(); | ||
this->interPass_ = hookConfig.inter_pass(); | ||
this->endPass_ = hookConfig.end_pass(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这几个参数命名都很不直观,需要去猜测所代表的意思,建议:
upper_bound
->sparsity_upper_bound
inter_pass
中的inter
应该代表的是interval
吧,缩写成inter
意思就不直观了。是否可以改成sparsity_increasing_interval
,或者其他- 我理解
endPass
的目的是用来计算sparsityRatio
每次的增量。但是在这里设置endPass
这个参数不太合适:
- 用户在train的时候会设置一次
num_passes
,这里又设置一次会很繁琐。 - 我理解也不能直接用外围设置的
num_passes
,因为用户会习惯将num_passes
设置成一个很大的值,然后等收敛了再将作业kill掉。
所以增量的设置,再斟酌一下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upper_bound 改成了 sparsity_upper_bound inter_pass 改成 interval_pass ,属性名字太长的话用户体验不是太好,end_pass 的话我认为是有必要有的,num_passes 是整个训练过程经过的pass, 而end_pass 是我们sparsity_ratio 变化期间经过的pass
|
||
size_t nonZeroNum = para->getSize() * (1 - sparsityRatio); | ||
this->generateMask(para, nonZeroNum); | ||
std::cout << para->getName() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果需要输出提示用户,调用glog
函数。否则删掉。删掉对应的#include <iostream>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
vec->copyFrom(*this->weightTemp_); | ||
} | ||
|
||
void handleBeforeFetch(Parameter *para) override { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 应该是在每个
forward
计算之前调用的吧,函数命名不直观。 - 所有增加的接口都通过api接口,在python端调用控制train的流程,是否可以改成在
ParameterUpdaterBase.h
的update
函数中调用:
55 // between startBatch() and finishBatch(), update() will be called
56 // by the trainer multiple times, each time for updating one Parameter
57 // with its gradient in PARAMETER_GRADIENT
58 void update(Parameter* para) {
59 SetDevice setDevice(para->getDeviceId());
60 para->updateHook();
61 this->updateImpl(para);
// Call handleBeforeFetch(para) here
62 }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
auto ¶Vec = para->getBuf(PARAMETER_VALUE); | ||
weightTemp_->copyFrom(*paraVec); | ||
paraVec->dotMul(*this->maskVec_); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我理解整个dynamic pruning
流程是这样的:
- 每隔
inter_pass
增大sparsity_ratio
,并将参数保存到weightTemp
中,下一个inter_pass
中,weightTemp
不会改变,maskVec
不会改变。 - 每次调用
backward
函数,在更新parameter
之前,将weightTemp
的数据拷贝到para->getBuf(PARAMETER_VALUE)
中,相当于是在使用weightTemp
来计算梯度更新,这里是会改变para->getBuf(PARAMETER_VALUE)
的内容?改变的内容不需要保存到weightTemp
里面? - 每次调用
forward
函数、即在使用parameter
之前,先调用handleBeforeFetch()
函数,将参数稀疏化vec->dotMul(*maskVec_);
,他所使用的参数实际是对weightTemp
做了更新的参数。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- preprocess(...)是每次forward都会调用的,所以weightTemp 每次都改变, maskVec是每隔inter_pass 才会改变一次
- 首先weightTemp 存储的是本次forward之前没有进行稀疏化的参数,前向的时候, 参数首先和mask相乘进行稀疏化,然后forwardbackward, 计算出梯度后, 我们更新的是本次forward之前没有进行稀疏化的参数, 原因是我们没有办法去改变momentum的稀疏度。
- 由于没有办法改变momentum的稀疏度,所以每次更新paramters后,parameters的并不是mask代表的稀疏度,所以 handleBeforeFetch() 是在python api, paramter.tor_tar()的时候,先让parameter * mask,然后存储。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- 第1点,
weightTemp
这个我理解了。而preprocess
和handleBeforeFetch()
都会在每次forward
之前调用,并且执行parameter * mask
,将parameter
变成稀疏的。这两处应该可以合并一下。 - 第2、3点, 我们实际上只需要在梯度更新也就是
updateImpl()
的时候,才需要用到稠密的weightTemp
,那么梯度更新完了之后,实际上就可以parameter * mask
,使parameter
的常态是稀疏的,这样也不用在forward
计算和to_tar
保存参数之前额外地执行一次parameter * mask
,所以,请考虑在line 166行提出的建议。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯, 在updateImpl()后进行parameter * mask,可以保证parameter 是稀疏的,这样是可以的,perfect!
fix format
fix format
fix compile bug on mac
this->updateImpl(para); | ||
if (para->useGpu()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么只有GPU的情况下才能使用hook?似乎是在updateImpl
之后调用updateHook
效果会好些?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里的updateImpl(para) 只在gpu下会更新 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/ThreadParameterUpdater.cpp#L85 cpu的在调用finishBatch() 时候会更新参数, 不是太清楚当时为什么这么设计,但是这块我确实找了很长时间。。。
if (para->useGpu()) { | ||
maskVec_ = Vector::create(para->getSize(), para->useGpu()); | ||
maskVec_->copyFrom(*maskTemp); | ||
this->maskVec_ = Vector::create(para->getSize(), para->useGpu()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hook
应该在整个训练过程中一直存在吧,也就是说,这里应该用resizeOrCreate
?就是maskVec_
如果已经被create
过了,应该就不用再次create
了。在static pruning
里面,generateMask
函数只被调用一次;在dynamic pruning
里面,generateMask
函数会被调用多次。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok 这里应该使用resizeOrCreate
auto ¶Vec = para->getBuf(PARAMETER_VALUE); | ||
paraVec->dotMul(*maskVec_); | ||
paraVec->dotMul(*this->maskVec_); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里只是有个疑问,*this->maskVec_
是等价于*(this->maskVec_)
的?为何要加this
呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this 可以去掉...
this->sparsityRatio_ = hookConfig.sparsity_ratio(); | ||
} | ||
|
||
void init(Parameter *para) override { | ||
size_t initCount = this->initCount_.fetch_add(1); | ||
CHECK_EQ(initCount, 0UL) << "Currently the StaticPruningHook must invoke " | ||
"in same ParamterUpdater"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
invoke
-> be invoked
same
-> the same
updateThreadChecker_.check(); | ||
auto &vec = para->getBuf(PARAMETER_GRADIENT); | ||
auto &vec = para->getBuf(PARAMETER_VALUE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看起来像是原来的pruning
根本没生效啊,-____-
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我们先进行梯度更新,由于momentum, 可能param之前为0的又变化了, 然后调用update() 重新将param 设置为0.
@@ -207,6 +207,7 @@ void SgdThreadUpdater::finishBatch(real cost) { | |||
for (auto& para : parameters_) { | |||
int pid = para->getID(); | |||
optimizers_[pid]->finishBatch(); | |||
para->updateHook(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里为何需要调用updateHook
,我理解,在每次调用para->update()
的时候同时调用updateHook()
即可?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
原因是,cpu更新权重(w = w - w_diff)在finishbatch 中gpu在 updateImpl()中,即
https://github.com/NHZlX/Paddle/blob/7c9d5e5653155aa2a9106ba9621a1fa7f3b3bc0f/paddle/trainer/ThreadParameterUpdater.cpp#L202 所以得分开进行update
// dynamic pruning's parameter, sparsity ratio will not change until 'pass % | ||
// interval_pass == 0', | ||
// the change in sparsity ratio is a log curve. | ||
// More details can be found https://github.com/PaddlePaddle/Paddle/pull/2603 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里可以改成具体见实现代码,还是不要引用这个pr吧。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
|
||
:param interval_pass: 'dynamic_pruning' hook parameters, | ||
sparsity ratio will not change until 'pass % interval_pass == 0', the change in sparsity ratio is a log curve. | ||
More details can be found https://github.com/PaddlePaddle/Paddle/pull/2603 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里也是,不要引用pr。可以把计算的公式在这里列出来,在对dynamic pruning
的总体介绍里面。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Xreki ok
float), 'sparisity_upper_bound must be float type' | ||
assert self.sparsity_upper_bound <= 1 and self.sparsity_upper_bound >= 0, 'sparsity_upper_bound must be a float between [0, 1] ' | ||
|
||
if self.interval_pass is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dynamic pruning
类型时,interval_pass
和end_pass
是必须设置的吧,这里要check
下。另外,pruning
类型时,不需要设置这些参数,打个warning提醒这些设置是不生效的,提醒使用dynamic pruning
类型。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interval_pass 和end_pass 可以使用默认的, 不过可以设置一下提醒
*@param currentPass | ||
*@param currentBatch | ||
*/ | ||
void preprocess(Parameter* param, size_t currentPass, size_t currentBatch); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
参数名,我觉得还是叫passId
和batchId
比较好。
Pruning
Principle:
The trained model has a large number of parameters redundancy, and these redundant parameter's value are very small, so we can cut them off.
For every layer chosen to be pruned, we add a 0-1 value mask which is of the same size as the layer's parameter and decide which of the parameters participate in the forward process。
Let's assume that the size of parameters in a layer is
M
and current sparsity ratio iscurrent_spr
, we first order the parameters according to the absolute value in that layer, then choose the smallestcurrent_spr * M
numbers out and set the corresponding mask's value to zero.Paddle uses an automatic, gradual pruning approach. We use
interval_pass
,sparsity_upper_bound
andend_pass
to control the process of this.The parameters are pruned every
interval_pass
pass (a pass represents a epoch) as the network is fine-tuned to gradually increase the sparsity while allowing the network recover from any pruning-induced loss in accuracy. The network will reachsparsity_upper_bound
sparsity finally, and the whole process will undergoend_pass/inter_pass
times pruning.As shown below, we use a log function for sparsity changes. We cut our network more aggressively in the initial stage for there exists a lot of redundant parameters and gradually reduced the number of the parameters being cutted for there are less redundant parameters in late stage and it's helpful for our network recover from pruning-induced loss in accuracy.
Usage: