Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto pruning #2603

Closed
wants to merge 41 commits into from
Closed

Auto pruning #2603

wants to merge 41 commits into from

Conversation

NHZlX
Copy link
Contributor

@NHZlX NHZlX commented Jun 26, 2017

Pruning

Principle:

The trained model has a large number of parameters redundancy, and these redundant parameter's value are very small, so we can cut them off.

For every layer chosen to be pruned, we add a 0-1 value mask which is of the same size as the layer's parameter and decide which of the parameters participate in the forward process。

Let's assume that the size of parameters in a layer is M and current sparsity ratio is current_spr, we first order the parameters according to the absolute value in that layer, then choose the smallest current_spr * M numbers out and set the corresponding mask's value to zero.

Paddle uses an automatic, gradual pruning approach. We use interval_pass, sparsity_upper_bound and end_passto control the process of this.
The parameters are pruned every interval_pass pass (a pass represents a epoch) as the network is fine-tuned to gradually increase the sparsity while allowing the network recover from any pruning-induced loss in accuracy. The network will reach sparsity_upper_bound sparsity finally, and the whole process will undergo end_pass/inter_pass times pruning.

As shown below, we use a log function for sparsity changes. We cut our network more aggressively in the initial stage for there exists a lot of redundant parameters and gradually reduced the number of the parameters being cutted for there are less redundant parameters in late stage and it's helpful for our network recover from pruning-induced loss in accuracy.

Usage:

from paddle.v2.attr import  Hook
from paddle.v2.attr import  ParamAttr

# The interval_pass value defalut is 3, end_pass value default is 60 
pa = ParamAttr(update_hooks = Hook('dynamic_pruning', sparsity_upper_bound=0.75, interval_pass=1, end_pass=3))

# for conv layer 
paddle.layer.img_conv(input=input,
                      filter_size=3,
                      num_channels=32,
                      num_filters=64,
                      param_attr=pa,
                      act=paddle.activation.Relu())

# for fully connected layer
out = paddle.layer.fc(input=input,
                      size=102,
                      act=paddle.activation.Softmax(),
                      param_attr = pa)

@NHZlX NHZlX requested review from Xreki and hedaoyuan June 26, 2017 07:28
@NHZlX
Copy link
Contributor Author

NHZlX commented Jul 3, 2017

It exits bug, i am trying to settle it.

@@ -880,6 +880,7 @@ class ParameterUpdater {
* @param param
*/
void update(Parameter* param);
void preprocess(Parameter* param, size_t currentPass, size_t currentBatch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.h文件中新加的函数都请添加注释,下同

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

for (size_t i = 0; i < para->getSize(); i++){
std::cout << data[i] << " " ;
}
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

138-143行注释的代码请删掉。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

sum_non += 1;
}
std::cout<<"sum_non: " <<sum_non << " " << para->getSize()<< std::endl;
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

169-178注释的代码请删掉。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -139,6 +220,8 @@ static IParameterUpdaterHook *createImpl(
auto &type = config.type();
if (type == "pruning") {
return new StaticPruningHook(config);
} else if (type == "dpruning") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dpruning需要写全称dynamic_pruning,以便用户更好理解呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


void generateMask(Parameter *para) {
virtual void generateMask(Parameter *para, size_t nonZeroNum) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

建议第二个参数传real sparsityRatio

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯, 好的

: ParameterPruningHook() {
this->upperBound_ = hookConfig.upper_bound();
this->interPass_ = hookConfig.inter_pass();
this->endPass_ = hookConfig.end_pass();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这几个参数命名都很不直观,需要去猜测所代表的意思,建议:

  1. upper_bound -> sparsity_upper_bound
  2. inter_pass中的inter应该代表的是interval吧,缩写成inter意思就不直观了。是否可以改成sparsity_increasing_interval,或者其他
  3. 我理解endPass的目的是用来计算sparsityRatio每次的增量。但是在这里设置endPass这个参数不太合适:
  • 用户在train的时候会设置一次num_passes,这里又设置一次会很繁琐。
  • 我理解也不能直接用外围设置的num_passes,因为用户会习惯将num_passes设置成一个很大的值,然后等收敛了再将作业kill掉。
    所以增量的设置,再斟酌一下。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

upper_bound 改成了 sparsity_upper_bound inter_pass 改成 interval_pass ,属性名字太长的话用户体验不是太好,end_pass 的话我认为是有必要有的,num_passes 是整个训练过程经过的pass, 而end_pass 是我们sparsity_ratio 变化期间经过的pass


size_t nonZeroNum = para->getSize() * (1 - sparsityRatio);
this->generateMask(para, nonZeroNum);
std::cout << para->getName()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果需要输出提示用户,调用glog函数。否则删掉。删掉对应的#include <iostream>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

vec->copyFrom(*this->weightTemp_);
}

void handleBeforeFetch(Parameter *para) override {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. 应该是在每个forward计算之前调用的吧,函数命名不直观。
  2. 所有增加的接口都通过api接口,在python端调用控制train的流程,是否可以改成在ParameterUpdaterBase.hupdate函数中调用:
 55   // between startBatch() and finishBatch(), update() will be called
 56   // by the trainer multiple times, each time for updating one Parameter
 57   // with its gradient in PARAMETER_GRADIENT
 58   void update(Parameter* para) {
 59     SetDevice setDevice(para->getDeviceId());
 60     para->updateHook();
 61     this->updateImpl(para);
        // Call handleBeforeFetch(para) here
 62   }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

auto &paraVec = para->getBuf(PARAMETER_VALUE);
weightTemp_->copyFrom(*paraVec);
paraVec->dotMul(*this->maskVec_);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我理解整个dynamic pruning流程是这样的:

  1. 每隔inter_pass增大sparsity_ratio,并将参数保存到weightTemp中,下一个inter_pass中,weightTemp不会改变,maskVec不会改变。
  2. 每次调用backward函数,在更新parameter之前,将weightTemp的数据拷贝到para->getBuf(PARAMETER_VALUE)中,相当于是在使用weightTemp来计算梯度更新,这里是会改变para->getBuf(PARAMETER_VALUE)的内容?改变的内容不需要保存到weightTemp里面?
  3. 每次调用forward函数、即在使用parameter之前,先调用handleBeforeFetch()函数,将参数稀疏化vec->dotMul(*maskVec_);,他所使用的参数实际是对weightTemp做了更新的参数。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. preprocess(...)是每次forward都会调用的,所以weightTemp 每次都改变, maskVec是每隔inter_pass 才会改变一次
  2. 首先weightTemp 存储的是本次forward之前没有进行稀疏化的参数,前向的时候, 参数首先和mask相乘进行稀疏化,然后forwardbackward, 计算出梯度后, 我们更新的是本次forward之前没有进行稀疏化的参数, 原因是我们没有办法去改变momentum的稀疏度。
  3. 由于没有办法改变momentum的稀疏度,所以每次更新paramters后,parameters的并不是mask代表的稀疏度,所以 handleBeforeFetch() 是在python api, paramter.tor_tar()的时候,先让parameter * mask,然后存储。

Copy link
Contributor

@Xreki Xreki Jul 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • 第1点,weightTemp这个我理解了。而preprocesshandleBeforeFetch()都会在每次forward之前调用,并且执行parameter * mask,将parameter变成稀疏的。这两处应该可以合并一下。
  • 第2、3点, 我们实际上只需要在梯度更新也就是updateImpl()的时候,才需要用到稠密的weightTemp,那么梯度更新完了之后,实际上就可以parameter * mask,使parameter的常态是稀疏的,这样也不用在forward计算和to_tar保存参数之前额外地执行一次parameter * mask,所以,请考虑在line 166行提出的建议。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯, 在updateImpl()后进行parameter * mask,可以保证parameter 是稀疏的,这样是可以的,perfect!

@Xreki Xreki added this to In Progress in Embedded and Mobile Deployment Jul 17, 2017
@hedaoyuan hedaoyuan moved this from In Progress to Model Compression in Embedded and Mobile Deployment Aug 3, 2017
@hedaoyuan
Copy link
Contributor

@NHZlX @Xreki 这个PR还有问题吗?@NHZlX这周看看有时间把这个PR Refine一下Merge了吧。

this->updateImpl(para);
if (para->useGpu()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为什么只有GPU的情况下才能使用hook?似乎是在updateImpl之后调用updateHook效果会好些?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里的updateImpl(para) 只在gpu下会更新 https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/trainer/ThreadParameterUpdater.cpp#L85 cpu的在调用finishBatch() 时候会更新参数, 不是太清楚当时为什么这么设计,但是这块我确实找了很长时间。。。

if (para->useGpu()) {
maskVec_ = Vector::create(para->getSize(), para->useGpu());
maskVec_->copyFrom(*maskTemp);
this->maskVec_ = Vector::create(para->getSize(), para->useGpu());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hook应该在整个训练过程中一直存在吧,也就是说,这里应该用resizeOrCreate?就是maskVec_如果已经被create过了,应该就不用再次create了。在static pruning里面,generateMask函数只被调用一次;在dynamic pruning里面,generateMask函数会被调用多次。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok 这里应该使用resizeOrCreate

auto &paraVec = para->getBuf(PARAMETER_VALUE);
paraVec->dotMul(*maskVec_);
paraVec->dotMul(*this->maskVec_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里只是有个疑问,*this->maskVec_是等价于*(this->maskVec_)的?为何要加this呢?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this 可以去掉...

this->sparsityRatio_ = hookConfig.sparsity_ratio();
}

void init(Parameter *para) override {
size_t initCount = this->initCount_.fetch_add(1);
CHECK_EQ(initCount, 0UL) << "Currently the StaticPruningHook must invoke "
"in same ParamterUpdater";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

invoke -> be invoked
same -> the same

updateThreadChecker_.check();
auto &vec = para->getBuf(PARAMETER_GRADIENT);
auto &vec = para->getBuf(PARAMETER_VALUE);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

看起来像是原来的pruning根本没生效啊,-____-

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们先进行梯度更新,由于momentum, 可能param之前为0的又变化了, 然后调用update() 重新将param 设置为0.

@@ -207,6 +207,7 @@ void SgdThreadUpdater::finishBatch(real cost) {
for (auto& para : parameters_) {
int pid = para->getID();
optimizers_[pid]->finishBatch();
para->updateHook();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为何需要调用updateHook,我理解,在每次调用para->update()的时候同时调用updateHook()即可?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

原因是,cpu更新权重(w = w - w_diff)在finishbatch 中gpu在 updateImpl()中,即
https://github.com/NHZlX/Paddle/blob/7c9d5e5653155aa2a9106ba9621a1fa7f3b3bc0f/paddle/trainer/ThreadParameterUpdater.cpp#L202 所以得分开进行update

// dynamic pruning's parameter, sparsity ratio will not change until 'pass %
// interval_pass == 0',
// the change in sparsity ratio is a log curve.
// More details can be found https://github.com/PaddlePaddle/Paddle/pull/2603
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里可以改成具体见实现代码,还是不要引用这个pr吧。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok


:param interval_pass: 'dynamic_pruning' hook parameters,
sparsity ratio will not change until 'pass % interval_pass == 0', the change in sparsity ratio is a log curve.
More details can be found https://github.com/PaddlePaddle/Paddle/pull/2603
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里也是,不要引用pr。可以把计算的公式在这里列出来,在对dynamic pruning的总体介绍里面。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Xreki ok

float), 'sparisity_upper_bound must be float type'
assert self.sparsity_upper_bound <= 1 and self.sparsity_upper_bound >= 0, 'sparsity_upper_bound must be a float between [0, 1] '

if self.interval_pass is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dynamic pruning类型时,interval_passend_pass是必须设置的吧,这里要check下。另外,pruning类型时,不需要设置这些参数,打个warning提醒这些设置是不生效的,提醒使用dynamic pruning类型。

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval_pass 和end_pass 可以使用默认的, 不过可以设置一下提醒

*@param currentPass
*@param currentBatch
*/
void preprocess(Parameter* param, size_t currentPass, size_t currentBatch);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

参数名,我觉得还是叫passIdbatchId比较好。

@NHZlX NHZlX closed this Aug 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants