Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pruning possibilities at inner_product_layer #4294

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

Caenorst
Copy link

This add the Deep compression's pruning feature on Inner product according to http://arxiv.org/abs/1506.02626, can also be used to a regularization looking at https://arxiv.org/abs/1602.07360

To use it just add in the prototxt the following:

layer {
  ...
  type: InnerProduct
  ...
  pruning_param {
    coeff: _here goes the pruning rate, between 0 and 1, with 0 = no pruning_
  }
  ...
}

@@ -915,6 +916,11 @@ message PowerParameter {
optional float shift = 3 [default = 0.0];
}

message PruningParameter {
// Pruning coefficient for deep compression
Copy link

@seanbell seanbell Jun 12, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to document what this parameter does here. The current comment adds no information. It looks like it's the fraction of weights to keep, sorted by absolute value?

@seanbell
Copy link

seanbell commented Jun 12, 2016

Thanks for the PR -- it looks like this is most of what is needed to implement weight compression.

It looks like the current PR doesn't actually compress any data. The weights are dropped at load time, not save time. Also, it looks like the weights that are pruned is fixed -- the mask never changes? Does that mean that the intent is to only add this to a fully trained model at deploy time?

I don't quite understand the point of this PR. It looks like it will always be slower (from the extra mask multiply) and have a lower accuracy, but with no improvement in final model size (actual size on disk). It's missing the part where you make the model file smaller because some weights are now 0. Was that going to be in a separate PR?

@seanbell
Copy link

seanbell commented Jun 12, 2016

I don't see any unit tests.

@Caenorst
Copy link
Author

Thank you for your review,

sorry for the messy PR, it's my first PR ever (So I'm not really confident with the process and Git neither)... I will to take more care next time, and add unit test aswell.

You are right about the parameter, gonna change the comment.

In the publications they have a better accuracy around ~70% pruning (act like a form of regularization), also at the end of the method we can use the sparsity to do the compression at the deployment (with sparse GEMM) which is coming, should I have implemented it in the same PR ?

@Caenorst
Copy link
Author

And yes this is suppose to be used on an already trained model, and the masks don't changes after the pruning during the training, as there is no rules about how the pruning parameter is suppose to change.

@seanbell
Copy link

which is coming, should I have implemented it in the same PR ?

No, it's good to split up PRs in small units of functionality; that makes them easier to review.

@ajtulloch
Copy link
Contributor

In practice to run a sufficiently pruned model, it might be worth just giving the input W as three blobs for the CSR representation, and calling mkl_csrmm on CPU/cusparseScsrmm on GPU. That's what the deep compression papers do when reporting their speedups for inference time (Appendix A in https://arxiv.org/pdf/1510.00149.pdf).

You can do the sparsification + conversion from dense to sparse in a few lines of PyCaffe by masking the original W, calling scipy.sparse.csr, and extracting the CSR ndarrays from the resulting csr_matrixand passing them to your new SparseInnerProduct layer.

@Caenorst
Copy link
Author

So you think it's better to do the retraining directly on sparse representation ? I was afraid it doesn't give a lot of flexibility for playing with pruning.

Also I'd to implement dropconnect later and I thought it would fit well with the already done mask.

While the slowdown of the mask is almost unoticeable (but could be important with a not-enough-sparse model and csrmm), I thought about doing the conversion only for deploy.

I could do one way or another, depend on what the community prefer, let me know.

@ajtulloch
Copy link
Contributor

Ah, I was just talking about replicating the speedups of the model.

For replicating that paper, the way @songhan did it IIRC was adding a mask_ SyncedData of the same size as data_ to Blob, exposing that to pycaffe so Python can get/set the mask, and overriding Blob::Update to conditionally mask out updates. Then everything else works IIRC.

@Caenorst
Copy link
Author

Oh, ok got it. Is that important to replicate this way ? I don't feel confident about changing the whole Blob::update just for adding this feature to InnerProduct (I don't intend to add it to convolution layers as the result is far less interesting).

@jpiabrantes
Copy link

@ajtulloch I'm trying to implement what you suggested. The cusparseScsrmm takes a bunch of arguments.

cusparseScsrmm(cusparseHandle_t handle,
 cusparseOperation_t transA,
 int m, 
 int n,
 int k,
 int nnz, 
 const float *alpha,
 const cusparseMatDescr_t descrA,
 const float *csrValA, 
 const int *csrRowPtrA, 
 const int *csrColIndA, 
 const float *B,
 int ldb, 
 const float *beta, 
 float *C, 
 int ldc)

I don't know what to use for handle, transA, ldb and ldc.

On my pycaffe code I have the following:

sparse=csr_matrix(net.params[conv][0].data)
add_blob(sparse.data) #csrValA
add_blob(sparse.indices) #csrColIndA
add_blob(sparse.indptr) #csrRowPtrA

Could you give me an example on what to call on my SparseInnerProductlayer.cu?

P.S. What function would I call on cudnn_conv_layer.cu ?

Thank you very much.

@beniz
Copy link

beniz commented Jun 16, 2016

This may not be what you need here, but I've fixed and merged an old PR with SparseInnerProduct in it, along with a larger set of changes for sparse computations on both CPU and GPU, see https://github.com/beniz/caffe/blob/master_dd_integ_sparse/src/caffe/layers/sparse_inner_product_layer.cu

I'm interested in any potential improvement to this sparse layer, though we already use it fine in many tasks.

@ajtulloch
Copy link
Contributor

@beniz that's perfect, that's exactly what I'm talking about. Have you considered PR'ing that so it's more visible and maybe gets merged? It's a very useful layer, and that implementation looks really nice.

@beniz
Copy link

beniz commented Jun 16, 2016

@ajtulloch I've tested the waters for a PR here #2364 (comment) but until now, no real feedback. For a full list of functionalities, see https://github.com/beniz/deepdetect/pull/142. The implementation originates from #2364 and @alemagnani, I've rebased and added support for MemorySparseDataLayer and batches.

I'd be happy to PR it because my fear is that it interferes with later Caffe changes and takes time to maintain on our own branch. Though we're committed to it, this is going into produciton. I'm a bit too busy immediately, so if this is helpful now and someone wants to PR quickly, please do.

@Caenorst
Copy link
Author

Caenorst commented Jun 16, 2016

Are you sure, that we are talking about the same thing ? If I understood well what you've done, the bottom is sparse, but deep compression is suppose to have sparse weights, isn't it ?

But even if I'm right this can be a nice inspiration for doing the speedup version.

EDIT:
I still think that doing the pruning offline with pycaffe is not so convenient because there is a lot of test to do to apply deep compression, and calculating the mask offline force us to keep the mask in the prototxt (which double the size of the data in the layer), so I think it's easier to just have one option in the prototxt.
And the slowdown is like nothing (if you don't use the pruning it just add one boolean check).

For the sparse representation it's another thing, let me know if you disagree.

@beniz
Copy link

beniz commented Jun 16, 2016

If I understood well what you've done, the bottom is sparse, but deep compression is suppose to have sparse weights, isn't it ?

Correct, good point.

@Harick1
Copy link

Harick1 commented Jul 20, 2016

@beniz I saw your implementation of sparse matrix computing, that's perfect. But I have some questions.
Do you test the speed of the function caffe_cpu_csr_gemm() and which BLAS library do you used?
I used mkl to test the speed but I found that the sparse version was slower than the normal version. That's strange.
In my case, the input/output data is dense and the weights is sparse. Could you give me some advices?

@beniz
Copy link

beniz commented Jul 20, 2016

@yeahkun I haven't tested against BLAS libraries, I use OpenBLAS everywhere. Have you tested the speed of caffe_cpu_csr_gemm() alone ? On general performances see #2364 (comment) and #2364 (comment)

For sparse weights, as stated by @Caenorst you certainly need to modify the code because at the moment it is casting the bottom layer blobs into SparseBlob and using dense weights.

@Harick1
Copy link

Harick1 commented Jul 21, 2016

@beniz Actually I have trained a sparse model according to deep compression mentioned by @Caenorst . But in inference time, I use the sparse matrix-matrix computing function in mkl ( mkl_scsrmm() ) to replace the original dense computing function cblas_sgemm() and I found it's slower to use the mkl_scsrmm() (~0.19s, compared to cblas_sgemm(), ~0.14s). I also test the speed of caffe_cpu_csr_gemm() (~0.2s).

I used the googlenet to do all above experiments and the compression rate is about 30% for the sparse model.

@DanikKamilov
Copy link

DanikKamilov commented Nov 26, 2016

Hi, I have error: Error parsing text-format caffe.NetParameter: 109:17: Message type "caffe.LayerParameter" has no field named "pruning_param".
what I must to do?

@mistiansen
Copy link

Hi, is this usable?

@Caenorst
Copy link
Author

Caenorst commented Jun 2, 2017

Hi, it should still be usable, but please notice that it never have been merged (I actually forgot that I had to do the unit tests - -') so I guess you need to compile my own version.
@saiguruju: just add in your model prototxt

layer {
  ...
  type: InnerProduct
  ...
  pruning_param {
    coeff: _here goes the pruning rate, between 0 and 1, with 0 = no pruning_
  }
  ...
}

@xizi
Copy link

xizi commented Jul 17, 2017

@beniz hi, is there a tutorial for how to use sparse matrix computing method?

@xizi
Copy link

xizi commented Jul 17, 2017

@Caenorst hi, is there a detail document for how to use model pruning?

@zyclonb
Copy link

zyclonb commented Oct 20, 2017

@yeahkun Does "trained a sparse model" mean first pruning a pre-trained model and then doing fine-tuning to recover the accuracy?

@zyclonb
Copy link

zyclonb commented Oct 20, 2017

@Caenorst Thanks for your sharing! Here's some of my understanding about your work. Please correct me if anything mis-understood.

  1. To prune a pre-trained caffemodel is simply to mask small weights according to the pruning parameter. And the reason to do it on-line, instead of off-line, is the flexibility to play different pruning parameter for checking inference accuracy/performance.

  2. No modificaton to original graph topology involved in pruning.

  3. Thus no re-training/fine-tuning is needed for the pruned model.

@Caenorst
Copy link
Author

Hi @zyclonb,
given the original publication you do need to fine-tune the network after pruning it. My features allow you to do so.

I used an online approach because there is actually no reason to do it offline (it add another manipulation for the user and also you would have to stock the mask in memory). You don't need to save the mask, you can directly put the weights at 0. Then when you want to reprune, the weights at 0 will be obviously pruned first.

@xizi
I don't know how to explain better how to use the pruning method than what I wrote earlier on the thread.
Of course you have to know a little what is this pruning method about, otherwise it makes no sense to try to use it.

Also I haven't applied the sparse GEMM approach, but I believe it can be done separately.

@@ -389,6 +389,7 @@ message LayerParameter {
optional PoolingParameter pooling_param = 121;
optional PowerParameter power_param = 122;
optional PReLUParameter prelu_param = 131;
optional PruningParameter pruning_param = 148;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it be 147? Since in the comment above you say that the next available ID is 148.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it should be 147.

@mristin
Copy link

mristin commented May 14, 2018

Hi,
Could someone please tell me what the status of this pull request is? Is any progress planned? Should we pick it up (and if so, could you maybe point me to what remains to be done)?

Is there maybe a parallel and similar pull request? I would be interested to test the compression proposed by Han et al. in "Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding", ICLR 2016.

I suppose this pull request would handle the "pruning" part?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet