Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fft based convolutional layer #544

Closed
wants to merge 9 commits into from
Closed

Fft based convolutional layer #544

wants to merge 9 commits into from

Conversation

borisgin
Copy link

I implemented Convolutional layer with fft-based Forward() . There is no FFT support in Backward() yet. The implementation is based on FFTW3 library. It was tested both with native FFTW3 and with MKL. Into addition it supports OpenMP to utilize all cores. Was tested with native gcc OpenMP and with MKL.

The current version is CPU-only . Is anybody interested in doing CUDA - version?

My impression, based on current CPU implementation (FFT+openMP), is that FFT-based convolutional layer makes sense only for large kernels (kernel_size / stride >= 7). There are more details on benchmark below:
I modified net_speed-benchmark to test Forward() only, then I took "examples/imagenet" imagenet topology and modified two first convolutonal layers:

  • batch = 128

  • stride = 1

  • kernel = { 5,7, 9,11,13,15}

  • 10 forward iterations
    For each kernel I slightly changed crop size in data layers to make map size FFT- friendly (128, 256,..) The results are below (time is seconds for 10 forward iterations):

    Layer Kernel Input Output base,sec fft,sec
    conv1 15 128x3x242x242 128x96x228x228 79 28
    conv2 15 128x96x114x114 128x256x104x104 549 168
    ----------------------------------------------------------------------------
    conv1 13 128x3x244x244 128x96x232x232 58 30
    conv2 13 128x96x116x116 128x256x108x108 431 170
    ----------------------------------------------------------------------------
    conv1 11 128x3x246x246 128x96x236x236 44 28
    conv2 11 128x96x118x118 128x256x112x112 314 168
    ---------------------------------------------------------------------------
    conv1 9 128x3x248x248 128x96x240x240 33 29
    conv2 9 128x96x120x120 128x256x116x116 230 170
    --------------------------------------------------------------------------
    conv1 7 128x3x250x250 128x96x244x244 23 29
    conv2 7 128x96x122x122 128x256x122x122 152 170
    -------------------------------------------------------------------------
    conv1 5 128x3x252x252 128x96x248x248 16 28
    conv2 5 128x96x124x124 128x256x120x120 83 167

@shelhamer
Copy link
Member

Thanks for your work exploring FFT convolution for CNNs @borisgin. We've been meaning to do this for a while. @forresti since you had an earlier interest in this perhaps you could do a review.

To integrate this a conv layer TYPE parameter should first be added for selecting BLAS vs. FFT convolution.

A complete implementation including GPU should match or improve on the performance reported here: http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html

p.s. Please rebase against BVLC/dev. Also note your Makefile.config should not be versioned.

@borisgin
Copy link
Author

Sure, I will add fft = on/off parameter to the layer description. Currently , by default fft is switched dynamically when kernel_size/stride > 5. It can be turned manually on and off on the fly using FFT_on() and FFT_off().
No problem with adding CUDA fft like described in the link.

REBASE ISSUE:
I tried to pull blvc/dev ( git clone https://github.com/BVLC/caffe/ -b dev ) ,
but during compilation I got following error:
In file included from ./include/caffe/vision_layers.hpp:15:0,
from src/caffe/layer_factory.cpp:9:
./include/caffe/data_layers.hpp:11:18: fatal error: lmdb.h: No such file or directory
compilation terminated
When I use borisgin/dev everything is ok. Any suggestion for resolution?

@sguada
Copy link
Contributor

sguada commented Jun 26, 2014

We added new functionality and now you need to add the lmdb library

On Thursday, June 26, 2014, Boris Ginzburg notifications@github.com wrote:

I tried to pull blvc/dev ( git clone https://github.com/BVLC/caffe/ -b
dev ) ,
but during compilation I got following error:
In file included from ./include/caffe/vision_layers.hpp:15:0,
from src/caffe/layer_factory.cpp:9:
./include/caffe/data_layers.hpp:11:18: fatal error: lmdb.h: No such file
or directory
compilation terminated


Reply to this email directly or view it on GitHub
#544 (comment).

Sergio

@borisgin
Copy link
Author

Thanks! It worked. Did you observed any speed-up from replacement of leveldb by lmdb?

@borisgin
Copy link
Author

Hi, I added lmdb lib, and rebased fft branch against BVLC/dev.

@luotao1
Copy link

luotao1 commented Jul 1, 2014

how is the performance compared to your openmp version?

@borisgin
Copy link
Author

borisgin commented Jul 1, 2014

All data above is for Forward() , CPU only. The comparison is (MKL_BLAS + openMP) vs ( FFT + openMP). You can disable OpenMP in Makefile.config.
I also added "fft=0/1" to Convolutional layer parameters so you turn it on/off for each layer.

@@ -195,16 +196,17 @@ ifeq ($(BLAS), mkl)
COMMON_FLAGS += -DUSE_MKL
MKL_DIR = /opt/intel/mkl
BLAS_INCLUDE ?= $(MKL_DIR)/include
BLAS_LIB ?= $(MKL_DIR)/lib $(MKL_DIR)/lib/intel64
BLAS_LIB ?= $(MKL_DIR)/lib $(MKL_DIR)/lib/intel64 /opt/intel/composer_xe_2013_sp1.2.144/compiler/lib/intel64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is more convenient to configure MKL_DIR as /opt/intel/composer_xe_2013_sp1.2.144/compiler in the Makefile.config.

@borisgin
Copy link
Author

borisgin commented Jul 2, 2014

Pushed the code with changes, suggested by kloudkl ( many thanks for code review :) ).
I accepted all suggested changes, except FFT flag in Makefile.config: I think that it's simpler to add FFTW3 as dependency, and always include fft.hpp and fft.cpp into project.

@bhack
Copy link
Contributor

bhack commented Jul 31, 2014

@soumith @borisgin Could be interesting to benchmark Caffe with this PR when there will be GPU support.

@soumith
Copy link

soumith commented Jul 31, 2014

@bhack i'd be happy to do so when it's ready.

@borisgin
Copy link
Author

We restructured FFT convolutonal layer as new engine for convolutional layer, similar to cuDNN. To switch convolutional layer to FFT, one should set engine = FFT in the layer description. For example:
layers {
name: "conv1"
type: CONVOLUTION
...

convolution_param {
num_output: 20
...
engine: FFT
...
}
}
}

Files structure:

FFT -based convolutional layer implementation is in 2 following files: ./src/caffe/layers/conv_layer_fft.cpp and ./src/caffe/layers/conv_layer_fft.cu;
3 additional files are added to connect caffe to fftw3 and cufft: ./src/caffe/util/fft.cpp , ./src/caffe/util/fft.cu, ./include/util/fft.hpp
corresponding tests are in ./src/caffe/test/test_convolution_layer_FFT.cpp.
Code is rebased and ready for merge
Boris and Lior

@shelhamer
Copy link
Member

@borisgin the FFT engine for convolution will be a great feature but this is not properly rebased. The diff shows 196 files changed and you are missing the Travis CI directory for the automatic PR testing and other changes.

Please rebase on the latest dev, or simply cherry-pick your additions as listed

FFT -based convolutional layer implementation is in 2 following files: ./src/caffe/layers/conv_layer_fft.cpp and ./src/caffe/layers/conv_layer_fft.cu;
3 additional files are added to connect caffe to fftw3 and cufft: ./src/caffe/util/fft.cpp , ./src/caffe/util/fft.cu, ./include/util/fft.hpp
corresponding tests are in ./src/caffe/test/test_convolution_layer_FFT.cpp.

and then push to this PR for review and merge.

corresponding tests are in ./src/caffe/test/test_convolution_layer_FFT.cpp.

These should be rolled into the convolution layer tests like was done with cuDNN.

As a last note, the FFT engine should be optional just as cuDNN is through the use of a compile-time flag and #ifdef guards.

@borisgin
Copy link
Author

borisgin commented Nov 4, 2014

FFT PR was successfully merged with dev branch

@borisgin
Copy link
Author

borisgin commented Nov 5, 2014

FFT iengine for Convolutional Layer is s implemented similar cuDNN, But it can be turned on for individual layer by setting engine:FFT in the network configuration file.
To build caffe with FFT please uncomment corresponding line in Makefile.config (FFT).
For better performance on CPU we also added OPENMP flag in Makefile.config.
.
INSTALLATION:
This branch requires cufft (if you use CUDA) and FFTW3.To install FFTW3:

  1. download sources http://www.fftw.org/download.html
  2. unzip
    3../configure --enable-float --disable-long-double --disable-quad-precision --enable-openmp --disable-fortran --disable-debug-malloc --enable-avx --enable-shared
  3. make
  4. sudo make install
  5. if there are problems with float128 just comment out corresponding line in fftw3.h

@emasa
Copy link

emasa commented Jan 23, 2015

Perhaps it worth take a look at CNN implementation using FFT - Facebook. Released as part of https://github.com/facebook/fbcunn.

@sunbaigui
Copy link

same suggestion with @emasa
@borisgin @shelhamer
Recently facebook open sourced a project for speeding up convolution using fft, it's much faster than cuda fft.
Related imformation is below:
paper: http://arxiv.org/pdf/1412.7580.pdf
open source: https://github.com/facebook/fbcunn

@borisgin
Copy link
Author

Hi,
We have looked at FB ppare, but we haven't compared the performance of our method vs Facebook yet. On the algorithm level they are very similar.

Our implementation is straightforward:

  1. do FFT of weights // weightFFT(f,g) // once per batch
  2. for all images n in the batch {
  3. ... for all input feature f {
  4. ........ inFFT(f, n ) = FFT ( input (n,f) )
  5. ........ for all output features g {
  6. ............. outFFT(n,g) += inFFT(n,f) .* weightFFT (f,g); // element-wise matrix multiply
  7. ........ }
  8. ... }
  9. ...for all output features g {
  10. ........ out (n,g)= IFFT (outFFT(g,n));
  11. ... }
  12. } //end of n

Facebook FFT-based implementation is similar except that they replaced element-wise multiply-acc
with transpose and SGEMM ( cool trick! ).

@bhack
Copy link
Contributor

bhack commented Jan 26, 2015

Why this PR was not merged?

@melgor
Copy link

melgor commented Feb 12, 2015

I think, that is not merged, because it need MKL library (do not work with OpenBLAS and ATLAS).
I would like to use CUDA version, but I do not have MKL, so I does not compile.

Edit: As it was pointed, code does not require MKL. I get wrong branch...

@ducha-aiki
Copy link
Contributor

@melgor
It works with OpenBLAS as well.

@borisgin
Copy link
Author

code does not require MKL, It works with openblas and cuBLAS .

@dxj19831029
Copy link

scheduled for next release?

@bhack
Copy link
Contributor

bhack commented Jun 15, 2015

Happy pull request birthday @borisgin :)

@shelhamer
Copy link
Member

While CPU execution can be further optimized this PR is closed since it is against the deprecated dev branch. This branch was not merged at the time due to concerns about further complexity and dependencies. Thanks for your work @borisgin.

Note that cuDNN v3 includes an FFT convolution on the GPU.

@shelhamer shelhamer closed this Aug 26, 2015
@bhack
Copy link
Contributor

bhack commented Aug 26, 2015

@naibaf7 It is working also on CPU at #2610

@outgrid
Copy link

outgrid commented Jun 12, 2016

original caffe take 2000ms for forward compution, but the FFT-based caffe takes 98000ms for forward computation (the kernel size mostly 3X3, only one convolution kernel size is 7X7,however the stride is 2);
It seems the fbfft can properly handle the small kernel situations,did anyone implement fbfft or winograd in caffe?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet