New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenMP support for CorrMM (simple paths) #3689
Conversation
Hmm, the failures here are interesting. @JesseLivezey, can you take a look and see that I'm doing the indexing into the new |
col_dim[1] = (npy_intp)(topHeight * topWidth); | ||
PyArrayObject* col = (PyArrayObject*)PyArray_EMPTY(2, | ||
npy_intp col_dim[3]; | ||
col_dim[0] = (npy_intp)(batchSize); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is ok for testing, but use too much memory. You only need the size min(batch_size, number_of_thread).
I think the max number of thread can be know by a call to openmp. This will need adjustmane in the usage of col.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this will need to be aligned? I'm not sure of that, I think it is not mandatory to align, but maybe this can give a speed up.
I'm not able to see the problem by code review. Do this work at your place? Do you use a parallel blas? |
I'm still puzzled by travis error. If a few people test it outside and it work well, maybe we could just disable openmp in travis? |
// First, im2col | ||
im2col((%(float_type)s*)PyArray_DATA(bottom) + n * bottom_stride, nChannels, bottomHeight, | ||
bottomWidth, kH, kW, padH, padW, dH, dW, (%(float_type)s*)PyArray_DATA(col)); | ||
bottomWidth, kH, kW, padH, padW, dH, dW, (%(float_type)s*)PyArray_GETPTR3(col, colidx, 0, 0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could it be the PyArray_DATA or PyArray_GETPTR3 that cause this problem? Can this be moved outside the loop for _DATA and for getptr3, could this be computated manually? This is the only python code in the parallel loop, so I don't see what else could cause the double free related to python in the error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically PyArray_GETPTR3
should just be doing pointer arithmetic to move past the first colidx
matrices, so we could, although it'll make the code much less readable. In fact, looking at the source, that's pretty much exactly what it does. PyArray_BYTES
and PyArray_STRIDES
both just do a pointer dereference, so I don't see how the problem can originate from there :/
.theanorc looks like
I'm just calling |
Adding
|
# Add the -fopenmp flags | ||
ret += super(BaseCorrMM, self).c_compile_args() | ||
|
||
return ret |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The method c_headers must also be updated to get the missing include added by the parent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fixes the first problem!
error: omp_get_max_threads was not declared in this scope.
By code review I'm not able to find the problem. Can you comment the pydecref and try if this fix the problem? It will help to know which object is causing problem (but this would probably cause a memory leak, so this isn't a long term solution) |
max = omp_get_max_threads(); | ||
#endif | ||
npy_intp col_dim[3]; | ||
col_dim[0] = (npy_intp)(max); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Following up in https://github.com/jonhoo/Theano/pull/1.
Sorry I've been so slow to follow up on this. Have been quite busy the past week. Will hopefully have time to look more closely on this on Thursday. @JesseLivezey, thanks, I'll have a look. |
I think this PR is good to merge. But as we won't be available in case of problem and it wasn't tested on mac/windows, I will postpone the merge when we are back from holidays. But those problems should be already addressed by the super clas OpenMPOp, but just in case. To test this, I run the test_corr.py file with OMP_NUM_THREADS=1 and 2. Do someone tested it with bigger matrix to see if there is a speed up? Just to be sure. thanks |
I ran some tests on convolutions that were similar to the first layer of imagenet model convolutions and also on some convolutions that would be more similar to 1D convolutions with a spectrogram. This was all done on a 4-core xeon processor. For both sets of shapes, running with MKL_NUM_THREADS=4 (previous version) was faster than no parallelization, and running with OMP_NUM_THREADS=4 and MKL_NUM_THREADS=1 (new openmp version) was faster than just having MKL use 4 threads! I'll also try timing on 8 and 12 core processors today. The previous version (no openmp, MKL using all cores) didn't scale well past ~6 cores, so hopefully this will help. |
It would be great to update this file with this timing information to this doc/tutorial/multi_cores.txt thanks On Mon, Dec 21, 2015 at 3:38 PM, Jesse Livezey notifications@github.com
|
I can make another PR to this PR with the timing + docs.
|
@jonhoo, @nouiz the test_corr.py tests DO NOT pass on OSX El Capitan 10.11.2 with Homebrew gcc 5.3.0 --without-multilib --enable-cxx. This is the current default installed with gcc through Homebrew. The default OSX clang compiler doesn't support OpenMP, but tests pass; OpenMP just gets disabled. The test DO pass on OSX (same machine as above) with Homebrew gcc48 4.8.4 --without-multilib --enable-cxx. The tests DO pass on Ubuntu with g++ version Ubuntu 4.8.4-2ubuntu1~14.04. Any ideas? I can try and run some of the other ops with OpenMP (elementwise, I guess) and see if they pass/fail the same way. |
That's very interesting.. I don't know what might be causing that. Can you try with gcc 5.X on Ubuntu? |
Tests also fail on Ubuntu with g++ 5.3 and also 4.9. Maybe something changed in OpenMP from 3.1 to 4.0? I think g++ 4.8 only supports 3.1 where 4.9+ supports 4.0. @nouiz, does Theano keep track of the g++ version? Elementwise tests seem okay for all versions. |
As the ICML deadline is passed, we can try to merge stuff that isn't 100% sure. Theano just record the version, but except at a few places, we don't use it. When we use the version, it is just to work around bugged version. @JesseLivezey which tests failed? Can you show the error? This PR would need a rebase. We try to clean up stuff to make a release. It would be great to have this in. thanks |
@nouiz With g++5 on OSX I get the following errors (I think the errors are the same on ubuntu with g++4.9+.) I looked into it a bit more a while ago and it looked like when OMP_NUM_THREADS >1 the computations were wrong. I think it might have been doing the computation for only the first element of the batch for every loop, but I'd have to go back and look more carefully.
|
Sorry, no time. Thanks to investigate. On Tue, Mar 8, 2016 at 9:24 PM, Jesse Livezey notifications@github.com
|
this was finished in #4591 |
Adds OpenMP support for two of the three paths in CorrMM. This patch results in a ~5x performance gain for my MNIST-like CNN when run on an 80-core machine. The last path, backpropagation with weights, is slightly trickier because of the write-sharing of
weights
, and will be handled separately in #3653.This cover forward and backprop wrt. inputs