Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault when run test #3531

Open
xnming opened this issue Jan 8, 2016 · 4 comments
Open

Segmentation fault when run test #3531

xnming opened this issue Jan 8, 2016 · 4 comments
Labels

Comments

@xnming
Copy link

xnming commented Jan 8, 2016

I updated caffe yesterday. However, it caught segmentation fault when I executed make runtest after updating. The environment is Ubuntu 14.04 and CUDA 7.5. Following is the debugging output.

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate

Program received signal SIGSEGV, Segmentation fault.
__memmove_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1546
1546 ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S: No such file or directory.
(gdb) bt
#0 __memmove_ssse3_back () at ../sysdeps/x86_64/multiarch/memcpy-ssse3-back.S:1546
#1 0x000000000052bef5 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m (__first=0xafa2ebc, __last=0xafa2eb8, __result=0xafa2eb8)

at /usr/include/c++/4.8/bits/stl_algobase.h:372

#2 0x000000000052bea3 in std::__copy_move_a<false, int*, int*> (__first=0xafa2ebc, __last=0xafa2eb8, __result=0xafa2eb8) at /usr/include/c++/4.8/bits/stl_algobase.h:390
#3 0x00007ffff1b7917d in std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=..., __last=..., __result=...) at /usr/include/c++/4.8/bits/stl_algobase.h:428
#4 0x00007ffff1b782d0 in std::copy<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=..., __last=..., __result=...) at /usr/include/c++/4.8/bits/stl_algobase.h:460
#5 0x00007ffff1b703b4 in std::vector<int, std::allocator >::erase (this=0x7fffffffcbc0, __position=...) at /usr/include/c++/4.8/bits/vector.tcc:138
#6 0x00007ffff1b6f3a9 in caffe::DevicePair::compute (devices=..., pairs=0x7fffffffd300) at src/caffe/parallel.cpp:178
#7 0x00007ffff1b72c81 in caffe::P2PSync::run (this=0xb9479a0, gpus=...) at src/caffe/parallel.cpp:386
#8 0x00000000008a2697 in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver (this=0xac948f0, learning_rate=1, weight_decay=0, momentum=0, num_iters=1,

iter_size=1, devices=3, snapshot=false, from_snapshot=0x0) at src/caffe/test/test_gradient_based_solver.cpp:208

#9 0x0000000000899db6 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate (this=0xac948f0, learning_rate=1, weight_decay=0, momentum=0,

iter_to_check=0) at src/caffe/test/test_gradient_based_solver.cpp:484

#10 0x0000000000892069 in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody (this=0xac948f0) at src/caffe/test/test_gradient_based_solver.cpp:577
#11 0x00000000008d18bd in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0xac948f0, method=&virtual testing::Test::TestBody(),

location=0x98285b "the test body") at src/gtest/gtest-all.cpp:3393

#12 0x00000000008ccfc4 in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0xac948f0, method=&virtual testing::Test::TestBody(),

location=0x98285b "the test body") at src/gtest/gtest-all.cpp:3429

#13 0x00000000008ba3db in testing::Test::Run (this=0xac948f0) at src/gtest/gtest-all.cpp:3465
#14 0x00000000008bab74 in testing::TestInfo::Run (this=0xf4aae0) at src/gtest/gtest-all.cpp:3641
#15 0x00000000008bb162 in testing::TestCase::Run (this=0xf4ac90) at src/gtest/gtest-all.cpp:3748
#16 0x00000000008bffec in testing::internal::UnitTestImpl::RunAllTests (this=0xe9ddc0) at src/gtest/gtest-all.cpp:5540
#17 0x00000000008d28f0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xe9ddc0,

method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x8bfd64 <testing::internal::UnitTestImpl::RunAllTests()>, 
location=0x983318 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3393

#18 0x00000000008cdbef in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0xe9ddc0,

method=(bool (testing::internal::UnitTestImpl::*)(testing::internal::UnitTestImpl * const)) 0x8bfd64 <testing::internal::UnitTestImpl::RunAllTests()>, 
location=0x983318 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3429

#19 0x00000000008bed80 in testing::UnitTest::Run (this=0xd4b160 testing::UnitTest::GetInstance()::instance) at src/gtest/gtest-all.cpp:5177
#20 0x00000000004ab923 in main (argc=1, argv=0x7fffffffdec8) at src/caffe/test/test_caffe_main.cpp:39

And I reviewed the code which the error occur (src/caffe/parallel.cpp:178).
172 remaining_depth = ceil(log2(remaining.size()));
173 for (int d = 0; d < remaining_depth; ++d) {
174 for (int i = 0; i < remaining.size(); ++i) {
175 pairs->push_back(DevicePair(remaining[i], remaining[i + 1]));
176 DLOG(INFO) << "Remaining pair: " << remaining[i] << ":"
177 << remaining[i + 1];
178 remaining.erase(remaining.begin() + i + 1);
179 }

It seems that remaining vector is empty when d > 0.

@seanbell seanbell added the bug label Jan 8, 2016
@mcculloh
Copy link

This looks like it will walk off the end of the array:

for (int i = 0; i < remaining.size(); ++i) {
  pairs->push_back(DevicePair(remaining[i], remaining[i + 1]));

And then the loop is removing items from the array at the same time as iterating over it. The outer loop will repeat and create memory issues if the vector has a number of items in it that is not a power of 2 at the start of the loop, say 6 (ceil(log_2(6)) = 3). So we will go thru the inner loop more than once, but there will be 3 items in the loop on the second pass. Calling remaining[i+1] does something different, depending on your compiler--but usually just returns garbage data after the internal array. The call to erase at the end of the loop is likewise undefined but seems to be causing the seg fault.

I think setting up a loop invariant that steps through every other item in the vector and makes the pairs would be safer.

@SvenTwo
Copy link

SvenTwo commented Jan 22, 2016

I have the same issue (on a regular PC, also Ubuntu 14). It happens when the test is run for 3 GPUs. If I change the test to stop at two devices, it runs fine.

I've attached a patch that fixes the problem for me. I fixed the loop index and just run it until finished instead of doing the log2 computation.

0001-Fix-crash-when-pairing-3-GPUs-without-P2P-access-git.patch.txt

Edit: Also added a pull request #3586

SvenTwo pushed a commit to SvenTwo/caffe that referenced this issue Jan 22, 2016
hartb pushed a commit to ibmsoe/caffe that referenced this issue May 4, 2016
npanpaliya pushed a commit to ibmsoe/caffe that referenced this issue May 18, 2016
@aaronpolhamus
Copy link

You guys are wizards. This worked great.

npanpaliya pushed a commit to ibmsoe/caffe that referenced this issue Jul 11, 2016
@chenyang-charles
Copy link

Works fine on 14.04, thank you.

dllehr81 pushed a commit to dllehr81/caffe that referenced this issue Aug 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants