Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault after make runtest on Ubuntu 14.04/ppc64le #3539

Closed
anuphalarnkar opened this issue Jan 12, 2016 · 5 comments
Closed

Segmentation fault after make runtest on Ubuntu 14.04/ppc64le #3539

anuphalarnkar opened this issue Jan 12, 2016 · 5 comments

Comments

@anuphalarnkar
Copy link

Hi,

I am working on PowerPC(ppc64le) machine with Ubuntu 14.04 and 4 GPU's (all Tesla k80's)

The make runtest fails with log:

@     0x100000040478 (unknown)
@     0x1000043f6c10 (unknown)
@     0x1000046daffc std::vector<>::erase()
@     0x1000046d95f0 caffe::DevicePair::compute()
@     0x1000046e0530 caffe::P2PSync<>::run()
@         0x1022b8ec caffe::GradientBasedSolverTest<>::RunLeastSquaresSolver()
@         0x10234154 caffe::GradientBasedSolverTest<>::TestLeastSquaresUpdate()
@         0x10234560 caffe::RMSPropSolverTest_TestRMSPropLeastSquaresUpdateWithWeightDecay_Test<>::TestBody()
@         0x105323a8 testing::internal::HandleExceptionsInMethodIfSupported<>()
@         0x10524940 testing::Test::Run()
@         0x10524a7c testing::TestInfo::Run()
@         0x10524c64 testing::TestCase::Run()
@         0x10529100 testing::internal::UnitTestImpl::RunAllTests()
@         0x105294a0 testing::UnitTest::Run()
@         0x1005bc58 main
@     0x100004d44d00 (unknown)
@     0x100004d44ef8 (unknown)
@                0x0 (unknown)

make: *** [runtest] Segmentation fault (core dumped)

I tried to run "test_all.testbin"using GDB. I am getting below output:

Program received signal SIGSEGV, Segmentation fault.

__memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364

364 ../sysdeps/powerpc/powerpc64/memcpy.S: No such file or directory.
I did some backtrace with gdb. Here are the results...

The GPU memories are allocated on pci bus at addresses starting with 0x3xxx xxxx xxxx as per the lshw command on linux.
Here the addresses at which functions are called are given on left hand side:
0x00003fffb35019f0 -> caffe::P2PSync::run,
0x00003fffb34fb2b0 -> caffe::DevicePair::compute,
0x00003fffb34fccbc -> std::vector<int, std::allocator >::erase
0x00003fffb30a1068 ->__GI_memmove (dest=0x153ea198, src=, len=)

In last function __GI_memmove, I feel the above destination address is an offset within the GPU memory range. For instance, the final computed address could be 0x00003fff00000000 + 0x153ea198 = 0x00003fff153ea198
However, I am unable to relate the address 0x3fff with any of the GPU cards on pci bus

I have taken a backtrace using gdb and pasted it below for your reference.

(gdb) bt
#0 __memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
#1 0x00003fffb30a1068 in __GI_memmove (dest=0x153ea198, src=, len=) at ../sysdeps/powerpc/memmove.c:54
#2 0x00003fffb34fccbc in std::vector<int, std::allocator >::erase(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >) ()

from /root/anup/caffe/.build_release/test/../lib/libcaffe.so
#3 0x00003fffb34fb2b0 in caffe::DevicePair::compute(std::vector<int, std::allocator >, std::vector<caffe::DevicePair,

std::allocatorcaffe::DevicePair >*) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so
#4 0x00003fffb35019f0 in caffe::P2PSync::run(std::vector<int, std::allocator > const&) () from

/root/anup/caffe/.build_release/test/../lib/libcaffe.so
#5 0x0000000010235074 in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver(float, float, float, int, int, int, bool, char

const*) ()
#6 0x0000000010247414 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate(float, float, float, int) ()
#7 0x00000000102488bc in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody() ()
#8 0x000000001053ce68 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test_, void (testing::Test::_)(), char

const*) ()
#9 0x000000001052f400 in testing::Test::Run() ()
#10 0x000000001052f53c in testing::TestInfo::Run() ()
#11 0x000000001052f724 in testing::TestCase::Run() ()
#12 0x0000000010533bc0 in testing::internal::UnitTestImpl::RunAllTests() ()
#13 0x0000000010533f60 in testing::UnitTest::Run() ()
#14 0x000000001005c038 in main ()

Any inputs will be greatly appreciated.

Thanks in advance,
Anup Halarnkar

@seanbell
Copy link

Do you think this is the same as #3531?

@cdluminate
Copy link
Contributor

It is so interesting seeing a real instance of caffe on a ppc64el machine.

Thank you for inferring that caffe should be ok to be ported to ppc64el, whatever with or without CUDA.

@SvenTwo
Copy link

SvenTwo commented Jan 22, 2016

Does this commit fix it for you? #3586

@anuphalarnkar
Copy link
Author

Hey SvenTwo,
The commit on #3586 solved it for me. I had not applied the patch correctly earlier. Apologize for this and wasting your time :(
Thanks,
Anup

@seanbell
Copy link

seanbell commented Feb 5, 2016

@anuphalarnkar glad that you solved it. Please add a comment to #3586 indicating that it's fixed for you, so that the PR doesn't get buried.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants