Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Closed
wants to merge 78 commits into from

Conversation

SvenTwo
Copy link

@SvenTwo SvenTwo commented Jan 22, 2016

Also simplify the code by not relying on the log2 computation to pre-estimate pairing counts. Just pair until one device remains.

@lukeyeager
Copy link
Contributor

/cc @thatguymike

@anuphalarnkar
Copy link

Hi SvenTwo,

This patch did not fix it for me. When I explored more related to nvidia p2p access, I found some more info related to my system.... please check below:

root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

GPU0 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU2 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU3 = " Tesla K80" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...

Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

It appears from above output that P2P between GPU's is not supported although each GPU is capable for it. Further, I ran a command to check topology as given below:

root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# nvidia-smi topo -m

           GPU0   GPU1  GPU2   GPU3    CPU Affinity

GPU0 X PIX SOC SOC 0-7,72-127
GPU1 PIX X SOC SOC 0-7,72-127
GPU2 SOC SOC X PIX 8-71
GPU3 SOC SOC PIX X 8-71

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

From above topo, it appears that GPU0 and GPU1 as well as GPU2 and GPU3 are connected to each other via PIX and other combinations via SOC.

With above results, I have the following conclusion:

  1. The failing test cases(ones with seg fault and mem corruption) are the only ones which are using P2P as a result of which other test cases are passing as they are not using P2P!!! This is because P2P is not supported by hardware configuration at all based on results given by simpleP2P test

Query:
Even if p2p is not supported, shouldn't the driver fall back to default access?? Is this supported by Caffe?

@SvenTwo
Copy link
Author

SvenTwo commented Jan 27, 2016

I think caffe pairs everything with p2p first and then does a non-p2p-fallback for the rest. The non-p2p-fallback had a bug that caused list corruption, which I'm fixing (at least for my machine) in this pull request.

Do you still have the exact same stack trace in your crash? What's the output of the pairing? I believe caffe tells you exactly which devices it paired and whether it was using P2P (run the failing test manually and specify logtostderr to get the output logs of the test)

@anuphalarnkar
Copy link

This is a gdb backtrace for your reference:

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
[New Thread 0x3efffe0deea0 (LWP 73001)]
[New Thread 0x3efffd8deea0 (LWP 73002)]
[Thread 0x3efffd8deea0 (LWP 73002) exited]

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
364 ../sysdeps/powerpc/powerpc64/memcpy.S: No such file or directory.
(gdb) bt
#0 __memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
#1 0x00003fffb2e41068 in __GI_memmove (dest=0x19e32888, src=, len=) at ../sysdeps/powerpc/memmove.c:54
#2 0x00003fffb32dc6cc in std::vector<int, std::allocator >::erase(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#3 0x00003fffb32dacc0 in caffe::DevicePair::compute(std::vector<int, std::allocator >, std::vector<caffe::DevicePair, std::allocatorcaffe::DevicePair >) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#4 0x00003fffb32e13d0 in caffe::P2PSync::run(std::vector<int, std::allocator > const&) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#5 0x0000000010254bc4 in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver(float, float, float, int, int, int, bool, char const
) ()
#6 0x0000000010267c14 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate(float, float, float, int) ()
#7 0x00000000102690bc in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody() ()
#8 0x00000000105f8ee8 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test_, void (testing::Test::_)(), char const*) ()
#9 0x00000000105eb480 in testing::Test::Run() ()
#10 0x00000000105eb5bc in testing::TestInfo::Run() ()
#11 0x00000000105eb7a4 in testing::TestCase::Run() ()
#12 0x00000000105efc40 in testing::internal::UnitTestImpl::RunAllTests() ()
#13 0x00000000105effe0 in testing::UnitTest::Run() ()
#14 0x0000000010067418 in main ()

@SvenTwo
Copy link
Author

SvenTwo commented Jan 29, 2016

You could compile a debug build (set DEBUG in Makefile.config). It may give you an assertion earlier if e.g. vectors are accessed out of bounds.

Otherwise, sorry, I have no idea. Maybe it's a different bug after all.

@anuphalarnkar
Copy link

I enabled DEBUG in Makefile and then took a gdb trace... its in detail as compared to the previous one..

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
[New Thread 0x3efffe33eea0 (LWP 17147)]
[New Thread 0x3efffdb3eea0 (LWP 17148)]
[Thread 0x3efffdb3eea0 (LWP 17148) exited]

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
364 ../sysdeps/powerpc/powerpc64/memcpy.S: No such file or directory.
(gdb) bt
#0 __memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
#1 0x00003fffb29f1068 in __GI_memmove (dest=0x1ab1d7c8, src=, len=) at ../sysdeps/powerpc/memmove.c:54
#2 0x000000001011bfc8 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m (__first=0x1ab1d7cc, __last=0x1ab1d7c8, __result=0x1ab1d7c8)
at /usr/include/c++/4.8/bits/stl_algobase.h:372
#3 0x000000001011bf3c in std::__copy_move_a<false, int*, int*> (__first=0x1ab1d7cc, __last=0x1ab1d7c8, __result=0x1ab1d7c8)
at /usr/include/c++/4.8/bits/stl_algobase.h:390
#4 0x00003fffb30e8c98 in std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=32, __last=0, __result=0) at /usr/include/c++/4.8/bits/stl_algobase.h:428
#5 0x00003fffb30e7068 in std::copy<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=32, __last=0, __result=0) at /usr/include/c++/4.8/bits/stl_algobase.h:460
#6 0x00003fffb30dc64c in std::vector<int, std::allocator >::erase (this=0x3fffffffdd60, __position=0) at /usr/include/c++/4.8/bits/vector.tcc:138
#7 0x00003fffb30daf50 in caffe::DevicePair::compute (devices=std::vector of length 3, capacity 3 = {...}, pairs=0x3fffffffe488) at src/caffe/parallel.cpp:178
#8 0x00003fffb30e002c in caffe::P2PSync::run (this=0x1a5a59c0, gpus=std::vector of length 3, capacity 4 = {...}) at src/caffe/parallel.cpp:386
#9 0x000000001030e69c in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver (this=0x1a5f5d50, learning_rate=1, weight_decay=0,
momentum=0, num_iters=1, iter_size=1, devices=3, snapshot=false, from_snapshot=0x0) at src/caffe/test/test_gradient_based_solver.cpp:208
#10 0x0000000010303978 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate (this=0x1a5f5d50, learning_rate=1, weight_decay=0,
momentum=0, iter_to_check=0) at src/caffe/test/test_gradient_based_solver.cpp:484
#11 0x00000000102f7be4 in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody (this=0x1a5f5d50)
at src/caffe/test/test_gradient_based_solver.cpp:577
#12 0x000000001071bcc4 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x1a5f5d50,
method=&virtual testing::Test::TestBody(), location=0x107dbe38 "the test body") at src/gtest/gtest-all.cpp:3393
#13 0x0000000010713f2c in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x1a5f5d50, method=&virtual testing::Test::TestBody(),
location=0x107dbe38 "the test body") at src/gtest/gtest-all.cpp:3429
#14 0x00000000106f45d8 in testing::Test::Run (this=0x1a5f5d50) at src/gtest/gtest-all.cpp:3465
#15 0x00000000106f51cc in testing::TestInfo::Run (this=0x10b1edb0) at src/gtest/gtest-all.cpp:3641
#16 0x00000000106f5bec in testing::TestCase::Run (this=0x10b1ef60) at src/gtest/gtest-all.cpp:3748
#17 0x00000000106fd7e8 in testing::internal::UnitTestImpl::RunAllTests (this=0x10ad8a70) at src/gtest/gtest-all.cpp:5540
#18 0x000000001071d7c0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x10ad8a70,
method=(bool (testing::internal::UnitTestImpl::)(testing::internal::UnitTestImpl * const)) 0x106fd40c testing::internal::UnitTestImpl::RunAllTests(),
location=0x107dc9c0 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3393
#19 0x00000000107153f8 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x10ad8a70,
method=(bool (testing::internal::UnitTestImpl::
)(testing::internal::UnitTestImpl * const)) 0x106fd40c testing::internal::UnitTestImpl::RunAllTests(),
location=0x107dc9c0 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3429
#20 0x00000000106fbc44 in testing::UnitTest::Run (this=0x109d36b8 testing::UnitTest::GetInstance()::instance) at src/gtest/gtest-all.cpp:5177
#21 0x00000000100b12f4 in main (argc=1, argv=0x3ffffffff438) at src/caffe/test/test_caffe_main.cpp:39

@SvenTwo
Copy link
Author

SvenTwo commented Feb 3, 2016

@anuphalarnkar Sorry, I have no idea. But you may find more hints in the log file of the test. E.g. if you run:

./build/test/test_all.testbin --gtest_filter="SGDSolverTest/2.TestLeastSquaresUpdate" --logtostderr

You should get all the standard caffe blabber, including the log messages about GPU pairing. It will be something like "parallel.cpp:390] GPUs pairs 0:1". You might be able to deduce which pairings cause the problem. (Use 2>filename.log to reroute it to a file)

I don't think it's related to this pull request though.

@anuphalarnkar
Copy link

#3539 fixed with this pull request
Thanks

tishibas and others added 17 commits April 6, 2016 11:49
The example talks about LevelDB as the db backend but has lmdb as the param in the execution.
This can happen if, e.g., testing never occurs in the log
This would've saved me an overnight download (slow connection here)

I tested it, and it worked for me.
refactor duplicate code into separate update function for smoothed loss

fix naming convention
…ype: boost::shared_ptr<caffe::Blob<float> >
and ScaleLayer.  The behavior of ChannelwiseAffineLayer can be
reproduced by a ScaleLayer with `scale_param { bias_term: true }`.

BiasLayer and ScaleLayer each take 1 or 2 bottoms, with the output having
the same shape as the first.  The second input -- either another  bottom or a
learned parameter -- will have its axes (virtually) broadcast and tiled to have
the same shape as the first, after which elementwise addition (Bias) or
multiplication (Scale) is performed.
Evan Lezar and others added 27 commits April 6, 2016 11:49
Fix some typos. Correct imports. Refactor data layers. Apply PEP8 formatting.
output info, warnings, and errors for fuller description of the upgrade
check all conditions all the time; V0 -> V1 and V1 -> V2 do not suffice.
convert inputs in legacy definitions (prototxt), but simply strip inputs
from legacy weights (caffemodel).

fix BVLC#3750
die loudly if a net definition (prototxt) mixes proto formats by
defining both `layer` and `layers` fields instead of complaining but
discarding and continuing.

fix BVLC#3381
- cosmetic change to mkl related doc
This is temporary measure to avoid an apparent upstream issue with
protobuf 3.0.0b2.post1.
This provides a framework for automatically aligning different layers of
a net despite up/downsampling, padding, and output size rounding.
- document by docstring and comment
- pep8
- add latest layers and alphabetize
- respect default crop params
- handle graphs with compositions of crops by walking only the
  first, cropped bottom of Crop layers
- make python3 happy by replacing arg tuple unpacking
- crop -> offset
- adjust crop axis by 1
- test known mappings: conv-pool-deconv stack, ReLU and 1x1 conv
- test effects of padding
- test rectangular/anisotropic coordinate mapping, test N-D
- catch error cases: negative crop, scale mismatch, tops that are not
  spatially connected
configure offset(s) through proto definition.
Changes are:
  reduce test blob dims for speed
  use standard Gaussian filler,
  polish formatting and rename tests,
  test HW crop and 5D crop,
  standard gradient checks
there seems to be a caching issue at the moment;
this is a temporary fix for BVLC#3786
MKLROOT variable is set by MKL scripts, so it also should be used in Makefile.
…spec interface.

The deconvolution layer uses the convolution_param subgroup for its parameters, so hardcode that connection into param_name_dict().
dllehr81 pushed a commit to dllehr81/caffe that referenced this pull request Aug 26, 2016
@shelhamer
Copy link
Member

Resolved by the switch to new parallelism in #4563. Thanks for submitting a fix all the same!

@shelhamer shelhamer closed this Apr 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet