Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

SvenTwo · 2016-01-22T16:52:45Z

Also simplify the code by not relying on the log2 computation to pre-estimate pairing counts. Just pair until one device remains.

…hub issue BVLC#3531)

lukeyeager · 2016-01-22T17:40:04Z

/cc @thatguymike

anuphalarnkar · 2016-01-27T08:29:06Z

Hi SvenTwo,

This patch did not fix it for me. When I explored more related to nvidia p2p access, I found some more info related to my system.... please check below:

root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# ./simpleP2P
[./simpleP2P] - Starting...
Checking for multiple GPUs...
CUDA-capable device count: 4

GPU0 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU1 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU2 = " Tesla K80" IS capable of Peer-to-Peer (P2P)
GPU3 = " Tesla K80" IS capable of Peer-to-Peer (P2P)

Checking GPU(s) for support of peer to peer memory access...

Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU0) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU2) : No
Peer access from Tesla K80 (GPU1) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU2) -> Tesla K80 (GPU3) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU0) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU1) : No
Peer access from Tesla K80 (GPU3) -> Tesla K80 (GPU2) : No
Two or more GPUs with SM 2.0 or higher capability are required for ./simpleP2P.
Peer to Peer access is not available amongst GPUs in the system, waiving test.

It appears from above output that P2P between GPU's is not supported although each GPU is capable for it. Further, I ran a command to check topology as given below:

root@fs3:/usr/local/cuda-7.5/samples/0_Simple/simpleP2P# nvidia-smi topo -m

           GPU0   GPU1  GPU2   GPU3    CPU Affinity

GPU0 X PIX SOC SOC 0-7,72-127
GPU1 PIX X SOC SOC 0-7,72-127
GPU2 SOC SOC X PIX 8-71
GPU3 SOC SOC PIX X 8-71

Legend:

X = Self
SOC = Path traverses a socket-level link (e.g. QPI)
PHB = Path traverses a PCIe host bridge
PXB = Path traverses multiple PCIe internal switches
PIX = Path traverses a PCIe internal switch

From above topo, it appears that GPU0 and GPU1 as well as GPU2 and GPU3 are connected to each other via PIX and other combinations via SOC.

With above results, I have the following conclusion:

The failing test cases(ones with seg fault and mem corruption) are the only ones which are using P2P as a result of which other test cases are passing as they are not using P2P!!! This is because P2P is not supported by hardware configuration at all based on results given by simpleP2P test

Query:
Even if p2p is not supported, shouldn't the driver fall back to default access?? Is this supported by Caffe?

SvenTwo · 2016-01-27T17:44:18Z

I think caffe pairs everything with p2p first and then does a non-p2p-fallback for the rest. The non-p2p-fallback had a bug that caused list corruption, which I'm fixing (at least for my machine) in this pull request.

Do you still have the exact same stack trace in your crash? What's the output of the pairing? I believe caffe tells you exactly which devices it paired and whether it was using P2P (run the failing test manually and specify logtostderr to get the output logs of the test)

anuphalarnkar · 2016-01-29T07:09:34Z

This is a gdb backtrace for your reference:

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
[New Thread 0x3efffe0deea0 (LWP 73001)]
[New Thread 0x3efffd8deea0 (LWP 73002)]
[Thread 0x3efffd8deea0 (LWP 73002) exited]

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
364 ../sysdeps/powerpc/powerpc64/memcpy.S: No such file or directory.
(gdb) bt
#0 __memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
#1 0x00003fffb2e41068 in __GI_memmove (dest=0x19e32888, src=, len=) at ../sysdeps/powerpc/memmove.c:54
#2 0x00003fffb32dc6cc in std::vector<int, std::allocator >::erase(__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#3 0x00003fffb32dacc0 in caffe::DevicePair::compute(std::vector<int, std::allocator >, std::vector<caffe::DevicePair, std::allocatorcaffe::DevicePair >) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#4 0x00003fffb32e13d0 in caffe::P2PSync::run(std::vector<int, std::allocator > const&) ()
from /root/anup/caffe/.build_release/test/../lib/libcaffe.so.1.0.0-rc3
#5 0x0000000010254bc4 in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver(float, float, float, int, int, int, bool, char const) ()
#6 0x0000000010267c14 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate(float, float, float, int) ()
#7 0x00000000102690bc in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody() ()
#8 0x00000000105f8ee8 in void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test_, void (testing::Test::_)(), char const*) ()
#9 0x00000000105eb480 in testing::Test::Run() ()
#10 0x00000000105eb5bc in testing::TestInfo::Run() ()
#11 0x00000000105eb7a4 in testing::TestCase::Run() ()
#12 0x00000000105efc40 in testing::internal::UnitTestImpl::RunAllTests() ()
#13 0x00000000105effe0 in testing::UnitTest::Run() ()
#14 0x0000000010067418 in main ()

SvenTwo · 2016-01-29T16:02:31Z

You could compile a debug build (set DEBUG in Makefile.config). It may give you an assertion earlier if e.g. vectors are accessed out of bounds.

Otherwise, sorry, I have no idea. Maybe it's a different bug after all.

anuphalarnkar · 2016-02-02T09:40:48Z

I enabled DEBUG in Makefile and then took a gdb trace... its in detail as compared to the previous one..

[----------] 12 tests from SGDSolverTest/2, where TypeParam = caffe::GPUDevice
[ RUN ] SGDSolverTest/2.TestLeastSquaresUpdate
[New Thread 0x3efffe33eea0 (LWP 17147)]
[New Thread 0x3efffdb3eea0 (LWP 17148)]
[Thread 0x3efffdb3eea0 (LWP 17148) exited]

Program received signal SIGSEGV, Segmentation fault.
__memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
364 ../sysdeps/powerpc/powerpc64/memcpy.S: No such file or directory.
(gdb) bt
#0 __memcpy_ppc () at ../sysdeps/powerpc/powerpc64/memcpy.S:364
#1 0x00003fffb29f1068 in __GI_memmove (dest=0x1ab1d7c8, src=, len=) at ../sysdeps/powerpc/memmove.c:54
#2 0x000000001011bfc8 in std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m (__first=0x1ab1d7cc, __last=0x1ab1d7c8, __result=0x1ab1d7c8)
at /usr/include/c++/4.8/bits/stl_algobase.h:372
#3 0x000000001011bf3c in std::__copy_move_a<false, int*, int*> (__first=0x1ab1d7cc, __last=0x1ab1d7c8, __result=0x1ab1d7c8)
at /usr/include/c++/4.8/bits/stl_algobase.h:390
#4 0x00003fffb30e8c98 in std::__copy_move_a2<false, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=32, __last=0, __result=0) at /usr/include/c++/4.8/bits/stl_algobase.h:428
#5 0x00003fffb30e7068 in std::copy<__gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > >, __gnu_cxx::__normal_iterator<int*, std::vector<int, std::allocator > > > (__first=32, __last=0, __result=0) at /usr/include/c++/4.8/bits/stl_algobase.h:460
#6 0x00003fffb30dc64c in std::vector<int, std::allocator >::erase (this=0x3fffffffdd60, __position=0) at /usr/include/c++/4.8/bits/vector.tcc:138
#7 0x00003fffb30daf50 in caffe::DevicePair::compute (devices=std::vector of length 3, capacity 3 = {...}, pairs=0x3fffffffe488) at src/caffe/parallel.cpp:178
#8 0x00003fffb30e002c in caffe::P2PSync::run (this=0x1a5a59c0, gpus=std::vector of length 3, capacity 4 = {...}) at src/caffe/parallel.cpp:386
#9 0x000000001030e69c in caffe::GradientBasedSolverTestcaffe::GPUDevice::RunLeastSquaresSolver (this=0x1a5f5d50, learning_rate=1, weight_decay=0,
momentum=0, num_iters=1, iter_size=1, devices=3, snapshot=false, from_snapshot=0x0) at src/caffe/test/test_gradient_based_solver.cpp:208
#10 0x0000000010303978 in caffe::GradientBasedSolverTestcaffe::GPUDevice::TestLeastSquaresUpdate (this=0x1a5f5d50, learning_rate=1, weight_decay=0,
momentum=0, iter_to_check=0) at src/caffe/test/test_gradient_based_solver.cpp:484
#11 0x00000000102f7be4 in caffe::SGDSolverTest_TestLeastSquaresUpdate_Testcaffe::GPUDevice::TestBody (this=0x1a5f5d50)
at src/caffe/test/test_gradient_based_solver.cpp:577
#12 0x000000001071bcc4 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void> (object=0x1a5f5d50,
method=&virtual testing::Test::TestBody(), location=0x107dbe38 "the test body") at src/gtest/gtest-all.cpp:3393
#13 0x0000000010713f2c in testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void> (object=0x1a5f5d50, method=&virtual testing::Test::TestBody(),
location=0x107dbe38 "the test body") at src/gtest/gtest-all.cpp:3429
#14 0x00000000106f45d8 in testing::Test::Run (this=0x1a5f5d50) at src/gtest/gtest-all.cpp:3465
#15 0x00000000106f51cc in testing::TestInfo::Run (this=0x10b1edb0) at src/gtest/gtest-all.cpp:3641
#16 0x00000000106f5bec in testing::TestCase::Run (this=0x10b1ef60) at src/gtest/gtest-all.cpp:3748
#17 0x00000000106fd7e8 in testing::internal::UnitTestImpl::RunAllTests (this=0x10ad8a70) at src/gtest/gtest-all.cpp:5540
#18 0x000000001071d7c0 in testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x10ad8a70,
method=(bool (testing::internal::UnitTestImpl::)(testing::internal::UnitTestImpl * const)) 0x106fd40c testing::internal::UnitTestImpl::RunAllTests(),
location=0x107dc9c0 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3393
#19 0x00000000107153f8 in testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool> (object=0x10ad8a70,
method=(bool (testing::internal::UnitTestImpl::)(testing::internal::UnitTestImpl * const)) 0x106fd40c testing::internal::UnitTestImpl::RunAllTests(),
location=0x107dc9c0 "auxiliary test code (environments or event listeners)") at src/gtest/gtest-all.cpp:3429
#20 0x00000000106fbc44 in testing::UnitTest::Run (this=0x109d36b8 testing::UnitTest::GetInstance()::instance) at src/gtest/gtest-all.cpp:5177
#21 0x00000000100b12f4 in main (argc=1, argv=0x3ffffffff438) at src/caffe/test/test_caffe_main.cpp:39

SvenTwo · 2016-02-03T00:42:24Z

@anuphalarnkar Sorry, I have no idea. But you may find more hints in the log file of the test. E.g. if you run:

./build/test/test_all.testbin --gtest_filter="SGDSolverTest/2.TestLeastSquaresUpdate" --logtostderr

You should get all the standard caffe blabber, including the log messages about GPU pairing. It will be something like "parallel.cpp:390] GPUs pairs 0:1". You might be able to deduce which pairings cause the problem. (Use 2>filename.log to reroute it to a file)

I don't think it's related to this pull request though.

anuphalarnkar · 2016-02-05T06:46:29Z

#3539 fixed with this pull request
Thanks

The example talks about LevelDB as the db backend but has lmdb as the param in the execution.

This can happen if, e.g., testing never occurs in the log

This would've saved me an overnight download (slow connection here) I tested it, and it worked for me.

refactor duplicate code into separate update function for smoothed loss fix naming convention

…ype: boost::shared_ptr<caffe::Blob<float> >

and ScaleLayer. The behavior of ChannelwiseAffineLayer can be reproduced by a ScaleLayer with `scale_param { bias_term: true }`. BiasLayer and ScaleLayer each take 1 or 2 bottoms, with the output having the same shape as the first. The second input -- either another bottom or a learned parameter -- will have its axes (virtually) broadcast and tiled to have the same shape as the first, after which elementwise addition (Bias) or multiplication (Scale) is performed.

Fix some typos. Correct imports. Refactor data layers. Apply PEP8 formatting.

output info, warnings, and errors for fuller description of the upgrade

check all conditions all the time; V0 -> V1 and V1 -> V2 do not suffice.

convert inputs in legacy definitions (prototxt), but simply strip inputs from legacy weights (caffemodel). fix BVLC#3750

die loudly if a net definition (prototxt) mixes proto formats by defining both `layer` and `layers` fields instead of complaining but discarding and continuing. fix BVLC#3381

- cosmetic change to mkl related doc

This is temporary measure to avoid an apparent upstream issue with protobuf 3.0.0b2.post1.

This provides a framework for automatically aligning different layers of a net despite up/downsampling, padding, and output size rounding.

- document by docstring and comment - pep8 - add latest layers and alphabetize - respect default crop params - handle graphs with compositions of crops by walking only the first, cropped bottom of Crop layers - make python3 happy by replacing arg tuple unpacking

- crop -> offset - adjust crop axis by 1

- test known mappings: conv-pool-deconv stack, ReLU and 1x1 conv - test effects of padding - test rectangular/anisotropic coordinate mapping, test N-D - catch error cases: negative crop, scale mismatch, tops that are not spatially connected

configure offset(s) through proto definition.

Changes are: reduce test blob dims for speed use standard Gaussian filler, polish formatting and rename tests, test HW crop and 5D crop, standard gradient checks

there seems to be a caching issue at the moment; this is a temporary fix for BVLC#3786

MKLROOT variable is set by MKL scripts, so it also should be used in Makefile.

…ime on repeated calls.

…spec interface. The deconvolution layer uses the convolution_param subgroup for its parameters, so hardcode that connection into param_name_dict().

…hub issue BVLC#3531,BVLC#3586)

shelhamer · 2017-04-14T23:04:52Z

Resolved by the switch to new parallelism in #4563. Thanks for submitting a fix all the same!

Fix crash when pairing an odd number of devices without P2P (BVLC/git…

5241487

…hub issue BVLC#3531)

This was referenced Jan 22, 2016

Segmentation fault when run test #3531

Open

Segmentation fault after make runtest on Ubuntu 14.04/ppc64le #3539

Closed

anuphalarnkar mentioned this pull request Feb 5, 2016

error: ‘__PTHREAD_SPINS’ was not declared in this scope #3636

Closed

SvenTwo mentioned this pull request Apr 1, 2016

Caffe Window on 3 GPUs #3883

Open

tishibas and others added 17 commits April 6, 2016 11:49

improved to load RGB image as grayscale image

6537309

fixing the database param

7aa3406

The example talks about LevelDB as the db backend but has lmdb as the param in the execution.

Net: expose param_display_names_

68ebc39

Don't attempt to write CSV if there are no lines to write

a7b8c57

This can happen if, e.g., testing never occurs in the log

Add a -c to wget so that it continues interrupted downloads

6332e7f

This would've saved me an overnight download (slow connection here) I tested it, and it worked for me.

Fix loss of last iteration when average_loss > 1

f976d9c

refactor duplicate code into separate update function for smoothed loss fix naming convention

ELU layer with basic tests

999a043

- Fix to cmake build for clang

baa4291

TestDataTransformer: fix some memory leaks caused by use of 'new'

c85f0b2

fixbug #issues/3494 No to_python (by-value) converter found for C++ t…

074128b

…ype: boost::shared_ptr<caffe::Blob<float> >

add register Net and Solver

a268187

Add makefile config option for linking Python 3 libraries

b6c1428

copy proto to distribute directory

b619796

Add ChannelwiseAffine for batch norm

8adf15e

Version 1.0.0-rc3

e7b14e5

show Caffe's version from MatCaffe

42037cd

Evan Lezar and others added 27 commits April 6, 2016 11:49

Refactor and improve code style.

91d3425

Fix some typos. Correct imports. Refactor data layers. Apply PEP8 formatting.

Finalized tutorial. Removed asyncronous layer.

490432d

output all logging from upgrade net tools

7133cf6

output info, warnings, and errors for fuller description of the upgrade

check all net upgrade conditions

283fe12

check all conditions all the time; V0 -> V1 and V1 -> V2 do not suffice.

fix input field -> input layer net upgrade: only convert full defs

89ec6b9

convert inputs in legacy definitions (prototxt), but simply strip inputs from legacy weights (caffemodel). fix BVLC#3750

refuse to upgrade net with layer/layers inconsistency

d2b07e4

die loudly if a net definition (prototxt) mixes proto formats by defining both `layer` and `layers` fields instead of complaining but discarding and continuing. fix BVLC#3381

- doc and cmake update MKL related

ec2d346

- cosmetic change to mkl related doc

[example] groom multilabel notebook title, order

b90ac1f

minor mistakes removed

debe545

Removed lint script reference to non-existant caffe_memcpy function.

89b921c

[travis] force protobuf 3.0.0b2 for Python 3

0757f7c

This is temporary measure to avoid an apparent upstream issue with protobuf 3.0.0b2.post1.

[pycaffe] add coord_map.py for computing induced coordinate transform

afe785e

This provides a framework for automatically aligning different layers of a net despite up/downsampling, padding, and output size rounding.

[pycaffe] align coord_map and BVLC#3570 Crop layer

84b3f07

- crop -> offset - adjust crop axis by 1

[pycaffe] test coord_map

6e7b06a

- test known mappings: conv-pool-deconv stack, ReLU and 1x1 conv - test effects of padding - test rectangular/anisotropic coordinate mapping, test N-D - catch error cases: negative crop, scale mismatch, tops that are not spatially connected

add check and find GPU device utilities

c71231e

add CropLayer: crop blob to another blob's dimensions with offsets

486d432

configure offset(s) through proto definition.

Extend Crop to N-D, changed CropParameter.

c56dd0f

Crop: fixes, tests and negative axis indexing.

985f2ea

Crop: more tests and test tuning.

e479467

Changes are: reduce test blob dims for speed use standard Gaussian filler, polish formatting and rename tests, test HW crop and 5D crop, standard gradient checks

split p2psync::run()

00ca028

[build] travis: remove existing conda dir

05c9dd4

there seems to be a caching issue at the moment; this is a temporary fix for BVLC#3786

Update Makefile: Changed MKL_DIR to MKLROOT

011c69c

MKLROOT variable is set by MKL scripts, so it also should be used in Makefile.

Use lazy initialization to reuse orderd dict/list creations to save t…

b260402

…ime on repeated calls.

test_net.cpp: add TestForcePropagateDown

6f96e10

Net: setting propagate_down: true forces backprop

3a63d20

Fix initialization of deconvolution layer parameters from python net_…

d2617ef

…spec interface. The deconvolution layer uses the convolution_param subgroup for its parameters, so hardcode that connection into param_name_dict().

dllehr81 pushed a commit to dllehr81/caffe that referenced this pull request Aug 26, 2016

Fix crash when pairing an odd number of devices without P2P (BVLC/git…

1c99519

…hub issue BVLC#3531,BVLC#3586)

shelhamer closed this Apr 14, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

SvenTwo commented Jan 22, 2016

lukeyeager commented Jan 22, 2016

anuphalarnkar commented Jan 27, 2016

SvenTwo commented Jan 27, 2016

anuphalarnkar commented Jan 29, 2016

SvenTwo commented Jan 29, 2016

anuphalarnkar commented Feb 2, 2016

SvenTwo commented Feb 3, 2016

anuphalarnkar commented Feb 5, 2016

shelhamer commented Apr 14, 2017

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Fix crash when pairing an odd number of devices without P2P (BVLC/github issue #3531) #3586

Conversation

SvenTwo commented Jan 22, 2016

lukeyeager commented Jan 22, 2016

anuphalarnkar commented Jan 27, 2016

SvenTwo commented Jan 27, 2016

anuphalarnkar commented Jan 29, 2016

SvenTwo commented Jan 29, 2016

anuphalarnkar commented Feb 2, 2016

SvenTwo commented Feb 3, 2016

anuphalarnkar commented Feb 5, 2016

shelhamer commented Apr 14, 2017