Skip to content

Conversation

@abergeron
Copy link
Member

@abergeron abergeron commented Aug 21, 2017

Extracted from #485.

fix #497

DEF_PROC(ncclResult_t, ncclBcast, (void* buff, size_t count, ncclDataType_t datatype, int root, ncclComm_t comm, cudaStream_t stream ));
DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));
DEF_PROC(ncclResult_t, ncclAllGather, (const void* sendbuff, void* recvbuff, size_t sendcount, ncclDataType_t datatype, ncclComm_t comm, cudaStream_t stream));
// We don't need this but we use it as a sentinel to prevent nccl 1.0 from loading.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell how this work and what user error this gave?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will try to find the ncclStartGroup symbol in the library, not find it, complain about it not being present and abort the load.

@nouiz
Copy link
Member

nouiz commented Aug 22, 2017 via email

@nouiz nouiz mentioned this pull request Aug 22, 2017
@nouiz
Copy link
Member

nouiz commented Aug 22, 2017

IRL discussion with @abergeron he will change it to give a good error if it is nccl that is avaliable and not 2.0

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

The error isn't passed up to the user:

DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s --pdb-failure --pdb

don't raise any errors with this branch with nccl 1 installed.

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

Also, it have a segfault.

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

~/repos/libgpuarray/pygpu$ DEVICE=cuda0 nosetests tests/collectives/test_collectives.py -s
*** Testing for GeForce GTX 750
mpi4py found: True
*** Collectives testing for GeForce GTX 750
F......*** Error in `/Tmp/lisa/os_v5/anaconda/bin/python': free(): invalid pointer: 0x00007f76f5e69bf8 ***
Aborted (core dumped)

@nouiz
Copy link
Member

nouiz commented Aug 23, 2017

It was my env that mixed pygpu and libgpuarray version. So merging.

@nouiz nouiz merged commit 351f359 into Theano:master Aug 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

nccl 2.0 support

3 participants