Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix shared memory bug #185

Merged
merged 3 commits into from
Mar 4, 2019
Merged

fix shared memory bug #185

merged 3 commits into from
Mar 4, 2019

Conversation

rongou
Copy link
Contributor

@rongou rongou commented Mar 1, 2019

When building the tree the connections are bidirectional, the original code was reusing the same shared memory buffer for both directions, causing race conditions. This only happens when CUDA_VISIBLE_DEVICES are set to single GPUs, thus all connections are through shared memory.

For example,

0 <-> 1 <-> 2 <-> 3

both 1 and 2 would use the same buffers for both directions.

@sjeaugey

sjeaugey pushed a commit that referenced this pull request Mar 4, 2019
The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <rong.ou@gmail.com>
The shared memory filename was only based on the destination. While
this was OK for rings since only one rank would send data to a given
rank, it would crash with trees because they communicate in both
directions.

Co-authored-by: Rong Ou <rong.ou@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants