Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repair nccl op test #8575

Merged
merged 12 commits into from
Mar 13, 2018
Merged

Repair nccl op test #8575

merged 12 commits into from
Mar 13, 2018

Conversation

QiJune
Copy link
Member

@QiJune QiJune commented Feb 26, 2018

fix #8582

@@ -121,27 +134,11 @@ class NCCLTester : public ::testing::Test {
std::vector<p::DeviceContext *> dev_ctxs;
f::Scope g_scope;
std::mutex mu;
std::vector<int> gpu_list;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Data members of classes should be with a trailing underscore.
refer : Variable Names

auto op = f::OpRegistry::CreateOp(*op1);
VLOG(1) << "invoke NCCLInitOp.";
op->Run(g_scope, cpu_place);
VLOG(1) << "NCCLInitOp finished.";
}

int GetGPUData(int gpu_id) { return gpu_id + 42; }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is necessary, because

Simply set GPU data = gpu_id won't expose the incorrect linking error. More specifically, Paddle might be compiled with NCCL1.3 header while dynamically linked with NCCL2.so.

@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test {
send_tensor->Resize(kDims);
send_tensor->mutable_data<T>(kDims, place);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe line 108 is unnecessary.

@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test {
send_tensor->Resize(kDims);
send_tensor->mutable_data<T>(kDims, place);

std::vector<T> send_vector(f::product(kDims), gpu_id);
std::vector<T> send_vector(f::product(kDims), GetGPUData(gpu_id));
paddle::framework::TensorFromVector<T>(send_vector, *ctx, send_tensor);
ctx->Wait();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it necessary to synchronize here? I think the copying will synchronize the GPU and CPU in line 179.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The copying is cudaMemcpyAsync. Looks like we need to add a wait to line 179...

template <>
void Copy<platform::CPUPlace, platform::CUDAPlace>(
platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place,
const void* src, size_t num, cudaStream_t stream) {
platform::SetDeviceId(src_place.device);
platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream);
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because the memory is pageable in CPU side, the copying doesn't return immediately until the copy has completed.
The current CUDA Runtime Documentation states:
Asynchronous(Memcpy):
- For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor

@chengduoZH chengduoZH left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@QiJune QiJune merged commit 7287630 into PaddlePaddle:develop Mar 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

nccl operator unit test is broken down
3 participants