Repair nccl op test #8575

QiJune · 2018-02-26T07:55:29Z

chengduoZH · 2018-02-26T10:56:48Z

paddle/fluid/operators/nccl_op_test.cu.cc

@@ -121,27 +134,11 @@ class NCCLTester : public ::testing::Test {
  std::vector<p::DeviceContext *> dev_ctxs;
  f::Scope g_scope;
  std::mutex mu;
+  std::vector<int> gpu_list;


Data members of classes should be with a trailing underscore.
refer : Variable Names

tonyyang-svail · 2018-03-08T00:09:40Z

paddle/fluid/operators/nccl_op_test.cu.cc

    auto op = f::OpRegistry::CreateOp(*op1);
    VLOG(1) << "invoke NCCLInitOp.";
    op->Run(g_scope, cpu_place);
    VLOG(1) << "NCCLInitOp finished.";
  }

+  int GetGPUData(int gpu_id) { return gpu_id + 42; }


This function is necessary, because

Simply set GPU data = gpu_id won't expose the incorrect linking error. More specifically, Paddle might be compiled with NCCL1.3 header while dynamically linked with NCCL2.so.

chengduoZH · 2018-03-08T16:41:31Z

paddle/fluid/operators/nccl_op_test.cu.cc

@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test {
      send_tensor->Resize(kDims);
      send_tensor->mutable_data<T>(kDims, place);


Maybe line 108 is unnecessary.

chengduoZH · 2018-03-08T16:54:38Z

paddle/fluid/operators/nccl_op_test.cu.cc

@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test {
      send_tensor->Resize(kDims);
      send_tensor->mutable_data<T>(kDims, place);

-      std::vector<T> send_vector(f::product(kDims), gpu_id);
+      std::vector<T> send_vector(f::product(kDims), GetGPUData(gpu_id));
      paddle::framework::TensorFromVector<T>(send_vector, *ctx, send_tensor);
      ctx->Wait();


Is it necessary to synchronize here? I think the copying will synchronize the GPU and CPU in line 179.

The copying is cudaMemcpyAsync. Looks like we need to add a wait to line 179...

Paddle/paddle/fluid/memory/memcpy.cc

Lines 30 to 36 in 0d49b92

template <>

void Copy<platform::CPUPlace, platform::CUDAPlace>(

platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place,

const void* src, size_t num, cudaStream_t stream) {

platform::SetDeviceId(src_place.device);

platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream);

}

I don't think so because the memory is pageable in CPU side, the copying doesn't return immediately until the copy has completed.
The current CUDA Runtime Documentation states:
Asynchronous(Memcpy):
- For transfers from device memory to pageable host memory, the function will return only once the copy has completed.

A reference https://devtalk.nvidia.com/default/topic/899020/does-cudamemcpyasync-require-pinned-memory-/

chengduoZH

LGTM!

QiJune added 3 commits February 26, 2018 15:33

fix nccl op unit test

a4b71e9

fix build error

614914e

format code

0e43e2c

QiJune requested review from dzhwinter and tonyyang-svail February 26, 2018 07:57

QiJune added 2 commits February 26, 2018 16:46

refine nccl related unit test

eb6ff79

fix build error

70b71c8

chengduoZH reviewed Feb 26, 2018

View reviewed changes

QiJune added this to DOING in Performance tuning On Parallel Do Feb 28, 2018

Yang Yang added 3 commits March 6, 2018 23:34

Merge remote-tracking branch 'upstream/develop' into repair_nccl_op_test

93f10fc

add setGPUData

0a0c7ed

clean up

fad09a9

tonyyang-svail reviewed Mar 8, 2018

View reviewed changes

Yang Yang added 2 commits March 8, 2018 00:15

follow comments

eeaf562

rm test_nccl.cu

e32f306

chengduoZH reviewed Mar 8, 2018

View reviewed changes

Yang Yang added 2 commits March 8, 2018 20:20

follow comment

000c756

rm wait

9a42d3e

chengduoZH approved these changes Mar 13, 2018

View reviewed changes

QiJune merged commit 7287630 into PaddlePaddle:develop Mar 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repair nccl op test #8575

Repair nccl op test #8575

QiJune commented Feb 26, 2018 •

edited

Loading

chengduoZH Feb 26, 2018

tonyyang-svail Mar 8, 2018 •

edited

Loading

chengduoZH Mar 8, 2018

chengduoZH Mar 8, 2018

tonyyang-svail Mar 8, 2018

chengduoZH Mar 9, 2018

QiJune Mar 9, 2018

tonyyang-svail Mar 12, 2018

chengduoZH left a comment

		@@ -97,7 +108,7 @@ class NCCLTester : public ::testing::Test {
		send_tensor->Resize(kDims);
		send_tensor->mutable_data<T>(kDims, place);

	template <>
	void Copy<platform::CPUPlace, platform::CUDAPlace>(
	platform::CPUPlace dst_place, void* dst, platform::CUDAPlace src_place,
	const void* src, size_t num, cudaStream_t stream) {
	platform::SetDeviceId(src_place.device);
	platform::GpuMemcpyAsync(dst, src, num, cudaMemcpyDeviceToHost, stream);
	}

Repair nccl op test #8575

Repair nccl op test #8575

Conversation

QiJune commented Feb 26, 2018 • edited Loading

chengduoZH Feb 26, 2018

Choose a reason for hiding this comment

tonyyang-svail Mar 8, 2018 • edited Loading

Choose a reason for hiding this comment

chengduoZH Mar 8, 2018

Choose a reason for hiding this comment

chengduoZH Mar 8, 2018

Choose a reason for hiding this comment

tonyyang-svail Mar 8, 2018

Choose a reason for hiding this comment

chengduoZH Mar 9, 2018

Choose a reason for hiding this comment

QiJune Mar 9, 2018

Choose a reason for hiding this comment

tonyyang-svail Mar 12, 2018

Choose a reason for hiding this comment

chengduoZH left a comment

Choose a reason for hiding this comment

QiJune commented Feb 26, 2018 •

edited

Loading

tonyyang-svail Mar 8, 2018 •

edited

Loading