The overlapping of communication over computation seems not been realized due to the GPU log #33

qjfytz · 2018-07-24T09:02:46Z

We use the apex extension with pytorch 0.4.0, The system information is:
system: ubuntu 16.04.4
pytorch version: 0.4.0 with CUDA 9.1 and CUDNN 7.0.5
python version: 3.5.2
GPU: Tesla P100 *8
NVIDIA driver: 390.46
Model: ResNet 50

we set shared_parameter=False to enable the overlapping of communication over computation (We have read the source code and find that if the value is True, the communication will be done after all computations). The message_size is reduced to 10^6. We run 6 iterations and record the GPU log with nvidia profiler tool.

However, we found from the GPU log that the overlapping is not realized. The log of the 6th iteration is shown as follow. The first "AllreduceKennel" call is after the call of 'MaxPoolBackward', which is the end of backward computation. We have checked the other iterations and find the same thing.

`28.478387,0.475834,49,8,64,256,1,1,32,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void MaxPoolBackward<float, float>(int, float const *, long const *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*)",291627

28.478871,0.138303,12544,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply3<ThresholdUpdateGradInput<float>, float, float, float, unsigned int, int=-2, int=-2, int=-2>(OffsetInfo<ThresholdUpdateGradInput<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, OffsetInfo<float, float, int=-2>, float, float)",291643

28.479021,0.007424,,,,,,,,,,0.001343,0.176630,"Device",,"Tesla P100-PCIE-16GB (0)","1","24","[CUDA memset]",291667

28.479047,0.197309,110,1,1,512,1,1,64,0.265625,24.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","24","void cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, float const *, cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, float const , cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)",291697

28.479262,0.002912,1,112,1,128,1,1,14,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeWgradOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)",291715
28.479274,0.008000,37,1,1,256,1,1,8,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void scalePackedTensor_kernel<float, float>(cudnnTensor4dStruct, float*, float)",291721
28.479295,0.004831,1,1,1,256,1,1,12,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeBOffsetsKernel(cudnn::maxwell::gemm::ComputeBOffsetsParams)",291726

28.479312,0.466554,2,1,112,128,1,1,128,10.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","maxwell_scudnn_128x64_stridedB_splitK_large_nn",291730

28.479791,0.007104,2,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291741

28.479806,0.043616,4000,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291754

28.479852,0.007136,,,,,,,,,,0.000046,0.006265,"Pinned","Device","Tesla P100-PCIE-16GB (0)","1","14","[CUDA memcpy HtoD]",291792

28.479865,0.005311,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291874

28.479879,0.049184,112,2,1,512,1,1,13,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void CatArrayBatchedCopy<float, unsigned int, int=1>(float*, CatArrInputTensor<float, unsigned int>*, OutputTensorSizeStride<unsigned int, unsigned int=4>, int, unsigned int)",291807

28.479886,0.007520,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291889

28.479906,0.032896,2048,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291908

28.479940,4.409442,1,1,1,257,1,1,128,0.007812,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void AllReduceKernel<int=256, int=8, FuncSum<float>, float>(KernelArgs<FuncSum<float>>)",291821`

Could you please tell us the reason for us or point out the our faults when use the apex extensions?Thanks for your help.

The text was updated successfully, but these errors were encountered:

mcarilli · 2018-07-24T18:30:55Z

I'll try to reproduce this behavior on our cluster. In the meantime, try running with the message size set to 1. This should cause distributed to allreduce each parameter individually, as soon as they receive their gradients. This may negatively affect performance, because it will disable bucketing of transfers, but it should help force communication to overlap with computation (assuming everything is working correctly).

qjfytz · 2018-07-25T02:15:10Z

Thank you @mcarilli. I have tried to set the message_size to 1, but still cannot find the overlapping. The AllReduceKernel and AllReduceKernelSmall are only called after all backward computation. Hope your testing result.

mcarilli · 2018-07-25T20:18:01Z

Ok, I have an explanation. Any Pytorch version prior to pytorch/pytorch#7604 will not observe overlap of communication with computation, due to a bug in how Pytorch prioritized ops in the backwards pass. Evidently 0.4.0 does not contain the fix.

I was able to reproduce the behavior you observed (no overlap) in the upstream 0.4.0 container (pytorch/pytorch:0.4-cuda9-cudnn7-devel). I then tried in another container which had a much more recent version of Pytorch installed, and saw comms and computation overlapping nicely. If you try with the most recent version of Pytorch and the most recent version of Apex, it should work.

mcarilli · 2018-07-27T20:17:45Z

The latest stable release, Pytorch 0.4.1, should contain 7604, so it is also a viable option.

qjfytz · 2018-07-28T02:50:33Z

You are right @mcarilli . We notice the overlap now. Thank you again. This issue can be closed

* enable deprecated fused adam optimizer * enable deprecated fused lamb * reset the compiler arguments * syntax error * aligning the compiler arguments

mcarilli closed this as completed Jul 29, 2018

Solacex mentioned this issue Jan 13, 2019

RuntimeError: cuda runtime error (74) : misaligned address at /pytorch/aten/src/THC/THCTensorCopy.cu:84 #124

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The overlapping of communication over computation seems not been realized due to the GPU log #33

The overlapping of communication over computation seems not been realized due to the GPU log #33

qjfytz commented Jul 24, 2018 •

edited

Loading

mcarilli commented Jul 24, 2018

qjfytz commented Jul 25, 2018

mcarilli commented Jul 25, 2018 •

edited

Loading

mcarilli commented Jul 27, 2018

qjfytz commented Jul 28, 2018

The overlapping of communication over computation seems not been realized due to the GPU log #33

The overlapping of communication over computation seems not been realized due to the GPU log #33

Comments

qjfytz commented Jul 24, 2018 • edited Loading

mcarilli commented Jul 24, 2018

qjfytz commented Jul 25, 2018

mcarilli commented Jul 25, 2018 • edited Loading

mcarilli commented Jul 27, 2018

qjfytz commented Jul 28, 2018

qjfytz commented Jul 24, 2018 •

edited

Loading

mcarilli commented Jul 25, 2018 •

edited

Loading