Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The overlapping of communication over computation seems not been realized due to the GPU log #33

Closed
qjfytz opened this issue Jul 24, 2018 · 5 comments

Comments

@qjfytz
Copy link

qjfytz commented Jul 24, 2018

We use the apex extension with pytorch 0.4.0, The system information is:
system: ubuntu 16.04.4
pytorch version: 0.4.0 with CUDA 9.1 and CUDNN 7.0.5
python version: 3.5.2
GPU: Tesla P100 *8
NVIDIA driver: 390.46
Model: ResNet 50

we set shared_parameter=False to enable the overlapping of communication over computation (We have read the source code and find that if the value is True, the communication will be done after all computations). The message_size is reduced to 10^6. We run 6 iterations and record the GPU log with nvidia profiler tool.

However, we found from the GPU log that the overlapping is not realized. The log of the 6th iteration is shown as follow. The first "AllreduceKennel" call is after the call of 'MaxPoolBackward', which is the end of backward computation. We have checked the other iterations and find the same thing.

`28.478387,0.475834,49,8,64,256,1,1,32,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void MaxPoolBackward<float, float>(int, float const *, long const *, int, int, int, int, int, int, int, int, int, int, int, int, int, int, float*)",291627

28.478871,0.138303,12544,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply3<ThresholdUpdateGradInput<float>, float, float, float, unsigned int, int=-2, int=-2, int=-2>(OffsetInfo<ThresholdUpdateGradInput<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, OffsetInfo<float, float, int=-2>, float, float)",291643

28.479021,0.007424,,,,,,,,,,0.001343,0.176630,"Device",,"Tesla P100-PCIE-16GB (0)","1","24","[CUDA memset]",291667

28.479047,0.197309,110,1,1,512,1,1,64,0.265625,24.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","24","void cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>(float, float, float, float, cudnnTensorStruct, float const *, cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, float const , cudnn::detail::bn_bw_1C11_singleread<float, int=512, bool=1, int=1, int=2, int=14>, cudnnTensorStruct*, float const *, float*, float const *, float const , float const , float, cudnn::reduced_divisor, int, float*, cudnn::detail::bnBwPersistentState*, int, float, float, float, int, float, cudnnStatus_t*, bool)",291697

28.479262,0.002912,1,112,1,128,1,1,14,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeWgradOffsetsKernel(cudnn::maxwell::gemm::ComputeOffsetsParams)",291715
28.479274,0.008000,37,1,1,256,1,1,8,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void scalePackedTensor_kernel<float, float>(cudnnTensor4dStruct, float*, float)",291721
28.479295,0.004831,1,1,1,256,1,1,12,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","cudnn::maxwell::gemm::computeBOffsetsKernel(cudnn::maxwell::gemm::ComputeBOffsetsParams)",291726

28.479312,0.466554,2,1,112,128,1,1,128,10.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","maxwell_scudnn_128x64_stridedB_splitK_large_nn",291730

28.479791,0.007104,2,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291741

28.479806,0.043616,4000,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291754

28.479852,0.007136,,,,,,,,,,0.000046,0.006265,"Pinned","Device","Tesla P100-PCIE-16GB (0)","1","14","[CUDA memcpy HtoD]",291792

28.479865,0.005311,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291874

28.479879,0.049184,112,2,1,512,1,1,13,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void CatArrayBatchedCopy<float, unsigned int, int=1>(float*, CatArrInputTensor<float, unsigned int>*, OutputTensorSizeStride<unsigned int, unsigned int=4>, int, unsigned int)",291807

28.479886,0.007520,4,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291889

28.479906,0.032896,2048,1,1,512,1,1,10,0.000000,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","7","void kernelPointwiseApply2<TensorAddOp<float>, float, float, unsigned int, int=-2, int=-2>(OffsetInfo<TensorAddOp<float>, float, unsigned int>, OffsetInfo<float, float, int=-2>, float, float)",291908

28.479940,4.409442,1,1,1,257,1,1,128,0.007812,0.000000,,,,,"Tesla P100-PCIE-16GB (0)","1","14","void AllReduceKernel<int=256, int=8, FuncSum<float>, float>(KernelArgs<FuncSum<float>>)",291821`

Could you please tell us the reason for us or point out the our faults when use the apex extensions?Thanks for your help.

@mcarilli
Copy link
Contributor

I'll try to reproduce this behavior on our cluster. In the meantime, try running with the message size set to 1. This should cause distributed to allreduce each parameter individually, as soon as they receive their gradients. This may negatively affect performance, because it will disable bucketing of transfers, but it should help force communication to overlap with computation (assuming everything is working correctly).

@qjfytz
Copy link
Author

qjfytz commented Jul 25, 2018

Thank you @mcarilli. I have tried to set the message_size to 1, but still cannot find the overlapping. The AllReduceKernel and AllReduceKernelSmall are only called after all backward computation. Hope your testing result.

@mcarilli
Copy link
Contributor

mcarilli commented Jul 25, 2018

Ok, I have an explanation. Any Pytorch version prior to pytorch/pytorch#7604 will not observe overlap of communication with computation, due to a bug in how Pytorch prioritized ops in the backwards pass. Evidently 0.4.0 does not contain the fix.

I was able to reproduce the behavior you observed (no overlap) in the upstream 0.4.0 container (pytorch/pytorch:0.4-cuda9-cudnn7-devel). I then tried in another container which had a much more recent version of Pytorch installed, and saw comms and computation overlapping nicely. If you try with the most recent version of Pytorch and the most recent version of Apex, it should work.

@mcarilli
Copy link
Contributor

The latest stable release, Pytorch 0.4.1, should contain 7604, so it is also a viable option.

@qjfytz
Copy link
Author

qjfytz commented Jul 28, 2018

You are right @mcarilli . We notice the overlap now. Thank you again. This issue can be closed

lcskrishna added a commit to lcskrishna/apex that referenced this issue Aug 18, 2020
* enable deprecated fused adam optimizer

* enable deprecated fused lamb

* reset the compiler arguments

* syntax error

* aligning the compiler arguments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants