Parallel all reduce communication and backprop #573

zhuzilin · 2022-02-25T08:44:12Z

Thank you for open source such a great repo for the community! Your work is really helping our team with training large pretrained model :)

In our experiment, we find out when training a not-that-large model, e.g. 2.7B, with data parallel, the scaling efficiency across multiple nodes is not good enough (under 70% for 2 nodes in our case). A reason for this is that currenly the backward calculation ("BackwardPass" instruction) and the communication (introduced in "ReduceGrads" instruction) are executed sequentially. In fact, if we start the allreduce communication right after each grad is calculated, we could parallel the backward computation and the ReduceGrads, reducing the negative effect on cross-node communication.

We could use the backward hook mechaism in pytorch for this optimization. Here is an example in the source code of pytorch.

This optimization may only work for pure data parallel as the communication pattern is quite different in model parallel or pipeline parallel.

We'd love to help if you have interest in applying such optimization to your project (gpt-neox or DeeperSpeed)~ Thank you again for your great contribution to the community!

P.S. We found some different behavior compare to the comment here:

gpt-neox/megatron/neox_arguments/arguments.py

Lines 755 to 758 in f6c611f

    
           # Update 'is pipe parallel' flag 
        
           # if we set pipe_parallel_size to 0 or 1, GPT2ModelPipe.to_sequential() is called, and we run training with 
        
           # the sequential model without the PipelineModule wrapper to avoid the overhead it incurs 
        
           self.update_value("is_pipe_parallel", self.pipe_parallel_size >= 1)

In our experiement, the PipelineModule wrapper is used when pipe_parallel_size is set to 1 and the to_sequential() version is used only when pipe_parallel_size is set to 0;
The PipelineModule is observably faster than the to_sequential() version.

I wonder if these are expected behavior? Thank you.

The text was updated successfully, but these errors were encountered:

StellaAthena · 2022-02-25T13:34:44Z

Very interesting! Can you upload the details about your cluster, namely the GPUs and interconnect being used, and the parallelism settings? I am surprised by this and want to do some experiments before making any changes.

re: your PS

I believe this was set up to do a comparative speed test of PP = 1 and sequential modeling, though I can’t find any records of the results of that testing. I’ll open a separate issue to test PP = 1 vs sequential so it doesn’t fall through the cracks again.

@EricHallahan I don’t suppose you recall or can find the results of this testing?

EricHallahan · 2022-02-25T23:49:48Z

In our experiement, the PipelineModule wrapper is used when pipe_parallel_size is set to 1 and the to_sequential() version is used only when pipe_parallel_size is set to 0;

The PipelineModule is observably faster than the to_sequential() version.

I wonder if these are expected behavior? Thank you.

Yes, it is expected behavior. Setting pipe_parallel_size to 0 sets is_pipe_parallel to False and hence disables pipelining. This is needed because enabling pipelining has nontrivial memory overhead on some systems.

It is news to me that it is faster to use PipelineModule than to_sequential(), but that is because I do not personally recall any benchmarks or testing @StellaAthena.

zhuzilin · 2022-02-26T16:07:43Z

@StellaAthena The testing cluster was 2 nodes, each with 4 V100. And I was running with the default 2-7B.yml and some custom data. The time per iteration will increase from around 7s/iter to 8s/iter...

reyoung · 2022-02-28T03:08:28Z

@StellaAthena Also, the NIC is 100G RDMA/RoCE.

StellaAthena · 2022-02-28T16:50:42Z

@ShivanshuPurohit is gong to look into this :)

@reyoung can you post whatever performance statistics you have with your 2 cluster set-up? FLOPS, % comms, etc?

sdtblck · 2022-03-01T14:04:44Z

Hey @zhuzilin , really interesting!

firstly, wrt to the speed difference between pp=0 and pp=1, we also found a similar thing, see #269 . Although maybe the speed difference isn't quite as stark as what you found. I'm not sure of the source of the difference.

wrt the optimization, I see no reason this couldn't also work with MP and PP, and we'd be very interested in getting something like this implemented. I suspect it might not be so straightforward with deepspeed though! Fundamentally, you're doing the same communication op with MP / PP, just the group you're reducing within is smaller. So I think this should definitely be possible, but i'm not yet certain how this optimization would interact with:

deepspeed. All training currently relies on deepspeed engine, and they "handle" DP optimization for you. We would have to figure out how to fully handle this ourselves, or implement the optimization into deepspeed. We're trying to remove our dependency on deepspeed and move to OSLO, but this will likely take a while.)
Zero 1 / 2. This also ties in with the above, since these optimizers are implemented in deepspeed. But making this optimization compatible with zero 1 / 2 optimizers would likely require some more work.

zhuzilin added the feature request New feature or request label Feb 25, 2022

StellaAthena mentioned this issue Feb 25, 2022

Is PP = 1 faster than Sequential? #574

Closed

StellaAthena assigned ShivanshuPurohit Feb 28, 2022

StellaAthena closed this as completed Sep 25, 2022

StellaAthena mentioned this issue Mar 6, 2024

pipe_parallel_size = 1 using DeepSpeed PipelineEngine #1172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel all reduce communication and backprop #573

Parallel all reduce communication and backprop #573

zhuzilin commented Feb 25, 2022

StellaAthena commented Feb 25, 2022

EricHallahan commented Feb 25, 2022

zhuzilin commented Feb 26, 2022

reyoung commented Feb 28, 2022

StellaAthena commented Feb 28, 2022 •

edited

sdtblck commented Mar 1, 2022

Parallel all reduce communication and backprop #573

Parallel all reduce communication and backprop #573

Comments

zhuzilin commented Feb 25, 2022

StellaAthena commented Feb 25, 2022

EricHallahan commented Feb 25, 2022

zhuzilin commented Feb 26, 2022

reyoung commented Feb 28, 2022

StellaAthena commented Feb 28, 2022 • edited

sdtblck commented Mar 1, 2022

StellaAthena commented Feb 28, 2022 •

edited