Some question on PipeTransformer #3

Young768 · 2021-07-24T06:12:10Z

Hello, thanks for the open-sourced project. I just ran some experiments and tried to understand your implementation.
Could you please help me explain these logs? (which could help me understand your paper and code...sorry...i haven't go through the code, but will do..)
1,

_auto_balanced_elastic_partition(): {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5}
4189 2021-07-23,22:41:45.277 - {auto_pipe.py (141)} - _auto_balanced_elastic_partition(): {0: 10.194431999999999, 1: 7.087872, 2: 11.81184, 3: 9.451775999999999, 4: 11.81184, 5: 9.451775999999999, 6: 14.175744, 7: 11.890276}

does the number, such as 10.194431999999999 here represent parameter size? How about the value in the first dict {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5} ? for example 0:4, whats does 4 mean on device 0?

2,

4189 2021-07-23,22:42:04.167 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 0/79
4189 2021-07-23,22:42:07.974 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 3.789346694946289
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (105)} - train(): global_rank = 0. data loading cost = 5.682324171066284
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 4466480
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 5065.584541
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (122)} - train(): -------------------------------------
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (72)} - train(): (epoch = 0) backwards_time_per_batch = 3.137924909591675
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (74)} - train(): --------------global_rank = 0. Epoch 0, batch index 1 Statistics: 
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 1/79
4189 2021-07-23,22:42:10.706 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 2.006029725074768
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 916
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 1.433845

what does communication frequency mean here? I did observe the number at the first batch is much bigger than others, i.e., 5065.584541 here? Do you know why?

3,
after the first epoch, pipetransformer obtains some frozen layers. I saw some newly added ranks. From my understanding, if some layers are frozen, the number of ranks should be reduced? Why there will be newly added ranks.

################################ Number of frozen layers: 6 
################################ Pipe length: 4/8 
################################ Newly added ranks: [1, 9]

Further more, how to interpret the frozen message here:

frozen_message = [6.0, 4.0, 14.175744, 1.0, 2.0, 1.0, 9.0, -1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

4, How do pipetransformer execute inter-node communication? My machine has RDMA enabled. But I found that it used TCP as default. Is there any way to enable using RDMA?

Thanks

The text was updated successfully, but these errors were encountered:

chaoyanghe · 2021-07-25T18:31:07Z

Hi Thanks for following our work.

It is generated by load balancer at

PipeTransformer/pipe_transformer/pipe/auto_pipe.py

Line 140 in 187b78b

logging.info(balanced_sub_layer_distribution)

check this function for details:

PipeTransformer/pipe_transformer/pipe/load_balance.py

Line 7 in 187b78b

    
           def generate_parameter_size_wise_balance(num_devices, param_list, num_frozen_layer):

communication frequency represents DP's frequency of synchronizing among all workers. I mainly used it for debugging, to check whether the frequency is stable. We need this because in our PipeTransformer design, the number of workers in DP is dynamic. Every time when the number is changed, DDP needs to wait for new CUDA context creation for new workers and then start to synchronize all workers, which makes the waiting time for the first batch longer than other batches. We put the time analysis for this step in the appendix. Its first value is big because we deduct the waiting wait from it (time_finish_prepare_ddp), making the time gap very small. Check details at:

PipeTransformer/examples/image_classification/cv_trainer.py

Line 111 in 187b78b

comm_freq = communication_count / (time.time() - time_finish_prepare_ddp)

Please check our paper's AutoDP section. After the pipeline length reduced, we add more pipelines to accelerate the training. The newly added ranks are for these new pipelines. You can also read our blog for details:

https://github.com/Distributed-AI/PipeTransformer/blob/master/doc/pytorch_blog/PipeTransformer.md
(will be released soon after FB's Legal procedure)

frozen_message is the message passing among newly added and old pipelines. Again, check the AutoDP section for details. The protocol of this message can be found at:

PipeTransformer/pipe_transformer/dp/auto_dp.py

Line 308 in 187b78b

    
           def _build_broad_cast_message(self, num_frozen_layers, pipe_len, max_parameter_per_gpu_at_beginning,

For the inter-node communication, we still use PyTorch DDP. But for Tensor (gradient) synchronization, we use NCCL backend, for frozen_message, we use GLOO backend. Again, check the AutoDP section to understand our design first.
The blog above introduced which APIs are for different process groups. For RDMA, please set the DDP's group to use NCCL backend. You can refer our code at:

PipeTransformer/pipe_transformer/dp/auto_dp.py

Line 174 in 187b78b

    
           self.active_process_group = dist.new_group(ranks=self.active_ranks, backend=Backend.NCCL,

Young768 · 2021-07-25T18:38:28Z

Thanks for detailed reply!

Young768 closed this as completed Jul 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some question on PipeTransformer #3

Some question on PipeTransformer #3

Young768 commented Jul 24, 2021 •

edited

chaoyanghe commented Jul 25, 2021 •

edited

Young768 commented Jul 25, 2021

Some question on PipeTransformer #3

Some question on PipeTransformer #3

Comments

Young768 commented Jul 24, 2021 • edited

chaoyanghe commented Jul 25, 2021 • edited

Young768 commented Jul 25, 2021

Young768 commented Jul 24, 2021 •

edited

chaoyanghe commented Jul 25, 2021 •

edited