Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some question on PipeTransformer #3

Closed
Young768 opened this issue Jul 24, 2021 · 2 comments
Closed

Some question on PipeTransformer #3

Young768 opened this issue Jul 24, 2021 · 2 comments

Comments

@Young768
Copy link

Young768 commented Jul 24, 2021

Hello, thanks for the open-sourced project. I just ran some experiments and tried to understand your implementation.
Could you please help me explain these logs? (which could help me understand your paper and code...sorry...i haven't go through the code, but will do..)
1,

_auto_balanced_elastic_partition(): {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5}
4189 2021-07-23,22:41:45.277 - {auto_pipe.py (141)} - _auto_balanced_elastic_partition(): {0: 10.194431999999999, 1: 7.087872, 2: 11.81184, 3: 9.451775999999999, 4: 11.81184, 5: 9.451775999999999, 6: 14.175744, 7: 11.890276}

does the number, such as 10.194431999999999 here represent parameter size? How about the value in the first dict {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5} ? for example 0:4, whats does 4 mean on device 0?

2,

4189 2021-07-23,22:42:04.167 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 0/79
4189 2021-07-23,22:42:07.974 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 3.789346694946289
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (105)} - train(): global_rank = 0. data loading cost = 5.682324171066284
4189 2021-07-23,22:42:09.849 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 4466480
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 5065.584541
4189 2021-07-23,22:42:09.850 - {cv_trainer.py (122)} - train(): -------------------------------------
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (72)} - train(): (epoch = 0) backwards_time_per_batch = 3.137924909591675
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (74)} - train(): --------------global_rank = 0. Epoch 0, batch index 1 Statistics: 
4189 2021-07-23,22:42:10.461 - {cv_trainer.py (76)} - train(): global_rank = 0. epoch = 0, batch index = 1/79
4189 2021-07-23,22:42:10.706 - {cv_trainer.py (92)} - train(): (epoch = 0) forward_time_per_batch = 2.006029725074768
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (109)} - train(): global_rank = 0. sample_num_throughput (images/second): 916
4189 2021-07-23,22:42:11.244 - {cv_trainer.py (112)} - train(): global_rank = 0. communication frequency (cross machine sync/second): 1.433845

what does communication frequency mean here? I did observe the number at the first batch is much bigger than others, i.e., 5065.584541 here? Do you know why?

3,
after the first epoch, pipetransformer obtains some frozen layers. I saw some newly added ranks. From my understanding, if some layers are frozen, the number of ranks should be reduced? Why there will be newly added ranks.

################################ Number of frozen layers: 6 
################################ Pipe length: 4/8 
################################ Newly added ranks: [1, 9] 

Further more, how to interpret the frozen message here:

frozen_message = [6.0, 4.0, 14.175744, 1.0, 2.0, 1.0, 9.0, -1, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]

4, How do pipetransformer execute inter-node communication? My machine has RDMA enabled. But I found that it used TCP as default. Is there any way to enable using RDMA?

Thanks

@chaoyanghe
Copy link
Contributor

chaoyanghe commented Jul 25, 2021

Hi Thanks for following our work.

  1. It is generated by load balancer at
    logging.info(balanced_sub_layer_distribution)

check this function for details:

def generate_parameter_size_wise_balance(num_devices, param_list, num_frozen_layer):

  1. communication frequency represents DP's frequency of synchronizing among all workers. I mainly used it for debugging, to check whether the frequency is stable. We need this because in our PipeTransformer design, the number of workers in DP is dynamic. Every time when the number is changed, DDP needs to wait for new CUDA context creation for new workers and then start to synchronize all workers, which makes the waiting time for the first batch longer than other batches. We put the time analysis for this step in the appendix. Its first value is big because we deduct the waiting wait from it (time_finish_prepare_ddp), making the time gap very small. Check details at:

comm_freq = communication_count / (time.time() - time_finish_prepare_ddp)

  1. Please check our paper's AutoDP section. After the pipeline length reduced, we add more pipelines to accelerate the training. The newly added ranks are for these new pipelines. You can also read our blog for details:

https://github.com/Distributed-AI/PipeTransformer/blob/master/doc/pytorch_blog/PipeTransformer.md
(will be released soon after FB's Legal procedure)

frozen_message is the message passing among newly added and old pipelines. Again, check the AutoDP section for details. The protocol of this message can be found at:

def _build_broad_cast_message(self, num_frozen_layers, pipe_len, max_parameter_per_gpu_at_beginning,

  1. For the inter-node communication, we still use PyTorch DDP. But for Tensor (gradient) synchronization, we use NCCL backend, for frozen_message, we use GLOO backend. Again, check the AutoDP section to understand our design first.
    The blog above introduced which APIs are for different process groups. For RDMA, please set the DDP's group to use NCCL backend. You can refer our code at:

self.active_process_group = dist.new_group(ranks=self.active_ranks, backend=Backend.NCCL,

@Young768
Copy link
Author

Thanks for detailed reply!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants