New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some question on PipeTransformer #3
Comments
Hi Thanks for following our work.
check this function for details:
https://github.com/Distributed-AI/PipeTransformer/blob/master/doc/pytorch_blog/PipeTransformer.md
|
Thanks for detailed reply! |
Hello, thanks for the open-sourced project. I just ran some experiments and tried to understand your implementation.
Could you please help me explain these logs? (which could help me understand your paper and code...sorry...i haven't go through the code, but will do..)
1,
does the number, such as 10.194431999999999 here represent parameter size? How about the value in the first dict {0: 4, 1: 2, 2: 3, 3: 3, 4: 3, 5: 3, 6: 4, 7: 5} ? for example 0:4, whats does 4 mean on device 0?
2,
what does
communication frequency
mean here? I did observe the number at the first batch is much bigger than others, i.e., 5065.584541 here? Do you know why?3,
after the first epoch, pipetransformer obtains some frozen layers. I saw some newly added ranks. From my understanding, if some layers are frozen, the number of ranks should be reduced? Why there will be newly added ranks.
Further more, how to interpret the frozen message here:
4, How do pipetransformer execute inter-node communication? My machine has RDMA enabled. But I found that it used TCP as default. Is there any way to enable using RDMA?
Thanks
The text was updated successfully, but these errors were encountered: