-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NVIDIA] TE Integration #7229
base: develop
Are you sure you want to change the base?
[NVIDIA] TE Integration #7229
Conversation
33de6ce
to
90eeef7
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## develop #7229 +/- ##
===========================================
+ Coverage 56.67% 58.28% +1.60%
===========================================
Files 588 580 -8
Lines 89243 85655 -3588
===========================================
- Hits 50580 49922 -658
+ Misses 38663 35733 -2930 ☔ View full report in Codecov by Sentry. |
9d4ec48
to
17bfad2
Compare
paddlenlp/trainer/trainer.py
Outdated
@@ -1823,7 +1839,8 @@ def training_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle.Tensor, | |||
inputs = self._prepare_inputs(inputs) | |||
|
|||
with self.autocast_smart_context_manager(): | |||
loss = self.compute_loss(model, inputs) | |||
with TransformerEngineHelper.fp8_autocast(enabled=self.use_fp8): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about move TransformerEngineHelper.fp8_autocast
into autocast_smart_context_manager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯,确实考虑过这样设计。但是我最近在测试pipeline parallel+fp8+gradient_accumulation的时候发现,fp8_autocast不能包在这个compute_loss外面。会报错,参考这里PR#93。因为pipeline parallel里面有类似这种for循环,fp8_autocast要放在for循环里面,而不能放在外面。所以我放到了这里paddlenlp/transformers/gpt/modeling.py#L656
paddlenlp/trainer/trainer.py
Outdated
@@ -1886,7 +1903,8 @@ def training_pipeline_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle | |||
model.lr_scheduler = None | |||
|
|||
with self.autocast_smart_context_manager(): | |||
loss = model.forward_backward_pipeline(inputs, self.scaler if self.do_grad_scaling else None) | |||
with TransformerEngineHelper.fp8_autocast(enabled=self.use_fp8, fp8_group=self.dp_group): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same as above
paddlenlp/te_utils/te_modeling.py
Outdated
from .te_helper import TransformerEngineHelper | ||
|
||
|
||
class GPTDecoderLayerWithNVTEBackend(nn.Layer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code of GPT model should be in transformers/gpt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
放到了paddlenlp/transformers/gpt/modeling.py中
@DrownFish19 一起review一下 |
8205e73
to
9c7ef55
Compare
c5f71a8
to
6fbe37a
Compare
842ab05
to
b03295e
Compare
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
6e1bc5e
to
cd3b8f7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR能否拆分下,先把代码修改了合进去,后面再统一看下文档、运行脚本如何管理。
@@ -0,0 +1,533 @@ | |||
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
代码重复较多,建议将TE的功能支持,直接加到已有的run_pretrain.py中,可参考Megatron-LM中的支持方式:https://github.com/NVIDIA/Megatron-LM/blob/2b92e61dac1c5ff84629239198659b447da118e1/megatron/training/arguments.py#L579
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个文件不需要了,gpt和llama都通过已有的run_pretrain.py来跑。我分拆到这个PR中#8228
@@ -0,0 +1,203 @@ | |||
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkpoint转换,GPT、LLaMA能否抽取公共部分,实现paddlenlp/utils目录中?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paddlenlp里面ckpt文件命名规则和模型参数命名规则修改过(以前写的版本后来就不能用了),并且参数的排布受很多参数影响,比如需要考虑TP并行, weight是否融合,还有llama需要额外考虑GQA参数的分拆,所以目前只是针对gpt和llama分别实现了能用的版本。
应该可以做一个更通用的版本,但还要梳理清楚这些细节,这个之后可以做。
--use_fused_rope 1 \ | ||
--fuse_attention_ffn 1 \ | ||
--bf16 \ | ||
--fp16_opt_level "O2" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BF16训练没有开启master_grad
(使用--amp_master_grad true
开启)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TE内还没有支持这个功能,等支持后再增加这个参数
$recompute_flag \ | ||
$init_weight_flag \ | ||
$sp_flag \ | ||
--device "gpu" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
一些依赖框架的Overlap优化似乎没有开启,develop支持Sharding梯度通信Overlap、P2P通信Overlap,其中P2P通信Overlap 2.6版本暂不支持,设置方式如下:
--sharding_parallel_config "split_param enable_stage1_overlap" \
--pipeline_parallel_config "enable_sharding_comm_overlap enable_overlap_p2p_comm" \
MP的反向AllReduce Overlap、梯度累加融合等优化,都固化实现在了TE里面?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
develop支持Sharding梯度通信Overlap、P2P通信Overlap
好的,我会加上这两个优化参数。
MP的反向AllReduce Overlap、梯度累加融合等优化,都固化实现在了TE里面?
MP的反向AllReduce Overlap是指什么?
梯度累加融合这个优化应该是依赖main_grad?
This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。 |
PR types
New features
PR changes
Others
Description
PaddleNLP with Transformer Engine Integration.