Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NVIDIA] TE Integration #7229

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

Wong4j
Copy link
Contributor

@Wong4j Wong4j commented Oct 15, 2023

PR types

New features

PR changes

Others

Description

PaddleNLP with Transformer Engine Integration.

@Wong4j Wong4j changed the title TE Integrtion [NVIDIA] TE Integration Oct 15, 2023
@codecov
Copy link

codecov bot commented Oct 16, 2023

Codecov Report

Attention: Patch coverage is 38.46154% with 80 lines in your changes are missing coverage. Please review.

Project coverage is 58.28%. Comparing base (c1ccafa) to head (9c7ef55).
Report is 197 commits behind head on develop.

❗ Current head 9c7ef55 differs from pull request most recent head 5fcdcfb. Consider uploading reports for the commit 5fcdcfb to get more accurate results

Files Patch % Lines
paddlenlp/utils/transformer_engine_utils.py 33.33% 58 Missing ⚠️
paddlenlp/transformers/gpt/modeling.py 40.90% 13 Missing ⚠️
paddlenlp/transformers/gpt/modeling_pp.py 30.00% 7 Missing ⚠️
paddlenlp/trainer/trainer.py 77.77% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #7229      +/-   ##
===========================================
+ Coverage    56.67%   58.28%   +1.60%     
===========================================
  Files          588      580       -8     
  Lines        89243    85655    -3588     
===========================================
- Hits         50580    49922     -658     
+ Misses       38663    35733    -2930     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

paddlenlp/trainer/trainer.py Outdated Show resolved Hide resolved
@Wong4j Wong4j force-pushed the jaywan/te_integration branch 2 times, most recently from 9d4ec48 to 17bfad2 Compare October 24, 2023 03:19
@@ -1823,7 +1839,8 @@ def training_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle.Tensor,
inputs = self._prepare_inputs(inputs)

with self.autocast_smart_context_manager():
loss = self.compute_loss(model, inputs)
with TransformerEngineHelper.fp8_autocast(enabled=self.use_fp8):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about move TransformerEngineHelper.fp8_autocast into autocast_smart_context_manager

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯,确实考虑过这样设计。但是我最近在测试pipeline parallel+fp8+gradient_accumulation的时候发现,fp8_autocast不能包在这个compute_loss外面。会报错,参考这里PR#93。因为pipeline parallel里面有类似这种for循环,fp8_autocast要放在for循环里面,而不能放在外面。所以我放到了这里paddlenlp/transformers/gpt/modeling.py#L656

paddlenlp/trainer/trainer.py Outdated Show resolved Hide resolved
@@ -1886,7 +1903,8 @@ def training_pipeline_step(self, model: nn.Layer, inputs: Dict[str, Union[paddle
model.lr_scheduler = None

with self.autocast_smart_context_manager():
loss = model.forward_backward_pipeline(inputs, self.scaler if self.do_grad_scaling else None)
with TransformerEngineHelper.fp8_autocast(enabled=self.use_fp8, fp8_group=self.dp_group):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same as above

from .te_helper import TransformerEngineHelper


class GPTDecoderLayerWithNVTEBackend(nn.Layer):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code of GPT model should be in transformers/gpt

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

放到了paddlenlp/transformers/gpt/modeling.py中

paddlenlp/te_utils/te_helper.py Outdated Show resolved Hide resolved
@ZHUI
Copy link
Collaborator

ZHUI commented Nov 9, 2023

@DrownFish19 一起review一下

Copy link

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Mar 18, 2024
Copy link
Contributor

@Xreki Xreki left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR能否拆分下,先把代码修改了合进去,后面再统一看下文档、运行脚本如何管理。

@@ -0,0 +1,533 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码重复较多,建议将TE的功能支持,直接加到已有的run_pretrain.py中,可参考Megatron-LM中的支持方式:https://github.com/NVIDIA/Megatron-LM/blob/2b92e61dac1c5ff84629239198659b447da118e1/megatron/training/arguments.py#L579

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件不需要了,gpt和llama都通过已有的run_pretrain.py来跑。我分拆到这个PR中#8228

@@ -0,0 +1,203 @@
# Copyright (c) 2023 PaddlePaddle Authors. All Rights Reserved.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checkpoint转换,GPT、LLaMA能否抽取公共部分,实现paddlenlp/utils目录中?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paddlenlp里面ckpt文件命名规则和模型参数命名规则修改过(以前写的版本后来就不能用了),并且参数的排布受很多参数影响,比如需要考虑TP并行, weight是否融合,还有llama需要额外考虑GQA参数的分拆,所以目前只是针对gpt和llama分别实现了能用的版本。
应该可以做一个更通用的版本,但还要梳理清楚这些细节,这个之后可以做。

--use_fused_rope 1 \
--fuse_attention_ffn 1 \
--bf16 \
--fp16_opt_level "O2" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BF16训练没有开启master_grad(使用--amp_master_grad true开启)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TE内还没有支持这个功能,等支持后再增加这个参数

$recompute_flag \
$init_weight_flag \
$sp_flag \
--device "gpu"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些依赖框架的Overlap优化似乎没有开启,develop支持Sharding梯度通信Overlap、P2P通信Overlap,其中P2P通信Overlap 2.6版本暂不支持,设置方式如下:

--sharding_parallel_config "split_param enable_stage1_overlap" \
--pipeline_parallel_config "enable_sharding_comm_overlap enable_overlap_p2p_comm" \

MP的反向AllReduce Overlap、梯度累加融合等优化,都固化实现在了TE里面?

Copy link
Contributor Author

@Wong4j Wong4j Apr 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

develop支持Sharding梯度通信Overlap、P2P通信Overlap

好的,我会加上这两个优化参数。

MP的反向AllReduce Overlap、梯度累加融合等优化,都固化实现在了TE里面?

MP的反向AllReduce Overlap是指什么?
梯度累加融合这个优化应该是依赖main_grad?

@Wong4j Wong4j mentioned this pull request Apr 3, 2024
Copy link

This Pull Request is stale because it has been open for 60 days with no activity. 当前Pull Request 60天内无活动,被标记为stale。

@github-actions github-actions bot added the stale label Jul 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants