-
Notifications
You must be signed in to change notification settings - Fork 645
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Hi!
I noticed in the TransformerEngine source code there are two environment variables related to LayerNorm/RMSNorm:
NVTE_FWD_LAYERNORM_SM_MARGINNVTE_BWD_LAYERNORM_SM_MARGIN
In NVIDIA’s submission for MLPerf Training 4.1 results, these variables were set to 8. The comments indicate that setting these two variables can improve p2p overlap performance on H100 GPUs:
# source: https://github.com/mlcommons/training_results_v4.1/blob/8821c7037ffd06e3775398fd39361a4c591d2235/NVIDIA/benchmarks/gpt3/implementations/eos-dfw_n1452_ngc24.04_nemo/config_common.sh#L9
# This is to improve p2p overlap on H100, and it shouldn't affect A100:
export NVTE_FWD_LAYERNORM_SM_MARGIN=8
export NVTE_BWD_LAYERNORM_SM_MARGIN=8Could you please clarify which type of P2P (p2p in pipeline parallelism, p2p in context parallelism, or p2p in tp-overlap) these variables impact? Additionally, would you mind provide some tuning recommendations for these parameters specifically for H800 and H20 GPUs?
Thanks!
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
questionFurther information is requestedFurther information is requested