These scripts run a recommended config for GPT, LLAMA2, Nemotron pretraining, and finetuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.
-
a100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type
-
h100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type
-
To run these scripts, you must have access to the nemo container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)
- Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
-
Update the following bash variables in the example run scripts:
-
NEMO_MEGATRON_LAUNCHER_DIR
: the directory of where this repository is located -
DATA_DIR
: the directory of the dataset used for pretraining, by default this isNEMO_MEGATRON_LAUNCHER_DIR/data
-
-
Enter your cluster enviroment settings at config.yaml
For bcm type clusters update the job name, partition, and account at bcm.yaml
-
For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:
cluster_type=interactive \ ++training.cluster_type=BCP \ training.model.data.data_impl="mock" \ training.model.data.data_prefix=[]
For further details see General Configuration
For performance, the "step_time_per_sec" variable on the console out provides a quick way to read performance of a workload.
For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>
with the following structure:
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml
: The config of the pretrained modelNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh
: The autogenerated .sh file that was runNEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/
: Directory contained per rank logs, and tensorboard data.
For further details see Interpreting the Results
- The results in the table below show pre-training performance of various models on DGXH100, with FP8.
- Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.
- To calculate Model TFLOPs, please see Appendix A in paper.
Model | #-GPUs | GBS | MBS | Sequence Length |
TP | PP | CP | Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to train in days (10T tokens, 1K GPUs) |
---|---|---|---|---|---|---|---|---|---|---|
GPT3-175B | 512 | 2048 | 1 | 2048 | 4 | 8 | 1 | 741 | 797* | 153 |
GPT3-5B | 64 | 2048 | 4 | 2048 | 1 | 1 | 1 | 23117 | 755 | 5 |
GPT3-20B | 64 | 256 | 2 | 2048 | 2 | 1 | 1 | 5611 | 719 | 20 |
LLAMA2-7B | 8 | 128 | 1 | 4096 | 1 | 1 | 1 | 16154 | 744 | 7 |
LLAMA2-13B | 16 | 128 | 1 | 4096 | 1 | 4 | 1 | 8344 | 727 | 14 |
LLAMA2-70B | 64 | 128 | 1 | 4096 | 4 | 4 | 1 | 1659 | 737 | 68 |
Nemotron-8B | 64 | 256 | 4 | 4096 | 2 | 1 | 1 | 11753 | 604 | 10 |
Nemotron-22B | 64 | 256 | 2 | 4096 | 2 | 4 | 1 | 4113 | 536 | 27 |
LLAMA3-8B | 8 | 128 | 1 | 8192 | 1 | 1 | 2 | 11879 | 688 | 10 |
LLAMA3-70B | 64 | 128 | 1 | 8192 | 4 | 4 | 2 | 1444 | 695 | 78 |
- The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.
- For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.
- To calculate Model TFLOPs, please see Appendix A in paper.
Model | Mode | #-GPUs | GBS | MBS | Sequence Length |
TP | PP | Tokens / sec / GPU |
Model TFLOP / sec / GPU |
Est. time to complete in mins (10M tokens) |
---|---|---|---|---|---|---|---|---|---|---|
LLAMA2-7B | SFT | 8 | 32 | 1 | 4096 | 1 | 1 | 16891 | 673 | 1.2 |
LLAMA2-13B | SFT | 8 | 32 | 1 | 4096 | 1 | 4 | 9384 | 726 | 2.2 |
LLAMA2-70B | SFT | 16 | 32 | 1 | 4096 | 4 | 4 | 1739 | 717 | 6.0 |
LLAMA2-7B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 23711 | 633 | 0.9 |
LLAMA2-13B | LoRA | 8 | 32 | 1 | 4096 | 1 | 1 | 14499 | 751 | 1.4 |
LLAMA2-70B | LoRA | 8 | 32 | 1 | 4096 | 2 | 4 | 2470 | 681 | 8.4 |