Example scripts for pre-train and finetuning

These scripts run a recommended config for GPT, LLAMA2, Nemotron pretraining, and finetuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.

a100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type
h100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type

Setup

To run these scripts, you must have access to the nemo container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)
- Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
Update the following bash variables in the example run scripts:
- NEMO_MEGATRON_LAUNCHER_DIR : the directory of where this repository is located
- DATA_DIR : the directory of the dataset used for pretraining, by default this is NEMO_MEGATRON_LAUNCHER_DIR/data
Enter your cluster enviroment settings at config.yaml

For bcm type clusters update the job name, partition, and account at bcm.yaml

For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:

        cluster_type=interactive \
        ++training.cluster_type=BCP \
        training.model.data.data_impl="mock" \
        training.model.data.data_prefix=[]

For further details see General Configuration

Results

For performance, the "step_time_per_sec" variable on the console out provides a quick way to read performance of a workload.

For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name> with the following structure:

NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml : The config of the pretrained model
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh : The autogenerated .sh file that was run
NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/ : Directory contained per rank logs, and tensorboard data.

For further details see Interpreting the Results

Benchmark performance numbers (pretraining)

The results in the table below show pre-training performance of various models on DGXH100, with FP8.
Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.
To calculate Model TFLOPs, please see Appendix A in paper.

Model	#-GPUs	GBS	MBS	Sequence Length	TP	PP	CP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to train in days (10T tokens, 1K GPUs)
GPT3-175B	512	2048	1	2048	4	8	1	741	797*	153
GPT3-5B	64	2048	4	2048	1	1	1	23117	755	5
GPT3-20B	64	256	2	2048	2	1	1	5611	719	20
LLAMA2-7B	8	128	1	4096	1	1	1	16154	744	7
LLAMA2-13B	16	128	1	4096	1	4	1	8344	727	14
LLAMA2-70B	64	128	1	4096	4	4	1	1659	737	68
Nemotron-8B	64	256	4	4096	2	1	1	11753	604	10
Nemotron-22B	64	256	2	4096	2	4	1	4113	536	27
LLAMA3-8B	8	128	1	8192	1	1	2	11879	688	10
LLAMA3-70B	64	128	1	8192	4	4	2	1444	695	78

Benchmark performance numbers (finetuning)

The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.
For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.
To calculate Model TFLOPs, please see Appendix A in paper.

Model	Mode	#-GPUs	GBS	MBS	Sequence Length	TP	PP	Tokens / sec / GPU	Model TFLOP / sec / GPU	Est. time to complete in mins (10M tokens)
LLAMA2-7B	SFT	8	32	1	4096	1	1	16891	673	1.2
LLAMA2-13B	SFT	8	32	1	4096	1	4	9384	726	2.2
LLAMA2-70B	SFT	16	32	1	4096	4	4	1739	717	6.0
LLAMA2-7B	LoRA	8	32	1	4096	1	1	23711	633	0.9
LLAMA2-13B	LoRA	8	32	1	4096	1	1	14499	751	1.4
LLAMA2-70B	LoRA	8	32	1	4096	2	4	2470	681	8.4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Example scripts for pre-train and finetuning

Setup

Results

Benchmark performance numbers (pretraining)

Benchmark performance numbers (finetuning)

Files

README.md

Latest commit

History

README.md

File metadata and controls

Example scripts for pre-train and finetuning

Setup

Results

Benchmark performance numbers (pretraining)

Benchmark performance numbers (finetuning)