Skip to content

Latest commit

 

History

History
80 lines (58 loc) · 5.66 KB

README.md

File metadata and controls

80 lines (58 loc) · 5.66 KB

Example scripts for pre-train and finetuning

These scripts run a recommended config for GPT, LLAMA2, Nemotron pretraining, and finetuning for various model sizes on A100, H100. For example, for GPT3 pretrain the following folders provide sample scripts.

  • a100 : Scripts to run GPT pretraining on NVIDIA A100, in bf16 data type

  • h100 : Scripts to run GPT pretraining for NVIDIA H100, in fp8 data type

Setup

  1. To run these scripts, you must have access to the nemo container (https://catalog.ngc.nvidia.com/orgs/nvidia/containers/nemo)

    • Please sign in at NGC (user = ea-bignlp/ga-participants) to access the catalog.
  2. Update the following bash variables in the example run scripts:

    • NEMO_MEGATRON_LAUNCHER_DIR : the directory of where this repository is located

    • DATA_DIR : the directory of the dataset used for pretraining, by default this is NEMO_MEGATRON_LAUNCHER_DIR/data

  3. Enter your cluster enviroment settings at config.yaml

    For bcm type clusters update the job name, partition, and account at bcm.yaml

  4. For testing performance with synthetic data on an interactive node, you need to add the following options to your bash script:

            cluster_type=interactive \
            ++training.cluster_type=BCP \
            training.model.data.data_impl="mock" \
            training.model.data.data_prefix=[]
    

For further details see General Configuration

Results

For performance, the "step_time_per_sec" variable on the console out provides a quick way to read performance of a workload.

For more details and graphics, one can use tensorboard or Weights and Biases. In order to use that, please use results stored at NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name> with the following structure:

  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<experiment_name>.yaml : The config of the pretrained model
  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/<jobname>_<experiment_name>.sh : The autogenerated .sh file that was run
  • NEMO_MEGATRON_LAUNCHER_DIR/results/<experiment_name>/results/ : Directory contained per rank logs, and tensorboard data.

For further details see Interpreting the Results

Benchmark performance numbers (pretraining)

  • The results in the table below show pre-training performance of various models on DGXH100, with FP8.
  • Please refer to MLCommons Training results for performance of GPT3-175B pre-training on large scale H100 systems.
  • To calculate Model TFLOPs, please see Appendix A in paper.
Model #-GPUs GBS MBS Sequence
Length
TP PP CP Tokens
/ sec / GPU
Model TFLOP
/ sec / GPU
Est. time to train
in days
(10T tokens, 1K GPUs)
GPT3-175B 512 2048 1 2048 4 8 1 741 797* 153
GPT3-5B 64 2048 4 2048 1 1 1 23117 755 5
GPT3-20B 64 256 2 2048 2 1 1 5611 719 20
LLAMA2-7B 8 128 1 4096 1 1 1 16154 744 7
LLAMA2-13B 16 128 1 4096 1 4 1 8344 727 14
LLAMA2-70B 64 128 1 4096 4 4 1 1659 737 68
Nemotron-8B 64 256 4 4096 2 1 1 11753 604 10
Nemotron-22B 64 256 2 4096 2 4 1 4113 536 27
LLAMA3-8B 8 128 1 8192 1 1 2 11879 688 10
LLAMA3-70B 64 128 1 8192 4 4 2 1444 695 78

Benchmark performance numbers (finetuning)

  • The following table provides performance benchmarking of LLAMA2 models with SFT (supervised fine-tuning), and LoRA (Low-rank adaptors) on DGXH100, with FP8.
  • For fine-tuning, we use SQuAD-v1.1 dataset, and the inputs are packed to 4096 tokens.
  • To calculate Model TFLOPs, please see Appendix A in paper.
Model Mode #-GPUs GBS MBS Sequence
Length
TP PP Tokens
/ sec / GPU
Model TFLOP
/ sec / GPU
Est. time to
complete in mins
(10M tokens)
LLAMA2-7B SFT 8 32 1 4096 1 1 16891 673 1.2
LLAMA2-13B SFT 8 32 1 4096 1 4 9384 726 2.2
LLAMA2-70B SFT 16 32 1 4096 4 4 1739 717 6.0
LLAMA2-7B LoRA 8 32 1 4096 1 1 23711 633 0.9
LLAMA2-13B LoRA 8 32 1 4096 1 1 14499 751 1.4
LLAMA2-70B LoRA 8 32 1 4096 2 4 2470 681 8.4