Evo2 Megatron Bridge Recipe Prototype by jstjohn · Pull Request #1357 · NVIDIA-BioNeMo/bionemo-framework

jstjohn · 2025-12-01T19:32:50Z

Description

Usage

Go to the recipe:

cd bionemo-recipes/recipes/evo2_megatron

Build the image:

docker build -t evo2_megatron .
docker run --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -it evo2_megatron

NOTE: some 2xA6000 users in general have problems with 2x GPUs freezing during torchrun. If this happens do the following:

export NCCL_P2P_DISABLE=1

Then execute the mock data example with:

torchrun --nproc-per-node 1  --no-python   train_evo2 --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256   --model-size striped_hyena_1b_nv_parallel --max-steps 22 --eval-interval 10   --eval-iters 3 --mock-data   --micro-batch-size 32 --global-batch-size 256 --seq-length 1024   --tensor-model-parallel 1   --use-precision-aware-optimizer --dataset-seed 33   --seed 41 --ckpt-async-save  --spike-no-more-embedding-init   --no-weight-decay-embeddings --cross-entropy-loss-fusion   --align-param-gather --overlap-param-gather  --grad-reduce-in-fp32   --decay-steps 100 --warmup-steps 10   --mixed-precision-recipe bf16-mixed   --no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1   --attention-dropout 0.001 --hidden-dropout 0.001   --eod-pad-in-loss-mask --enable-preemption   --log-interval 5 --debug-ddp-parity-freq 10   --wandb-project evo2-recipes-verification-tmp   --wandb-run-name tmp_workstation_run_mock_data   --result-dir tmp --no-renormalize-loss

That should give something like the following:

----------------------------------
Setting rerun_state_machine.current_iteration to 0...
Starting training loop at iteration 0
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4879: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
Step Time : 7.04s GPU utilization: 24.7MODEL_TFLOP/s/GPU
Number of parameters in transformer layers in billions:  0.86
 [2025-12-05 00:25:41] iteration        1/      10 | consumed samples:          128 | elapsed time per iteration (ms): 7037.3 | learning rate: 3.000000E-05 | global batch size:   128 | lm loss: 6.700717E+00 | loss scale: 1.0 | grad norm: 117.718 | number of skipped iterations:   0 | number of nan iterations:   0 |
Number of parameters in embedding layers in billions: 0.00
Total number of parameters in billions: 0.86
Number of parameters in most loaded shard in billions: 0.8614
Theoretical memory footprints: weight and optimizer=9857.42 MB
[Rank 0] (after 1 iterations) memory (GB) | mem-allocated-gigabytes: 13.367 | mem-active-gigabytes: 13.367 | mem-inactive-gigabytes: 0.42641 | mem-reserved-gigabytes: 14.259 | mem-max-allocated-gigabytes: 13.367 | mem-max-active-gigabytes: 13.367 | mem-max-inactive-gigabytes: 0.43855 | mem-max-reserved-gigabytes: 14.259 | mem-alloc-retires: 0 | mem-allocated-count: 284
Step Time : 5.96s GPU utilization: 29.2MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:47] iteration        2/      10 | consumed samples:          256 | elapsed time per iteration (ms): 5959.8 | learning rate: 6.000000E-05 | global batch size:   128 | lm loss: 6.705625E+00 | loss scale: 1.0 | grad norm: 117.618 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 5.95s GPU utilization: 29.2MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:53] iteration        3/      10 | consumed samples:          384 | elapsed time per iteration (ms): 5954.5 | learning rate: 9.000000E-05 | global batch size:   128 | lm loss: 8.673918E-02 | loss scale: 1.0 | grad norm: 5.777 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 5.97s GPU utilization: 29.1MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:59] iteration        4/      10 | consumed samples:          512 | elapsed time per iteration (ms): 5974.1 | learning rate: 1.200000E-04 | global batch size:   128 | lm loss: 7.124253E-03 | loss scale: 1.0 | grad norm: 0.827 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.00s GPU utilization: 29.0MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:05] iteration        5/      10 | consumed samples:          640 | elapsed time per iteration (ms): 5996.1 | learning rate: 1.500000E-04 | global batch size:   128 | lm loss: 1.208314E-03 | loss scale: 1.0 | grad norm: 0.130 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.02s GPU utilization: 28.9MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:11] iteration        6/      10 | consumed samples:          768 | elapsed time per iteration (ms): 6017.2 | learning rate: 1.800000E-04 | global batch size:   128 | lm loss: 4.079018E-04 | loss scale: 1.0 | grad norm: 0.041 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.03s GPU utilization: 28.9MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:17] iteration        7/      10 | consumed samples:          896 | elapsed time per iteration (ms): 6034.3 | learning rate: 2.100000E-04 | global batch size:   128 | lm loss: 1.331343E-04 | loss scale: 1.0 | grad norm: 0.010 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.05s GPU utilization: 28.8MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:23] iteration        8/      10 | consumed samples:         1024 | elapsed time per iteration (ms): 6045.5 | learning rate: 2.400000E-04 | global batch size:   128 | lm loss: 8.991118E-05 | loss scale: 1.0 | grad norm: 0.006 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.07s GPU utilization: 28.7MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:29] iteration        9/      10 | consumed samples:         1152 | elapsed time per iteration (ms): 6065.8 | learning rate: 2.700000E-04 | global batch size:   128 | lm loss: 7.406142E-05 | loss scale: 1.0 | grad norm: 0.005 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.08s GPU utilization: 28.7MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:35] iteration       10/      10 | consumed samples:         1280 | elapsed time per iteration (ms): 6078.1 | learning rate: 3.000000E-04 | global batch size:   128 | lm loss: 5.936620E-05 | loss scale: 1.0 | grad norm: 0.004 | number of skipped iterations:   0 | number of nan iterations:   0 |
[after training is done] datetime: 2025-12-05 00:26:35

Accuracy evaluation

We are on-par between bf16 and the previous FP8 runs. However there is a bug where this FP8 recipe is underperforming. This is in addition to the following two issues which also block FP8 use in practice currently: NVIDIA-NeMo/Megatron-Bridge#1730, NVIDIA-NeMo/Megatron-Bridge#1707

Performance Comparison

Both BF16 and FP8 precision outperform the previous FP8 runs in NeMo2.

Evo2 1B Run	Seconds per step (lower is better)	Tokens/sec/GPU	Global Batch Size	Number of GPUs	Vocab Size
MBridge BF16	6.10	26,859	960	48	256
MBridge FP8 (delayed)	5.38	30,453	960	48	256
MBridge FP8 (delayed)	5.39	30,397	960	48	512
Nemo2 FP8 (delayed)	6.18	26,511	960	48	512

Type of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Refactor
Documentation update
Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

ciflow:skip - Skip all CI tests for this PR
ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
/ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.