Skip to content

Evo2 Megatron Bridge Recipe Prototype#1357

Merged
jstjohn merged 66 commits into
mainfrom
jstjohn/evo2_megatron_bridge_recipe
Dec 19, 2025
Merged

Evo2 Megatron Bridge Recipe Prototype#1357
jstjohn merged 66 commits into
mainfrom
jstjohn/evo2_megatron_bridge_recipe

Conversation

@jstjohn
Copy link
Copy Markdown
Collaborator

@jstjohn jstjohn commented Dec 1, 2025

Description

Usage

Go to the recipe:

cd bionemo-recipes/recipes/evo2_megatron

Build the image:

docker build -t evo2_megatron .
docker run --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all -it evo2_megatron

NOTE: some 2xA6000 users in general have problems with 2x GPUs freezing during torchrun. If this happens do the following:

export NCCL_P2P_DISABLE=1

Then execute the mock data example with:

torchrun --nproc-per-node 1  --no-python   train_evo2 --hf-tokenizer-model-path tokenizers/nucleotide_fast_tokenizer_256   --model-size striped_hyena_1b_nv_parallel --max-steps 22 --eval-interval 10   --eval-iters 3 --mock-data   --micro-batch-size 32 --global-batch-size 256 --seq-length 1024   --tensor-model-parallel 1   --use-precision-aware-optimizer --dataset-seed 33   --seed 41 --ckpt-async-save  --spike-no-more-embedding-init   --no-weight-decay-embeddings --cross-entropy-loss-fusion   --align-param-gather --overlap-param-gather  --grad-reduce-in-fp32   --decay-steps 100 --warmup-steps 10   --mixed-precision-recipe bf16-mixed   --no-fp32-residual-connection --activation-checkpoint-recompute-num-layers 1   --attention-dropout 0.001 --hidden-dropout 0.001   --eod-pad-in-loss-mask --enable-preemption   --log-interval 5 --debug-ddp-parity-freq 10   --wandb-project evo2-recipes-verification-tmp   --wandb-run-name tmp_workstation_run_mock_data   --result-dir tmp --no-renormalize-loss

That should give something like the following:

----------------------------------
Setting rerun_state_machine.current_iteration to 0...
Starting training loop at iteration 0
/usr/local/lib/python3.12/dist-packages/torch/distributed/distributed_c10d.py:4879: UserWarning: barrier(): using the device under current context. You can specify `device_id` in `init_process_group` to mute this warning.
  warnings.warn(  # warn only once
Step Time : 7.04s GPU utilization: 24.7MODEL_TFLOP/s/GPU
Number of parameters in transformer layers in billions:  0.86
 [2025-12-05 00:25:41] iteration        1/      10 | consumed samples:          128 | elapsed time per iteration (ms): 7037.3 | learning rate: 3.000000E-05 | global batch size:   128 | lm loss: 6.700717E+00 | loss scale: 1.0 | grad norm: 117.718 | number of skipped iterations:   0 | number of nan iterations:   0 |
Number of parameters in embedding layers in billions: 0.00
Total number of parameters in billions: 0.86
Number of parameters in most loaded shard in billions: 0.8614
Theoretical memory footprints: weight and optimizer=9857.42 MB
[Rank 0] (after 1 iterations) memory (GB) | mem-allocated-gigabytes: 13.367 | mem-active-gigabytes: 13.367 | mem-inactive-gigabytes: 0.42641 | mem-reserved-gigabytes: 14.259 | mem-max-allocated-gigabytes: 13.367 | mem-max-active-gigabytes: 13.367 | mem-max-inactive-gigabytes: 0.43855 | mem-max-reserved-gigabytes: 14.259 | mem-alloc-retires: 0 | mem-allocated-count: 284
Step Time : 5.96s GPU utilization: 29.2MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:47] iteration        2/      10 | consumed samples:          256 | elapsed time per iteration (ms): 5959.8 | learning rate: 6.000000E-05 | global batch size:   128 | lm loss: 6.705625E+00 | loss scale: 1.0 | grad norm: 117.618 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 5.95s GPU utilization: 29.2MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:53] iteration        3/      10 | consumed samples:          384 | elapsed time per iteration (ms): 5954.5 | learning rate: 9.000000E-05 | global batch size:   128 | lm loss: 8.673918E-02 | loss scale: 1.0 | grad norm: 5.777 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 5.97s GPU utilization: 29.1MODEL_TFLOP/s/GPU
 [2025-12-05 00:25:59] iteration        4/      10 | consumed samples:          512 | elapsed time per iteration (ms): 5974.1 | learning rate: 1.200000E-04 | global batch size:   128 | lm loss: 7.124253E-03 | loss scale: 1.0 | grad norm: 0.827 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.00s GPU utilization: 29.0MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:05] iteration        5/      10 | consumed samples:          640 | elapsed time per iteration (ms): 5996.1 | learning rate: 1.500000E-04 | global batch size:   128 | lm loss: 1.208314E-03 | loss scale: 1.0 | grad norm: 0.130 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.02s GPU utilization: 28.9MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:11] iteration        6/      10 | consumed samples:          768 | elapsed time per iteration (ms): 6017.2 | learning rate: 1.800000E-04 | global batch size:   128 | lm loss: 4.079018E-04 | loss scale: 1.0 | grad norm: 0.041 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.03s GPU utilization: 28.9MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:17] iteration        7/      10 | consumed samples:          896 | elapsed time per iteration (ms): 6034.3 | learning rate: 2.100000E-04 | global batch size:   128 | lm loss: 1.331343E-04 | loss scale: 1.0 | grad norm: 0.010 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.05s GPU utilization: 28.8MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:23] iteration        8/      10 | consumed samples:         1024 | elapsed time per iteration (ms): 6045.5 | learning rate: 2.400000E-04 | global batch size:   128 | lm loss: 8.991118E-05 | loss scale: 1.0 | grad norm: 0.006 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.07s GPU utilization: 28.7MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:29] iteration        9/      10 | consumed samples:         1152 | elapsed time per iteration (ms): 6065.8 | learning rate: 2.700000E-04 | global batch size:   128 | lm loss: 7.406142E-05 | loss scale: 1.0 | grad norm: 0.005 | number of skipped iterations:   0 | number of nan iterations:   0 |
Step Time : 6.08s GPU utilization: 28.7MODEL_TFLOP/s/GPU
 [2025-12-05 00:26:35] iteration       10/      10 | consumed samples:         1280 | elapsed time per iteration (ms): 6078.1 | learning rate: 3.000000E-04 | global batch size:   128 | lm loss: 5.936620E-05 | loss scale: 1.0 | grad norm: 0.004 | number of skipped iterations:   0 | number of nan iterations:   0 |
[after training is done] datetime: 2025-12-05 00:26:35

Accuracy evaluation

We are on-par between bf16 and the previous FP8 runs. However there is a bug where this FP8 recipe is underperforming. This is in addition to the following two issues which also block FP8 use in practice currently: NVIDIA-NeMo/Megatron-Bridge#1730, NVIDIA-NeMo/Megatron-Bridge#1707
training_loss_comparison

Performance Comparison

Both BF16 and FP8 precision outperform the previous FP8 runs in NeMo2.

Evo2 1B Run Seconds per step (lower is better) Tokens/sec/GPU Global Batch Size Number of GPUs Vocab Size
MBridge BF16 6.10 26,859 960 48 256
MBridge FP8 (delayed) 5.38 30,453 960 48 256
MBridge FP8 (delayed) 5.39 30,397 960 48 512
Nemo2 FP8 (delayed) 6.18 26,511 960 48 512

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

…ridge recipe

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Dec 1, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
@jstjohn
Copy link
Copy Markdown
Collaborator Author

jstjohn commented Dec 4, 2025

Note, this depends on NVIDIA-NeMo/Megatron-Bridge#1594 currently.

…t_mode

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
…tests

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
…ntially and separately deprecate the evo2 package.

Signed-off-by: John St John <jstjohn@nvidia.com>
Signed-off-by: John St John <jstjohn@nvidia.com>
Copy link
Copy Markdown
Collaborator

@pstjohn pstjohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, but please edit the readme to indicate this is still WIP. And maybe some brief intro to what the goal here will be

@jstjohn jstjohn added this pull request to the merge queue Dec 19, 2025
Merged via the queue into main with commit 7caf18c Dec 19, 2025
18 checks passed
@jstjohn jstjohn deleted the jstjohn/evo2_megatron_bridge_recipe branch December 19, 2025 22:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants