Skip to content

Add opengenome2 llama3 recipe#1499

Open
savitha-eng wants to merge 37 commits intomainfrom
savitha/opengenome2-recipe-clean
Open

Add opengenome2 llama3 recipe#1499
savitha-eng wants to merge 37 commits intomainfrom
savitha/opengenome2-recipe-clean

Conversation

@savitha-eng
Copy link
Collaborator

@savitha-eng savitha-eng commented Mar 5, 2026

Self-contained FSDP2 + TransformerEngine recipe for OpenGenome2 training, extracted from the generic llama3_native_te recipe with OG2-specific defaults:

  • FP32 master weights with MixedPrecisionPolicy (cast_forward_inputs=False)
  • Megatron-style scaled init for proj/fc2 layers
  • Spike-No-More embedding initialization (std=1.0)
  • Genomic masking for degenerate bases
  • Weight decay grouping (skip bias/1D params)
  • THD sequence packing with GQA
  • FP8 training with first/last layer BF16 override
  • RoPE fp32 assertion (from PR Assert fp32 for rope embeddings, misc test fixes #1496)

Description

Usage

TODO: Add code snippet

Type of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Refactor
  • Documentation update
  • Other (please describe):

CI Pipeline Configuration

Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.

  • ciflow:skip - Skip all CI tests for this PR
  • ciflow:notebooks - Run Jupyter notebooks execution tests for bionemo2
  • ciflow:slow - Run slow single GPU integration tests marked as @pytest.mark.slow for bionemo2
  • ciflow:all - Run all tests (unit tests, slow tests, and notebooks) for bionemo2. This label can be used to enforce running tests for all bionemo2.
  • ciflow:all-recipes - Run tests for all recipes (under bionemo-recipes). This label can be used to enforce running tests for all recipes.

Unit tests marked as @pytest.mark.multi_gpu or @pytest.mark.distributed are not run in the PR pipeline.

For more details, see CONTRIBUTING

Note

By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.

Authorizing CI Runs

We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.

  • If a pull request is opened by a trusted user and contains only trusted changes, the pull request's code will
    automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
  • If a pull request is opened by an untrusted user or contains untrusted changes, an NVIDIA org member must leave an
    /ok to test comment on the pull request to trigger CI. This will need to be done for each new commit.

Triggering Code Rabbit AI Review

To trigger a code review from code rabbit, comment on a pull request with one of these commands:

See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.

Pre-submit Checklist

  • I have tested these changes locally
  • I have updated the documentation accordingly
  • I have added/updated tests as needed
  • All existing tests pass successfully

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenGenome2 Llama 3 with TransformerEngine training recipe, featuring FSDP2 distributed training, THD sequence packing, and context parallelism support.
    • Introduced comprehensive data loading and collation utilities for genomic sequences with optional degenerate base masking and case normalization.
    • Added performance logging, profiling, and validation frameworks for training workflows.
  • Bug Fixes

    • Improved checkpoint loading robustness with fallback handling for unavailable tensor quantization features.
  • Documentation

    • Added extensive README and configuration presets for various training scenarios and hardware setups.

Self-contained FSDP2 + TransformerEngine recipe for OpenGenome2 training,
extracted from the generic llama3_native_te recipe with OG2-specific defaults:
- FP32 master weights with MixedPrecisionPolicy (cast_forward_inputs=False)
- Megatron-style scaled init for proj/fc2 layers
- Spike-No-More embedding initialization (std=1.0)
- Genomic masking for degenerate bases
- Weight decay grouping (skip bias/1D params)
- THD sequence packing with GQA
- FP8 training with first/last layer BF16 override
- RoPE fp32 assertion (from PR #1496)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 5, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: d08512b5-4357-4f23-bce5-7fc7749eca43

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Introduces the OpenGenome2 Llama3 with TransformerEngine recipe, a self-contained training solution featuring FSDP2 distributed training, context parallelism, sequence packing, genomic data handling, and checkpoint management. Includes model implementations, data loaders, training scripts, comprehensive test suite, and supporting utilities. Also improves checkpoint robustness in the llama3_native_te recipe by deferring QuantizedTensor validation logic.

Changes

Cohort / File(s) Summary
Checkpoint & Distributed Config
bionemo-recipes/recipes/llama3_native_te/checkpoint.py, bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py, bionemo-recipes/recipes/opengenome2_llama_native_te/distributed_config.py
Robust import handling for QuantizedTensor with fallback validation; FSDP2-style checkpoint loading/saving with lenient load planner, async support, and dataloader state persistence; distributed config wrapper for rank/world-size management.
Model Implementations
bionemo-recipes/recipes/opengenome2_llama_native_te/modeling_llama_te.py, bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_modeling_llama_te.py
TransformerEngine-optimized Llama models with THD/BSHD input format support, rotary embeddings, and FP8/BF16 handling; OG2-specific model with Spike-No-More embedding init, Megatron-scaled weight init, and configurable FP8 layer boundaries.
Data & Collation
bionemo-recipes/recipes/opengenome2_llama_native_te/dataset.py, bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py, bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_collator.py
Dataset tokenization and dataloader creation for BSHD/THD formats with sequence packing; context-parallel-aware batch distribution with per-rank sharding; genomic data collator for degenerate base masking and label uppercasing.
Training Scripts
bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py, bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py
Full FSDP2 training workflows with gradient accumulation, mixed-precision autocasting, checkpoint resume/save logic; context-parallel variant with CP/DP device mesh and CP-aware data loading.
Utilities & Infrastructure
bionemo-recipes/recipes/opengenome2_llama_native_te/scheduler.py, bionemo-recipes/recipes/opengenome2_llama_native_te/optimizer.py, bionemo-recipes/recipes/opengenome2_llama_native_te/perf_logger.py, bionemo-recipes/recipes/opengenome2_llama_native_te/validation.py, bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py
Cosine annealing scheduler with warmup; Megatron-style weight decay grouping; distributed performance logging with wandb/tqdm/Nsight profiler integration; validation loop with per-batch loss aggregation; FP8 debugging stats initialization.
Hydra Configuration
bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/*
Comprehensive training presets including L0 debug/sanity runs, context-parallel variants, OG2 7B GQA (standard/FP8/global-shuffle), with model architecture, optimizer, dataset, and logging overrides.
Model Configs & Tokenizers
bionemo-recipes/recipes/opengenome2_llama_native_te/model_configs/..., bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/*
JSON model configurations for Lingua-1B and Llama-3.2-1B architectures; nucleotide tokenizer with special tokens, word-level vocab, and template post-processor for DNA sequences.
Testing & Validation
bionemo-recipes/recipes/opengenome2_llama_native_te/tests/*, bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py
Conftest fixtures for distributed setup and cleanup; tests for dataset creation, genomic masking, and end-to-end training with various configurations; minimal NaN reproduction script for THD+GQA+CP edge case.
Containerization & Docs
bionemo-recipes/recipes/opengenome2_llama_native_te/.ruff.toml, bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile, bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt, bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
Ruff linter config extending parent; NVIDIA PyTorch base image with pip caching and workspace setup; dependencies (datasets, hydra, torch, transformer_engine, etc.); comprehensive README covering features, installation, benchmarks, and usage.
Lepton Integration
bionemo-recipes/recipes/llama3_native_te/lepton_backup/og2_recipe_clean_cp2_llama3_code.yaml, bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py
YAML job config for multi-node CP2 training with OG2 GQA overrides; Python script to resolve scheduling targets, build Hydra overrides, manage NFS mounts, and submit Lepton jobs with Git sync and WandB integration.
CI/Scripts
ci/scripts/check_copied_files.py
Extended file-sync mapping to include collator and modeling files from opengenome2_llama_native_te as destinations for copy verification.

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant Hydra as Hydra Config
    participant DistributedSetup as Distributed<br/>Setup
    participant DataLoader as DataLoader<br/>(CP-Aware)
    participant Model as NVLlamaForCausalLM<br/>(FSDP2)
    participant Optimizer as Optimizer &<br/>Scheduler
    participant Checkpoint as Checkpoint<br/>Manager
    participant Logger as PerfLogger &<br/>Validation

    User->>Hydra: train_fsdp2_cp.py with config
    Hydra->>DistributedSetup: Initialize ranks, device mesh (CP/DP)
    DistributedSetup->>DistributedSetup: Create process groups, set seeds
    
    DistributedSetup->>Model: Wrap NVLlama with FSDP2, attach CP group
    DistributedSetup->>DataLoader: Build CP-aware dataloader on rank 0
    DataLoader->>DataLoader: ContextParallelDataLoaderWrapper wraps collate_fn
    
    DistributedSetup->>Checkpoint: Load checkpoint or start fresh
    Checkpoint->>Model: Resume state_dict and optimizer state
    
    loop Training Loop (per step)
        DataLoader->>DataLoader: Next batch with sharded CP ranks
        DataLoader->>Model: Forward with THD/BSHD input, scattered to ranks
        Model->>Model: Transformer layers (TE + RoPE + FP8/BF16 if enabled)
        Model-->>Logger: Loss output
        Logger->>Logger: Accumulate micro-step metrics
        
        Model->>Optimizer: Backward pass (loss / grad_acc_steps)
        Optimizer->>Model: Gradient clip and optimize
        Optimizer->>Scheduler: Update learning rate
        
        Logger->>Logger: Log step metrics (every N steps)
        
        alt Checkpoint Interval
            Model->>Checkpoint: Save checkpoint (possibly async)
        end
    end
    
    Checkpoint->>Model: Save final model (FP32 on rank 0)
    Logger->>Logger: Finish wandb, profiler
    DistributedSetup->>DistributedSetup: Destroy process groups, cleanup
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~50 minutes

Possibly related PRs

  • Flatten esm-2 model package #1464: Modifies the same ci/scripts/check_copied_files.py SOURCE_TO_DESTINATION_MAP to add file synchronization entries for model and collator files.

Suggested reviewers

  • jomitchellnv
  • trvachov
  • jstjohn
  • jwilber
  • dorotat-nv

Poem

🐰 Behold the OpenGenome, now sharded and fast,
With TransformerEngine magic and context at last,
THD-packed sequences, FSDP2 dance,
DNA springs to life in the distributed trance!
✨ 🧬

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 inconclusive)

Check name Status Explanation Resolution
Description check ❓ Inconclusive The PR description includes a clear overview of the OG2-specific defaults and features being added. However, the template sections (Type of changes, Pre-submit Checklist) are incomplete, with all checkboxes left unchecked and TODO placeholders remaining in the Usage section. Complete the Type of changes section by marking the appropriate option (likely 'New feature'), fill in the Usage code snippet example, and verify the Pre-submit Checklist items before merging.
✅ Passed checks (2 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 97.83% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title accurately summarizes the main change: adding a new OpenGenome2 Llama3 recipe with FSDP2 and TransformerEngine support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch savitha/opengenome2-recipe-clean

Comment @coderabbitai help to get the list of available commands and usage tips.

savitha-eng and others added 9 commits March 6, 2026 02:23
…d base files

- Restore modeling_llama_te.py and collator.py to CI-synced copies of upstream
- Create opengenome_modeling_llama_te.py with OG2-specific init (Megatron scaled, Spike-No-More, FP8 layer override, RoPE fixes)
- Rename genomic_dataset.py to opengenome_collator.py for consistency
- Extract weight decay grouping to optimizer.py
- Extract validation loop to validation.py
- Add gradient accumulation tests for BSHD and THD formats
- Register new files in CI copy-check sync lists

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add train_fsdp2_cp.py with FSDP2+CP support, keeping all OG2-specific
  features (FP32 master weights, Megatron scaled init, Spike-No-More,
  weight decay grouping)
- Add L0_sanity_cp.yaml hydra config for CP testing
- Add single-GPU CP test (cp_size=1) and 2-GPU CP tests (cp_size=2)
  for both BSHD and THD formats

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ive_te

The QuantizedTensor class moved between TE versions. Make the import
conditional so checkpoint.py works with both old and new TE releases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The llama3 L0_sanity_cp hydra config has no validation key, so
validation.enabled override fails. Use ++ prefix for robustness and
remove validation_enabled from the llama3-code control experiment config.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The L0_sanity_cp base config has data_files=dlcm_sanity_dataset.parquet
which doesn't exist at the OG2 dataset path. Use Hydra delete syntax to
remove it so HF auto-discovers parquet files in the directory.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add scripts/minimal_repro_thd_gqa_cp_nan.py: single TransformerLayer
with synthetic data, runs MHA then GQA with THD+CP=2 to demonstrate
the NaN bug. Run with: torchrun --nproc_per_node=2 scripts/minimal_repro_thd_gqa_cp_nan.py

Also remove first-batch dtype/loss logging from train_fsdp2.py and
train_fsdp2_cp.py (not present in upstream llama3 code).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
__version__ is on transformer_engine, not transformer_engine.pytorch.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use tiny config (hidden=384, 6 heads, 256 seq len) instead of 7B
- Fix cu_seqlens: pass FULL sequence cu_seqlens (TE divides internally)
- Fix hidden_states: use per-rank token count but full cu_seqlens
- Remove FP8 recipe (not needed when autocast disabled)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@savitha-eng savitha-eng marked this pull request as ready for review March 10, 2026 18:07
@savitha-eng savitha-eng marked this pull request as draft March 10, 2026 18:07
@savitha-eng
Copy link
Collaborator Author

@coderabbitai review

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 10, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 18

Note

Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.

♻️ Duplicate comments (1)
bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml (1)

1-96: ⚠️ Potential issue | 🟡 Minor

This file appears to be a personal debugging config and should be removed.

This config contains:

  • User-specific hardcoded paths (/data/savithas/)
  • User-specific secret references (wandb.savithas, HUGGING_FACE_HUB_TOKEN.savithas)
  • Specific node exclusions for debugging
  • Comments indicating it's a "control experiment" for debugging NaN issues

This appears to be a personal debugging artifact that shouldn't be committed to the main branch.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml`
around lines 1 - 96, This file is a personal debugging config and should be
removed from the repo: delete or move the YAML (job_name
"og2-7b-cp2-llama3-code") out of the main branch and/or add it to .gitignore;
remove user-specific hardcoded paths (repo_root, code_path), secrets
(wandb_secret, hf_secret), checkpoint_dir, and node exclusions (exclude_nodes)
or replace them with CI-safe placeholders, and ensure git_branch and any
personal comments are not committed; if you need to keep it for local debugging,
put it in a personal fork or a local-only directory and document the allowed
storage location.
🟡 Minor comments (4)
bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py-26-27 (1)

26-27: ⚠️ Potential issue | 🟡 Minor

Avoid setting logger level at module import time.

Setting logger.setLevel(logging.INFO) at module level overrides any application-level logging configuration. This can cause unexpected behavior when users want different log levels.

Suggested fix: remove module-level setLevel
 logger = logging.getLogger(__name__)
-logger.setLevel(logging.INFO)

If INFO-level logging is needed for this module specifically, configure it in the application's logging setup or use a handler with the appropriate level.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py` around
lines 26 - 27, The module currently forces a logging level at import by calling
logger.setLevel(logging.INFO); remove that module-level setLevel call so the
module uses the application/global logging configuration instead, and if
module-specific INFO behavior is required, configure it externally (in
application logging setup) or attach a handler with its own level to the logger
rather than calling logger.setLevel in this file; look for the logger variable
created with logging.getLogger(__name__) and delete the subsequent
logger.setLevel(logging.INFO) invocation.
bionemo-recipes/recipes/opengenome2_llama_native_te/README.md-9-12 (1)

9-12: ⚠️ Potential issue | 🟡 Minor

Use descriptive link text instead of "here".

The link text "here" is not descriptive, which impacts accessibility and SEO. Static analysis flagged this as MD059.

Suggested fix
 This folder is a self-contained training example. You can download a zipped directory of this recipe
-alone by clicking
-[here](https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te).
+alone by downloading the [opengenome2-llama-native-te recipe directory](https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines 9
- 12, Replace the non-descriptive link text "here" in the README.md snippet with
a meaningful, descriptive phrase (e.g., "download this recipe as a zipped
directory" or "download the opengenome2-llama-native-te recipe") so the anchor
text clearly describes the link target; update the anchor around the URL
https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te
used in the README to use that descriptive text instead of "here" (locate the
anchor in the README.md block that contains the current "[here](...)").
bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py-500-548 (1)

500-548: ⚠️ Potential issue | 🟡 Minor

Thread cleanup in __iter__ may leave dangling prefetch thread.

When __iter__ is called, it calls self.close() to clean up any existing prefetch thread, but close() uses a 10-second timeout. If the previous iteration didn't complete and the thread is stuck, this could lead to resource leaks. Additionally, daemon threads will be killed abruptly on process exit, potentially leaving incomplete operations.

Consider logging a warning if the thread join times out:

Suggested improvement
     def close(self):
         """Stop the prefetch thread. Must be called before destroy_process_group()."""
         if self._prefetch_thread is not None:
             self._prefetch_thread.join(timeout=10)
+            if self._prefetch_thread.is_alive():
+                logger.warning("Prefetch thread did not terminate within timeout")
             self._prefetch_thread = None
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines
500 - 548, The close() call in __iter__ may leave a stuck _prefetch_thread
running because close() joins with a 10s timeout and the thread is a daemon;
change the cleanup to (1) attempt a bounded join, (2) if join timed out detect
thread.is_alive(), log a warning including thread id/context, and do not drop
the reference (keep _prefetch_thread so future close() calls can retry), and (3)
stop making the thread a daemon (create it in _kick_prefetch without
daemon=True) so it won't be abruptly killed on exit; update __iter__, close,
_kick_prefetch, and _do_one_prefetch to use these checks and logging around
_prefetch_thread joins and timeouts.
bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py-272-279 (1)

272-279: ⚠️ Potential issue | 🟡 Minor

Potential race condition with async checkpoint save.

When async_save=True, the code awaits the previous future before starting a new one, but dcp_async_save returns immediately. If prune_checkpoints is called while the async save is still in flight (after this block), files could be deleted before they're fully written.

Suggested fix: await async save before pruning
     if async_save:
         # If we're using asynchronous checkpointing, make sure we only have one checkpoint future at a time.
         if "fsdp2" in _ckpt_futures and _ckpt_futures["fsdp2"] is not None:
             _ckpt_futures["fsdp2"].result()
 
         _ckpt_futures["fsdp2"] = dcp_async_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group)
+        # Wait for async save to complete before pruning to avoid deleting in-flight checkpoints
+        if max_checkpoints is not None:
+            _ckpt_futures["fsdp2"].result()
     else:
         dcp_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group)
 
-    if max_checkpoints is not None and dist_config.is_main_process():
+    if max_checkpoints is not None and dist_config.is_main_process() and not async_save:
         prune_checkpoints(ckpt_path, max_checkpoints)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around
lines 272 - 279, When async_save is true the code starts a new dcp_async_save
and assigns its future to _ckpt_futures["fsdp2"] but prune_checkpoints may run
before that future completes, causing deletion of files still being written;
update the logic so prune_checkpoints awaits the outstanding future (check and
await _ckpt_futures["fsdp2"] if not None and not done) before removing old
checkpoints, or alternatively await the new future (returned by dcp_async_save)
prior to invoking prune_checkpoints; reference the symbols _ckpt_futures,
dcp_async_save, prune_checkpoints and the async_save branch to locate and change
the behavior.
🧹 Nitpick comments (16)
bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py (3)

193-194: Add return type annotation for completeness.

Suggested fix
-def main():
+def main() -> None:
     """Run THD+GQA+CP NaN reproduction tests."""
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py`
around lines 193 - 194, The function main lacks a return type annotation; update
the def main() signature (the main function in this module) to include an
explicit return type (e.g., def main() -> None:) to indicate it does not return
a value and improve type clarity.

86-86: Add return type annotation.

The function returns bool but is missing the return type hint. For Pyright type checking compliance:

Suggested fix
-def run_test(num_kv_heads: int, num_attn_heads: int = 6, num_steps: int = 20):
+def run_test(num_kv_heads: int, num_attn_heads: int = 6, num_steps: int = 20) -> bool:
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py`
at line 86, The function run_test(num_kv_heads: int, num_attn_heads: int = 6,
num_steps: int = 20) is missing a return type annotation; update its signature
to declare the return type as bool (-> bool) and ensure all return statements
within run_test produce boolean values so Pyright type checking passes.

46-83: Consider adding return type annotation and Google-style docstring sections.

The function logic is correct. For coding guideline compliance, consider adding:

  • A more specific return type (e.g., dict[str, torch.Tensor | int])
  • Google-style Args: and Returns: sections

Since this is a debugging script, this is a minor refinement.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py`
around lines 46 - 83, Add an explicit, more specific return type to
create_mock_thd_batch (for example dict[str, torch.Tensor | int]) and expand the
existing docstring to Google style with Args: and Returns: sections describing
each parameter (num_sequences, seq_length, hidden_size, vocab_size, cp_size,
device) and the returned mapping keys ("hidden_states", "cu_seqlens",
"max_seqlen", "labels" and their types); keep the implementation unchanged but
update the function signature and docstring to match project doc/type
guidelines.
bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile (2)

9-9: Consider adding a .dockerignore file.

COPY . . copies the entire build context. Without a .dockerignore, this may include unnecessary files (e.g., .git, __pycache__, test artifacts) that bloat the image and slow builds.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile` at line 9,
The Dockerfile uses COPY . . which will include the entire build context; add a
.dockerignore at the project root to exclude unnecessary files (e.g., .git,
.github, node_modules, __pycache__, *.pyc, tests/, .venv, build/, dist/, .env)
so image builds are smaller and faster; update repository by creating the
.dockerignore file with those entries and keep the Dockerfile unchanged (or
reference .dockerignore in docs) to ensure unwanted files are not copied during
the COPY . . step.

1-9: Consider adding a non-root user for improved security.

The Dockerfile runs as root, which is flagged by static analysis (Trivy DS-0002). For production or shared environments, running containers as a non-root user reduces the attack surface.

Example: Add non-root user
 # syntax=docker/dockerfile:1.4
 FROM nvcr.io/nvidia/pytorch:26.02-py3

 RUN --mount=type=cache,target=/root/.cache/pip \
     --mount=type=bind,source=requirements.txt,target=/requirements.txt \
     PIP_CONSTRAINT= pip install -r /requirements.txt

 WORKDIR /workspace/bionemo
 COPY . .
+
+# Create non-root user
+RUN useradd -m -u 1000 bionemo && chown -R bionemo:bionemo /workspace/bionemo
+USER bionemo
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile` around lines
1 - 9, The Dockerfile currently runs as root; create and switch to a non-root
user to address DS-0002: add steps to create a user/group (e.g.,
addgroup/adduser or groupadd/useradd) and set ownership of WORKDIR
(/workspace/bionemo) and copied files (referencing WORKDIR and COPY . .) before
switching with USER <username>; ensure any pip/cache directories used in RUN
(--mount paths) are accessible to that user and that subsequent RUN/CMD steps
run under the non-root user instead of root.
bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt (1)

1-12: Consider pinning dependency versions for reproducibility.

Most dependencies are unpinned, which can lead to non-reproducible builds when package versions change. For a training recipe, reproducibility is important.

Additionally, the nvdlfw_inspect git dependency lacks a commit hash or tag, meaning builds may break or behave differently over time.

Example: Pin versions and add commit hash
-datasets
+datasets==2.18.0
 hydra-core
-torch
+torch>=2.2.0
 torchao!=0.14.0
 torchdata
 torchmetrics
 tqdm
 transformer_engine[pytorch]
 transformers
 wandb
 zstandard
-nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect
+nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect@<commit-hash>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt` around
lines 1 - 12, Pin all top-level dependencies in requirements.txt to specific,
tested versions (e.g., datasets, hydra-core, torch, torchdata, torchmetrics,
tqdm, transformer_engine[pytorch], transformers, wandb, zstandard) to ensure
reproducible installs, and replace the loose git URL for nvdlfw_inspect with a
reference that includes a stable tag or full commit hash (e.g., add
@<tag-or-commit> after the repo URL); update any version constraints (== or ~=)
to the exact versions used in testing and add a short comment noting why each
pin was chosen.
bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md (1)

1-4: Consider adding a brief usage example.

The README provides minimal context. A short code snippet showing how to load and use the tokenizer would be helpful for users unfamiliar with the setup.

Example addition
 # A transformers-based tokenizer for DNA sequences
 
 This tokenizer is similar to the one used in the Evo-2 DNA model, and tokenizes DNA sequences using the `ord()`
 function.
+
+## Usage
+
+```python
+from transformers import AutoTokenizer
+
+tokenizer = AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer")
+tokens = tokenizer("ACGTACGT")
+```
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md`
around lines 1 - 4, The README lacks a usage example; add a short code snippet
demonstrating how to load and call the tokenizer using transformers'
AutoTokenizer (e.g.,
AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer")) and a simple
example call like tokenizer("ACGTACGT") (or
tokenizer.encode/tokenizer.convert_tokens_to_ids as appropriate) so users can
see how to instantiate and tokenize DNA strings; include this under a “Usage” or
“Example” section in the README.md near the top description.
bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py (3)

111-115: Simplify boolean return.

The condition can be returned directly without the if-else.

Suggested simplification
 def should_save_checkpoint(step: int, save_every_n_steps: int) -> bool:
     """Determine if a checkpoint should be saved."""
-    if save_every_n_steps > 0 and step % save_every_n_steps == 0 and step > 0:
-        return True
-    return False
+    return save_every_n_steps > 0 and step % save_every_n_steps == 0 and step > 0
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around
lines 111 - 115, The function should_save_checkpoint currently uses an
if/return/return pattern; simplify by returning the boolean expression directly:
replace the conditional block in should_save_checkpoint(step: int,
save_every_n_steps: int) -> bool with a single return of (save_every_n_steps > 0
and step % save_every_n_steps == 0 and step > 0) so behavior remains identical
but code is concise.

173-177: Use lazy % formatting in logger calls for consistency.

The logging calls use f-strings, but logger methods support lazy formatting which avoids string interpolation when the log level is disabled.

Suggested change
         if incompatible.missing_keys:
-            logger.warning(f"Missing keys when loading checkpoint: {incompatible.missing_keys}")
+            logger.warning("Missing keys when loading checkpoint: %s", incompatible.missing_keys)
         if incompatible.unexpected_keys:
-            logger.warning(f"Unexpected keys when loading checkpoint: {incompatible.unexpected_keys}")
+            logger.warning("Unexpected keys when loading checkpoint: %s", incompatible.unexpected_keys)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around
lines 173 - 177, The logger.warning calls in the checkpoint loader use f-strings
causing eager interpolation; change them to lazy formatting by passing the
message template and the values as parameters (e.g., replace
logger.warning(f"Missing keys when loading checkpoint:
{incompatible.missing_keys}") with logger.warning("Missing keys when loading
checkpoint: %s", incompatible.missing_keys) and do the same for unexpected_keys)
so that logging in the block that checks incompatible (and its
missing_keys/unexpected_keys) uses logger.warning with format string +
arguments.

68-71: Consider using a more explicit type annotation for the module-level dict.

The mutable module-level dict is intentional for tracking async checkpoint futures, but the type annotation could be more specific.

Suggested improvement
-_ckpt_futures: dict = {}
+_ckpt_futures: dict[str, torch.futures.Future | None] = {}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around
lines 68 - 71, The module-level variable _ckpt_futures currently has a generic
dict annotation; change it to an explicit typing annotation such as Dict[str,
asyncio.Future[Any]] (or the specific Future result type used by dcp_async_save)
so callers and linters know keys are strategy names and values are asyncio
Futures; add necessary imports (from typing import Dict, Any and import asyncio)
and update the declaration of _ckpt_futures accordingly, referencing the
_ckpt_futures symbol and the dcp_async_save-returned Future type to choose the
exact Future result type.
bionemo-recipes/recipes/opengenome2_llama_native_te/README.md (2)

58-62: Benchmark tables contain placeholder values.

The convergence benchmark table has placeholder "—" values that should be filled in before merging.

Would you like help filling in these benchmark values, or should this be tracked as a follow-up task?

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines
58 - 62, The convergence benchmark table in README.md contains placeholder "—"
values for "This recipe (OG2 7B)" and "Megatron baseline (OG2 7B)"; replace
those placeholders with the actual measured metrics (Step / checkpoint,
Perplexity, Train loss, Test loss) collected from your training/evaluation runs,
or if metrics are not yet available, add a clear TODO note and link to the
tracking issue/experiment ID; update the two table rows ("This recipe (OG2 7B)"
and "Megatron baseline (OG2 7B)") accordingly so the README reflects real
results or a tracked follow-up.

88-102: Mermaid charts contain placeholder instructions.

The chart titles say "fill with your numbers" indicating these are placeholders. Consider either:

  1. Filling in actual benchmark data
  2. Removing the charts until data is available
  3. Adding a note that these are example visualizations
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines
88 - 102, The README contains two Mermaid charts whose titles include the
placeholder text "fill with your numbers" (the xychart-beta blocks titled "MFU
(%) — fill with your numbers" and "Step time (s) — fill with your numbers");
replace the placeholders by either (a) populating the bar arrays with actual
benchmark values and updating the title to reflect measured data, (b) removing
the two mermaid xychart-beta blocks entirely if no data is available, or (c)
keeping the charts but changing the title/text to clearly mark them as examples
(e.g., "Example — replace with your measurements") and adding a short caption
explaining they are illustrative only. Ensure the chosen change appears in the
README.md where those mermaid blocks are defined.
bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml (1)

43-56: Hardcoded data path may limit portability.

The path /data/opengenome2/parquet_split is hardcoded. Consider using a placeholder or environment variable reference for better portability across environments.

Suggested change
   load_dataset_kwargs:
-    path: "/data/opengenome2/parquet_split"
+    path: ${oc.env:OG2_DATA_PATH,/data/opengenome2/parquet_split}
     split: "train"
     streaming: true

Note: This uses OmegaConf's environment variable resolver, allowing users to override via OG2_DATA_PATH env var.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml`
around lines 43 - 56, The dataset load path is hardcoded in
dataset.load_dataset_kwargs.path; change it to use OmegaConf's environment
variable resolver so users can override via OG2_DATA_PATH while keeping the
current path as the default. Update dataset.load_dataset_kwargs.path to
reference the env var (oc.env) with a fallback to the existing
"/data/opengenome2/parquet_split" so deployments can set OG2_DATA_PATH when
needed.
bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py (2)

446-455: Variable shadowing: batch used in list comprehension shadows outer variable.

The variable batch in the list comprehension shadows the batch variable from line 405, which could cause confusion during debugging.

Suggested fix
         if self._is_cp_row_major:
             # Flattened mesh: [(cp0,tp0), (cp0,tp1), (cp1,tp0), (cp1,tp1)]
             # Output: [cp0, cp0, cp1, cp1]
-            combined_batch = [batch for batch in combined_batch for _ in range(self.tp_world_size)]
+            combined_batch = [shard for shard in combined_batch for _ in range(self.tp_world_size)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines
446 - 455, The list comprehensions in the collator (inside the method using
self._is_cp_row_major) reuse the name "batch" which shadows the outer variable
from earlier in the function; update those comprehensions to use a distinct
inner variable name (e.g., inner_batch or item) instead of "batch" in the first
branch and similarly use a clear loop variable in the second branch so the outer
"batch" is not shadowed; adjust both expressions that build combined_batch to
use the new inner variable names (refer to self._is_cp_row_major,
self.tp_world_size, self.cp_world_size and the existing combined_batch
construction) so behavior is unchanged but variable shadowing is removed.

958-996: Consider adding sbhd format support or removing it from validation.

Line 958 validates that qvk_format is one of ["thd", "bshd", "sbhd"], but line 993-994 raises an error for sbhd saying "Support not implemented yet". This is confusing - either remove sbhd from the valid formats or implement support.

Option 1: Remove sbhd from validation
-    if qvk_format not in ["thd", "bshd", "sbhd"]:
+    if qvk_format not in ["thd", "bshd"]:
         raise ValueError(f"Unsupported qvk_format: {qvk_format}!")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines
958 - 996, The validation allows "sbhd" but there is no handler for it, causing
a misleading validation; either remove "sbhd" from the allowed list next to
qvk_format or add explicit support by adding an elif branch for qvk_format ==
"sbhd" that processes input_ids_padded and labels_padded (similar to the "bshd"
branch) and/or calls a new helper like _process_tensor_sbhd (or re-use
_process_tensor_bshd if semantics match), and ensure the starting validation
list is updated to match the implemented branches (symbols to edit: qvk_format
validation list, the elif blocks handling "bshd" and add "sbhd", and implement
or reuse _process_tensor_sbhd/_process_tensor_bshd accordingly).
bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py (1)

43-58: Handle TimeoutExpired explicitly and make the timeout configurable.

If subprocess.run hits the 240s limit, the test errors out before you print stdout/stderr, which makes the likely failure mode much harder to debug. Two-GPU TE/FSDP2 startup can easily exceed four minutes on shared runners.

Proposed helper update
-def run_train_cmd(cmd, recipe_path):
+def run_train_cmd(cmd, recipe_path, timeout: int = 600):
     """Run a training command and check for errors."""
-    result = subprocess.run(
-        cmd,
-        check=False,
-        text=True,
-        stdout=subprocess.PIPE,
-        stderr=subprocess.PIPE,
-        timeout=240,
-        cwd=str(recipe_path),
-    )
+    try:
+        result = subprocess.run(
+            cmd,
+            check=False,
+            text=True,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.PIPE,
+            timeout=timeout,
+            cwd=str(recipe_path),
+        )
+    except subprocess.TimeoutExpired as exc:
+        pytest.fail(
+            f"Command timed out after {timeout}s:\n{' '.join(cmd)}\n"
+            f"STDOUT:\n{exc.stdout or ''}\nSTDERR:\n{exc.stderr or ''}"
+        )
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py`
around lines 43 - 58, The helper run_train_cmd currently lets subprocess.run
raise TimeoutExpired (with a hardcoded 240s), so when the process times out you
lose stdout/stderr and useful diagnostics; update run_train_cmd to accept a
configurable timeout parameter (e.g., timeout_seconds with default 240) and wrap
the subprocess.run call in a try/except that catches subprocess.TimeoutExpired,
printing any available output (exc.stdout / exc.stderr or the partial result)
and the exception details before failing the test; also update callers to pass a
larger timeout where needed (or use an env var) so two-GPU TE/FSDP2 startup can
complete on slower runners.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In
`@bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py`:
- Around line 155-168: The current construction of hydra_overrides and
hydra_args_formatted interpolates raw config values (cfg.wandb_project,
cfg.wandb_name, cfg.extra_hydra_overrides and any path fields like
dataset_path/code_path) into a bash -c script and is vulnerable to shell
injection and breaking on spaces/quotes; fix by shell-quoting/escaping every
external value before adding to hydra_overrides (use a safe escaper such as
Python's shlex.quote or JSON/string literal quoting) so that the entries added
in the block that appends wandb.project/wandb.name and the loop that extends
extra_hydra_overrides are each sanitized, and then build hydra_args_formatted
from those already-escaped entries (ensure you treat each
cfg.extra_hydra_overrides element as a single token and escape it as well).
- Around line 234-235: Replace the current unsafe invocations that expand
secrets on the command line: change the python -c "from huggingface_hub import
login; login(token='${{HF_TOKEN}}')" call to read HF_TOKEN from the environment
inside Python (e.g. import os; from huggingface_hub import login;
login(token=os.environ.get('HF_TOKEN')) — locate the login(token=...) snippet),
and remove the wandb login ${{WANDB_API_KEY}} command; instead rely on setting
WANDB_API_KEY in the process environment (W&B will pick it up) or use secure
stdin-based login if needed (avoid using the wandb login <token> argv form
referenced by the wandb login symbol).
- Around line 190-199: The follower ranks currently use a fixed sleep (sleep 30)
after the rank-0 creates /tmp/git_sync_complete_marker; change this to an active
wait that polls for the marker file and only proceeds once it exists to
guarantee all ranks use the same checkout. Update the follower branch (the else
block that references NODE_RANK and cd {repo_root}) to loop until test -f
/tmp/git_sync_complete_marker (with a short sleep between polls), then read/echo
the marker (git commit) and then cd {repo_root} before continuing; ensure the
rank-0 code still writes the marker with echo "$(git rev-parse HEAD)" >
/tmp/git_sync_complete_marker so the follower can verify it.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/dataset.py`:
- Around line 297-312: The TokenPackingDataset is not told to account for
post-collation padding, so batches sized by TokenPackingDataset can exceed
token_micro_batch_size after DataCollatorWithFlattening pads to
pad_sequences_to_be_divisible_by; update the instantiation of
TokenPackingDataset (used to build train_dataloader) to pass the padding divisor
through (e.g. add
pad_sequences_to_be_divisible_by=pad_sequences_to_be_divisible_by or the
corresponding parameter name expected by TokenPackingDataset) so the dataset
computes budgets using the same divisor as
GenomicDataCollator/DataCollatorWithFlattening.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml`:
- Around line 15-19: The defaults enable spike_no_more_embedding_init and
use_megatron_scaled_init while use_meta_device remains true, but
NVLlamaPreTrainedModel.init_empty_weights (and the comment) indicate meta-device
init breaks those distributions; fix by making the defaults consistent: either
set use_meta_device: false when spike_no_more_embedding_init or
use_megatron_scaled_init are true, or disable those init flags by default if
use_meta_device must stay true. Update the YAML keys
(spike_no_more_embedding_init, use_megatron_scaled_init, use_meta_device) so the
default recipe preserves the advertised OG2 initialization behavior.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa.yaml`:
- Around line 105-112: Validation windowing overrides (validation.max_seq_length
and validation.stride) are ignored by the current trainer; update train_fsdp2.py
so that when building val_dataset_kwargs it copies validation.max_seq_length and
validation.stride (in addition to validation.data_path and
validation.micro_batch_size) into val_dataset_kwargs or otherwise applies them
to the validation dataset creation path (use the same keys/names expected by the
dataset loader and fall back to training windowing only if those validation
fields are null). Ensure the code references the existing val_dataset_kwargs
construction point and the validation.* config keys to avoid silent use of
training windowing during validation.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/modeling_llama_te.py`:
- Around line 205-235: The code still dereferences input_ids when callers supply
inputs_embeds; update the branches that use input_ids (inside should_pack_inputs
and the InferenceParams length computation) to fall back to inputs_embeds when
input_ids is None: use inputs_embeds.size(1) (or .shape) for padded_seq_len and
any shape comparisons, and when computing lengths use attention_mask.shape ==
(inputs_embeds.shape) or use inputs_embeds.size(1) to derive sequence lengths;
adjust the conditionals around _unpad_input, the cu_seq_lens/max_length
assignments, and the lengths list passed to past_key_values.pre_step so they
work whether input_ids or inputs_embeds was provided (referencing
should_pack_inputs, _unpad_input, attention_mask, inputs_embeds, input_ids,
InferenceParams, and pre_step).

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_modeling_llama_te.py`:
- Around line 346-350: The code assumes input_ids exists when should_pack_inputs
is true, causing AttributeError if callers passed inputs_embeds; update the
packing path to derive padded_seq_len and batch_size from inputs_embeds when
present (fall back to input_ids otherwise), e.g., use inputs_embeds.size(1) /
.shape or hidden_states.shape to compute padded_seq_len and batch_size before
calling _unpad_input; apply the same change to the other occurrence mentioned
(the cache-length computation block around lines 360-365) so all logic
referencing input_ids handles inputs_embeds or uses hidden_states as a fallback.
- Line 255: The code is using config.dtype (and passing strings) where PyTorch
expects a torch.dtype; update all usages to use config.torch_dtype (e.g.,
replace the dtype arg in the nn.Embedding creation at self.embed_tokens and the
dtype args passed to the TransformerEngine constructors referenced in this file)
and ensure the value is an actual torch.dtype (if configs may supply a string,
convert it to a torch.dtype via torch.<type> lookup before use). Also
standardize configs to use the HF key torch_dtype instead of the custom dtype so
subsequent code can rely on config.torch_dtype being present and correct.
- Around line 245-255: The model class NVLlamaModel lacks the embedding
accessors required by PreTrainedModel.tie_weights(), so implement
get_input_embeddings(self) to return self.embed_tokens and
get_output_embeddings(self) to return getattr(self, "lm_head", None) (and
optionally set_input_embeddings(self, new_emb) to assign self.embed_tokens) so
tie_word_embeddings actually ties lm_head to the token embeddings; apply the
same three accessor methods to the other related model class around lines
418-436 (the class that defines lm_head there) so the tying mechanism works for
both classes.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/perf_logger.py`:
- Around line 118-129: The logging window miscounts tokens because
num_tokens/num_unpadded_tokens are only updated when step % logging_frequency ==
0 while running_loss and grad_acc_step_count update every step; change the
accounting so token counts are accumulated every step (same place running_loss
and grad_acc_step_count are incremented inside with torch.no_grad()) and keep
the existing log emission that checks step against logging_frequency (or
alternatively change the emit condition to (step+1) % logging_frequency == 0) so
the tokens, loss, and grad_acc_step_count reflect the same window; update
references to grad_acc_step_count, running_loss, num_tokens,
num_unpadded_tokens, logging_frequency, and step to be consistent across
accumulation and logging.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/scheduler.py`:
- Around line 21-27: The get_cosine_annealing_schedule_with_warmup function must
validate inputs before computing the schedule: add guards that raise clear
ValueError(s) if num_decay_steps is 0 or negative (to avoid ZeroDivisionError in
the cosine branch) and if min_lr_ratio is out of the [0.0, 1.0] range (to
prevent negative or increasing LRs); place these checks at the start of
get_cosine_annealing_schedule_with_warmup and return/raise immediately with
descriptive messages referencing the offending parameter names.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/conftest.py`:
- Around line 53-69: The autouse fixture device_mesh currently assumes CUDA
exists and unconditionally sets CUDA device and synchronizes; update it to first
query torch.cuda.is_available() and torch.cuda.device_count() and compare to
DistributedConfig().local_rank to decide whether CUDA GPU setup should run; only
call torch.cuda.set_device, torch.cuda.empty_cache, and torch.cuda.synchronize
when CUDA is available and local_rank < device_count, and ensure
torch.distributed.init_process_group / torch.distributed.destroy_process_group
are only invoked (or torn down) when initialized (or guarded) so CPU-only
environments skip GPU setup cleanly; reference the fixture name device_mesh, the
DistributedConfig usage, and the calls to torch.distributed.init_process_group,
torch.cuda.set_device, torch.cuda.empty_cache, and torch.cuda.synchronize when
implementing the guard.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train.py`:
- Around line 59-76: The test test_sanity_convergence_fsdp2_te_thd is not
triggering the THD dataloader because main_fsdp2 selects create_thd_dataloader
based on use_sequence_packing, not on config_kwargs.attn_input_format; update
the test to set the configuration option that enables sequence packing (e.g.,
set use_sequence_packing=true or the exact hydra override key that main_fsdp2
checks) so main_fsdp2 will call create_thd_dataloader and exercise the THD path
rather than only setting config_kwargs.attn_input_format.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 146-154: Add a fail-fast guard that rejects using meta-device
initialization together with the custom init modes: when getattr(args,
"use_meta_device", False) is true and either getattr(args,
"spike_no_more_embedding_init", False) or getattr(args,
"use_megatron_scaled_init", False) is true, raise a clear exception (or call
parser.error) explaining that use_meta_device is incompatible with
spike_no_more_embedding_init and use_megatron_scaled_init; place this check
before you mutate config_kwargs (near where config_kwargs is prepared and where
spike_no_more_embedding_init / use_megatron_scaled_init are handled) so the bad
combination is detected early.
- Around line 124-128: The mesh construction using init_device_mesh can produce
an invalid shape when args.cp_size doesn't evenly divide dist_config.world_size;
validate args.cp_size before calling init_device_mesh by checking it's a
positive integer ≤ dist_config.world_size and that dist_config.world_size %
args.cp_size == 0, and raise a clear ValueError (or argparse/cli error)
indicating the mismatch (include values for dist_config.world_size and
args.cp_size) so the misconfiguration fails fast rather than causing a later
distributed init error.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 141-149: The code allows incompatible combinations where
use_meta_device breaks Megatron-scaled and Spike-No-More initializations; add a
hard validation that raises an error when use_meta_device is true and either
args.use_megatron_scaled_init or args.spike_no_more_embedding_init is set.
Specifically, in train_fsdp2.py (near where getattr(args,
"spike_no_more_embedding_init", ...) and getattr(args,
"use_megatron_scaled_init", ...) are handled), check getattr(args,
"use_meta_device", False) and if true and (getattr(args,
"use_megatron_scaled_init", False) or getattr(args,
"spike_no_more_embedding_init", False)) raise a ValueError (or parser.error)
with a clear message naming use_meta_device, use_megatron_scaled_init, and
spike_no_more_embedding_init so the process fails fast. Ensure this validation
runs before modifying config_kwargs or proceeding with initialization.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/validation.py`:
- Around line 73-88: The current try/except in the validation loop (around the
forward pass using transformer_engine.pytorch.autocast and model(**batch))
swallows all exceptions and can return misleading metrics when no batches
succeed; update the exception handling to re-raise critical errors or at minimum
raise an exception when num_evaluated == 0 after the loop. Specifically, inside
the except block for the forward pass in the validation function, do not just
logger.warning and continue—either re-raise the caught exception (raise) or set
a flag and after the loop check num_evaluated (or total_tokens) and raise a
RuntimeError indicating validation failed if zero batches were evaluated; ensure
this touches the same symbols: model(**batch), total_loss, total_tokens,
total_weighted_loss, and num_evaluated so metrics are not returned as
val_loss=0/val_ppl=1 when no successful forward occurred.

---

Minor comments:
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py`:
- Around line 272-279: When async_save is true the code starts a new
dcp_async_save and assigns its future to _ckpt_futures["fsdp2"] but
prune_checkpoints may run before that future completes, causing deletion of
files still being written; update the logic so prune_checkpoints awaits the
outstanding future (check and await _ckpt_futures["fsdp2"] if not None and not
done) before removing old checkpoints, or alternatively await the new future
(returned by dcp_async_save) prior to invoking prune_checkpoints; reference the
symbols _ckpt_futures, dcp_async_save, prune_checkpoints and the async_save
branch to locate and change the behavior.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py`:
- Around line 500-548: The close() call in __iter__ may leave a stuck
_prefetch_thread running because close() joins with a 10s timeout and the thread
is a daemon; change the cleanup to (1) attempt a bounded join, (2) if join timed
out detect thread.is_alive(), log a warning including thread id/context, and do
not drop the reference (keep _prefetch_thread so future close() calls can
retry), and (3) stop making the thread a daemon (create it in _kick_prefetch
without daemon=True) so it won't be abruptly killed on exit; update __iter__,
close, _kick_prefetch, and _do_one_prefetch to use these checks and logging
around _prefetch_thread joins and timeouts.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py`:
- Around line 26-27: The module currently forces a logging level at import by
calling logger.setLevel(logging.INFO); remove that module-level setLevel call so
the module uses the application/global logging configuration instead, and if
module-specific INFO behavior is required, configure it externally (in
application logging setup) or attach a handler with its own level to the logger
rather than calling logger.setLevel in this file; look for the logger variable
created with logging.getLogger(__name__) and delete the subsequent
logger.setLevel(logging.INFO) invocation.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md`:
- Around line 9-12: Replace the non-descriptive link text "here" in the
README.md snippet with a meaningful, descriptive phrase (e.g., "download this
recipe as a zipped directory" or "download the opengenome2-llama-native-te
recipe") so the anchor text clearly describes the link target; update the anchor
around the URL
https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te
used in the README to use that descriptive text instead of "here" (locate the
anchor in the README.md block that contains the current "[here](...)").

---

Duplicate comments:
In
`@bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml`:
- Around line 1-96: This file is a personal debugging config and should be
removed from the repo: delete or move the YAML (job_name
"og2-7b-cp2-llama3-code") out of the main branch and/or add it to .gitignore;
remove user-specific hardcoded paths (repo_root, code_path), secrets
(wandb_secret, hf_secret), checkpoint_dir, and node exclusions (exclude_nodes)
or replace them with CI-safe placeholders, and ensure git_branch and any
personal comments are not committed; if you need to keep it for local debugging,
put it in a personal fork or a local-only directory and document the allowed
storage location.

---

Nitpick comments:
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py`:
- Around line 111-115: The function should_save_checkpoint currently uses an
if/return/return pattern; simplify by returning the boolean expression directly:
replace the conditional block in should_save_checkpoint(step: int,
save_every_n_steps: int) -> bool with a single return of (save_every_n_steps > 0
and step % save_every_n_steps == 0 and step > 0) so behavior remains identical
but code is concise.
- Around line 173-177: The logger.warning calls in the checkpoint loader use
f-strings causing eager interpolation; change them to lazy formatting by passing
the message template and the values as parameters (e.g., replace
logger.warning(f"Missing keys when loading checkpoint:
{incompatible.missing_keys}") with logger.warning("Missing keys when loading
checkpoint: %s", incompatible.missing_keys) and do the same for unexpected_keys)
so that logging in the block that checks incompatible (and its
missing_keys/unexpected_keys) uses logger.warning with format string +
arguments.
- Around line 68-71: The module-level variable _ckpt_futures currently has a
generic dict annotation; change it to an explicit typing annotation such as
Dict[str, asyncio.Future[Any]] (or the specific Future result type used by
dcp_async_save) so callers and linters know keys are strategy names and values
are asyncio Futures; add necessary imports (from typing import Dict, Any and
import asyncio) and update the declaration of _ckpt_futures accordingly,
referencing the _ckpt_futures symbol and the dcp_async_save-returned Future type
to choose the exact Future result type.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py`:
- Around line 446-455: The list comprehensions in the collator (inside the
method using self._is_cp_row_major) reuse the name "batch" which shadows the
outer variable from earlier in the function; update those comprehensions to use
a distinct inner variable name (e.g., inner_batch or item) instead of "batch" in
the first branch and similarly use a clear loop variable in the second branch so
the outer "batch" is not shadowed; adjust both expressions that build
combined_batch to use the new inner variable names (refer to
self._is_cp_row_major, self.tp_world_size, self.cp_world_size and the existing
combined_batch construction) so behavior is unchanged but variable shadowing is
removed.
- Around line 958-996: The validation allows "sbhd" but there is no handler for
it, causing a misleading validation; either remove "sbhd" from the allowed list
next to qvk_format or add explicit support by adding an elif branch for
qvk_format == "sbhd" that processes input_ids_padded and labels_padded (similar
to the "bshd" branch) and/or calls a new helper like _process_tensor_sbhd (or
re-use _process_tensor_bshd if semantics match), and ensure the starting
validation list is updated to match the implemented branches (symbols to edit:
qvk_format validation list, the elif blocks handling "bshd" and add "sbhd", and
implement or reuse _process_tensor_sbhd/_process_tensor_bshd accordingly).

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile`:
- Line 9: The Dockerfile uses COPY . . which will include the entire build
context; add a .dockerignore at the project root to exclude unnecessary files
(e.g., .git, .github, node_modules, __pycache__, *.pyc, tests/, .venv, build/,
dist/, .env) so image builds are smaller and faster; update repository by
creating the .dockerignore file with those entries and keep the Dockerfile
unchanged (or reference .dockerignore in docs) to ensure unwanted files are not
copied during the COPY . . step.
- Around line 1-9: The Dockerfile currently runs as root; create and switch to a
non-root user to address DS-0002: add steps to create a user/group (e.g.,
addgroup/adduser or groupadd/useradd) and set ownership of WORKDIR
(/workspace/bionemo) and copied files (referencing WORKDIR and COPY . .) before
switching with USER <username>; ensure any pip/cache directories used in RUN
(--mount paths) are accessible to that user and that subsequent RUN/CMD steps
run under the non-root user instead of root.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml`:
- Around line 43-56: The dataset load path is hardcoded in
dataset.load_dataset_kwargs.path; change it to use OmegaConf's environment
variable resolver so users can override via OG2_DATA_PATH while keeping the
current path as the default. Update dataset.load_dataset_kwargs.path to
reference the env var (oc.env) with a fallback to the existing
"/data/opengenome2/parquet_split" so deployments can set OG2_DATA_PATH when
needed.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md`:
- Around line 58-62: The convergence benchmark table in README.md contains
placeholder "—" values for "This recipe (OG2 7B)" and "Megatron baseline (OG2
7B)"; replace those placeholders with the actual measured metrics (Step /
checkpoint, Perplexity, Train loss, Test loss) collected from your
training/evaluation runs, or if metrics are not yet available, add a clear TODO
note and link to the tracking issue/experiment ID; update the two table rows
("This recipe (OG2 7B)" and "Megatron baseline (OG2 7B)") accordingly so the
README reflects real results or a tracked follow-up.
- Around line 88-102: The README contains two Mermaid charts whose titles
include the placeholder text "fill with your numbers" (the xychart-beta blocks
titled "MFU (%) — fill with your numbers" and "Step time (s) — fill with your
numbers"); replace the placeholders by either (a) populating the bar arrays with
actual benchmark values and updating the title to reflect measured data, (b)
removing the two mermaid xychart-beta blocks entirely if no data is available,
or (c) keeping the charts but changing the title/text to clearly mark them as
examples (e.g., "Example — replace with your measurements") and adding a short
caption explaining they are illustrative only. Ensure the chosen change appears
in the README.md where those mermaid blocks are defined.

In `@bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt`:
- Around line 1-12: Pin all top-level dependencies in requirements.txt to
specific, tested versions (e.g., datasets, hydra-core, torch, torchdata,
torchmetrics, tqdm, transformer_engine[pytorch], transformers, wandb, zstandard)
to ensure reproducible installs, and replace the loose git URL for
nvdlfw_inspect with a reference that includes a stable tag or full commit hash
(e.g., add @<tag-or-commit> after the repo URL); update any version constraints
(== or ~=) to the exact versions used in testing and add a short comment noting
why each pin was chosen.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py`:
- Around line 193-194: The function main lacks a return type annotation; update
the def main() signature (the main function in this module) to include an
explicit return type (e.g., def main() -> None:) to indicate it does not return
a value and improve type clarity.
- Line 86: The function run_test(num_kv_heads: int, num_attn_heads: int = 6,
num_steps: int = 20) is missing a return type annotation; update its signature
to declare the return type as bool (-> bool) and ensure all return statements
within run_test produce boolean values so Pyright type checking passes.
- Around line 46-83: Add an explicit, more specific return type to
create_mock_thd_batch (for example dict[str, torch.Tensor | int]) and expand the
existing docstring to Google style with Args: and Returns: sections describing
each parameter (num_sequences, seq_length, hidden_size, vocab_size, cp_size,
device) and the returned mapping keys ("hidden_states", "cu_seqlens",
"max_seqlen", "labels" and their types); keep the implementation unchanged but
update the function signature and docstring to match project doc/type
guidelines.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py`:
- Around line 43-58: The helper run_train_cmd currently lets subprocess.run
raise TimeoutExpired (with a hardcoded 240s), so when the process times out you
lose stdout/stderr and useful diagnostics; update run_train_cmd to accept a
configurable timeout parameter (e.g., timeout_seconds with default 240) and wrap
the subprocess.run call in a try/except that catches subprocess.TimeoutExpired,
printing any available output (exc.stdout / exc.stderr or the partial result)
and the exception details before failing the test; also update callers to pass a
larger timeout where needed (or use an env var) so two-GPU TE/FSDP2 startup can
complete on slower runners.

In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md`:
- Around line 1-4: The README lacks a usage example; add a short code snippet
demonstrating how to load and call the tokenizer using transformers'
AutoTokenizer (e.g.,
AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer")) and a simple
example call like tokenizer("ACGTACGT") (or
tokenizer.encode/tokenizer.convert_tokens_to_ids as appropriate) so users can
see how to instantiate and tokenize DNA strings; include this under a “Usage” or
“Example” section in the README.md near the top description.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 9d4c1dac-1fa0-48bc-a8fc-b774dff6b015

📥 Commits

Reviewing files that changed from the base of the PR and between 5676160 and 0ae0b8e.

⛔ Files ignored due to path filters (2)
  • bionemo-recipes/recipes/opengenome2_llama_native_te/dlcm_sanity_dataset.parquet is excluded by !**/*.parquet
  • bionemo-recipes/recipes/opengenome2_llama_native_te/test_genomic_sequences.parquet is excluded by !**/*.parquet
📒 Files selected for processing (42)
  • bionemo-recipes/recipes/llama3_native_te/checkpoint.py
  • bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml
  • bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/.ruff.toml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile
  • bionemo-recipes/recipes/opengenome2_llama_native_te/README.md
  • bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/dataset.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/distributed_config.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging_stats.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_debug_2layer.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_sanity.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_sanity_cp.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_fp8.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml
  • bionemo-recipes/recipes/opengenome2_llama_native_te/model_configs/lingua-1B/config.json
  • bionemo-recipes/recipes/opengenome2_llama_native_te/model_configs/meta-llama/Llama-3.2-1B/config.json
  • bionemo-recipes/recipes/opengenome2_llama_native_te/modeling_llama_te.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_collator.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_modeling_llama_te.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/optimizer.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/perf_logger.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt
  • bionemo-recipes/recipes/opengenome2_llama_native_te/scheduler.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/conftest.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_dataset.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_genomic_dataset.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/special_tokens_map.json
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer.json
  • bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer_config.json
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py
  • bionemo-recipes/recipes/opengenome2_llama_native_te/validation.py
  • ci/scripts/check_copied_files.py

Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@savitha-eng savitha-eng changed the title Add opengenome2 llama3 recipe [WIP - Draft] Add opengenome2 llama3 recipe Mar 11, 2026
savitha-eng and others added 12 commits March 11, 2026 07:33
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
- Remove lepton_backup folder from llama3_native_te (tracked files)
- Revert checkpoint.py to main (QuantizedTensor/async FP8 fix belongs in separate PR)
- Set use_meta_device=false in OG2 defaults (incompatible with scaled init/Spike-No-More)
- Add hard validation for meta-device + scaled init/Spike-No-More in both training scripts
- Fail fast in validation.py when all batches fail instead of returning fake-good metrics

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@savitha-eng savitha-eng marked this pull request as ready for review March 11, 2026 19:50
@savitha-eng savitha-eng changed the title [WIP - Draft] Add opengenome2 llama3 recipe Add opengenome2 llama3 recipe Mar 11, 2026
savitha-eng and others added 4 commits March 11, 2026 20:03
Adds scripts/evaluate_fasta_lm_loss.py and scripts/metagenomics.fasta
for computing per-sequence log probabilities on a pre-sampled FASTA file.
This enables direct comparison across checkpoints using identical sequences.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add note explaining that test losses were computed using the FASTA eval
script on a shared set of pre-sampled sequences, with a matching script
used for the Megatron baseline. Add "Evaluating Checkpoints" usage section.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@savitha-eng savitha-eng requested a review from pstjohn March 11, 2026 21:37
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
@savitha-eng
Copy link
Collaborator Author

/ok to test

@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 11, 2026

/ok to test

@savitha-eng, there was an error processing your request: E1

See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/

Signed-off-by: savitha-eng <savithas@nvidia.com>
@savitha-eng
Copy link
Collaborator Author

/ok to test ca7176f

savitha-eng and others added 2 commits March 11, 2026 22:28
Restore checkpoint.py to match origin/main which already includes the
QuantizedTensor async save guard, _extra_state filtering, and strict=False
for checkpoint loading (added in c157a41).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sync modeling_llama_te.py with source (check_copied_files requirement)
- Replace hardcoded /data/savithas/ paths in eval script docstring
- Fix trailing whitespace in README

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@savitha-eng
Copy link
Collaborator Author

/ok to test 44c3455

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants