Conversation
Self-contained FSDP2 + TransformerEngine recipe for OpenGenome2 training, extracted from the generic llama3_native_te recipe with OG2-specific defaults: - FP32 master weights with MixedPrecisionPolicy (cast_forward_inputs=False) - Megatron-style scaled init for proj/fc2 layers - Spike-No-More embedding initialization (std=1.0) - Genomic masking for degenerate bases - Weight decay grouping (skip bias/1D params) - THD sequence packing with GQA - FP8 training with first/last layer BF16 override - RoPE fp32 assertion (from PR #1496) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughIntroduces the OpenGenome2 Llama3 with TransformerEngine recipe, a self-contained training solution featuring FSDP2 distributed training, context parallelism, sequence packing, genomic data handling, and checkpoint management. Includes model implementations, data loaders, training scripts, comprehensive test suite, and supporting utilities. Also improves checkpoint robustness in the llama3_native_te recipe by deferring QuantizedTensor validation logic. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant Hydra as Hydra Config
participant DistributedSetup as Distributed<br/>Setup
participant DataLoader as DataLoader<br/>(CP-Aware)
participant Model as NVLlamaForCausalLM<br/>(FSDP2)
participant Optimizer as Optimizer &<br/>Scheduler
participant Checkpoint as Checkpoint<br/>Manager
participant Logger as PerfLogger &<br/>Validation
User->>Hydra: train_fsdp2_cp.py with config
Hydra->>DistributedSetup: Initialize ranks, device mesh (CP/DP)
DistributedSetup->>DistributedSetup: Create process groups, set seeds
DistributedSetup->>Model: Wrap NVLlama with FSDP2, attach CP group
DistributedSetup->>DataLoader: Build CP-aware dataloader on rank 0
DataLoader->>DataLoader: ContextParallelDataLoaderWrapper wraps collate_fn
DistributedSetup->>Checkpoint: Load checkpoint or start fresh
Checkpoint->>Model: Resume state_dict and optimizer state
loop Training Loop (per step)
DataLoader->>DataLoader: Next batch with sharded CP ranks
DataLoader->>Model: Forward with THD/BSHD input, scattered to ranks
Model->>Model: Transformer layers (TE + RoPE + FP8/BF16 if enabled)
Model-->>Logger: Loss output
Logger->>Logger: Accumulate micro-step metrics
Model->>Optimizer: Backward pass (loss / grad_acc_steps)
Optimizer->>Model: Gradient clip and optimize
Optimizer->>Scheduler: Update learning rate
Logger->>Logger: Log step metrics (every N steps)
alt Checkpoint Interval
Model->>Checkpoint: Save checkpoint (possibly async)
end
end
Checkpoint->>Model: Save final model (FP32 on rank 0)
Logger->>Logger: Finish wandb, profiler
DistributedSetup->>DistributedSetup: Destroy process groups, cleanup
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 inconclusive)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
bionemo-recipes/recipes/opengenome2_llama_native_te/dlcm_sanity_dataset.parquet
Show resolved
Hide resolved
…d base files - Restore modeling_llama_te.py and collator.py to CI-synced copies of upstream - Create opengenome_modeling_llama_te.py with OG2-specific init (Megatron scaled, Spike-No-More, FP8 layer override, RoPE fixes) - Rename genomic_dataset.py to opengenome_collator.py for consistency - Extract weight decay grouping to optimizer.py - Extract validation loop to validation.py - Add gradient accumulation tests for BSHD and THD formats - Register new files in CI copy-check sync lists Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add train_fsdp2_cp.py with FSDP2+CP support, keeping all OG2-specific features (FP32 master weights, Megatron scaled init, Spike-No-More, weight decay grouping) - Add L0_sanity_cp.yaml hydra config for CP testing - Add single-GPU CP test (cp_size=1) and 2-GPU CP tests (cp_size=2) for both BSHD and THD formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ive_te The QuantizedTensor class moved between TE versions. Make the import conditional so checkpoint.py works with both old and new TE releases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The llama3 L0_sanity_cp hydra config has no validation key, so validation.enabled override fails. Use ++ prefix for robustness and remove validation_enabled from the llama3-code control experiment config. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The L0_sanity_cp base config has data_files=dlcm_sanity_dataset.parquet which doesn't exist at the OG2 dataset path. Use Hydra delete syntax to remove it so HF auto-discovers parquet files in the directory. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add scripts/minimal_repro_thd_gqa_cp_nan.py: single TransformerLayer with synthetic data, runs MHA then GQA with THD+CP=2 to demonstrate the NaN bug. Run with: torchrun --nproc_per_node=2 scripts/minimal_repro_thd_gqa_cp_nan.py Also remove first-batch dtype/loss logging from train_fsdp2.py and train_fsdp2_cp.py (not present in upstream llama3 code). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
__version__ is on transformer_engine, not transformer_engine.pytorch. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Use tiny config (hidden=384, 6 heads, 256 seq len) instead of 7B - Fix cu_seqlens: pass FULL sequence cu_seqlens (TE divides internally) - Fix hidden_states: use per-rank token count but full cu_seqlens - Remove FP8 recipe (not needed when autocast disabled) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
.../recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml
Outdated
Show resolved
Hide resolved
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
There was a problem hiding this comment.
Actionable comments posted: 18
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
♻️ Duplicate comments (1)
bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml (1)
1-96:⚠️ Potential issue | 🟡 MinorThis file appears to be a personal debugging config and should be removed.
This config contains:
- User-specific hardcoded paths (
/data/savithas/)- User-specific secret references (
wandb.savithas,HUGGING_FACE_HUB_TOKEN.savithas)- Specific node exclusions for debugging
- Comments indicating it's a "control experiment" for debugging NaN issues
This appears to be a personal debugging artifact that shouldn't be committed to the main branch.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml` around lines 1 - 96, This file is a personal debugging config and should be removed from the repo: delete or move the YAML (job_name "og2-7b-cp2-llama3-code") out of the main branch and/or add it to .gitignore; remove user-specific hardcoded paths (repo_root, code_path), secrets (wandb_secret, hf_secret), checkpoint_dir, and node exclusions (exclude_nodes) or replace them with CI-safe placeholders, and ensure git_branch and any personal comments are not committed; if you need to keep it for local debugging, put it in a personal fork or a local-only directory and document the allowed storage location.
🟡 Minor comments (4)
bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py-26-27 (1)
26-27:⚠️ Potential issue | 🟡 MinorAvoid setting logger level at module import time.
Setting
logger.setLevel(logging.INFO)at module level overrides any application-level logging configuration. This can cause unexpected behavior when users want different log levels.Suggested fix: remove module-level setLevel
logger = logging.getLogger(__name__) -logger.setLevel(logging.INFO)If INFO-level logging is needed for this module specifically, configure it in the application's logging setup or use a handler with the appropriate level.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py` around lines 26 - 27, The module currently forces a logging level at import by calling logger.setLevel(logging.INFO); remove that module-level setLevel call so the module uses the application/global logging configuration instead, and if module-specific INFO behavior is required, configure it externally (in application logging setup) or attach a handler with its own level to the logger rather than calling logger.setLevel in this file; look for the logger variable created with logging.getLogger(__name__) and delete the subsequent logger.setLevel(logging.INFO) invocation.bionemo-recipes/recipes/opengenome2_llama_native_te/README.md-9-12 (1)
9-12:⚠️ Potential issue | 🟡 MinorUse descriptive link text instead of "here".
The link text "here" is not descriptive, which impacts accessibility and SEO. Static analysis flagged this as MD059.
Suggested fix
This folder is a self-contained training example. You can download a zipped directory of this recipe -alone by clicking -[here](https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te). +alone by downloading the [opengenome2-llama-native-te recipe directory](https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines 9 - 12, Replace the non-descriptive link text "here" in the README.md snippet with a meaningful, descriptive phrase (e.g., "download this recipe as a zipped directory" or "download the opengenome2-llama-native-te recipe") so the anchor text clearly describes the link target; update the anchor around the URL https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te used in the README to use that descriptive text instead of "here" (locate the anchor in the README.md block that contains the current "[here](...)").bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py-500-548 (1)
500-548:⚠️ Potential issue | 🟡 MinorThread cleanup in
__iter__may leave dangling prefetch thread.When
__iter__is called, it callsself.close()to clean up any existing prefetch thread, butclose()uses a 10-second timeout. If the previous iteration didn't complete and the thread is stuck, this could lead to resource leaks. Additionally, daemon threads will be killed abruptly on process exit, potentially leaving incomplete operations.Consider logging a warning if the thread join times out:
Suggested improvement
def close(self): """Stop the prefetch thread. Must be called before destroy_process_group().""" if self._prefetch_thread is not None: self._prefetch_thread.join(timeout=10) + if self._prefetch_thread.is_alive(): + logger.warning("Prefetch thread did not terminate within timeout") self._prefetch_thread = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines 500 - 548, The close() call in __iter__ may leave a stuck _prefetch_thread running because close() joins with a 10s timeout and the thread is a daemon; change the cleanup to (1) attempt a bounded join, (2) if join timed out detect thread.is_alive(), log a warning including thread id/context, and do not drop the reference (keep _prefetch_thread so future close() calls can retry), and (3) stop making the thread a daemon (create it in _kick_prefetch without daemon=True) so it won't be abruptly killed on exit; update __iter__, close, _kick_prefetch, and _do_one_prefetch to use these checks and logging around _prefetch_thread joins and timeouts.bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py-272-279 (1)
272-279:⚠️ Potential issue | 🟡 MinorPotential race condition with async checkpoint save.
When
async_save=True, the code awaits the previous future before starting a new one, butdcp_async_savereturns immediately. Ifprune_checkpointsis called while the async save is still in flight (after this block), files could be deleted before they're fully written.Suggested fix: await async save before pruning
if async_save: # If we're using asynchronous checkpointing, make sure we only have one checkpoint future at a time. if "fsdp2" in _ckpt_futures and _ckpt_futures["fsdp2"] is not None: _ckpt_futures["fsdp2"].result() _ckpt_futures["fsdp2"] = dcp_async_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) + # Wait for async save to complete before pruning to avoid deleting in-flight checkpoints + if max_checkpoints is not None: + _ckpt_futures["fsdp2"].result() else: dcp_save(state_dict, checkpoint_id=checkpoint_path, process_group=process_group) - if max_checkpoints is not None and dist_config.is_main_process(): + if max_checkpoints is not None and dist_config.is_main_process() and not async_save: prune_checkpoints(ckpt_path, max_checkpoints)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around lines 272 - 279, When async_save is true the code starts a new dcp_async_save and assigns its future to _ckpt_futures["fsdp2"] but prune_checkpoints may run before that future completes, causing deletion of files still being written; update the logic so prune_checkpoints awaits the outstanding future (check and await _ckpt_futures["fsdp2"] if not None and not done) before removing old checkpoints, or alternatively await the new future (returned by dcp_async_save) prior to invoking prune_checkpoints; reference the symbols _ckpt_futures, dcp_async_save, prune_checkpoints and the async_save branch to locate and change the behavior.
🧹 Nitpick comments (16)
bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py (3)
193-194: Add return type annotation for completeness.Suggested fix
-def main(): +def main() -> None: """Run THD+GQA+CP NaN reproduction tests."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py` around lines 193 - 194, The function main lacks a return type annotation; update the def main() signature (the main function in this module) to include an explicit return type (e.g., def main() -> None:) to indicate it does not return a value and improve type clarity.
86-86: Add return type annotation.The function returns
boolbut is missing the return type hint. For Pyright type checking compliance:Suggested fix
-def run_test(num_kv_heads: int, num_attn_heads: int = 6, num_steps: int = 20): +def run_test(num_kv_heads: int, num_attn_heads: int = 6, num_steps: int = 20) -> bool:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py` at line 86, The function run_test(num_kv_heads: int, num_attn_heads: int = 6, num_steps: int = 20) is missing a return type annotation; update its signature to declare the return type as bool (-> bool) and ensure all return statements within run_test produce boolean values so Pyright type checking passes.
46-83: Consider adding return type annotation and Google-style docstring sections.The function logic is correct. For coding guideline compliance, consider adding:
- A more specific return type (e.g.,
dict[str, torch.Tensor | int])- Google-style
Args:andReturns:sectionsSince this is a debugging script, this is a minor refinement.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py` around lines 46 - 83, Add an explicit, more specific return type to create_mock_thd_batch (for example dict[str, torch.Tensor | int]) and expand the existing docstring to Google style with Args: and Returns: sections describing each parameter (num_sequences, seq_length, hidden_size, vocab_size, cp_size, device) and the returned mapping keys ("hidden_states", "cu_seqlens", "max_seqlen", "labels" and their types); keep the implementation unchanged but update the function signature and docstring to match project doc/type guidelines.bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile (2)
9-9: Consider adding a.dockerignorefile.
COPY . .copies the entire build context. Without a.dockerignore, this may include unnecessary files (e.g.,.git,__pycache__, test artifacts) that bloat the image and slow builds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile` at line 9, The Dockerfile uses COPY . . which will include the entire build context; add a .dockerignore at the project root to exclude unnecessary files (e.g., .git, .github, node_modules, __pycache__, *.pyc, tests/, .venv, build/, dist/, .env) so image builds are smaller and faster; update repository by creating the .dockerignore file with those entries and keep the Dockerfile unchanged (or reference .dockerignore in docs) to ensure unwanted files are not copied during the COPY . . step.
1-9: Consider adding a non-root user for improved security.The Dockerfile runs as root, which is flagged by static analysis (Trivy DS-0002). For production or shared environments, running containers as a non-root user reduces the attack surface.
Example: Add non-root user
# syntax=docker/dockerfile:1.4 FROM nvcr.io/nvidia/pytorch:26.02-py3 RUN --mount=type=cache,target=/root/.cache/pip \ --mount=type=bind,source=requirements.txt,target=/requirements.txt \ PIP_CONSTRAINT= pip install -r /requirements.txt WORKDIR /workspace/bionemo COPY . . + +# Create non-root user +RUN useradd -m -u 1000 bionemo && chown -R bionemo:bionemo /workspace/bionemo +USER bionemo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile` around lines 1 - 9, The Dockerfile currently runs as root; create and switch to a non-root user to address DS-0002: add steps to create a user/group (e.g., addgroup/adduser or groupadd/useradd) and set ownership of WORKDIR (/workspace/bionemo) and copied files (referencing WORKDIR and COPY . .) before switching with USER <username>; ensure any pip/cache directories used in RUN (--mount paths) are accessible to that user and that subsequent RUN/CMD steps run under the non-root user instead of root.bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt (1)
1-12: Consider pinning dependency versions for reproducibility.Most dependencies are unpinned, which can lead to non-reproducible builds when package versions change. For a training recipe, reproducibility is important.
Additionally, the
nvdlfw_inspectgit dependency lacks a commit hash or tag, meaning builds may break or behave differently over time.Example: Pin versions and add commit hash
-datasets +datasets==2.18.0 hydra-core -torch +torch>=2.2.0 torchao!=0.14.0 torchdata torchmetrics tqdm transformer_engine[pytorch] transformers wandb zstandard -nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect +nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect@<commit-hash>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt` around lines 1 - 12, Pin all top-level dependencies in requirements.txt to specific, tested versions (e.g., datasets, hydra-core, torch, torchdata, torchmetrics, tqdm, transformer_engine[pytorch], transformers, wandb, zstandard) to ensure reproducible installs, and replace the loose git URL for nvdlfw_inspect with a reference that includes a stable tag or full commit hash (e.g., add @<tag-or-commit> after the repo URL); update any version constraints (== or ~=) to the exact versions used in testing and add a short comment noting why each pin was chosen.bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md (1)
1-4: Consider adding a brief usage example.The README provides minimal context. A short code snippet showing how to load and use the tokenizer would be helpful for users unfamiliar with the setup.
Example addition
# A transformers-based tokenizer for DNA sequences This tokenizer is similar to the one used in the Evo-2 DNA model, and tokenizes DNA sequences using the `ord()` function. + +## Usage + +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer") +tokens = tokenizer("ACGTACGT") +```🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md` around lines 1 - 4, The README lacks a usage example; add a short code snippet demonstrating how to load and call the tokenizer using transformers' AutoTokenizer (e.g., AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer")) and a simple example call like tokenizer("ACGTACGT") (or tokenizer.encode/tokenizer.convert_tokens_to_ids as appropriate) so users can see how to instantiate and tokenize DNA strings; include this under a “Usage” or “Example” section in the README.md near the top description.bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py (3)
111-115: Simplify boolean return.The condition can be returned directly without the if-else.
Suggested simplification
def should_save_checkpoint(step: int, save_every_n_steps: int) -> bool: """Determine if a checkpoint should be saved.""" - if save_every_n_steps > 0 and step % save_every_n_steps == 0 and step > 0: - return True - return False + return save_every_n_steps > 0 and step % save_every_n_steps == 0 and step > 0🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around lines 111 - 115, The function should_save_checkpoint currently uses an if/return/return pattern; simplify by returning the boolean expression directly: replace the conditional block in should_save_checkpoint(step: int, save_every_n_steps: int) -> bool with a single return of (save_every_n_steps > 0 and step % save_every_n_steps == 0 and step > 0) so behavior remains identical but code is concise.
173-177: Use lazy % formatting in logger calls for consistency.The logging calls use f-strings, but logger methods support lazy formatting which avoids string interpolation when the log level is disabled.
Suggested change
if incompatible.missing_keys: - logger.warning(f"Missing keys when loading checkpoint: {incompatible.missing_keys}") + logger.warning("Missing keys when loading checkpoint: %s", incompatible.missing_keys) if incompatible.unexpected_keys: - logger.warning(f"Unexpected keys when loading checkpoint: {incompatible.unexpected_keys}") + logger.warning("Unexpected keys when loading checkpoint: %s", incompatible.unexpected_keys)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around lines 173 - 177, The logger.warning calls in the checkpoint loader use f-strings causing eager interpolation; change them to lazy formatting by passing the message template and the values as parameters (e.g., replace logger.warning(f"Missing keys when loading checkpoint: {incompatible.missing_keys}") with logger.warning("Missing keys when loading checkpoint: %s", incompatible.missing_keys) and do the same for unexpected_keys) so that logging in the block that checks incompatible (and its missing_keys/unexpected_keys) uses logger.warning with format string + arguments.
68-71: Consider using a more explicit type annotation for the module-level dict.The mutable module-level dict is intentional for tracking async checkpoint futures, but the type annotation could be more specific.
Suggested improvement
-_ckpt_futures: dict = {} +_ckpt_futures: dict[str, torch.futures.Future | None] = {}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py` around lines 68 - 71, The module-level variable _ckpt_futures currently has a generic dict annotation; change it to an explicit typing annotation such as Dict[str, asyncio.Future[Any]] (or the specific Future result type used by dcp_async_save) so callers and linters know keys are strategy names and values are asyncio Futures; add necessary imports (from typing import Dict, Any and import asyncio) and update the declaration of _ckpt_futures accordingly, referencing the _ckpt_futures symbol and the dcp_async_save-returned Future type to choose the exact Future result type.bionemo-recipes/recipes/opengenome2_llama_native_te/README.md (2)
58-62: Benchmark tables contain placeholder values.The convergence benchmark table has placeholder "—" values that should be filled in before merging.
Would you like help filling in these benchmark values, or should this be tracked as a follow-up task?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines 58 - 62, The convergence benchmark table in README.md contains placeholder "—" values for "This recipe (OG2 7B)" and "Megatron baseline (OG2 7B)"; replace those placeholders with the actual measured metrics (Step / checkpoint, Perplexity, Train loss, Test loss) collected from your training/evaluation runs, or if metrics are not yet available, add a clear TODO note and link to the tracking issue/experiment ID; update the two table rows ("This recipe (OG2 7B)" and "Megatron baseline (OG2 7B)") accordingly so the README reflects real results or a tracked follow-up.
88-102: Mermaid charts contain placeholder instructions.The chart titles say "fill with your numbers" indicating these are placeholders. Consider either:
- Filling in actual benchmark data
- Removing the charts until data is available
- Adding a note that these are example visualizations
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md` around lines 88 - 102, The README contains two Mermaid charts whose titles include the placeholder text "fill with your numbers" (the xychart-beta blocks titled "MFU (%) — fill with your numbers" and "Step time (s) — fill with your numbers"); replace the placeholders by either (a) populating the bar arrays with actual benchmark values and updating the title to reflect measured data, (b) removing the two mermaid xychart-beta blocks entirely if no data is available, or (c) keeping the charts but changing the title/text to clearly mark them as examples (e.g., "Example — replace with your measurements") and adding a short caption explaining they are illustrative only. Ensure the chosen change appears in the README.md where those mermaid blocks are defined.bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml (1)
43-56: Hardcoded data path may limit portability.The path
/data/opengenome2/parquet_splitis hardcoded. Consider using a placeholder or environment variable reference for better portability across environments.Suggested change
load_dataset_kwargs: - path: "/data/opengenome2/parquet_split" + path: ${oc.env:OG2_DATA_PATH,/data/opengenome2/parquet_split} split: "train" streaming: trueNote: This uses OmegaConf's environment variable resolver, allowing users to override via
OG2_DATA_PATHenv var.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml` around lines 43 - 56, The dataset load path is hardcoded in dataset.load_dataset_kwargs.path; change it to use OmegaConf's environment variable resolver so users can override via OG2_DATA_PATH while keeping the current path as the default. Update dataset.load_dataset_kwargs.path to reference the env var (oc.env) with a fallback to the existing "/data/opengenome2/parquet_split" so deployments can set OG2_DATA_PATH when needed.bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py (2)
446-455: Variable shadowing:batchused in list comprehension shadows outer variable.The variable
batchin the list comprehension shadows thebatchvariable from line 405, which could cause confusion during debugging.Suggested fix
if self._is_cp_row_major: # Flattened mesh: [(cp0,tp0), (cp0,tp1), (cp1,tp0), (cp1,tp1)] # Output: [cp0, cp0, cp1, cp1] - combined_batch = [batch for batch in combined_batch for _ in range(self.tp_world_size)] + combined_batch = [shard for shard in combined_batch for _ in range(self.tp_world_size)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines 446 - 455, The list comprehensions in the collator (inside the method using self._is_cp_row_major) reuse the name "batch" which shadows the outer variable from earlier in the function; update those comprehensions to use a distinct inner variable name (e.g., inner_batch or item) instead of "batch" in the first branch and similarly use a clear loop variable in the second branch so the outer "batch" is not shadowed; adjust both expressions that build combined_batch to use the new inner variable names (refer to self._is_cp_row_major, self.tp_world_size, self.cp_world_size and the existing combined_batch construction) so behavior is unchanged but variable shadowing is removed.
958-996: Consider addingsbhdformat support or removing it from validation.Line 958 validates that
qvk_formatis one of["thd", "bshd", "sbhd"], but line 993-994 raises an error forsbhdsaying "Support not implemented yet". This is confusing - either removesbhdfrom the valid formats or implement support.Option 1: Remove sbhd from validation
- if qvk_format not in ["thd", "bshd", "sbhd"]: + if qvk_format not in ["thd", "bshd"]: raise ValueError(f"Unsupported qvk_format: {qvk_format}!")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py` around lines 958 - 996, The validation allows "sbhd" but there is no handler for it, causing a misleading validation; either remove "sbhd" from the allowed list next to qvk_format or add explicit support by adding an elif branch for qvk_format == "sbhd" that processes input_ids_padded and labels_padded (similar to the "bshd" branch) and/or calls a new helper like _process_tensor_sbhd (or re-use _process_tensor_bshd if semantics match), and ensure the starting validation list is updated to match the implemented branches (symbols to edit: qvk_format validation list, the elif blocks handling "bshd" and add "sbhd", and implement or reuse _process_tensor_sbhd/_process_tensor_bshd accordingly).bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py (1)
43-58: HandleTimeoutExpiredexplicitly and make the timeout configurable.If
subprocess.runhits the 240s limit, the test errors out before you print stdout/stderr, which makes the likely failure mode much harder to debug. Two-GPU TE/FSDP2 startup can easily exceed four minutes on shared runners.Proposed helper update
-def run_train_cmd(cmd, recipe_path): +def run_train_cmd(cmd, recipe_path, timeout: int = 600): """Run a training command and check for errors.""" - result = subprocess.run( - cmd, - check=False, - text=True, - stdout=subprocess.PIPE, - stderr=subprocess.PIPE, - timeout=240, - cwd=str(recipe_path), - ) + try: + result = subprocess.run( + cmd, + check=False, + text=True, + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + timeout=timeout, + cwd=str(recipe_path), + ) + except subprocess.TimeoutExpired as exc: + pytest.fail( + f"Command timed out after {timeout}s:\n{' '.join(cmd)}\n" + f"STDOUT:\n{exc.stdout or ''}\nSTDERR:\n{exc.stderr or ''}" + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py` around lines 43 - 58, The helper run_train_cmd currently lets subprocess.run raise TimeoutExpired (with a hardcoded 240s), so when the process times out you lose stdout/stderr and useful diagnostics; update run_train_cmd to accept a configurable timeout parameter (e.g., timeout_seconds with default 240) and wrap the subprocess.run call in a try/except that catches subprocess.TimeoutExpired, printing any available output (exc.stdout / exc.stderr or the partial result) and the exception details before failing the test; also update callers to pass a larger timeout where needed (or use an env var) so two-GPU TE/FSDP2 startup can complete on slower runners.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In
`@bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py`:
- Around line 155-168: The current construction of hydra_overrides and
hydra_args_formatted interpolates raw config values (cfg.wandb_project,
cfg.wandb_name, cfg.extra_hydra_overrides and any path fields like
dataset_path/code_path) into a bash -c script and is vulnerable to shell
injection and breaking on spaces/quotes; fix by shell-quoting/escaping every
external value before adding to hydra_overrides (use a safe escaper such as
Python's shlex.quote or JSON/string literal quoting) so that the entries added
in the block that appends wandb.project/wandb.name and the loop that extends
extra_hydra_overrides are each sanitized, and then build hydra_args_formatted
from those already-escaped entries (ensure you treat each
cfg.extra_hydra_overrides element as a single token and escape it as well).
- Around line 234-235: Replace the current unsafe invocations that expand
secrets on the command line: change the python -c "from huggingface_hub import
login; login(token='${{HF_TOKEN}}')" call to read HF_TOKEN from the environment
inside Python (e.g. import os; from huggingface_hub import login;
login(token=os.environ.get('HF_TOKEN')) — locate the login(token=...) snippet),
and remove the wandb login ${{WANDB_API_KEY}} command; instead rely on setting
WANDB_API_KEY in the process environment (W&B will pick it up) or use secure
stdin-based login if needed (avoid using the wandb login <token> argv form
referenced by the wandb login symbol).
- Around line 190-199: The follower ranks currently use a fixed sleep (sleep 30)
after the rank-0 creates /tmp/git_sync_complete_marker; change this to an active
wait that polls for the marker file and only proceeds once it exists to
guarantee all ranks use the same checkout. Update the follower branch (the else
block that references NODE_RANK and cd {repo_root}) to loop until test -f
/tmp/git_sync_complete_marker (with a short sleep between polls), then read/echo
the marker (git commit) and then cd {repo_root} before continuing; ensure the
rank-0 code still writes the marker with echo "$(git rev-parse HEAD)" >
/tmp/git_sync_complete_marker so the follower can verify it.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/dataset.py`:
- Around line 297-312: The TokenPackingDataset is not told to account for
post-collation padding, so batches sized by TokenPackingDataset can exceed
token_micro_batch_size after DataCollatorWithFlattening pads to
pad_sequences_to_be_divisible_by; update the instantiation of
TokenPackingDataset (used to build train_dataloader) to pass the padding divisor
through (e.g. add
pad_sequences_to_be_divisible_by=pad_sequences_to_be_divisible_by or the
corresponding parameter name expected by TokenPackingDataset) so the dataset
computes budgets using the same divisor as
GenomicDataCollator/DataCollatorWithFlattening.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml`:
- Around line 15-19: The defaults enable spike_no_more_embedding_init and
use_megatron_scaled_init while use_meta_device remains true, but
NVLlamaPreTrainedModel.init_empty_weights (and the comment) indicate meta-device
init breaks those distributions; fix by making the defaults consistent: either
set use_meta_device: false when spike_no_more_embedding_init or
use_megatron_scaled_init are true, or disable those init flags by default if
use_meta_device must stay true. Update the YAML keys
(spike_no_more_embedding_init, use_megatron_scaled_init, use_meta_device) so the
default recipe preserves the advertised OG2 initialization behavior.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa.yaml`:
- Around line 105-112: Validation windowing overrides (validation.max_seq_length
and validation.stride) are ignored by the current trainer; update train_fsdp2.py
so that when building val_dataset_kwargs it copies validation.max_seq_length and
validation.stride (in addition to validation.data_path and
validation.micro_batch_size) into val_dataset_kwargs or otherwise applies them
to the validation dataset creation path (use the same keys/names expected by the
dataset loader and fall back to training windowing only if those validation
fields are null). Ensure the code references the existing val_dataset_kwargs
construction point and the validation.* config keys to avoid silent use of
training windowing during validation.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/modeling_llama_te.py`:
- Around line 205-235: The code still dereferences input_ids when callers supply
inputs_embeds; update the branches that use input_ids (inside should_pack_inputs
and the InferenceParams length computation) to fall back to inputs_embeds when
input_ids is None: use inputs_embeds.size(1) (or .shape) for padded_seq_len and
any shape comparisons, and when computing lengths use attention_mask.shape ==
(inputs_embeds.shape) or use inputs_embeds.size(1) to derive sequence lengths;
adjust the conditionals around _unpad_input, the cu_seq_lens/max_length
assignments, and the lengths list passed to past_key_values.pre_step so they
work whether input_ids or inputs_embeds was provided (referencing
should_pack_inputs, _unpad_input, attention_mask, inputs_embeds, input_ids,
InferenceParams, and pre_step).
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_modeling_llama_te.py`:
- Around line 346-350: The code assumes input_ids exists when should_pack_inputs
is true, causing AttributeError if callers passed inputs_embeds; update the
packing path to derive padded_seq_len and batch_size from inputs_embeds when
present (fall back to input_ids otherwise), e.g., use inputs_embeds.size(1) /
.shape or hidden_states.shape to compute padded_seq_len and batch_size before
calling _unpad_input; apply the same change to the other occurrence mentioned
(the cache-length computation block around lines 360-365) so all logic
referencing input_ids handles inputs_embeds or uses hidden_states as a fallback.
- Line 255: The code is using config.dtype (and passing strings) where PyTorch
expects a torch.dtype; update all usages to use config.torch_dtype (e.g.,
replace the dtype arg in the nn.Embedding creation at self.embed_tokens and the
dtype args passed to the TransformerEngine constructors referenced in this file)
and ensure the value is an actual torch.dtype (if configs may supply a string,
convert it to a torch.dtype via torch.<type> lookup before use). Also
standardize configs to use the HF key torch_dtype instead of the custom dtype so
subsequent code can rely on config.torch_dtype being present and correct.
- Around line 245-255: The model class NVLlamaModel lacks the embedding
accessors required by PreTrainedModel.tie_weights(), so implement
get_input_embeddings(self) to return self.embed_tokens and
get_output_embeddings(self) to return getattr(self, "lm_head", None) (and
optionally set_input_embeddings(self, new_emb) to assign self.embed_tokens) so
tie_word_embeddings actually ties lm_head to the token embeddings; apply the
same three accessor methods to the other related model class around lines
418-436 (the class that defines lm_head there) so the tying mechanism works for
both classes.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/perf_logger.py`:
- Around line 118-129: The logging window miscounts tokens because
num_tokens/num_unpadded_tokens are only updated when step % logging_frequency ==
0 while running_loss and grad_acc_step_count update every step; change the
accounting so token counts are accumulated every step (same place running_loss
and grad_acc_step_count are incremented inside with torch.no_grad()) and keep
the existing log emission that checks step against logging_frequency (or
alternatively change the emit condition to (step+1) % logging_frequency == 0) so
the tokens, loss, and grad_acc_step_count reflect the same window; update
references to grad_acc_step_count, running_loss, num_tokens,
num_unpadded_tokens, logging_frequency, and step to be consistent across
accumulation and logging.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/scheduler.py`:
- Around line 21-27: The get_cosine_annealing_schedule_with_warmup function must
validate inputs before computing the schedule: add guards that raise clear
ValueError(s) if num_decay_steps is 0 or negative (to avoid ZeroDivisionError in
the cosine branch) and if min_lr_ratio is out of the [0.0, 1.0] range (to
prevent negative or increasing LRs); place these checks at the start of
get_cosine_annealing_schedule_with_warmup and return/raise immediately with
descriptive messages referencing the offending parameter names.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/conftest.py`:
- Around line 53-69: The autouse fixture device_mesh currently assumes CUDA
exists and unconditionally sets CUDA device and synchronizes; update it to first
query torch.cuda.is_available() and torch.cuda.device_count() and compare to
DistributedConfig().local_rank to decide whether CUDA GPU setup should run; only
call torch.cuda.set_device, torch.cuda.empty_cache, and torch.cuda.synchronize
when CUDA is available and local_rank < device_count, and ensure
torch.distributed.init_process_group / torch.distributed.destroy_process_group
are only invoked (or torn down) when initialized (or guarded) so CPU-only
environments skip GPU setup cleanly; reference the fixture name device_mesh, the
DistributedConfig usage, and the calls to torch.distributed.init_process_group,
torch.cuda.set_device, torch.cuda.empty_cache, and torch.cuda.synchronize when
implementing the guard.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train.py`:
- Around line 59-76: The test test_sanity_convergence_fsdp2_te_thd is not
triggering the THD dataloader because main_fsdp2 selects create_thd_dataloader
based on use_sequence_packing, not on config_kwargs.attn_input_format; update
the test to set the configuration option that enables sequence packing (e.g.,
set use_sequence_packing=true or the exact hydra override key that main_fsdp2
checks) so main_fsdp2 will call create_thd_dataloader and exercise the THD path
rather than only setting config_kwargs.attn_input_format.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.py`:
- Around line 146-154: Add a fail-fast guard that rejects using meta-device
initialization together with the custom init modes: when getattr(args,
"use_meta_device", False) is true and either getattr(args,
"spike_no_more_embedding_init", False) or getattr(args,
"use_megatron_scaled_init", False) is true, raise a clear exception (or call
parser.error) explaining that use_meta_device is incompatible with
spike_no_more_embedding_init and use_megatron_scaled_init; place this check
before you mutate config_kwargs (near where config_kwargs is prepared and where
spike_no_more_embedding_init / use_megatron_scaled_init are handled) so the bad
combination is detected early.
- Around line 124-128: The mesh construction using init_device_mesh can produce
an invalid shape when args.cp_size doesn't evenly divide dist_config.world_size;
validate args.cp_size before calling init_device_mesh by checking it's a
positive integer ≤ dist_config.world_size and that dist_config.world_size %
args.cp_size == 0, and raise a clear ValueError (or argparse/cli error)
indicating the mismatch (include values for dist_config.world_size and
args.cp_size) so the misconfiguration fails fast rather than causing a later
distributed init error.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.py`:
- Around line 141-149: The code allows incompatible combinations where
use_meta_device breaks Megatron-scaled and Spike-No-More initializations; add a
hard validation that raises an error when use_meta_device is true and either
args.use_megatron_scaled_init or args.spike_no_more_embedding_init is set.
Specifically, in train_fsdp2.py (near where getattr(args,
"spike_no_more_embedding_init", ...) and getattr(args,
"use_megatron_scaled_init", ...) are handled), check getattr(args,
"use_meta_device", False) and if true and (getattr(args,
"use_megatron_scaled_init", False) or getattr(args,
"spike_no_more_embedding_init", False)) raise a ValueError (or parser.error)
with a clear message naming use_meta_device, use_megatron_scaled_init, and
spike_no_more_embedding_init so the process fails fast. Ensure this validation
runs before modifying config_kwargs or proceeding with initialization.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/validation.py`:
- Around line 73-88: The current try/except in the validation loop (around the
forward pass using transformer_engine.pytorch.autocast and model(**batch))
swallows all exceptions and can return misleading metrics when no batches
succeed; update the exception handling to re-raise critical errors or at minimum
raise an exception when num_evaluated == 0 after the loop. Specifically, inside
the except block for the forward pass in the validation function, do not just
logger.warning and continue—either re-raise the caught exception (raise) or set
a flag and after the loop check num_evaluated (or total_tokens) and raise a
RuntimeError indicating validation failed if zero batches were evaluated; ensure
this touches the same symbols: model(**batch), total_loss, total_tokens,
total_weighted_loss, and num_evaluated so metrics are not returned as
val_loss=0/val_ppl=1 when no successful forward occurred.
---
Minor comments:
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py`:
- Around line 272-279: When async_save is true the code starts a new
dcp_async_save and assigns its future to _ckpt_futures["fsdp2"] but
prune_checkpoints may run before that future completes, causing deletion of
files still being written; update the logic so prune_checkpoints awaits the
outstanding future (check and await _ckpt_futures["fsdp2"] if not None and not
done) before removing old checkpoints, or alternatively await the new future
(returned by dcp_async_save) prior to invoking prune_checkpoints; reference the
symbols _ckpt_futures, dcp_async_save, prune_checkpoints and the async_save
branch to locate and change the behavior.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py`:
- Around line 500-548: The close() call in __iter__ may leave a stuck
_prefetch_thread running because close() joins with a 10s timeout and the thread
is a daemon; change the cleanup to (1) attempt a bounded join, (2) if join timed
out detect thread.is_alive(), log a warning including thread id/context, and do
not drop the reference (keep _prefetch_thread so future close() calls can
retry), and (3) stop making the thread a daemon (create it in _kick_prefetch
without daemon=True) so it won't be abruptly killed on exit; update __iter__,
close, _kick_prefetch, and _do_one_prefetch to use these checks and logging
around _prefetch_thread joins and timeouts.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.py`:
- Around line 26-27: The module currently forces a logging level at import by
calling logger.setLevel(logging.INFO); remove that module-level setLevel call so
the module uses the application/global logging configuration instead, and if
module-specific INFO behavior is required, configure it externally (in
application logging setup) or attach a handler with its own level to the logger
rather than calling logger.setLevel in this file; look for the logger variable
created with logging.getLogger(__name__) and delete the subsequent
logger.setLevel(logging.INFO) invocation.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md`:
- Around line 9-12: Replace the non-descriptive link text "here" in the
README.md snippet with a meaningful, descriptive phrase (e.g., "download this
recipe as a zipped directory" or "download the opengenome2-llama-native-te
recipe") so the anchor text clearly describes the link target; update the anchor
around the URL
https://download-directory.github.io?url=https://github.com/NVIDIA/bionemo-framework/tree/main/bionemo-recipes/recipes/opengenome2_llama_native_te&filename=opengenome2-llama-native-te
used in the README to use that descriptive text instead of "here" (locate the
anchor in the README.md block that contains the current "[here](...)").
---
Duplicate comments:
In
`@bionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yaml`:
- Around line 1-96: This file is a personal debugging config and should be
removed from the repo: delete or move the YAML (job_name
"og2-7b-cp2-llama3-code") out of the main branch and/or add it to .gitignore;
remove user-specific hardcoded paths (repo_root, code_path), secrets
(wandb_secret, hf_secret), checkpoint_dir, and node exclusions (exclude_nodes)
or replace them with CI-safe placeholders, and ensure git_branch and any
personal comments are not committed; if you need to keep it for local debugging,
put it in a personal fork or a local-only directory and document the allowed
storage location.
---
Nitpick comments:
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.py`:
- Around line 111-115: The function should_save_checkpoint currently uses an
if/return/return pattern; simplify by returning the boolean expression directly:
replace the conditional block in should_save_checkpoint(step: int,
save_every_n_steps: int) -> bool with a single return of (save_every_n_steps > 0
and step % save_every_n_steps == 0 and step > 0) so behavior remains identical
but code is concise.
- Around line 173-177: The logger.warning calls in the checkpoint loader use
f-strings causing eager interpolation; change them to lazy formatting by passing
the message template and the values as parameters (e.g., replace
logger.warning(f"Missing keys when loading checkpoint:
{incompatible.missing_keys}") with logger.warning("Missing keys when loading
checkpoint: %s", incompatible.missing_keys) and do the same for unexpected_keys)
so that logging in the block that checks incompatible (and its
missing_keys/unexpected_keys) uses logger.warning with format string +
arguments.
- Around line 68-71: The module-level variable _ckpt_futures currently has a
generic dict annotation; change it to an explicit typing annotation such as
Dict[str, asyncio.Future[Any]] (or the specific Future result type used by
dcp_async_save) so callers and linters know keys are strategy names and values
are asyncio Futures; add necessary imports (from typing import Dict, Any and
import asyncio) and update the declaration of _ckpt_futures accordingly,
referencing the _ckpt_futures symbol and the dcp_async_save-returned Future type
to choose the exact Future result type.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/collator.py`:
- Around line 446-455: The list comprehensions in the collator (inside the
method using self._is_cp_row_major) reuse the name "batch" which shadows the
outer variable from earlier in the function; update those comprehensions to use
a distinct inner variable name (e.g., inner_batch or item) instead of "batch" in
the first branch and similarly use a clear loop variable in the second branch so
the outer "batch" is not shadowed; adjust both expressions that build
combined_batch to use the new inner variable names (refer to
self._is_cp_row_major, self.tp_world_size, self.cp_world_size and the existing
combined_batch construction) so behavior is unchanged but variable shadowing is
removed.
- Around line 958-996: The validation allows "sbhd" but there is no handler for
it, causing a misleading validation; either remove "sbhd" from the allowed list
next to qvk_format or add explicit support by adding an elif branch for
qvk_format == "sbhd" that processes input_ids_padded and labels_padded (similar
to the "bshd" branch) and/or calls a new helper like _process_tensor_sbhd (or
re-use _process_tensor_bshd if semantics match), and ensure the starting
validation list is updated to match the implemented branches (symbols to edit:
qvk_format validation list, the elif blocks handling "bshd" and add "sbhd", and
implement or reuse _process_tensor_sbhd/_process_tensor_bshd accordingly).
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfile`:
- Line 9: The Dockerfile uses COPY . . which will include the entire build
context; add a .dockerignore at the project root to exclude unnecessary files
(e.g., .git, .github, node_modules, __pycache__, *.pyc, tests/, .venv, build/,
dist/, .env) so image builds are smaller and faster; update repository by
creating the .dockerignore file with those entries and keep the Dockerfile
unchanged (or reference .dockerignore in docs) to ensure unwanted files are not
copied during the COPY . . step.
- Around line 1-9: The Dockerfile currently runs as root; create and switch to a
non-root user to address DS-0002: add steps to create a user/group (e.g.,
addgroup/adduser or groupadd/useradd) and set ownership of WORKDIR
(/workspace/bionemo) and copied files (referencing WORKDIR and COPY . .) before
switching with USER <username>; ensure any pip/cache directories used in RUN
(--mount paths) are accessible to that user and that subsequent RUN/CMD steps
run under the non-root user instead of root.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yaml`:
- Around line 43-56: The dataset load path is hardcoded in
dataset.load_dataset_kwargs.path; change it to use OmegaConf's environment
variable resolver so users can override via OG2_DATA_PATH while keeping the
current path as the default. Update dataset.load_dataset_kwargs.path to
reference the env var (oc.env) with a fallback to the existing
"/data/opengenome2/parquet_split" so deployments can set OG2_DATA_PATH when
needed.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/README.md`:
- Around line 58-62: The convergence benchmark table in README.md contains
placeholder "—" values for "This recipe (OG2 7B)" and "Megatron baseline (OG2
7B)"; replace those placeholders with the actual measured metrics (Step /
checkpoint, Perplexity, Train loss, Test loss) collected from your
training/evaluation runs, or if metrics are not yet available, add a clear TODO
note and link to the tracking issue/experiment ID; update the two table rows
("This recipe (OG2 7B)" and "Megatron baseline (OG2 7B)") accordingly so the
README reflects real results or a tracked follow-up.
- Around line 88-102: The README contains two Mermaid charts whose titles
include the placeholder text "fill with your numbers" (the xychart-beta blocks
titled "MFU (%) — fill with your numbers" and "Step time (s) — fill with your
numbers"); replace the placeholders by either (a) populating the bar arrays with
actual benchmark values and updating the title to reflect measured data, (b)
removing the two mermaid xychart-beta blocks entirely if no data is available,
or (c) keeping the charts but changing the title/text to clearly mark them as
examples (e.g., "Example — replace with your measurements") and adding a short
caption explaining they are illustrative only. Ensure the chosen change appears
in the README.md where those mermaid blocks are defined.
In `@bionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txt`:
- Around line 1-12: Pin all top-level dependencies in requirements.txt to
specific, tested versions (e.g., datasets, hydra-core, torch, torchdata,
torchmetrics, tqdm, transformer_engine[pytorch], transformers, wandb, zstandard)
to ensure reproducible installs, and replace the loose git URL for
nvdlfw_inspect with a reference that includes a stable tag or full commit hash
(e.g., add @<tag-or-commit> after the repo URL); update any version constraints
(== or ~=) to the exact versions used in testing and add a short comment noting
why each pin was chosen.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.py`:
- Around line 193-194: The function main lacks a return type annotation; update
the def main() signature (the main function in this module) to include an
explicit return type (e.g., def main() -> None:) to indicate it does not return
a value and improve type clarity.
- Line 86: The function run_test(num_kv_heads: int, num_attn_heads: int = 6,
num_steps: int = 20) is missing a return type annotation; update its signature
to declare the return type as bool (-> bool) and ensure all return statements
within run_test produce boolean values so Pyright type checking passes.
- Around line 46-83: Add an explicit, more specific return type to
create_mock_thd_batch (for example dict[str, torch.Tensor | int]) and expand the
existing docstring to Google style with Args: and Returns: sections describing
each parameter (num_sequences, seq_length, hidden_size, vocab_size, cp_size,
device) and the returned mapping keys ("hidden_states", "cu_seqlens",
"max_seqlen", "labels" and their types); keep the implementation unchanged but
update the function signature and docstring to match project doc/type
guidelines.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.py`:
- Around line 43-58: The helper run_train_cmd currently lets subprocess.run
raise TimeoutExpired (with a hardcoded 240s), so when the process times out you
lose stdout/stderr and useful diagnostics; update run_train_cmd to accept a
configurable timeout parameter (e.g., timeout_seconds with default 240) and wrap
the subprocess.run call in a try/except that catches subprocess.TimeoutExpired,
printing any available output (exc.stdout / exc.stderr or the partial result)
and the exception details before failing the test; also update callers to pass a
larger timeout where needed (or use an env var) so two-GPU TE/FSDP2 startup can
complete on slower runners.
In
`@bionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.md`:
- Around line 1-4: The README lacks a usage example; add a short code snippet
demonstrating how to load and call the tokenizer using transformers'
AutoTokenizer (e.g.,
AutoTokenizer.from_pretrained("path/to/nucleotide_fast_tokenizer")) and a simple
example call like tokenizer("ACGTACGT") (or
tokenizer.encode/tokenizer.convert_tokens_to_ids as appropriate) so users can
see how to instantiate and tokenize DNA strings; include this under a “Usage” or
“Example” section in the README.md near the top description.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 9d4c1dac-1fa0-48bc-a8fc-b774dff6b015
⛔ Files ignored due to path filters (2)
bionemo-recipes/recipes/opengenome2_llama_native_te/dlcm_sanity_dataset.parquetis excluded by!**/*.parquetbionemo-recipes/recipes/opengenome2_llama_native_te/test_genomic_sequences.parquetis excluded by!**/*.parquet
📒 Files selected for processing (42)
bionemo-recipes/recipes/llama3_native_te/checkpoint.pybionemo-recipes/recipes/llama3_native_te/lepton_backup/lepton_configs/og2_recipe_clean_cp2_llama3_code.yamlbionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.pybionemo-recipes/recipes/opengenome2_llama_native_te/.ruff.tomlbionemo-recipes/recipes/opengenome2_llama_native_te/Dockerfilebionemo-recipes/recipes/opengenome2_llama_native_te/README.mdbionemo-recipes/recipes/opengenome2_llama_native_te/checkpoint.pybionemo-recipes/recipes/opengenome2_llama_native_te/collator.pybionemo-recipes/recipes/opengenome2_llama_native_te/dataset.pybionemo-recipes/recipes/opengenome2_llama_native_te/distributed_config.pybionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging.pybionemo-recipes/recipes/opengenome2_llama_native_te/fp8_debugging_stats.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_debug_2layer.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/L0_sanity_cp.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_fp8.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/og2_7b_thd_gqa_global_shuffle.yamlbionemo-recipes/recipes/opengenome2_llama_native_te/model_configs/lingua-1B/config.jsonbionemo-recipes/recipes/opengenome2_llama_native_te/model_configs/meta-llama/Llama-3.2-1B/config.jsonbionemo-recipes/recipes/opengenome2_llama_native_te/modeling_llama_te.pybionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_collator.pybionemo-recipes/recipes/opengenome2_llama_native_te/opengenome_modeling_llama_te.pybionemo-recipes/recipes/opengenome2_llama_native_te/optimizer.pybionemo-recipes/recipes/opengenome2_llama_native_te/perf_logger.pybionemo-recipes/recipes/opengenome2_llama_native_te/requirements.txtbionemo-recipes/recipes/opengenome2_llama_native_te/scheduler.pybionemo-recipes/recipes/opengenome2_llama_native_te/scripts/minimal_repro_thd_gqa_cp_nan.pybionemo-recipes/recipes/opengenome2_llama_native_te/tests/conftest.pybionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_dataset.pybionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_genomic_dataset.pybionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train.pybionemo-recipes/recipes/opengenome2_llama_native_te/tests/test_train_two_gpu.pybionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/README.mdbionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/special_tokens_map.jsonbionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer.jsonbionemo-recipes/recipes/opengenome2_llama_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer_config.jsonbionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2.pybionemo-recipes/recipes/opengenome2_llama_native_te/train_fsdp2_cp.pybionemo-recipes/recipes/opengenome2_llama_native_te/validation.pyci/scripts/check_copied_files.py
bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py
Outdated
Show resolved
Hide resolved
bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py
Outdated
Show resolved
Hide resolved
bionemo-recipes/recipes/llama3_native_te/lepton_backup/submit_og2_lepton_eden.py
Outdated
Show resolved
Hide resolved
bionemo-recipes/recipes/opengenome2_llama_native_te/hydra_config/defaults.yaml
Show resolved
Hide resolved
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
Signed-off-by: savitha-eng <savithas@nvidia.com>
- Remove lepton_backup folder from llama3_native_te (tracked files) - Revert checkpoint.py to main (QuantizedTensor/async FP8 fix belongs in separate PR) - Set use_meta_device=false in OG2 defaults (incompatible with scaled init/Spike-No-More) - Add hard validation for meta-device + scaled init/Spike-No-More in both training scripts - Fail fast in validation.py when all batches fail instead of returning fake-good metrics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds scripts/evaluate_fasta_lm_loss.py and scripts/metagenomics.fasta for computing per-sequence log probabilities on a pre-sampled FASTA file. This enables direct comparison across checkpoints using identical sequences. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add note explaining that test losses were computed using the FASTA eval script on a shared set of pre-sampled sequences, with a matching script used for the Megatron baseline. Add "Evaluating Checkpoints" usage section. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
Signed-off-by: Savitha Srinivasan <savithas@nvidia.com>
|
/ok to test |
@savitha-eng, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/1/ |
Signed-off-by: savitha-eng <savithas@nvidia.com>
|
/ok to test ca7176f |
Restore checkpoint.py to match origin/main which already includes the QuantizedTensor async save guard, _extra_state filtering, and strict=False for checkpoint loading (added in c157a41). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Sync modeling_llama_te.py with source (check_copied_files requirement) - Replace hardcoded /data/savithas/ paths in eval script docstring - Fix trailing whitespace in README Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
/ok to test 44c3455 |
Self-contained FSDP2 + TransformerEngine recipe for OpenGenome2 training, extracted from the generic llama3_native_te recipe with OG2-specific defaults:
Description
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation