Conversation
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
|
Important Review skippedAuto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughAdds two new Mixtral native-TE recipes (mixtral_native_te, opengenome2_mixtral_native_te) implementing FSDP2 training with TransformerEngine support, sequence packing/CP/THD, MoE model implementations, distributed checkpointing, dataloaders/collators, perf logging, Hydra configs, Dockerfiles, and comprehensive tests. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Poem
✨ Finishing Touches🧪 Generate unit tests (beta)
|
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 17
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
🟡 Minor comments (11)
bionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.py-16-18 (1)
16-18:⚠️ Potential issue | 🟡 MinorRemove duplicate license header.
Lines 16-18 duplicate the SPDX copyright and license identifier already present in lines 1-14.
🧹 Proposed fix
# limitations under the License. -# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: LicenseRef-Apache2 - """Optimizer utilities for OpenGenome2 training."""🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.py` around lines 16 - 18, Remove the duplicated SPDX header by deleting the repeated lines containing "SPDX-FileCopyrightText" and "SPDX-License-Identifier" that appear after the initial header; keep only the first occurrence at the top of the file so that the file contains a single SPDX copyright and license identifier (search for the strings "SPDX-FileCopyrightText" and "SPDX-License-Identifier" to locate the duplicate).bionemo-recipes/recipes/opengenome2_mixtral_native_te/model_configs/og2-small-mixtral/config.json-27-27 (1)
27-27:⚠️ Potential issue | 🟡 MinorUpdate transformers version to a stable release and pin in requirements.
The configuration references
"4.40.0.dev0", a development pre-release that is significantly outdated. Other models in the repo pin stable versions (4.51.3, 4.57.1, 4.57.3), and the latest stable is 5.4.0. Additionally,requirements.txtonly specifiestransformerswithout a version constraint, creating a mismatch. Pin a stable version in bothrequirements.txtand the config to ensure reproducibility and compatibility.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/model_configs/og2-small-mixtral/config.json` at line 27, The config key "transformers_version" currently set to "4.40.0.dev0" must be changed to a stable release and the same stable version must be pinned in requirements.txt; update the "transformers_version" value in config.json (og2-small-mixtral config) to a stable release string (e.g., "5.4.0" or the repo-consistent stable version you choose) and add an explicit version constraint for transformers in requirements.txt (e.g., transformers==5.4.0) so both files reference the identical pinned stable release for reproducibility.bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py-1-18 (1)
1-18:⚠️ Potential issue | 🟡 MinorRemove duplicate license header.
The license header appears twice: lines 1-14 and lines 16-17 (partial duplicate). This should be cleaned up.
🧹 Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: LicenseRef-Apache2 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: LicenseRef-Apache2 - import torch🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py` around lines 1 - 18, Remove the duplicated license header by deleting the second occurrence of the SPDX block (the lines containing "SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and "SPDX-License-Identifier: LicenseRef-Apache2") so only the single, full Apache-2.0 license header at the top of test_lingua_small_mixtral.py remains.bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_experiment.yaml-39-42 (1)
39-42:⚠️ Potential issue | 🟡 MinorHardcoded absolute path and cross-recipe directory reference.
The
data_filespath is hardcoded to an absolute workspace path and references theopengenome2_llama_native_terecipe directory, but this config lives inopengenome2_mixtral_native_te. This will break portability and may be a copy-paste error.Consider using a relative path or Hydra interpolation, and verify whether this should reference files within the current recipe directory instead.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_experiment.yaml` around lines 39 - 42, The data_files field under load_dataset_kwargs currently uses a hardcoded absolute path pointing to the opengenome2_llama_native_te recipe (likely a copy/paste error); change it to a portable reference (use a relative path within the current recipe or a Hydra interpolation/macro) and ensure it points to the correct dataset for opengenome2_mixtral_native_te. Locate load_dataset_kwargs and replace the absolute path in data_files with either a relative path from the recipe directory or a Hydra variable (e.g., recipe-specific dataset path) so the config is portable and references the correct recipe’s dataset.bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train_two_gpu.py-61-80 (1)
61-80:⚠️ Potential issue | 🟡 MinorRemove unused
tmp_pathparameter.The
tmp_pathfixture is declared but not used intest_multi_gpu_train_fsdp2. This creates unnecessary fixture setup.Proposed fix
`@requires_multi_gpu` -def test_multi_gpu_train_fsdp2(tmp_path, recipe_path): +def test_multi_gpu_train_fsdp2(recipe_path): """Test FSDP2 training on 2 GPUs.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train_two_gpu.py` around lines 61 - 80, The test function test_multi_gpu_train_fsdp2 declares an unused tmp_path fixture; remove tmp_path from the function signature so it becomes def test_multi_gpu_train_fsdp2(recipe_path): and adjust any references or imports accordingly to eliminate the unused fixture warning and unnecessary setup.bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py-1-17 (1)
1-17:⚠️ Potential issue | 🟡 MinorRemove duplicate license header block.
Lines 1-14 and 16-17 contain duplicate SPDX copyright headers (both 2026).
Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: LicenseRef-Apache2 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: LicenseRef-Apache2 - import subprocess🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py` around lines 1 - 17, The file test_train_two_gpu.py contains duplicated SPDX license header blocks (two identical copyright/SPDX sections at the top); remove the redundant block so only a single SPDX copyright and license block remains at the file head, preserving one complete header (the SPDX-FileCopyrightText and SPDX-License-Identifier lines) and deleting the duplicate lines to avoid repeated headers.bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py-1-17 (1)
1-17:⚠️ Potential issue | 🟡 MinorRemove duplicate license header block.
Same issue as
opengenome2_mixtral_native_te/fp8_debugging.py- lines 1-14 and 16-17 contain duplicate SPDX copyright/license headers.Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: LicenseRef-Apache2 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: LicenseRef-Apache2 - import math🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py` around lines 1 - 17, The file contains two identical SPDX license header blocks duplicated back-to-back; remove the redundant block so only a single license header remains at the top of scheduler.py (keep one of the SPDX-FileCopyrightText/SPDX-License-Identifier header blocks and delete the other), ensuring there are no blank or stray duplicate license lines left.bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py-1-17 (1)
1-17:⚠️ Potential issue | 🟡 MinorRemove duplicate license header block.
Lines 1-14 and 16-17 contain duplicate SPDX copyright/license headers (2026 and 2025). Keep only one license block.
Proposed fix
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: LicenseRef-Apache2 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. -# SPDX-License-Identifier: LicenseRef-Apache2 - import logging🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py` around lines 1 - 17, Remove the duplicated SPDX license header at the top of the file: keep a single license block (prefer the 2026 NVIDIA COPYRIGHT/SPDX block) and delete the extra/older duplicate (the 2025 block), ensuring only one SPDX-FileCopyrightText and one SPDX-License-Identifier remain and there are no extra blank lines between the retained header and the rest of the file.bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.py-231-233 (1)
231-233:⚠️ Potential issue | 🟡 Minor
seq1 != seq2is not a stable invariant here.With homopolymer inputs plus windowing, the first batch can legitimately contain two identical windows from the same source sequence. This makes the test flaky even when batching is correct.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.py` around lines 231 - 233, The test currently assumes seq1 and seq2 (from batch["input_ids"][0] and [1]) must differ, but identical windows can legitimately occur; update the assertion to test that the entire batch is not composed of identical sequences instead of just the first two. Replace the seq1/seq2 equality check with a check over batch["input_ids"] (e.g., compute whether all rows are equal using torch.eq(batch["input_ids"], batch["input_ids"][0]).all(dim=1).all() or an equivalent) and assert that not every sequence in batch equals the first one (i.e., assert not all_identical) so the test allows occasional duplicate windows but still catches a batch of all-identical sequences.bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py-51-52 (1)
51-52:⚠️ Potential issue | 🟡 MinorAlways run
_cleanup()from afinallyblock.Right now cleanup only runs on the success path. If
main_fsdp2()throws or the loss assertion fails, CUDA/process state leaks into the next test.Apply this pattern in both test bodies
- final_loss = main_fsdp2(sanity_config) - _cleanup() + try: + final_loss = main_fsdp2(sanity_config) + finally: + _cleanup()Also applies to: 72-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py` around lines 51 - 52, Wrap the calls to main_fsdp2(...) and the subsequent _cleanup() in a try/finally so _cleanup() always runs even if main_fsdp2 raises or assertions fail; specifically update both test bodies that call main_fsdp2(sanity_config) (and the second call at the other test) to execute _cleanup() inside a finally block while keeping the assertions/assignments inside the try.bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.py-121-128 (1)
121-128:⚠️ Potential issue | 🟡 MinorAccumulate token counts across the whole logging window.
step_timeis averaged acrosslogger.frequencysteps, butnum_tokens/num_unpadded_tokensare only recorded on the step that triggers logging. In packed or otherwise variable-length runs, that turns the throughput metrics into “last step tokens divided by window-average time” instead of actual window throughput.Suggested fix
with torch.no_grad(): self.grad_acc_step_count += 1 self.running_loss += outputs.loss - if step % self.logging_frequency == 0 and step > 0: - self.num_tokens += batch["input_ids"].numel() - # Use attention_mask to count unpadded tokens (works for both BSHD and THD) - if "attention_mask" in batch: - self.num_unpadded_tokens += batch["attention_mask"].sum() - else: - # Fallback for pure sequence packing with no padding: all tokens are unpadded - self.num_unpadded_tokens += batch["input_ids"].numel() + self.num_tokens += batch["input_ids"].numel() + # Use attention_mask to count unpadded tokens (works for both BSHD and THD) + if "attention_mask" in batch: + self.num_unpadded_tokens += batch["attention_mask"].sum() + else: + # Fallback for pure sequence packing with no padding: all tokens are unpadded + self.num_unpadded_tokens += batch["input_ids"].numel() ... - step_time = (now - self.previous_step_time) / self.logging_frequency + window_time = now - self.previous_step_time + step_time = window_time / self.logging_frequency self.previous_step_time = now ... - self.metrics["train/tokens_per_second_per_gpu"].update(self.num_tokens / step_time) - self.metrics["train/unpadded_tokens_per_second_per_gpu"].update(self.num_unpadded_tokens / step_time) + self.metrics["train/tokens_per_second_per_gpu"].update(self.num_tokens / window_time) + self.metrics["train/unpadded_tokens_per_second_per_gpu"].update( + self.num_unpadded_tokens / window_time + )Also applies to: 163-180
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.py` around lines 121 - 128, The throughput counters are only updated on the logging step, producing incorrect window throughput; move the increments for self.num_tokens and self.num_unpadded_tokens out of the logging-only branch so they run every step (use batch["input_ids"].numel() for num_tokens and batch["attention_mask"].sum() if present else input_ids.numel() for num_unpadded_tokens), keep the existing attention_mask fallback logic, and then at the logging point (where step_time is averaged using self.logging_frequency) compute throughput using the accumulated counters and reset them after logging; apply the same change to the other similar block referenced (lines ~163-180) so both windows accumulate tokens across the full logging window.
🧹 Nitpick comments (14)
bionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/README.md (1)
1-4: Consider expanding the documentation.The README is quite minimal and could benefit from additional information:
- Usage example: Show how to load and use the tokenizer
- Reference: Provide a link or citation for the Evo-2 DNA model
- Technical details: Explain what tokenizing with
ord()means (e.g., "converts each nucleotide character to its ASCII integer value")- File references: Link to or mention the tokenizer configuration files in this directory (
tokenizer.json,tokenizer_config.json, etc.)- Requirements: List any dependencies needed to use this tokenizer
📝 Example expansion
# A transformers-based tokenizer for DNA sequences -This tokenizer is similar to the one used in the Evo-2 DNA model, and tokenizes DNA sequences using the `ord()` -function. +This tokenizer is similar to the one used in the [Evo-2 DNA model](https://example.com/evo2), and tokenizes DNA sequences by converting each nucleotide character to its ASCII integer value using Python's `ord()` function. + +## Files + +- `tokenizer.json`: Tokenizer vocabulary and configuration +- `tokenizer_config.json`: Tokenizer settings +- `special_tokens_map.json`: Special token definitions + +## Usage + +```python +from transformers import AutoTokenizer + +tokenizer = AutoTokenizer.from_pretrained("path/to/this/directory") +sequence = "ATCGATCG" +tokens = tokenizer(sequence) +``` + +## Requirements + +- transformers>=4.x.x🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/README.md` around lines 1 - 4, Expand the README to include a brief usage example showing how to load the tokenizer with transformers (e.g., mention AutoTokenizer.from_pretrained and calling the tokenizer on a DNA string), add a short technical note explaining that "tokenizing with ord()" converts each nucleotide character to its ASCII integer value, add a reference/link or citation to the Evo-2 DNA model, list the local files (tokenizer.json, tokenizer_config.json, and any vocab/mer files) and their purpose, and add a Requirements section with required packages and minimum versions (e.g., transformers>=4.x.x).bionemo-recipes/recipes/mixtral_native_te/distributed_config.py (2)
21-21: Unused logger declaration.The
loggeris declared but never used in this module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/distributed_config.py` at line 21, The module defines an unused logger variable named logger in distributed_config.py; either remove that declaration to eliminate the unused symbol or replace usages of print/raise (or add appropriate log messages) with logger calls (e.g., logger.debug/info/warning/error) in functions that perform configuration/initialization in this module so the logger is actually used and lint warnings go away.
39-40: Unused private fields_master_addrand_master_port.These fields are initialized but never accessed within the class or exposed via properties. If they're intended for future use, consider documenting that intent; otherwise, they can be removed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/distributed_config.py` around lines 39 - 40, The private dataclass fields _master_addr and _master_port are created but never used; either remove these unused fields from the class (delete _master_addr and _master_port declarations) or make them visible/used by adding properties or using them where the class needs MASTER_ADDR/MASTER_PORT (e.g., replace direct os.environ access with self._master_addr / self._master_port or add public properties master_addr/master_port that return those values). Update any dependent code to reference the chosen properties if you expose them.bionemo-recipes/recipes/mixtral_native_te/hydra_config/L2_lingua_small_mixtral.yaml (1)
39-39: Consider using scientific notation foreps.
0.00000001is equivalent to1e-8but harder to read and verify. Scientific notation would improve clarity.✨ Suggested change
- eps: 0.00000001 + eps: 1.0e-8🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/hydra_config/L2_lingua_small_mixtral.yaml` at line 39, Replace the hard-to-read decimal literal for the epsilon parameter by using scientific notation: locate the eps key (eps:) in the L2_lingua_small_mixtral.yaml hydra config and change its value from 0.00000001 to 1e-8 so the intent and magnitude are clearer.bionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfile (1)
1-9: Consider adding a non-root user for improved security posture.The container runs as root by default. While this is common for NVIDIA GPU containers (due to device access requirements), consider whether a non-root user could be used at runtime for defense-in-depth. If root is intentionally required for GPU/CUDA operations in this context, this can be accepted.
🛡️ Optional: Add non-root user if device access permits
WORKDIR /workspace/bionemo COPY . . + +# Optional: Run as non-root if GPU access permits +# RUN useradd -m -s /bin/bash bionemo +# USER bionemo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfile` around lines 1 - 9, The Dockerfile currently runs as root; add a non-root user and switch to it after installation to improve security: create a user and group (e.g., USER bionemo), chown the WORKDIR (/workspace/bionemo) and any cache directories used during build (refer to the RUN with --mount and the WORKDIR/COPY steps) to that user, then use USER bionemo before continuing runtime steps; ensure any GPU/device access or runtime requirements that need root are validated and, if still required, document why the container must remain root.bionemo-recipes/recipes/mixtral_native_te/requirements.txt (1)
1-12: Same dependency pinning concerns as the other recipe.This file is identical to
opengenome2_mixtral_native_te/requirements.txt. The same recommendations apply: consider pinning versions and using a commit hash for the Git dependency.If both recipes share the same dependencies, consider extracting a shared requirements file to reduce duplication.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/requirements.txt` around lines 1 - 12, The requirements.txt for mixtral_native_te repeats the same unpinned deps as opengenome2_mixtral_native_te and uses a Git URL without a commit hash (nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect), which risks unreproducible installs; update bionemo-recipes/recipes/mixtral_native_te/requirements.txt to pin exact versions for all packages (e.g., datasets==x.y.z, torch==x.y.z, etc.), change the Git dependency to reference a specific commit/tag, and if both recipes truly share identical deps extract a shared requirements file and have both recipes reference it to avoid duplication.bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt (2)
1-12: Consider pinning dependency versions for reproducibility.The requirements file lacks version pins, which can lead to non-reproducible builds. For production/research recipes, consider pinning major dependencies (especially
torch,transformers,transformer_engine) to ensure consistent environments.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt` around lines 1 - 12, The requirements file lists many floating dependencies (datasets, hydra-core, torch, torchao, torchdata, torchmetrics, tqdm, transformer_engine[pytorch], transformers, wandb, zstandard, and the git nvdlfw_inspect) which risks non-reproducible environments; update requirements.txt to pin stable versions (use exact == or compatible ~= pins) for at least torch, transformers, transformer_engine[pytorch], hydra-core and torchdata, and add a comment or separate constraints/lockfile (e.g., requirements.lock or pip-tools/poetry lock) to capture transitive versions so CI and research runs reproduce the same environment and include a pinned reference for the git dependency (commit hash) instead of a branch.
12-12: Pin the Git dependency to a specific commit or tag.The
nvdlfw_inspectdependency is installed from the GitHub main branch without a commit hash or tag, creating reproducibility risk. While the GitHub repository appears to be private (API access restricted), pinning to a specific commit hash would improve build consistency. If the repository does not provide public releases or tags, consider pinning to a known-working commit hash.♻️ Proposed fix: Pin to a specific commit
-nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect +nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect@<commit_hash>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt` at line 12, The Git dependency for nvdlfw_inspect in requirements.txt is unpinned; update the nvdlfw_inspect entry to reference a specific commit hash or tag (e.g., convert "nvdlfw_inspect @ git+https://github.com/NVIDIA/nvidia-dlfw-inspect" to include "@<commit-hash>" or a release tag) so installations are reproducible; ensure the chosen commit/tag is verified to work with the project and document the selected hash/tag in the repo or PR description.bionemo-recipes/recipes/mixtral_native_te/Dockerfile (1)
1-9: Consider adding a non-root user for improved security posture.Same consideration as the other recipe Dockerfile: the container runs as root by default. If root is required for GPU/CUDA operations, this is acceptable; otherwise, consider adding a non-root user for defense-in-depth.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/Dockerfile` around lines 1 - 9, Add a non-root user to the Dockerfile to improve security: create a user and group (e.g., bionemo), set its home and shell, chown the WORKDIR (/workspace/bionemo) and any copied files after COPY . . to that user, and switch to that user with USER before running non-root processes (while keeping RUN pip install and other build steps as root if necessary for CUDA/GPU setup). Ensure the Dockerfile updates reference WORKDIR, COPY, RUN pip install and place USER after all build steps that require root so the container runs as the non-root account by default.bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py (2)
56-56: Consider lazy % formatting for logging.Using f-strings in logging statements evaluates the string even if the log level is disabled. Prefer lazy %-style formatting:
- logger.info(f"Logging FP8 stats to {fp8_log_dir}") + logger.info("Logging FP8 stats to %s", fp8_log_dir)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py` at line 56, The logging call in fp8_debugging.py uses an f-string which forces evaluation even when the logger level would skip it; change the logger.info call that references fp8_log_dir to use lazy %-style (or the logger's parameterized) formatting, e.g. replace logger.info(f"Logging FP8 stats to {fp8_log_dir}") with logger.info("Logging FP8 stats to %s", fp8_log_dir) so the string is only formatted when the message will actually be emitted.
29-63: Consider consolidating duplicated FP8 debugging code.This implementation is nearly identical to
bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py. Consider extracting to a shared utility module.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py` around lines 29 - 63, The function initialize_fp8_debugging duplicates code present in another recipe; extract the shared logic into a single utility module (e.g., a new module exposing initialize_fp8_debugging or init_fp8_debugging) and have both recipe files import and call that utility; ensure the extracted function keeps the same signature (dist_config, enabled, fp8_stats_file, fp8_log_dir, fp8_enabled), moves the Path creation, fp8_enabled check, logger.info, transformer_engine/debug_api usage, and necessary imports (Path, logger, transformer_engine, debug_api) into the utility, and update both original files to remove the duplicated implementation and replace it with an import and a single call to the shared initialize_fp8_debugging function.bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py (1)
49-63: Consider adding module docstring and datacenter hardware marker.The opengenome2 version includes a module docstring explaining test purpose and a
requires_datacenter_hardwaremarker. Consider adding these for consistency if the same hardware requirements apply.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py` around lines 49 - 63, Add a module-level docstring at the top of the test file describing the purpose of these multi-GPU integration tests and their hardware requirements, and also add the pytest marker requires_datacenter_hardware to the test (e.g., decorate test_multi_gpu_train_fsdp2 or the module with `@pytest.mark.requires_datacenter_hardware`) so the test runner can skip it on non-datacenter setups; ensure you import pytest if not present and keep the existing `@requires_multi_gpu` decorator on test_multi_gpu_train_fsdp2.bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py (1)
24-47: Consider consolidating duplicated scheduler code.This scheduler implementation is nearly identical to
bionemo-recipes/recipes/mixtral_native_te/scheduler.py. Consider extracting to a shared utility module to avoid maintaining duplicate code.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py` around lines 24 - 47, The get_cosine_annealing_schedule_with_warmup function is duplicated across two schedulers; extract this implementation into a shared utility (e.g., a common scheduler module) and have both scheduler modules import and call that single function instead of keeping two copies; update the callers to import get_cosine_annealing_schedule_with_warmup and remove the duplicated local definition so maintenance and bug fixes are centralized.bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py (1)
32-58: Add Google-style docstring with Args documentation.The docstring is minimal compared to the version in
mixtral_native_te/fp8_debugging.py, which documents all parameters. For consistency and maintainability, consider adding full Args documentation.Proposed docstring
def initialize_fp8_debugging( dist_config: DistributedConfig, enabled: bool, fp8_stats_file: str, fp8_log_dir: str | os.PathLike, fp8_enabled: bool, ) -> None: - """Initialize FP8 statistics logging.""" + """Initialize FP8 statistics logging. + + Args: + dist_config: The distributed configuration. + enabled: Whether to enable FP8 debugging. + fp8_stats_file: The file containing the FP8 stats config. + fp8_log_dir: The directory to log the FP8 stats to. + fp8_enabled: Whether FP8 autocast is enabled. + """ if not enabled:🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py` around lines 32 - 58, The initialize_fp8_debugging function has a minimal docstring; update it to a Google-style docstring that documents all parameters (dist_config: DistributedConfig, enabled: bool, fp8_stats_file: str, fp8_log_dir: str | os.PathLike, fp8_enabled: bool) and the return value/behavior (None, early return when enabled is False, and raised ValueError when fp8_enabled is False) so it matches the more complete docstring in mixtral_native_te/fp8_debugging.py; place the new docstring immediately under the def line and include brief descriptions for each Arg and the Raises and Returns sections.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@bionemo-recipes/recipes/mixtral_native_te/collator.py`:
- Around line 550-554: The close() method currently clears self._prefetch_thread
even if join(timeout=10) timed out, risking teardown while the prefetch thread
may still be inside scatter_object_list(); change the logic to only clear
self._prefetch_thread when the thread has actually stopped (e.g., call
join(timeout=10) then check self._prefetch_thread.is_alive() or join() return
and only set self._prefetch_thread = None if not alive); if the thread is still
alive, either raise/log a warning or leave the handle intact so callers can
retry shutdown safely, and ensure any shutdown signal used by the prefetch loop
is set before joining so scatter_object_list() can exit promptly.
- Around line 416-418: When building shifted labels for causal LM (inside the
is_causal_lm branch where labels = pad(...); batch["labels"] = labels[...,
1:].contiguous()), mask out targets that cross packed THD sequence boundaries so
predictions do not supervise the next packed sequence: compute a boundary mask
for the shifted positions (e.g., boundary_mask = batch["thd_boundary_mask"][...,
1:] or derive it from whatever THD boundary tensor is provided), set labels[...,
1:][boundary_mask] = -100 to ignore those cross-sequence targets, then assign
batch["labels"] = labels[..., 1:].contiguous(); keep references to is_causal_lm
and batch["labels"] when implementing the change.
In `@bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py`:
- Around line 572-594: The code assumes input_ids exists when should_pack_inputs
or when building lengths for InferenceParams; update the logic in the block
around should_pack_inputs, _unpad_input, and the past_key_values handling to use
inputs_embeds when input_ids is None: compute padded_seq_len and batch_size from
inputs_embeds.size() if input_ids is None, and when computing lengths for
InferenceParams derive lengths from attention_mask shape or inputs_embeds.shape
(fallback) rather than input_ids.shape; ensure references to input_ids.size(1)
and input_ids.shape[0] are replaced with conditional expressions that prefer
input_ids but use inputs_embeds when available before calling _unpad_input or
past_key_values.pre_step.
- Around line 156-172: init_empty_weights assumes embed_tokens and rotary_emb
live on self.model, but NVMixtralModel places them directly on self; update
init_empty_weights to look up embed_tokens and rotary_emb from either self or
self.model (e.g., embed = getattr(self, "embed_tokens", None) or getattr(self,
"model", None) and getattr(self.model, "embed_tokens", None)), then operate on
that found object (call to_empty, apply(self._init_weights) on embed if
present). Similarly resolve config from self.config or self.model.config for
building LlamaRotaryEmbedding and assign the resulting inv_freq to the located
rotary_emb.inv_freq after moving to cuda. Ensure you keep the existing
NVMixtralSparseMoeBlock _restack_from_views loop unchanged.
In `@bionemo-recipes/recipes/mixtral_native_te/perf_logger.py`:
- Around line 155-159: The code only updates self.min_loss on logging
boundaries, so if training ends mid-window min_loss can remain inf/stale; locate
the PerfLogger method(s) using self.running_loss, self.grad_acc_step_count and
the logging block (the if that checks step % self.logging_frequency == 0) and
add a finalization step before returning (or at end of an epoch) that checks if
self.grad_acc_step_count > 0, computes avg_loss = self.running_loss /
self.grad_acc_step_count and updates self.min_loss =
torch.minimum(self.min_loss, avg_loss); apply the same fix to the other
identical block referenced (the one around lines 201-207).
In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py`:
- Around line 19-24: The current test_set_seed() is being collected as a test
and may run out of order; convert it into an autouse pytest fixture (e.g.,
rename or keep name but decorate with `@pytest.fixture`(autouse=True,
scope="function" or "module") and ensure pytest is imported) so RNG state is
seeded before every relevant test; update the function signature accordingly
(remove assertions inside it), seed torch, random, numpy if used, and remove the
leading test_ prefix if you prefer to avoid accidental collection without the
decorator — reference the test_set_seed function in test_train.py to locate and
change it.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/checkpoint.py`:
- Around line 67-70: get_latest_checkpoint() can pick up a newly-created
step_{step} directory before its async upload (started by save_checkpoint_fsdp2
via dcp_async_save) finishes, causing resumption from a partial checkpoint and
failure in dcp_load(). Fix by tracking and checking _ckpt_futures for the
strategy: when save_checkpoint_fsdp2 creates step_{step} and kicks off
dcp_async_save, store the returned Future in _ckpt_futures[strategy]; update
get_latest_checkpoint() (and any resume logic) to ignore step_* directories
whose corresponding Future exists and is not done (or await the Future before
considering that step done); also ensure any shutdown/restart paths wait for
pending _ckpt_futures to complete before allowing resume.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/collator.py`:
- Around line 550-554: The close() method currently clears self._prefetch_thread
even if join(timeout=10) timed out and the background worker may still be inside
scatter_object_list(), so change close() to call join(timeout=10) then check if
self._prefetch_thread.is_alive() and only clear the handle (set to None) when
the thread is no longer alive; if it is still alive, do not drop the
reference—either log/warn and return or raise an error so teardown does not
proceed while scatter_object_list() may still be running, and keep the unique
symbols _prefetch_thread, close(), and scatter_object_list() in mind when making
the change.
- Around line 416-418: The current shift of labels for causal LM (inside the
block using self.is_causal_lm and batch["labels"]) can create cross-sequence
targets for packed THD batches; before doing labels = pad(...) and
batch["labels"] = labels[...,1:].contiguous(), detect THD sequence boundaries
and set the label at the first token of each subsequent packed sequence to -100
so it does not contribute to loss. Concretely: compute a boundary mask
(preferably from batch["sequence_ids"] by comparing sequence_ids[...,1:] !=
sequence_ids[...,:-1], or if sequence_ids is not available derive it from
attention transitions using batch["attention_mask"]), apply that mask to
labels[...,1:] and set those positions to -100, then continue with the existing
pad/shift logic (referencing self.is_causal_lm and batch["labels"] to locate the
code).
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/dataset.py`:
- Around line 296-303: The token-packing budget is computed from raw sequence
lengths but the collator pads sequences up to pad_sequences_to_be_divisible_by,
so batches can exceed token_micro_batch_size; fix by ensuring the same
pad_sequences_to_be_divisible_by is forwarded to the packing logic so packing
budgets account for post-collation padding. Concretely, when creating
TokenPackingDataset (or wherever token packing is computed) use the
pad_sequences_to_be_divisible_by value (the same one passed into
DataCollatorWithFlattening) to round each sequence length up to the next
multiple before summing against token_micro_batch_size (or add a parameter like
padding_granularity passed from DataCollatorWithFlattening/GenomicDataCollator
into the TokenPackingDataset), and apply the same change for the second
occurrence at lines 307-311 so packing and collator use identical padding rules.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yaml`:
- Around line 23-27: The recipe currently hard-codes a cluster-local dataset
path in load_dataset_kwargs (keys: path, data_files, split, streaming); replace
the literal /data/opengenome2/... value with a portable required override or
parameter reference (e.g., set data_files to ??? to force callers to provide it,
or read it from an environment/CLI variable) and ensure the same change is
applied to the corresponding block at lines 51-55 so the preset remains portable
across environments.
- Around line 44-49: The checkpoint block sets ckpt_dir: null while also
enabling save_final_model, resume_from_checkpoint, and save_every_n_steps, which
is invalid; update the ckpt_dir to a valid directory path (e.g., a string like
"checkpoints/run_name") so checkpoints can be written and resumed, or if you
intend not to persist checkpoints, set resume_from_checkpoint: false and
save_final_model: false (and/or remove save_every_n_steps) to make the config
consistent; change the ckpt_dir, resume_from_checkpoint, save_final_model, or
save_every_n_steps fields accordingly to resolve the conflict.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/opengenome_collator.py`:
- Around line 53-57: The current mask uses (not_dna | is_control) which
incorrectly sets control tags to -100; change the predicate so only labels that
are neither DNA nor control get masked. Locate dna_tokens_tensor,
control_tensor, not_dna and is_control and replace the assignment logic to
compute a mask that is true when a label is not in dna_tokens_tensor AND not in
control_tensor (e.g., mask = not_dna & ~is_control) and then set labels[mask &
(labels != -100)] = -100 so control tags are preserved.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.py`:
- Around line 53-69: The fixture device_mesh currently has autouse=True which
forces CUDA/distributed setup for every test; remove autouse=True so it only
runs when requested, keep scope="session" if shared across tests, and then
update tests that need it (test_train.py and test_train_two_gpu.py) to
explicitly request it either by adding device_mesh as a test parameter or by
using `@pytest.mark.usefixtures`("device_mesh"); ensure the fixture name
(device_mesh) and any use of DistributedConfig/local_rank remain unchanged so
the existing init logic is reused only for tests that opt in.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py`:
- Line 40: run_validation() currently calls model.eval() (lines showing
model.eval() at 40 and 104) but never restores the previous training mode on
error; wrap the evaluation block so you capture the prior state (e.g.,
was_training = model.training), set model.eval(), then in a finally clause
restore the state (e.g., model.train(was_training)) to ensure training/eval mode
is returned regardless of exceptions; apply this pattern in the run_validation()
function around both evaluation sections referencing model.eval().
- Around line 73-82: The RuntimeError raised when num_evaluated == 0 causes some
ranks to skip the subsequent collectives (torch.distributed.barrier and
all_reduce) and can deadlock; instead, replace the immediate raise with a
per-rank failure indicator (e.g., local_failed = 1 if num_evaluated == 0 else 0)
and participate in the same barrier/all_reduce path using the existing
loss_tensor/all_reduce so all ranks synchronize; after the collective(s) check
the aggregated result (e.g., sum of num_evaluated or sum of local_failed) and
only then raise the RuntimeError (including combined context such as
total_evaluated and involved ranks) or handle the failure, ensuring symbols
num_evaluated, num_batches, dist_config.rank, torch.distributed.barrier,
loss_tensor, and torch.distributed.all_reduce are used to locate and implement
the change.
---
Minor comments:
In
`@bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py`:
- Around line 1-18: Remove the duplicated license header by deleting the second
occurrence of the SPDX block (the lines containing "SPDX-FileCopyrightText:
Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved." and
"SPDX-License-Identifier: LicenseRef-Apache2") so only the single, full
Apache-2.0 license header at the top of test_lingua_small_mixtral.py remains.
In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py`:
- Around line 1-17: The file test_train_two_gpu.py contains duplicated SPDX
license header blocks (two identical copyright/SPDX sections at the top); remove
the redundant block so only a single SPDX copyright and license block remains at
the file head, preserving one complete header (the SPDX-FileCopyrightText and
SPDX-License-Identifier lines) and deleting the duplicate lines to avoid
repeated headers.
In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py`:
- Around line 51-52: Wrap the calls to main_fsdp2(...) and the subsequent
_cleanup() in a try/finally so _cleanup() always runs even if main_fsdp2 raises
or assertions fail; specifically update both test bodies that call
main_fsdp2(sanity_config) (and the second call at the other test) to execute
_cleanup() inside a finally block while keeping the assertions/assignments
inside the try.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py`:
- Around line 1-17: Remove the duplicated SPDX license header at the top of the
file: keep a single license block (prefer the 2026 NVIDIA COPYRIGHT/SPDX block)
and delete the extra/older duplicate (the 2025 block), ensuring only one
SPDX-FileCopyrightText and one SPDX-License-Identifier remain and there are no
extra blank lines between the retained header and the rest of the file.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_experiment.yaml`:
- Around line 39-42: The data_files field under load_dataset_kwargs currently
uses a hardcoded absolute path pointing to the opengenome2_llama_native_te
recipe (likely a copy/paste error); change it to a portable reference (use a
relative path within the current recipe or a Hydra interpolation/macro) and
ensure it points to the correct dataset for opengenome2_mixtral_native_te.
Locate load_dataset_kwargs and replace the absolute path in data_files with
either a relative path from the recipe directory or a Hydra variable (e.g.,
recipe-specific dataset path) so the config is portable and references the
correct recipe’s dataset.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/model_configs/og2-small-mixtral/config.json`:
- Line 27: The config key "transformers_version" currently set to "4.40.0.dev0"
must be changed to a stable release and the same stable version must be pinned
in requirements.txt; update the "transformers_version" value in config.json
(og2-small-mixtral config) to a stable release string (e.g., "5.4.0" or the
repo-consistent stable version you choose) and add an explicit version
constraint for transformers in requirements.txt (e.g., transformers==5.4.0) so
both files reference the identical pinned stable release for reproducibility.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.py`:
- Around line 16-18: Remove the duplicated SPDX header by deleting the repeated
lines containing "SPDX-FileCopyrightText" and "SPDX-License-Identifier" that
appear after the initial header; keep only the first occurrence at the top of
the file so that the file contains a single SPDX copyright and license
identifier (search for the strings "SPDX-FileCopyrightText" and
"SPDX-License-Identifier" to locate the duplicate).
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.py`:
- Around line 121-128: The throughput counters are only updated on the logging
step, producing incorrect window throughput; move the increments for
self.num_tokens and self.num_unpadded_tokens out of the logging-only branch so
they run every step (use batch["input_ids"].numel() for num_tokens and
batch["attention_mask"].sum() if present else input_ids.numel() for
num_unpadded_tokens), keep the existing attention_mask fallback logic, and then
at the logging point (where step_time is averaged using self.logging_frequency)
compute throughput using the accumulated counters and reset them after logging;
apply the same change to the other similar block referenced (lines ~163-180) so
both windows accumulate tokens across the full logging window.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py`:
- Around line 1-17: The file contains two identical SPDX license header blocks
duplicated back-to-back; remove the redundant block so only a single license
header remains at the top of scheduler.py (keep one of the
SPDX-FileCopyrightText/SPDX-License-Identifier header blocks and delete the
other), ensuring there are no blank or stray duplicate license lines left.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.py`:
- Around line 231-233: The test currently assumes seq1 and seq2 (from
batch["input_ids"][0] and [1]) must differ, but identical windows can
legitimately occur; update the assertion to test that the entire batch is not
composed of identical sequences instead of just the first two. Replace the
seq1/seq2 equality check with a check over batch["input_ids"] (e.g., compute
whether all rows are equal using torch.eq(batch["input_ids"],
batch["input_ids"][0]).all(dim=1).all() or an equivalent) and assert that not
every sequence in batch equals the first one (i.e., assert not all_identical) so
the test allows occasional duplicate windows but still catches a batch of
all-identical sequences.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train_two_gpu.py`:
- Around line 61-80: The test function test_multi_gpu_train_fsdp2 declares an
unused tmp_path fixture; remove tmp_path from the function signature so it
becomes def test_multi_gpu_train_fsdp2(recipe_path): and adjust any references
or imports accordingly to eliminate the unused fixture warning and unnecessary
setup.
---
Nitpick comments:
In `@bionemo-recipes/recipes/mixtral_native_te/distributed_config.py`:
- Line 21: The module defines an unused logger variable named logger in
distributed_config.py; either remove that declaration to eliminate the unused
symbol or replace usages of print/raise (or add appropriate log messages) with
logger calls (e.g., logger.debug/info/warning/error) in functions that perform
configuration/initialization in this module so the logger is actually used and
lint warnings go away.
- Around line 39-40: The private dataclass fields _master_addr and _master_port
are created but never used; either remove these unused fields from the class
(delete _master_addr and _master_port declarations) or make them visible/used by
adding properties or using them where the class needs MASTER_ADDR/MASTER_PORT
(e.g., replace direct os.environ access with self._master_addr /
self._master_port or add public properties master_addr/master_port that return
those values). Update any dependent code to reference the chosen properties if
you expose them.
In `@bionemo-recipes/recipes/mixtral_native_te/Dockerfile`:
- Around line 1-9: Add a non-root user to the Dockerfile to improve security:
create a user and group (e.g., bionemo), set its home and shell, chown the
WORKDIR (/workspace/bionemo) and any copied files after COPY . . to that user,
and switch to that user with USER before running non-root processes (while
keeping RUN pip install and other build steps as root if necessary for CUDA/GPU
setup). Ensure the Dockerfile updates reference WORKDIR, COPY, RUN pip install
and place USER after all build steps that require root so the container runs as
the non-root account by default.
In `@bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py`:
- Line 56: The logging call in fp8_debugging.py uses an f-string which forces
evaluation even when the logger level would skip it; change the logger.info call
that references fp8_log_dir to use lazy %-style (or the logger's parameterized)
formatting, e.g. replace logger.info(f"Logging FP8 stats to {fp8_log_dir}") with
logger.info("Logging FP8 stats to %s", fp8_log_dir) so the string is only
formatted when the message will actually be emitted.
- Around line 29-63: The function initialize_fp8_debugging duplicates code
present in another recipe; extract the shared logic into a single utility module
(e.g., a new module exposing initialize_fp8_debugging or init_fp8_debugging) and
have both recipe files import and call that utility; ensure the extracted
function keeps the same signature (dist_config, enabled, fp8_stats_file,
fp8_log_dir, fp8_enabled), moves the Path creation, fp8_enabled check,
logger.info, transformer_engine/debug_api usage, and necessary imports (Path,
logger, transformer_engine, debug_api) into the utility, and update both
original files to remove the duplicated implementation and replace it with an
import and a single call to the shared initialize_fp8_debugging function.
In
`@bionemo-recipes/recipes/mixtral_native_te/hydra_config/L2_lingua_small_mixtral.yaml`:
- Line 39: Replace the hard-to-read decimal literal for the epsilon parameter by
using scientific notation: locate the eps key (eps:) in the
L2_lingua_small_mixtral.yaml hydra config and change its value from 0.00000001
to 1e-8 so the intent and magnitude are clearer.
In `@bionemo-recipes/recipes/mixtral_native_te/requirements.txt`:
- Around line 1-12: The requirements.txt for mixtral_native_te repeats the same
unpinned deps as opengenome2_mixtral_native_te and uses a Git URL without a
commit hash (nvdlfw_inspect @
git+https://github.com/NVIDIA/nvidia-dlfw-inspect), which risks unreproducible
installs; update bionemo-recipes/recipes/mixtral_native_te/requirements.txt to
pin exact versions for all packages (e.g., datasets==x.y.z, torch==x.y.z, etc.),
change the Git dependency to reference a specific commit/tag, and if both
recipes truly share identical deps extract a shared requirements file and have
both recipes reference it to avoid duplication.
In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py`:
- Around line 49-63: Add a module-level docstring at the top of the test file
describing the purpose of these multi-GPU integration tests and their hardware
requirements, and also add the pytest marker requires_datacenter_hardware to the
test (e.g., decorate test_multi_gpu_train_fsdp2 or the module with
`@pytest.mark.requires_datacenter_hardware`) so the test runner can skip it on
non-datacenter setups; ensure you import pytest if not present and keep the
existing `@requires_multi_gpu` decorator on test_multi_gpu_train_fsdp2.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfile`:
- Around line 1-9: The Dockerfile currently runs as root; add a non-root user
and switch to it after installation to improve security: create a user and group
(e.g., USER bionemo), chown the WORKDIR (/workspace/bionemo) and any cache
directories used during build (refer to the RUN with --mount and the
WORKDIR/COPY steps) to that user, then use USER bionemo before continuing
runtime steps; ensure any GPU/device access or runtime requirements that need
root are validated and, if still required, document why the container must
remain root.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.py`:
- Around line 32-58: The initialize_fp8_debugging function has a minimal
docstring; update it to a Google-style docstring that documents all parameters
(dist_config: DistributedConfig, enabled: bool, fp8_stats_file: str,
fp8_log_dir: str | os.PathLike, fp8_enabled: bool) and the return value/behavior
(None, early return when enabled is False, and raised ValueError when
fp8_enabled is False) so it matches the more complete docstring in
mixtral_native_te/fp8_debugging.py; place the new docstring immediately under
the def line and include brief descriptions for each Arg and the Raises and
Returns sections.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt`:
- Around line 1-12: The requirements file lists many floating dependencies
(datasets, hydra-core, torch, torchao, torchdata, torchmetrics, tqdm,
transformer_engine[pytorch], transformers, wandb, zstandard, and the git
nvdlfw_inspect) which risks non-reproducible environments; update
requirements.txt to pin stable versions (use exact == or compatible ~= pins) for
at least torch, transformers, transformer_engine[pytorch], hydra-core and
torchdata, and add a comment or separate constraints/lockfile (e.g.,
requirements.lock or pip-tools/poetry lock) to capture transitive versions so CI
and research runs reproduce the same environment and include a pinned reference
for the git dependency (commit hash) instead of a branch.
- Line 12: The Git dependency for nvdlfw_inspect in requirements.txt is
unpinned; update the nvdlfw_inspect entry to reference a specific commit hash or
tag (e.g., convert "nvdlfw_inspect @
git+https://github.com/NVIDIA/nvidia-dlfw-inspect" to include "@<commit-hash>"
or a release tag) so installations are reproducible; ensure the chosen
commit/tag is verified to work with the project and document the selected
hash/tag in the repo or PR description.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py`:
- Around line 24-47: The get_cosine_annealing_schedule_with_warmup function is
duplicated across two schedulers; extract this implementation into a shared
utility (e.g., a common scheduler module) and have both scheduler modules import
and call that single function instead of keeping two copies; update the callers
to import get_cosine_annealing_schedule_with_warmup and remove the duplicated
local definition so maintenance and bug fixes are centralized.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/README.md`:
- Around line 1-4: Expand the README to include a brief usage example showing
how to load the tokenizer with transformers (e.g., mention
AutoTokenizer.from_pretrained and calling the tokenizer on a DNA string), add a
short technical note explaining that "tokenizing with ord()" converts each
nucleotide character to its ASCII integer value, add a reference/link or
citation to the Evo-2 DNA model, list the local files (tokenizer.json,
tokenizer_config.json, and any vocab/mer files) and their purpose, and add a
Requirements section with required packages and minimum versions (e.g.,
transformers>=4.x.x).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
| # Tracks in-flight async checkpoint futures keyed by strategy name (e.g. "fsdp2"). | ||
| # Each entry holds the Future returned by dcp_async_save so we can await it before starting | ||
| # the next async save or before shutting down. | ||
| _ckpt_futures: dict = {} |
There was a problem hiding this comment.
Don't let an in-flight async save become the "latest" resumable checkpoint.
This file still treats the highest step_* directory as loadable immediately, even though save_checkpoint_fsdp2() can leave the last save in flight and return. If the run stops during that async write, the next resume will pick a partial checkpoint directory and fail while loading it.
Also applies to: 67-85, 355-363
| if self.is_causal_lm: | ||
| labels = torch.nn.functional.pad(batch["labels"], (0, 1), value=-100) | ||
| batch["labels"] = labels[..., 1:].contiguous() |
There was a problem hiding this comment.
Mask THD sequence boundaries before building shift_labels.
For packed THD batches, labels[..., 1:] makes the last prediction of one sequence target the first token of the next packed sequence. That reintroduces cross-sequence loss even though the attention path keeps the sequences separate.
Suggested fix
if self.is_causal_lm:
labels = torch.nn.functional.pad(batch["labels"], (0, 1), value=-100)
- batch["labels"] = labels[..., 1:].contiguous()
+ shifted_labels = labels[..., 1:].contiguous()
+ if self.qkv_format == "thd":
+ cu_seqlens = batch.get("cu_seq_lens_q_padded", batch["cu_seq_lens_q"])
+ shifted_labels[..., cu_seqlens[1:-1] - 1] = -100
+ batch["labels"] = shifted_labels🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/collator.py` around lines 416 -
418, When building shifted labels for causal LM (inside the is_causal_lm branch
where labels = pad(...); batch["labels"] = labels[..., 1:].contiguous()), mask
out targets that cross packed THD sequence boundaries so predictions do not
supervise the next packed sequence: compute a boundary mask for the shifted
positions (e.g., boundary_mask = batch["thd_boundary_mask"][..., 1:] or derive
it from whatever THD boundary tensor is provided), set labels[...,
1:][boundary_mask] = -100 to ignore those cross-sequence targets, then assign
batch["labels"] = labels[..., 1:].contiguous(); keep references to is_causal_lm
and batch["labels"] when implementing the change.
| def close(self): | ||
| """Stop the prefetch thread. Must be called before destroy_process_group().""" | ||
| if self._prefetch_thread is not None: | ||
| self._prefetch_thread.join(timeout=10) | ||
| self._prefetch_thread = None |
There was a problem hiding this comment.
close() shouldn't drop the thread handle while scatter may still be blocked.
If join(timeout=10) times out, the prefetch thread can still be inside scatter_object_list(), but the wrapper clears _prefetch_thread anyway. That makes teardown look safe when a background collective is still live.
Suggested fix
def close(self):
"""Stop the prefetch thread. Must be called before destroy_process_group()."""
if self._prefetch_thread is not None:
self._prefetch_thread.join(timeout=10)
+ if self._prefetch_thread.is_alive():
+ raise RuntimeError("Prefetch thread did not stop before destroy_process_group().")
self._prefetch_thread = None📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def close(self): | |
| """Stop the prefetch thread. Must be called before destroy_process_group().""" | |
| if self._prefetch_thread is not None: | |
| self._prefetch_thread.join(timeout=10) | |
| self._prefetch_thread = None | |
| def close(self): | |
| """Stop the prefetch thread. Must be called before destroy_process_group().""" | |
| if self._prefetch_thread is not None: | |
| self._prefetch_thread.join(timeout=10) | |
| if self._prefetch_thread.is_alive(): | |
| raise RuntimeError("Prefetch thread did not stop before destroy_process_group().") | |
| self._prefetch_thread = None |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/collator.py` around lines 550 -
554, The close() method currently clears self._prefetch_thread even if
join(timeout=10) timed out, risking teardown while the prefetch thread may still
be inside scatter_object_list(); change the logic to only clear
self._prefetch_thread when the thread has actually stopped (e.g., call
join(timeout=10) then check self._prefetch_thread.is_alive() or join() return
and only set self._prefetch_thread = None if not alive); if the thread is still
alive, either raise/log a warning or leave the handle intact so callers can
retry shutdown safely, and ensure any shutdown signal used by the prefetch loop
is set before joining so scatter_object_list() can exit promptly.
| def init_empty_weights(self): | ||
| """Handles moving the model from the meta device to the cuda device and initializing the weights.""" | ||
| for module in self.modules(): | ||
| if hasattr(module, "reset_parameters"): | ||
| module.reset_parameters() | ||
|
|
||
| # After reset_parameters materializes GroupedLinear views on CUDA, | ||
| # re-stack them into the authoritative stacked parameters. | ||
| for module in self.modules(): | ||
| if isinstance(module, NVMixtralSparseMoeBlock): | ||
| module._restack_from_views() | ||
|
|
||
| self.model.embed_tokens.to_empty(device="cuda") | ||
| self.model.embed_tokens.apply(self._init_weights) | ||
|
|
||
| self.model.rotary_emb.inv_freq = LlamaRotaryEmbedding(config=self.model.config).inv_freq.to("cuda") | ||
|
|
There was a problem hiding this comment.
init_empty_weights() only works on the LM wrapper right now.
This helper reaches through self.model.*, but NVMixtralModel inherits NVMixtralPreTrainedModel directly and keeps embed_tokens / rotary_emb on self. Calling NVMixtralModel.init_empty_weights() will raise AttributeError.
Suggested fix
def init_empty_weights(self):
"""Handles moving the model from the meta device to the cuda device and initializing the weights."""
+ base_model = self.model if hasattr(self, "model") else self
+
for module in self.modules():
if hasattr(module, "reset_parameters"):
module.reset_parameters()
...
- self.model.embed_tokens.to_empty(device="cuda")
- self.model.embed_tokens.apply(self._init_weights)
+ base_model.embed_tokens.to_empty(device="cuda")
+ base_model.embed_tokens.apply(self._init_weights)
- self.model.rotary_emb.inv_freq = LlamaRotaryEmbedding(config=self.model.config).inv_freq.to("cuda")
+ base_model.rotary_emb.inv_freq = LlamaRotaryEmbedding(config=base_model.config).inv_freq.to("cuda")
self.tie_weights()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py` around
lines 156 - 172, init_empty_weights assumes embed_tokens and rotary_emb live on
self.model, but NVMixtralModel places them directly on self; update
init_empty_weights to look up embed_tokens and rotary_emb from either self or
self.model (e.g., embed = getattr(self, "embed_tokens", None) or getattr(self,
"model", None) and getattr(self.model, "embed_tokens", None)), then operate on
that found object (call to_empty, apply(self._init_weights) on embed if
present). Similarly resolve config from self.config or self.model.config for
building LlamaRotaryEmbedding and assign the resulting inv_freq to the located
rotary_emb.inv_freq after moving to cuda. Ensure you keep the existing
NVMixtralSparseMoeBlock _restack_from_views loop unchanged.
| if should_pack_inputs: | ||
| assert attention_mask is not None, "Attention mask is required when packing BSHD inputs." | ||
| batch_size = hidden_states.size(0) | ||
| padded_seq_len = input_ids.size(1) | ||
| hidden_states, indices, cu_seqlens, max_seqlen, _ = _unpad_input(hidden_states, attention_mask) | ||
| kwargs["cu_seq_lens_q"] = kwargs["cu_seq_lens_k"] = cu_seqlens | ||
| kwargs["max_length_q"] = kwargs["max_length_k"] = max_seqlen | ||
|
|
||
| if self.config.attn_input_format == "thd" and hidden_states.dim() == 3 and hidden_states.size(0) == 1: | ||
| hidden_states = hidden_states.squeeze(0) | ||
|
|
||
| if self.config.attn_input_format == "bshd" and attention_mask is not None and attention_mask.dim() == 2: | ||
| # Convert HF mask (1=attend, 0=pad) to TE boolean mask (True=masked, False=attend) | ||
| attention_mask = ~attention_mask[:, None, None, :].bool() | ||
|
|
||
| if isinstance(past_key_values, InferenceParams): | ||
| lengths = ( | ||
| attention_mask.sum(dim=1).tolist() | ||
| if attention_mask.shape == input_ids.shape | ||
| else [1] * input_ids.shape[0] | ||
| ) | ||
| past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths))) | ||
|
|
There was a problem hiding this comment.
The THD/inference path still assumes input_ids exists.
The signature allows inputs_embeds, but this block still uses input_ids.size(1) while packing and input_ids.shape[0] when preparing InferenceParams. An embeds-only call passes the initial XOR check and then crashes here before the first layer runs.
Suggested fix
- if should_pack_inputs:
+ batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0]
+ padded_seq_len = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1]
+
+ if should_pack_inputs:
assert attention_mask is not None, "Attention mask is required when packing BSHD inputs."
- batch_size = hidden_states.size(0)
- padded_seq_len = input_ids.size(1)
hidden_states, indices, cu_seqlens, max_seqlen, _ = _unpad_input(hidden_states, attention_mask)
kwargs["cu_seq_lens_q"] = kwargs["cu_seq_lens_k"] = cu_seqlens
kwargs["max_length_q"] = kwargs["max_length_k"] = max_seqlen
...
if isinstance(past_key_values, InferenceParams):
lengths = (
attention_mask.sum(dim=1).tolist()
- if attention_mask.shape == input_ids.shape
- else [1] * input_ids.shape[0]
+ if attention_mask is not None and attention_mask.dim() == 2 and attention_mask.shape[0] == batch_size
+ else [1] * batch_size
)
past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py` around
lines 572 - 594, The code assumes input_ids exists when should_pack_inputs or
when building lengths for InferenceParams; update the logic in the block around
should_pack_inputs, _unpad_input, and the past_key_values handling to use
inputs_embeds when input_ids is None: compute padded_seq_len and batch_size from
inputs_embeds.size() if input_ids is None, and when computing lengths for
InferenceParams derive lengths from attention_mask shape or inputs_embeds.shape
(fallback) rather than input_ids.shape; ensure references to input_ids.size(1)
and input_ids.shape[0] are replaced with conditional expressions that prefer
input_ids but use inputs_embeds when available before calling _unpad_input or
past_key_values.pre_step.
| checkpoint: | ||
| ckpt_dir: null | ||
| save_final_model: true | ||
| resume_from_checkpoint: true | ||
| save_every_n_steps: 1_000 | ||
| async_save: true |
There was a problem hiding this comment.
Checkpointing is enabled without a destination directory.
ckpt_dir: null conflicts with save_final_model: true, resume_from_checkpoint: true, and save_every_n_steps. A direct run cannot both resume and write checkpoints with this preset as written.
Suggested config fix
checkpoint:
- ckpt_dir: null
- save_final_model: true
- resume_from_checkpoint: true
+ ckpt_dir: ???
+ save_final_model: true
+ resume_from_checkpoint: false
save_every_n_steps: 1_000
async_save: true📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| checkpoint: | |
| ckpt_dir: null | |
| save_final_model: true | |
| resume_from_checkpoint: true | |
| save_every_n_steps: 1_000 | |
| async_save: true | |
| checkpoint: | |
| ckpt_dir: ??? | |
| save_final_model: true | |
| resume_from_checkpoint: false | |
| save_every_n_steps: 1_000 | |
| async_save: true |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yaml`
around lines 44 - 49, The checkpoint block sets ckpt_dir: null while also
enabling save_final_model, resume_from_checkpoint, and save_every_n_steps, which
is invalid; update the ckpt_dir to a valid directory path (e.g., a string like
"checkpoints/run_name") so checkpoints can be written and resumed, or if you
intend not to persist checkpoints, set resume_from_checkpoint: false and
save_final_model: false (and/or remove save_every_n_steps) to make the config
consistent; change the ckpt_dir, resume_from_checkpoint, save_final_model, or
save_every_n_steps fields accordingly to resolve the conflict.
| dna_tokens_tensor = torch.tensor(self.dna_tokens, device=labels.device) | ||
| control_tensor = torch.tensor(self.control_tags, device=labels.device) | ||
| not_dna = ~torch.isin(labels, dna_tokens_tensor) | ||
| is_control = torch.isin(labels, control_tensor) | ||
| labels[(not_dna | is_control) & (labels != -100)] = -100 |
There was a problem hiding this comment.
Preserve control tags when masking labels.
This predicate masks every configured control tag because is_control is ORed into the write mask. That turns control markers into -100 instead of only masking non-DNA, non-control labels.
Suggested fix
- not_dna = ~torch.isin(labels, dna_tokens_tensor)
- is_control = torch.isin(labels, control_tensor)
- labels[(not_dna | is_control) & (labels != -100)] = -100
+ is_dna = torch.isin(labels, dna_tokens_tensor)
+ is_control = torch.isin(labels, control_tensor)
+ labels[(~is_dna & ~is_control) & (labels != -100)] = -100🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/opengenome_collator.py`
around lines 53 - 57, The current mask uses (not_dna | is_control) which
incorrectly sets control tags to -100; change the predicate so only labels that
are neither DNA nor control get masked. Locate dna_tokens_tensor,
control_tensor, not_dna and is_control and replace the assignment logic to
compute a mask that is true when a label is not in dna_tokens_tensor AND not in
control_tensor (e.g., mask = not_dna & ~is_control) and then set labels[mask &
(labels != -100)] = -100 so control tags are preserved.
| @pytest.fixture(scope="session", autouse=True) | ||
| def device_mesh(): | ||
| """Create a re-usable torch process group for testing.""" | ||
| dist_config = DistributedConfig() | ||
| device = torch.device(f"cuda:{dist_config.local_rank}") | ||
| torch.distributed.init_process_group(backend="cpu:gloo,cuda:nccl", device_id=device) | ||
| torch.cuda.set_device(dist_config.local_rank) | ||
|
|
||
| with ( | ||
| mock.patch("torch.distributed.init_process_group", return_value=None), | ||
| mock.patch("torch.distributed.destroy_process_group", return_value=None), | ||
| ): | ||
| yield | ||
|
|
||
| torch.distributed.destroy_process_group() | ||
| torch.cuda.empty_cache() | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First, find the actual file mentioned in the review
find . -name "conftest.py" -path "*opengenome2_mixtral_native_te*" 2>/dev/nullRepository: NVIDIA/bionemo-framework
Length of output: 141
🏁 Script executed:
#!/bin/bash
# List the repository structure to understand the layout
ls -la bionemo-recipes/recipes/ 2>/dev/null | head -20Repository: NVIDIA/bionemo-framework
Length of output: 1340
🏁 Script executed:
#!/bin/bash
# Check if the file exists and read the relevant lines
if [ -f "bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.py" ]; then
echo "=== File found, reading lines 40-80 ==="
sed -n '40,80p' "bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.py" | cat -n
else
echo "File not found at expected path"
find . -type f -name "conftest.py" 2>/dev/null | grep -i opengenome
fiRepository: NVIDIA/bionemo-framework
Length of output: 1243
🏁 Script executed:
#!/bin/bash
# Find all test files in the directory
find bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests -name "test_*.py" -o -name "*_test.py"Repository: NVIDIA/bionemo-framework
Length of output: 299
🏁 Script executed:
#!/bin/bash
# Get more context about the test files and conftest
ls -la bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/Repository: NVIDIA/bionemo-framework
Length of output: 445
🏁 Script executed:
#!/bin/bash
# Check if there are dataset-only tests mentioned in the review
grep -r "dataset" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/ --include="*.py" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 2608
🏁 Script executed:
#!/bin/bash
# Look at the beginning of conftest.py to understand imports and DistributedConfig
head -52 bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.py | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 1946
🏁 Script executed:
#!/bin/bash
# Check the test_dataset.py file to see if it uses the device_mesh fixture or depends on distributed setup
head -50 bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.py | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 2170
🏁 Script executed:
#!/bin/bash
# Check if test_dataset.py has any pytest markers for skipping or conditional execution
grep -n "pytest.mark\|skipif\|xfail" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.pyRepository: NVIDIA/bionemo-framework
Length of output: 154
🏁 Script executed:
#!/bin/bash
# Check what DistributedConfig does
find . -name "distributed_config.py" -path "*opengenome2_mixtral*" -exec cat {} \;Repository: NVIDIA/bionemo-framework
Length of output: 1587
🏁 Script executed:
#!/bin/bash
# Check if test_train.py actually uses device_mesh or needs it
grep -n "device_mesh" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train.pyRepository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Check if test_dataset.py uses torch.distributed or device_mesh
grep -n "torch.distributed\|device_mesh\|distributed\|cuda" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 1300
🏁 Script executed:
#!/bin/bash
# Check if there's any conditional logic in conftest.py that might skip the device_mesh fixture
wc -l bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.pyRepository: NVIDIA/bionemo-framework
Length of output: 142
🏁 Script executed:
#!/bin/bash
# Read the entire conftest.py to check for any skip/conditional logic
cat bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.pyRepository: NVIDIA/bionemo-framework
Length of output: 2246
🏁 Script executed:
#!/bin/bash
# Check if test_train.py actually uses the device_mesh fixture or torch.distributed
grep -n "device_mesh" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train.pyRepository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
#!/bin/bash
# Check if torch.cuda is available or if tests have skip markers for CUDA requirements
grep -n "cuda\|skipif\|xfail" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 276
🏁 Script executed:
#!/bin/bash
# Check which tests actually import or use the device_mesh fixture or torch.distributed
grep -r "device_mesh\|torch.distributed" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/ --include="*.py" | grep -v "Binary"Repository: NVIDIA/bionemo-framework
Length of output: 742
🏁 Script executed:
#!/bin/bash
# Verify if torch.cuda operations would fail on CPU-only systems by checking torch version and error handling
# Look for any try-except or skipif markers around CUDA calls
grep -B5 -A5 "torch.cuda\|torch.distributed" bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.pyRepository: NVIDIA/bionemo-framework
Length of output: 731
Scope device_mesh to tests that actually need distributed CUDA state.
The session-scoped autouse fixture unconditionally initializes CUDA device and distributed process group for every test in the directory before any mocking occurs. test_dataset.py does not use distributed features and will fail on CPU-only runners at setup time. Remove autouse=True and have only test_train.py and test_train_two_gpu.py explicitly require this fixture.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.py`
around lines 53 - 69, The fixture device_mesh currently has autouse=True which
forces CUDA/distributed setup for every test; remove autouse=True so it only
runs when requested, keep scope="session" if shared across tests, and then
update tests that need it (test_train.py and test_train_two_gpu.py) to
explicitly request it either by adding device_mesh as a test parameter or by
using `@pytest.mark.usefixtures`("device_mesh"); ensure the fixture name
(device_mesh) and any use of DistributedConfig/local_rank remain unchanged so
the existing init logic is reused only for tests that opt in.
| dist_config: DistributedConfig, | ||
| ) -> dict: | ||
| """Run validation and aggregate metrics across ranks.""" | ||
| model.eval() |
There was a problem hiding this comment.
Restore the previous train/eval mode in a finally.
run_validation() is called from a try/except in the training loop. If an exception escapes this function, the model stays in eval() mode and subsequent training continues with train-time behavior disabled.
Suggested fix
- model.eval()
+ was_training = model.training
+ model.eval()
+ try:
+ ...
+ finally:
+ model.train(was_training)
-
- model.train()Also applies to: 104-104
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py` at line
40, run_validation() currently calls model.eval() (lines showing model.eval() at
40 and 104) but never restores the previous training mode on error; wrap the
evaluation block so you capture the prior state (e.g., was_training =
model.training), set model.eval(), then in a finally clause restore the state
(e.g., model.train(was_training)) to ensure training/eval mode is returned
regardless of exceptions; apply this pattern in the run_validation() function
around both evaluation sections referencing model.eval().
| if num_evaluated == 0: | ||
| raise RuntimeError(f"All {num_batches} validation batches failed on rank {dist_config.rank}") | ||
|
|
||
| torch.distributed.barrier() | ||
|
|
||
| loss_tensor = torch.tensor( | ||
| [total_loss, float(total_tokens), float(num_evaluated), total_weighted_loss], | ||
| device=device, | ||
| ) | ||
| torch.distributed.all_reduce(loss_tensor) |
There was a problem hiding this comment.
Keep all ranks on the same collective path.
A rank that raises on Line 74 skips the collective path, while healthier ranks still enter the barrier() / all_reduce() on Lines 76-82. That can deadlock validation as soon as an eval shard is empty, or all batches fail, on only part of the world.
Suggested fix
- if num_evaluated == 0:
- raise RuntimeError(f"All {num_batches} validation batches failed on rank {dist_config.rank}")
-
- torch.distributed.barrier()
+ success = torch.tensor([1 if num_evaluated > 0 else 0], device=device, dtype=torch.int32)
+ torch.distributed.all_reduce(success, op=torch.distributed.ReduceOp.MIN)
+ if success.item() == 0:
+ raise RuntimeError("Validation failed on at least one rank")
loss_tensor = torch.tensor(
[total_loss, float(total_tokens), float(num_evaluated), total_weighted_loss],
device=device,
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if num_evaluated == 0: | |
| raise RuntimeError(f"All {num_batches} validation batches failed on rank {dist_config.rank}") | |
| torch.distributed.barrier() | |
| loss_tensor = torch.tensor( | |
| [total_loss, float(total_tokens), float(num_evaluated), total_weighted_loss], | |
| device=device, | |
| ) | |
| torch.distributed.all_reduce(loss_tensor) | |
| success = torch.tensor([1 if num_evaluated > 0 else 0], device=device, dtype=torch.int32) | |
| torch.distributed.all_reduce(success, op=torch.distributed.ReduceOp.MIN) | |
| if success.item() == 0: | |
| raise RuntimeError("Validation failed on at least one rank") | |
| loss_tensor = torch.tensor( | |
| [total_loss, float(total_tokens), float(num_evaluated), total_weighted_loss], | |
| device=device, | |
| ) | |
| torch.distributed.all_reduce(loss_tensor) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py` around
lines 73 - 82, The RuntimeError raised when num_evaluated == 0 causes some ranks
to skip the subsequent collectives (torch.distributed.barrier and all_reduce)
and can deadlock; instead, replace the immediate raise with a per-rank failure
indicator (e.g., local_failed = 1 if num_evaluated == 0 else 0) and participate
in the same barrier/all_reduce path using the existing loss_tensor/all_reduce so
all ranks synchronize; after the collective(s) check the aggregated result
(e.g., sum of num_evaluated or sum of local_failed) and only then raise the
RuntimeError (including combined context such as total_evaluated and involved
ranks) or handle the failure, ensuring symbols num_evaluated, num_batches,
dist_config.rank, torch.distributed.barrier, loss_tensor, and
torch.distributed.all_reduce are used to locate and implement the change.
| if args.fp8_stats_config.enabled: | ||
| initialize_fp8_debugging(dist_config, **args.fp8_stats_config, fp8_enabled=args.fp8_config.enabled) | ||
|
|
||
| device_mesh = init_device_mesh("cuda", mesh_shape=(dist_config.world_size,), mesh_dim_names=("dp",)) |
There was a problem hiding this comment.
No ep demo? we have the tests for that in the models/ directory
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
|
@coderabbitai some things should be fix -- can you re-review? and check the comments you posted from before? |
|
/ok to test c2fd1d5 |
|
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Actionable comments posted: 18
♻️ Duplicate comments (16)
bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py (2)
40-104:⚠️ Potential issue | 🟠 MajorRestore the previous train/eval mode in a
finally.If any exception escapes the loop or the later collectives, the model stays in
eval()and the next training step silently runs with train-time behavior disabled.Possible fix
- model.eval() + was_training = model.training + model.eval() + try: + ... + finally: + model.train(was_training) ... - model.train()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py` around lines 40 - 104, Wrap the validation logic so the model's original training/eval state is always restored in a finally block: record the prior state with model.training before calling model.eval(), then run the loop, StopIteration handling, the num_evaluated check, barrier and all_reduce calls (the code that references model, model.eval(), num_evaluated, torch.distributed.barrier(), loss_tensor, and torch.distributed.all_reduce) inside a try, and in finally call model.train() only if the original state indicated training was True (or call model.train()/model.eval() to restore the saved state). This guarantees model.train() (or restoring the saved state) runs even if exceptions escape the loop or collectives.
73-82:⚠️ Potential issue | 🔴 CriticalKeep all ranks on the same collective path.
Line 73 raises before the barrier/all-reduce path, so one empty or fully failed validation shard can leave the healthier ranks hanging in the collectives.
Possible fix
- if num_evaluated == 0: - raise RuntimeError(f"All {num_batches} validation batches failed on rank {dist_config.rank}") - - torch.distributed.barrier() + local_success = torch.tensor([int(num_evaluated > 0)], device=device, dtype=torch.int32) + torch.distributed.all_reduce(local_success, op=torch.distributed.ReduceOp.MIN) + if local_success.item() == 0: + raise RuntimeError("Validation failed on at least one rank")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py` around lines 73 - 82, The code currently raises RuntimeError when num_evaluated == 0 before participating in the collective operations, which can deadlock other ranks; instead, do not raise early—create the loss_tensor containing [total_loss, total_tokens, num_evaluated, total_weighted_loss] on every rank, call torch.distributed.barrier() and torch.distributed.all_reduce(loss_tensor) so all ranks follow the same collective path, then check the aggregated num_evaluated (from loss_tensor[2] after reduction) and raise the RuntimeError if the global count is zero (you can raise on all ranks or only rank 0 but only after the all_reduce). Ensure you reference and update the existing variables loss_tensor, num_evaluated, total_loss, total_tokens, total_weighted_loss and the barrier/all_reduce calls when making this change.bionemo-recipes/recipes/opengenome2_mixtral_native_te/collator.py (2)
550-554:⚠️ Potential issue | 🟠 MajorDon't drop
_prefetch_threadwhile scatter may still be blocked.If
join(timeout=10)times out, teardown clears the only handle to a live background collective. That makes shutdown look safe when a thread is still insidescatter_object_list().Possible fix
def close(self): """Stop the prefetch thread. Must be called before destroy_process_group().""" if self._prefetch_thread is not None: self._prefetch_thread.join(timeout=10) - self._prefetch_thread = None + if self._prefetch_thread.is_alive(): + raise RuntimeError("Prefetch thread did not stop before destroy_process_group().") + self._prefetch_thread = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/collator.py` around lines 550 - 554, close() currently clears the _prefetch_thread reference even if join(timeout=10) times out, which can drop the only handle to a thread that may be blocked inside scatter_object_list(); change close() so that after calling self._prefetch_thread.join(timeout=10) you check self._prefetch_thread.is_alive() and only set self._prefetch_thread = None once the thread is confirmed stopped (or loop/wait/retry join with a bounded backoff), and if the thread remains alive emit a warning/error and avoid dropping the reference so the background collective cannot be lost; use the existing _prefetch_thread and scatter_object_list() symbols to locate where to implement this behavior.
416-418:⚠️ Potential issue | 🟠 MajorMask THD sequence boundaries before building shifted labels.
For packed THD batches, the last token of one sequence currently predicts the first token of the next packed sequence. That reintroduces cross-sequence loss even though the attention path keeps the sequences separate.
Possible fix
if self.is_causal_lm: labels = torch.nn.functional.pad(batch["labels"], (0, 1), value=-100) - batch["labels"] = labels[..., 1:].contiguous() + shifted_labels = labels[..., 1:].contiguous() + if self.qkv_format == "thd": + cu_seqlens = batch.get("cu_seq_lens_q_padded", batch["cu_seq_lens_q"]) + shifted_labels[..., cu_seqlens[1:-1] - 1] = -100 + batch["labels"] = shifted_labels🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/collator.py` around lines 416 - 418, The shifted-label logic in collator.py (inside the self.is_causal_lm branch that pads and then shifts batch["labels"]) must mask boundaries between packed THD sequences so the last token of one sequence does not predict the first token of the next; compute a boundary mask from the per-token sequence identifier array (e.g., batch["sequence_ids"] or equivalent per-token segment id available in the batch) as boundary_mask = batch["sequence_ids"][...,1:] != batch["sequence_ids"][...,:-1], then after padding but before assigning batch["labels"] = labels[...,1:].contiguous(), set labels[...,1:][boundary_mask] = -100 (or use the label ignore value) so cross-sequence positions are excluded from loss; update the code around self.is_causal_lm handling of batch["labels"] accordingly.bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py (2)
162-178:⚠️ Potential issue | 🟠 Major
init_empty_weights()still only works on the LM wrapper.
NVMixtralModelkeepsembed_tokensandrotary_embonself, not onself.model, so calling this helper on the base model still raisesAttributeError.Possible fix
def init_empty_weights(self): """Handles moving the model from the meta device to the cuda device and initializing the weights.""" + base_model = self.model if hasattr(self, "model") else self + for module in self.modules(): if hasattr(module, "reset_parameters"): module.reset_parameters() ... - self.model.embed_tokens.to_empty(device="cuda") - self.model.embed_tokens.apply(self._init_weights) + base_model.embed_tokens.to_empty(device="cuda") + base_model.embed_tokens.apply(self._init_weights) - self.model.rotary_emb.inv_freq = LlamaRotaryEmbedding(config=self.model.config).inv_freq.to("cuda") + base_model.rotary_emb.inv_freq = LlamaRotaryEmbedding(config=base_model.config).inv_freq.to("cuda")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py` around lines 162 - 178, init_empty_weights assumes embed_tokens and rotary_emb live on self.model but NVMixtralModel stores them on self; change init_empty_weights to first prefer attributes on self (check hasattr(self, "embed_tokens") and hasattr(self, "rotary_emb")) and fall back to self.model when absent, then perform the same to_empty, apply(self._init_weights) and rotary inv_freq assignment on the selected object(s); keep the existing modules() loops (including NVMixtralSparseMoeBlock._restack_from_views) unchanged but ensure the rotary LlamaRotaryEmbedding uses the correct config source (self.config if present else self.model.config).
586-607:⚠️ Potential issue | 🟠 MajorDerive packing/cache metadata without assuming
input_idssurvives.This block still dereferences
input_idsforpadded_seq_lenand batch size, and after Line 599 the BSHD mask is 4-D so theattention_mask.shape == input_ids.shapebranch never fires.inputs_embedsprefills still crash, and cached BSHD prefills collapse prompt lengths to1.Possible fix
+ batch_size = input_ids.shape[0] if input_ids is not None else inputs_embeds.shape[0] + padded_seq_len = input_ids.shape[1] if input_ids is not None else inputs_embeds.shape[1] + prompt_lengths = attention_mask.sum(dim=1).tolist() if attention_mask is not None and attention_mask.dim() == 2 else None + if should_pack_inputs: assert attention_mask is not None, "Attention mask is required when packing BSHD inputs." - batch_size = hidden_states.size(0) - padded_seq_len = input_ids.size(1) hidden_states, indices, cu_seqlens, max_seqlen, _ = _unpad_input(hidden_states, attention_mask) kwargs["cu_seq_lens_q"] = kwargs["cu_seq_lens_k"] = cu_seqlens kwargs["max_length_q"] = kwargs["max_length_k"] = max_seqlen ... if isinstance(past_key_values, InferenceParams): - lengths = ( - attention_mask.sum(dim=1).tolist() - if attention_mask.shape == input_ids.shape - else [1] * input_ids.shape[0] - ) + lengths = prompt_lengths if prompt_lengths is not None else [1] * batch_size past_key_values.pre_step(OrderedDict(zip(list(range(len(lengths))), lengths)))🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.py` around lines 586 - 607, The code currently dereferences input_ids and compares attention_mask to input_ids shape which fails for packed/BSHD flows; change logic in the should_pack_inputs block and the InferenceParams lengths calculation to derive batch_size and padded_seq_len from hidden_states (use hidden_states.size(0) and hidden_states.size(1) or cu_seqlens/max_seqlen returned by _unpad_input) and compute prompt lengths from attention_mask by summing over the sequence dimension(s) (handle both 2D HF masks and 4D BSHD masks by reducing across the last dim), and ensure lengths are computed before you convert HF mask to TE boolean; update the past_key_values.pre_step call to use these derived lengths (not input_ids) and keep kwargs["cu_seq_lens_q"]/["cu_seq_lens_k"] and kwargs["max_length_q"]/["max_length_k"] set from _unpad_input results so inputs_embeds prefill and cached BSHD prefills use correct prompt lengths.bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yaml (2)
44-49:⚠️ Potential issue | 🟠 MajorThe checkpoint block is internally inconsistent.
ckpt_dir: nulldisables checkpoint I/O, sosave_final_model,resume_from_checkpoint, andsave_every_n_stepsdo nothing here. Either require a directory or turn those flags off.Suggested fix
checkpoint: - ckpt_dir: null + ckpt_dir: ??? save_final_model: true - resume_from_checkpoint: true + resume_from_checkpoint: false save_every_n_steps: 1_000 async_save: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yaml` around lines 44 - 49, The checkpoint block is inconsistent: ckpt_dir is set to null while save_final_model, resume_from_checkpoint, and save_every_n_steps are enabled; update the checkpoint configuration in the checkpoint block to be consistent by either (A) providing a valid directory path for ckpt_dir so checkpoint I/O is enabled and the options save_final_model, resume_from_checkpoint, save_every_n_steps, and async_save take effect, or (B) disabling the unused options (set save_final_model: false, resume_from_checkpoint: false, and save_every_n_steps: null/0) if you intend to keep ckpt_dir: null; adjust the checkpoint keys in the YAML (ckpt_dir, save_final_model, resume_from_checkpoint, save_every_n_steps, async_save) accordingly.
23-27:⚠️ Potential issue | 🟠 MajorDon't hard-code cluster-local data paths in the checked-in preset.
These
/data/opengenome2/...values only work on one filesystem. Keep them as required overrides or parameterize them so the preset stays portable.Suggested fix
load_dataset_kwargs: path: json - data_files: /data/opengenome2/json/pretraining_or_both_phases/metagenomes/data_metagenomics_train_*.jsonl.gz + data_files: ??? split: train streaming: true @@ - data_path: /data/opengenome2/json/pretraining_or_both_phases/metagenomes/data_metagenomics_valid_chunk1.jsonl.gz + data_path: ???Also applies to: 51-55
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yaml` around lines 23 - 27, The preset currently hard-codes cluster-local dataset paths in load_dataset_kwargs (the data_files value), which makes the checked-in og2_small_thd_moe.yaml non-portable; change the hard-coded path to a configurable placeholder or parameter (e.g., ${DATA_PATH} or a required override) and update load_dataset_kwargs.data_files (and the similar entries at the other occurrence around lines 51-55) to reference that parameter or an environment variable so callers must supply the actual filesystem path at runtime instead of relying on /data/opengenome2/....bionemo-recipes/recipes/opengenome2_mixtral_native_te/opengenome_collator.py (1)
52-57:⚠️ Potential issue | 🟠 MajorPreserve control tags when masking labels.
This predicate still masks every configured control tag because
is_controlis ORed into the write mask. Control markers end up as-100instead of remaining trainable.Suggested fix
- not_dna = ~torch.isin(labels, dna_tokens_tensor) - is_control = torch.isin(labels, control_tensor) - labels[(not_dna | is_control) & (labels != -100)] = -100 + is_dna = torch.isin(labels, dna_tokens_tensor) + is_control = torch.isin(labels, control_tensor) + labels[(~is_dna & ~is_control) & (labels != -100)] = -100🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/opengenome_collator.py` around lines 52 - 57, The masking logic in mask_degenerate_bases currently ORs in is_control so configured control tags get masked; update the predicate used to set -100 so control tags are preserved by only masking labels that are non-DNA and not a control. Concretely, in the block using dna_tokens_tensor, control_tensor, not_dna and is_control, change the mask construction to exclude is_control (i.e., require not_dna AND NOT is_control) before applying the (labels != -100) check so control markers remain trainable.bionemo-recipes/recipes/opengenome2_mixtral_native_te/dataset.py (1)
296-311:⚠️ Potential issue | 🟠 MajorStill unresolved: keep the token budget in sync with padding granularity.
The collator can pad each packed sequence up to
pad_sequences_to_be_divisible_by, butTokenPackingDatasetis still budgeting batches from raw lengths. A batch that looks full to the packer can therefore exceedtoken_micro_batch_sizeafter collation.Suggested fix
TokenPackingDataset( tokenized_dataset, max_tokens_per_batch=token_micro_batch_size, split_samples=split_samples_in_token_packing, + pad_sequences_to_be_divisible_by=pad_sequences_to_be_divisible_by, ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/dataset.py` around lines 296 - 311, TokenPackingDataset is budgeting batches from raw token lengths while GenomicDataCollator/DataCollatorWithFlattening pads each packed sequence to pad_sequences_to_be_divisible_by, so batches can exceed token_micro_batch_size after collation; fix by passing pad_sequences_to_be_divisible_by into TokenPackingDataset (or adding a parameter like pad_divisor) and change its batching logic to compute effective length = ceil(raw_length / pad_divisor) * pad_divisor when summing tokens for max_tokens_per_batch (token_micro_batch_size), ensuring TokenPackingDataset honors the same padding granularity used by GenomicDataCollator.bionemo-recipes/recipes/mixtral_native_te/perf_logger.py (1)
201-210:⚠️ Potential issue | 🟠 MajorStill unresolved: finalize the tail window and end debug state on every rank.
finish()still drops the last partial loss window, so short runs can returninfor stalemin_loss, and the early return means worker ranks never calldebug_api.end_debug().Suggested fix
def finish(self): """Finish the logger and close the progress bar.""" + if self.grad_acc_step_count > 0: + avg_loss = self.running_loss / self.grad_acc_step_count + self.min_loss = torch.minimum(self.min_loss, avg_loss) + + if self.fp8_stats_enabled and debug_api is not None: + debug_api.end_debug() + if not self._dist_config.is_main_process(): return wandb.finish() self._progress_bar.close() - - if self.fp8_stats_enabled: - debug_api.end_debug()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/perf_logger.py` around lines 201 - 210, The finish() method currently returns early for non-main ranks, causing the final (partial) loss window to be dropped and worker ranks to never call debug_api.end_debug(); modify finish() so that finalizing the tail loss window and calling debug_api.end_debug() always run on every rank (i.e., move/ensure calls that finalize the partial loss window and the debug_api.end_debug() invocation outside the is_main_process() guard), while keeping wandb.finish() and self._progress_bar.close() restricted to the main process; use the existing symbols finish(), _dist_config.is_main_process(), fp8_stats_enabled, and debug_api.end_debug() and any existing loss-window flush/finalize routine to implement this.bionemo-recipes/recipes/mixtral_native_te/dataset.py (1)
251-266:⚠️ Potential issue | 🟠 MajorKeep the token budget aligned with the padding granularity.
The collator can pad each packed sequence up to
pad_sequences_to_be_divisible_by, butTokenPackingDatasetis still budgeting batches from raw lengths. A batch that looks full to the packer can therefore exceedtoken_micro_batch_sizeafter collation.Suggested fix
TokenPackingDataset( tokenized_dataset, max_tokens_per_batch=token_micro_batch_size, split_samples=split_samples_in_token_packing, + pad_sequences_to_be_divisible_by=pad_sequences_to_be_divisible_by, ),🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/dataset.py` around lines 251 - 266, TokenPackingDataset is budgeting batches from raw lengths but collation pads up to pad_sequences_to_be_divisible_by, so compute an adjusted max_tokens_per_batch that is the largest multiple of pad_sequences_to_be_divisible_by not exceeding token_micro_batch_size and pass that to TokenPackingDataset; specifically, before creating the dataloader, calculate pad = pad_sequences_to_be_divisible_by or 1, then adjusted_max = (token_micro_batch_size // pad) * pad (fall back to token_micro_batch_size if pad is 0/None), and replace max_tokens_per_batch=token_micro_batch_size with max_tokens_per_batch=adjusted_max in the TokenPackingDataset constructor to keep the token budget aligned with DataCollatorWithFlattening padding.bionemo-recipes/recipes/mixtral_native_te/collator.py (2)
416-418:⚠️ Potential issue | 🟠 MajorTHD shift_labels crosses packed sequence boundaries.
For causal LM with THD-packed batches,
labels[..., 1:]makes the last prediction of one sequence supervise the first token of the next packed sequence, introducing cross-sequence loss leakage.The boundary positions (derived from
cu_seq_lens_q_padded) should be masked to-100after the shift:Suggested fix
if self.is_causal_lm: labels = torch.nn.functional.pad(batch["labels"], (0, 1), value=-100) - batch["labels"] = labels[..., 1:].contiguous() + shifted_labels = labels[..., 1:].contiguous() + if self.qkv_format == "thd": + cu_seqlens = batch.get("cu_seq_lens_q_padded", batch["cu_seq_lens_q"]) + # Mask the last position of each sequence (which now targets the next sequence's first token) + shifted_labels[..., cu_seqlens[1:-1] - 1] = -100 + batch["labels"] = shifted_labels🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/collator.py` around lines 416 - 418, The current causal-LM label shift (in the is_causal_lm branch where you pad batch["labels"] and then do labels[..., 1:]) leaks targets across packed sequence boundaries; after shifting you must mask boundary positions to -100 using the packed-sequence lengths. Modify the block that pads and shifts labels (the code using batch["labels"] and labels[..., 1:]) to also read batch["cu_seq_lens_q_padded"] (or the equivalent cu_seq_lens array), compute the absolute indices of sequence ends in the packed tensor, and set those positions in labels to -100 so the last token of one packed sequence does not supervise the first token of the next. Ensure you update batch["labels"] with the masked, contiguous tensor.
550-554:⚠️ Potential issue | 🟠 Major
close()clears thread handle even when join times out.If
join(timeout=10)returns while the prefetch thread is still blocked insidescatter_object_list(), setting_prefetch_thread = Nonemakes teardown appear complete when a distributed collective may still be live. This can cause hangs or crashes whendestroy_process_group()is called.Suggested fix
def close(self): """Stop the prefetch thread. Must be called before destroy_process_group().""" if self._prefetch_thread is not None: self._prefetch_thread.join(timeout=10) + if self._prefetch_thread.is_alive(): + logger.warning("Prefetch thread did not stop within timeout; distributed collective may still be live.") + # Don't clear the handle so callers know the thread is still active + return self._prefetch_thread = None🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/collator.py` around lines 550 - 554, The close() method currently clears self._prefetch_thread even if join(timeout=10) times out, which can hide a still-running thread (e.g., blocked in scatter_object_list()) and lead to destroy_process_group() races; change close() to call self._prefetch_thread.join(timeout=10) and then check self._prefetch_thread.is_alive() — only set self._prefetch_thread = None if not alive, otherwise log/error/raise (or return False) to indicate teardown didn’t complete so callers won’t call destroy_process_group() prematurely; reference close(), _prefetch_thread, join(timeout=10), scatter_object_list(), and destroy_process_group() in your change.bionemo-recipes/recipes/opengenome2_mixtral_native_te/checkpoint.py (1)
67-70:⚠️ Potential issue | 🟠 MajorAsync checkpoint directories can be picked up before save completes.
get_latest_checkpoint()selects the higheststep_*directory without verifying that the async save (dcp_async_save) has completed. Sincesave_checkpoint_fsdp2()creates thestep_{step}directory before initiating the async save and returns immediately, a crash during the async write leaves a partial checkpoint that the next run will attempt to load, causingdcp_load()to fail.Consider one of these mitigations:
- Write a completion marker file (e.g.,
.complete) after the async future completes, and haveget_latest_checkpoint()ignore directories without it.- Wait for any pending
_ckpt_futures["fsdp2"]at the start ofget_latest_checkpoint()before scanning directories.Also applies to: 89-107
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/checkpoint.py` around lines 67 - 70, get_latest_checkpoint() currently can pick up a step_* directory before its async write finishes (save_checkpoint_fsdp2 starts dcp_async_save and returns immediately), so change get_latest_checkpoint() to first check _ckpt_futures for a pending future for the relevant strategy (e.g., _ckpt_futures.get("fsdp2")) and await it (with try/except to catch and log errors) before scanning directories; after awaiting, remove/clear the entry in _ckpt_futures so subsequent calls don’t wait on a completed future. This ensures any in-flight dcp_async_save (from save_checkpoint_fsdp2 which uses dcp_async_save) completes (or fails) before get_latest_checkpoint() selects a checkpoint.bionemo-recipes/recipes/mixtral_native_te/checkpoint.py (1)
45-48:⚠️ Potential issue | 🟠 MajorAsync checkpoint directories can be picked up before save completes.
Same issue as in the OpenGenome2 variant:
get_latest_checkpoint()finds the higheststep_*directory without verifying the async save has finished. A crash duringdcp_async_saveleaves a partial checkpoint that the next run will try to load.Consider writing a completion marker after the async future completes, or waiting for pending futures before scanning directories.
Also applies to: 67-85
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/checkpoint.py` around lines 45 - 48, get_latest_checkpoint() can pick up incomplete async saves because _ckpt_futures stores in-flight futures from dcp_async_save; modify the flow so either (A) after each dcp_async_save future in _ckpt_futures completes you atomically write a completion marker (e.g. "<strategy>.ckpt_complete" or touch a ".done" file) next to the step_* directory and only consider directories with that marker in get_latest_checkpoint(), or (B) before scanning directories in get_latest_checkpoint() await any pending futures in _ckpt_futures for the relevant strategy to ensure saves finished; update the code paths that add/remove entries in _ckpt_futures so the marker is written and futures are cleared when the future resolves, and change get_latest_checkpoint() to filter by the marker or wait on _ckpt_futures accordingly.
🧹 Nitpick comments (11)
bionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.py (1)
26-55: LGTM with a minor docstring suggestion.The weight decay filtering logic correctly follows the Megatron pattern (excluding biases, 1D params like LayerNorm, and optionally embeddings). The implementation is clean and well-logged.
Consider expanding the docstring to document the return structure more explicitly:
📝 Optional: Enhanced docstring
def get_parameter_groups_with_weight_decay( model: torch.nn.Module, weight_decay: float, skip_embeddings: bool = False, ) -> list[dict]: - """Create optimizer parameter groups with Megatron-style weight decay filtering.""" + """Create optimizer parameter groups with Megatron-style weight decay filtering. + + Args: + model: The model whose parameters to group. + weight_decay: Weight decay value for the decay group. + skip_embeddings: If True, exclude embedding parameters from weight decay. + + Returns: + A list of two dicts: [{"params": [...], "weight_decay": weight_decay}, + {"params": [...], "weight_decay": 0.0}]. + """🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.py` around lines 26 - 55, The docstring for get_parameter_groups_with_weight_decay is minimal; update it to explicitly describe the function behavior and return structure: state that it splits model.named_parameters() into two groups (decay and no_decay) based on biases, 1-D params, and optional embedding names (skip_embeddings flag), and that it returns a list of two dicts of the form {"params": List[Tensor], "weight_decay": float} with the first dict using the provided weight_decay and the second using 0.0; mention the meaning of the skip_embeddings argument and that parameters with requires_grad==False are ignored to help future readers and callers.bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt (1)
1-12: Consider version pinning for reproducible builds.Most dependencies lack version constraints, which could lead to build inconsistencies over time. The
torchao!=0.14.0exclusion suggests awareness of version-specific issues. Consider pinning versions or adding version ranges for critical dependencies liketorch,transformers, andtransformer_engine[pytorch]to ensure reproducible builds.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txt` around lines 1 - 12, The requirements file currently lists many packages without pinned versions (e.g., torch, transformers, transformer_engine[pytorch], torchao), which risks non-reproducible builds; update requirements.txt to pin or constrain key packages by adding explicit versions or compatible version ranges (for example pin torch to a tested X.Y.Z or a range compatible with CUDA, pin transformers to a known working release, specify transformer_engine[pytorch] version that matches the pinned torch, and replace the torchao!=0.14.0 exclusion with an explicit allowed version or range), and ensure any git-installed packages (nvdlfw_inspect) are pinned to a commit or tag for reproducibility.bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py (1)
21-44: Near-duplicate code with minimal docstring.This scheduler implementation is nearly identical to
bionemo-recipes/recipes/mixtral_native_te/scheduler.py. Consider:
- Extracting this to a shared module to avoid duplication
- Expanding the docstring to match the more detailed version in the other file
The same edge case with
num_warmup_steps=0applies here as noted for the other scheduler file.📝 Proposed docstring expansion
def get_cosine_annealing_schedule_with_warmup( optimizer, num_warmup_steps=2_000, num_decay_steps=500_000, min_lr_ratio=0.0, last_epoch=-1, ): - """Cosine annealing scheduler with warmup.""" + """Cosine annealing scheduler with warmup. + + The learning rate is linearly warmed up from 0 to max_lr over num_warmup_steps, + then follows a cosine annealing schedule from max_lr to min_lr over num_decay_steps. + After warmup_steps + decay_steps, the learning rate remains at min_lr. + + Args: + optimizer: The optimizer to schedule. + num_warmup_steps: Number of warmup steps. + num_decay_steps: Number of decay steps after warmup. + min_lr_ratio: Minimum learning rate as a ratio of the initial learning rate. + last_epoch: The index of the last epoch. Default: -1. + + Returns: + A LambdaLR scheduler configured with the cosine warmup schedule. + """As per coding guidelines, Python files should follow Google-style docstrings (pydocstyle convention).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.py` around lines 21 - 44, The get_cosine_annealing_schedule_with_warmup function duplicates logic from mixtral_native_te/scheduler.py and has a weak docstring and the same num_warmup_steps=0 edge case; refactor by moving the common implementation (lr_lambda and scheduler creation using LambdaLR) into a shared utility module (e.g., a scheduler factory) and import it from both recipe packages, expand the docstring to Google-style describing parameters, return value and edge cases, and fix the warmup edge-case in lr_lambda so it correctly handles num_warmup_steps == 0 (use strict comparisons and ensure division by zero cannot occur) while keeping references to the existing symbols get_cosine_annealing_schedule_with_warmup, lr_lambda, LambdaLR and optimizer to locate and replace the code.bionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfile (1)
1-9: Consider adding a non-root user for improved security posture.The container runs as root by default. While this is common for GPU training workloads, adding a non-root user would improve security. If root is required for training, document the reasoning.
Additionally, consider reordering to optimize layer caching: move
COPY . .beforepip installwould currently invalidate the pip cache on code changes. However, since requirements.txt is bind-mounted, the current order is actually correct for this pattern.🛡️ Optional: Add non-root user
# syntax=docker/dockerfile:1.4 FROM nvcr.io/nvidia/pytorch:26.02-py3 +# Create non-root user (optional, may require adjustments for GPU access) +ARG UID=1000 +ARG GID=1000 +RUN groupadd -g ${GID} bionemo && useradd -m -u ${UID} -g ${GID} bionemo + RUN --mount=type=cache,target=/root/.cache/pip \ --mount=type=bind,source=requirements.txt,target=/requirements.txt \ PIP_CONSTRAINT= pip install -r /requirements.txt WORKDIR /workspace/bionemo COPY . . + +# Uncomment to run as non-root (if GPU access permits) +# USER bionemo🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfile` around lines 1 - 9, Add a non-root user in the Dockerfile to improve security: create a dedicated user/group (e.g., via adduser/addgroup or useradd), chown the WORKDIR (/workspace/bionemo) and copied files (after COPY . .) to that user, and switch to it with USER before running the workload; if root is required for GPU setup, add a brief comment in the Dockerfile explaining why and keep the default. Keep the current RUN installation order (requirements bind-mount) as-is to preserve pip caching semantics.bionemo-recipes/recipes/mixtral_native_te/requirements.txt (1)
1-11: Consider pinning core package versions (or using a constraints file).This will make recipe environments deterministic and easier to debug across CI/runs.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/requirements.txt` around lines 1 - 11, The requirements file lists unpinned core packages (datasets, hydra-core, torch, torchao, torchdata, torchmetrics, tqdm, transformer_engine[pytorch], transformers, wandb, zstandard) which makes builds non-deterministic; pin explicit versions for these packages or add and reference a constraints file to freeze transitive dependency versions. Update requirements.txt to either include exact version specifiers (e.g., package==x.y.z) for each core dependency or create a constraints file (constraints.txt) with concrete pins and install with pip using -c constraints.txt so CI and local runs reproduce identical environments; ensure you pick compatible torch/torchvision/transformers combinations and document the chosen versions.bionemo-recipes/recipes/mixtral_native_te/Dockerfile (1)
4-6: Make the requirements bind mount path explicit to this recipe.
source=requirements.txtdepends on the build context. If the build runs from repo root, this can bind the wrong file (or fail). Use the recipe-local path explicitly to avoid context-dependent behavior.Proposed change
RUN --mount=type=cache,target=/root/.cache/pip \ - --mount=type=bind,source=requirements.txt,target=/requirements.txt \ + --mount=type=bind,source=bionemo-recipes/recipes/mixtral_native_te/requirements.txt,target=/requirements.txt \ PIP_CONSTRAINT= pip install -r /requirements.txt🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/Dockerfile` around lines 4 - 6, The bind mount in the RUN line (RUN --mount=type=bind,source=requirements.txt,target=/requirements.txt) uses a context-dependent bare file name; change the bind mount source to an explicit recipe-local requirements file (i.e., replace source=requirements.txt with an explicit recipe-local path) so the build always mounts the correct requirements file for the mixtral_native_te recipe.bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py (1)
59-77: Consider adding docstring for consistency.The
test_sanity_convergence_fsdp2_te_thdtest lacks a docstring whiletest_sanity_convergence_fsdp2_te_bshd_grad_acchas one. For consistency, add a brief docstring documenting the THD attention format being tested.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train.py` around lines 59 - 77, Add a short docstring to the test function test_sanity_convergence_fsdp2_te_thd describing that it verifies convergence using the THD attention input format (matching the style of test_sanity_convergence_fsdp2_te_bshd_grad_acc); update the function definition to include a one-line string like "Verify sanity convergence when using THD attention input format." placed immediately after the def line so linters and test docs remain consistent.bionemo-recipes/recipes/opengenome2_mixtral_native_te/README.md (1)
54-54: Consider documentingHF_TOKENrequirement.The README shows
hf downloadcommands for the OpenGenome2 dataset and references Hugging Face dataset access, but only mentionsWANDB_KEY. If the dataset or tokenizer requires authentication, addHF_TOKENto the documentation for consistency with themixtral_native_teREADME.📝 Suggested update
-Use `WANDB_KEY` for Weights & Biases logging. +Use `WANDB_KEY` for Weights & Biases logging and `HF_TOKEN` for Hugging Face dataset access.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/README.md` at line 54, The README currently only references WANDB_KEY for logging but omits HF_TOKEN required for Hugging Face-authenticated downloads; update the README (same section that mentions `WANDB_KEY` and the `hf download` commands) to document that users must set an `HF_TOKEN` environment variable (or provide an equivalent auth method) when downloading the OpenGenome2 dataset or tokenizer, give a short example export line (`export HF_TOKEN=...`) and reference the existing `mixtral_native_te` README as a template for wording and placement.bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py (1)
46-60: Consider adding explicit success assertion for clarity.The test relies on
run_train_cmdnot callingpytest.fail()to indicate success. For readability, consider returning a boolean or the result fromrun_train_cmdand adding an explicit assertion.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.py` around lines 46 - 60, The test test_multi_gpu_train_fsdp2 currently assumes run_train_cmd will raise on failure; change it to explicitly assert success by capturing and asserting the return value from run_train_cmd (or modify run_train_cmd to return a boolean/result object) and then add an assertion like assert result (or assert result.success) after the call; reference the run_train_cmd invocation in test_multi_gpu_train_fsdp2 so the test clearly fails on a false/failed return instead of relying solely on exceptions.bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py (2)
38-42: Test assertions could be more specific about scheduler behavior.The loop runs 3 steps and only checks
lr > 0. Consider adding a check that the LR is actually changing (warmup behavior) or that it matches expected values at specific steps for more robust golden value testing.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py` around lines 38 - 42, The test currently only asserts optimizer.param_groups[0]["lr"] > 0 after three steps; capture the initial_lr before the loop (optimizer.param_groups[0]["lr']), record lr at each step (via optimizer.param_groups[0]["lr"]) and add assertions that the LR changes as expected (e.g., lr after step 1 != initial_lr and lr after step 3 matches a known golden value or monotonic change for warmup/decay), referencing optimizer and scheduler to verify scheduler behavior rather than a simple >0 check.
36-36: Consider convertinglr_scheduler_kwargsto a native dict.Line 29 converts
config.adamw_kwargsusingOmegaConf.to_container(...)before passing toAdamW, but line 36 passesconfig.lr_scheduler_kwargsdirectly. Ifget_cosine_annealing_schedule_with_warmupdoesn't handle OmegaConf objects gracefully, this could cause issues. For consistency and safety:♻️ Suggested fix
- scheduler = get_cosine_annealing_schedule_with_warmup(optimizer, **config.lr_scheduler_kwargs) + scheduler = get_cosine_annealing_schedule_with_warmup( + optimizer, **OmegaConf.to_container(config.lr_scheduler_kwargs, resolve=True) + )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@bionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.py` at line 36, The call to get_cosine_annealing_schedule_with_warmup(optimizer, **config.lr_scheduler_kwargs) may receive an OmegaConf container; convert config.lr_scheduler_kwargs to a native dict (similar to how config.adamw_kwargs is converted with OmegaConf.to_container) before unpacking so the scheduler function receives plain Python types; update the code around scheduler = get_cosine_annealing_schedule_with_warmup to pass OmegaConf.to_container(config.lr_scheduler_kwargs, resolve=True) (or equivalent) when unpacking.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 07e77bc0-95a5-4999-9c86-9672aa60708a
⛔ Files ignored due to path filters (1)
bionemo-recipes/recipes/mixtral_native_te/dlcm_sanity_dataset.parquetis excluded by!**/*.parquet
📒 Files selected for processing (51)
bionemo-recipes/models/mixtral/modeling_mixtral_te.pybionemo-recipes/recipes/mixtral_native_te/Dockerfilebionemo-recipes/recipes/mixtral_native_te/README.mdbionemo-recipes/recipes/mixtral_native_te/checkpoint.pybionemo-recipes/recipes/mixtral_native_te/collator.pybionemo-recipes/recipes/mixtral_native_te/dataset.pybionemo-recipes/recipes/mixtral_native_te/distributed_config.pybionemo-recipes/recipes/mixtral_native_te/fp8_debugging.pybionemo-recipes/recipes/mixtral_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/mixtral_native_te/hydra_config/L2_lingua_small_mixtral.yamlbionemo-recipes/recipes/mixtral_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/mixtral_native_te/model_configs/lingua-small-mixtral/config.jsonbionemo-recipes/recipes/mixtral_native_te/modeling_mixtral_te.pybionemo-recipes/recipes/mixtral_native_te/perf_logger.pybionemo-recipes/recipes/mixtral_native_te/requirements.txtbionemo-recipes/recipes/mixtral_native_te/scheduler.pybionemo-recipes/recipes/mixtral_native_te/tests/conftest.pybionemo-recipes/recipes/mixtral_native_te/tests/test_lingua_small_mixtral.pybionemo-recipes/recipes/mixtral_native_te/tests/test_train.pybionemo-recipes/recipes/mixtral_native_te/tests/test_train_two_gpu.pybionemo-recipes/recipes/mixtral_native_te/train_fsdp2.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/Dockerfilebionemo-recipes/recipes/opengenome2_mixtral_native_te/README.mdbionemo-recipes/recipes/opengenome2_mixtral_native_te/checkpoint.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/collator.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/dataset.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/distributed_config.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/fp8_debugging.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/L0_sanity.yamlbionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/defaults.yamlbionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_experiment.yamlbionemo-recipes/recipes/opengenome2_mixtral_native_te/hydra_config/og2_small_thd_moe.yamlbionemo-recipes/recipes/opengenome2_mixtral_native_te/model_configs/og2-small-mixtral/config.jsonbionemo-recipes/recipes/opengenome2_mixtral_native_te/modeling_mixtral_te.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/opengenome_collator.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/optimizer.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/requirements.txtbionemo-recipes/recipes/opengenome2_mixtral_native_te/scheduler.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/test_genomic_sequences.jsonlbionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/conftest.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_dataset.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train_two_gpu.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/README.mdbionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/special_tokens_map.jsonbionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer.jsonbionemo-recipes/recipes/opengenome2_mixtral_native_te/tokenizers/nucleotide_fast_tokenizer/tokenizer_config.jsonbionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pybionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.pyci/scripts/check_copied_files.py
| debug_api.initialize( | ||
| config_file=fp8_stats_file, | ||
| feature_dirs=[te_features_dir], | ||
| log_dir=fp8_log_dir.as_posix(), | ||
| default_logging_enabled=True, | ||
| ) |
There was a problem hiding this comment.
Validate FP8 stats config path before debug API init.
Failing early with FileNotFoundError gives a clearer error than deferring failure into debug_api.initialize.
Proposed change
def initialize_fp8_debugging(
@@
if not fp8_enabled:
raise ValueError(
"fp8_stats_config.enabled is true but fp8_config.enabled is false, "
"please set fp8_config.enabled to true in the config if you wish to collect FP8 stats"
)
+ if not Path(fp8_stats_file).is_file():
+ raise FileNotFoundError(f"FP8 stats config not found: {fp8_stats_file}")
fp8_log_dir = Path(fp8_log_dir) / f"rank_{dist_config.rank}"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py` around lines 58 -
63, The code calls debug_api.initialize with config_file=fp8_stats_file without
verifying the file exists; before calling debug_api.initialize (in the same
scope where fp8_stats_file, te_features_dir, and fp8_log_dir are defined) check
that fp8_stats_file refers to an existing file (e.g., using
pathlib.Path(fp8_stats_file).is_file()) and raise a clear FileNotFoundError (or
log and exit) if missing so initialization fails fast with a helpful message
rather than letting debug_api.initialize surface a later error.
| checkpoint: | ||
| ckpt_dir: null | ||
| save_final_model: true | ||
| resume_from_checkpoint: true | ||
| save_every_n_steps: 10_000 | ||
| async_save: true |
There was a problem hiding this comment.
Checkpointing is enabled, but this preset has nowhere to write.
With ckpt_dir: null, the final-save, periodic-save, and resume flags are dead config. A direct run of this preset cannot resume or emit checkpoints as written.
Suggested fix
checkpoint:
- ckpt_dir: null
+ ckpt_dir: ???
save_final_model: true
- resume_from_checkpoint: true
+ resume_from_checkpoint: false
save_every_n_steps: 10_000
async_save: true🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@bionemo-recipes/recipes/mixtral_native_te/hydra_config/L2_lingua_small_mixtral.yaml`
around lines 47 - 52, The checkpoint block enables saving/resuming
(save_final_model, resume_from_checkpoint, save_every_n_steps, async_save) but
ckpt_dir is null so nothing will be written or resumed; set ckpt_dir to a valid
path or to a hydra/runtime output subdir (e.g., a relative "checkpoints"
directory or an interpolated hydra output variable) or add logic to disable
checkpointing when ckpt_dir is null by setting save_final_model and
resume_from_checkpoint to false—update the checkpoint stanza
(checkpoint.ckpt_dir or the related flags) accordingly to ensure checkpoints are
actually written or that saving/resuming is disabled.
| import nvdlfw_inspect.api as debug_api | ||
| import nvtx | ||
| import torch | ||
| import torchmetrics | ||
| import wandb | ||
| from distributed_config import DistributedConfig |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "perf_logger.py" -type fRepository: NVIDIA/bionemo-framework
Length of output: 575
🏁 Script executed:
git ls-files | grep -E "(perf_logger|entrypoint)" | head -20Repository: NVIDIA/bionemo-framework
Length of output: 694
🏁 Script executed:
find . -path "*/mixtral_native_te/*" -type f | head -20Repository: NVIDIA/bionemo-framework
Length of output: 1385
🏁 Script executed:
cat -n ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.py | head -100Repository: NVIDIA/bionemo-framework
Length of output: 4815
🏁 Script executed:
cat -n ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.py | sed -n '85,105p'Repository: NVIDIA/bionemo-framework
Length of output: 1102
🏁 Script executed:
# Search for the training entrypoint to see how it handles nvdlfw_inspect
find ./bionemo-recipes/recipes/mixtral_native_te -name "*.py" -type f -exec grep -l "nvdlfw_inspect" {} \;Repository: NVIDIA/bionemo-framework
Length of output: 308
🏁 Script executed:
grep -n "nvdlfw_inspect" ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 109
🏁 Script executed:
grep -n "debug_api" ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.py | head -20Repository: NVIDIA/bionemo-framework
Length of output: 246
🏁 Script executed:
head -50 ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | grep -A 5 -B 5 "nvdlfw_inspect"Repository: NVIDIA/bionemo-framework
Length of output: 313
🏁 Script executed:
cat ./bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.py | head -40Repository: NVIDIA/bionemo-framework
Length of output: 1321
🏁 Script executed:
cat -n ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | sed -n '20,50p'Repository: NVIDIA/bionemo-framework
Length of output: 1388
🏁 Script executed:
# Check if debug_api is used conditionally in train_fsdp2.py
grep -n -B 5 -A 5 "debug_api" ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | head -40Repository: NVIDIA/bionemo-framework
Length of output: 839
🏁 Script executed:
# Check if there's a try-except pattern for nvdlfw_inspect in train_fsdp2.py
grep -n -B 3 -A 3 "try\|except\|ImportError" ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | grep -A 3 -B 3 "nvdlfw_inspect"Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Let's look at the full import section of train_fsdp2.py
cat -n ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.py | sed -n '1,80p'Repository: NVIDIA/bionemo-framework
Length of output: 3997
🏁 Script executed:
# Check where debug_api is actually used in train_fsdp2.py
grep -n "debug_api\." ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 125
🏁 Script executed:
# Check where debug_api is actually used in perf_logger.py
grep -n "debug_api\." ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.pyRepository: NVIDIA/bionemo-framework
Length of output: 204
🏁 Script executed:
# Check if there's any documentation or requirements about nvdlfw_inspect being optional
cat ./bionemo-recipes/recipes/mixtral_native_te/requirements.txtRepository: NVIDIA/bionemo-framework
Length of output: 261
🏁 Script executed:
# Check the fp8_debugging.py file to see all usages of debug_api
cat -n ./bionemo-recipes/recipes/mixtral_native_te/fp8_debugging.pyRepository: NVIDIA/bionemo-framework
Length of output: 2668
🏁 Script executed:
# Check other perf_logger files to see if they have the same pattern
grep -n "nvdlfw_inspect" ./bionemo-recipes/recipes/*/perf_logger.pyRepository: NVIDIA/bionemo-framework
Length of output: 699
🏁 Script executed:
# Check how fp8_stats_config is defined - is it always present?
grep -n "fp8_stats_config" ./bionemo-recipes/recipes/mixtral_native_te/hydra_config/*.yaml | head -20Repository: NVIDIA/bionemo-framework
Length of output: 159
🏁 Script executed:
# Check if fp8_debugging.py is only called conditionally
grep -n "initialize_fp8_debugging" ./bionemo-recipes/recipes/mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 232
🏁 Script executed:
# Look at how debug_api usage is guarded in perf_logger.py
sed -n '145,160p' ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.pyRepository: NVIDIA/bionemo-framework
Length of output: 702
🏁 Script executed:
# Check line 210 context to see if debug_api.end_debug() is guarded
sed -n '205,215p' ./bionemo-recipes/recipes/mixtral_native_te/perf_logger.pyRepository: NVIDIA/bionemo-framework
Length of output: 286
🏁 Script executed:
# Look at the opengenome2 perf_logger to see how they handle the optional import
cat -n ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.py | sed -n '1,35p'Repository: NVIDIA/bionemo-framework
Length of output: 1391
🏁 Script executed:
# Check the full usage of debug_api in opengenome2_mixtral_native_te to understand the pattern
grep -n "debug_api" ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.pyRepository: NVIDIA/bionemo-framework
Length of output: 275
Make nvdlfw_inspect optional at import time, but required when FP8 stats are enabled.
The module currently imports nvdlfw_inspect.api unconditionally at line 20, which causes import failure if the package is not installed—even when fp8_stats_config.enabled is false. All actual uses of debug_api (lines 153, 210) are already guarded by the fp8_stats_enabled flag. Other recipes in the codebase (e.g., opengenome2_mixtral_native_te) already implement the correct optional import pattern.
Suggested fix
-import nvdlfw_inspect.api as debug_api
+try:
+ import nvdlfw_inspect.api as debug_api
+except ImportError:
+ debug_api = NoneThen add this check in __init__ after line 93:
# Whether to step debug_api.step() after each step
self.fp8_stats_enabled = args.fp8_stats_config.enabled
+ if self.fp8_stats_enabled and debug_api is None:
+ raise RuntimeError("fp8_stats_config.enabled requires nvdlfw_inspect to be installed.")Also applies to: 92-94
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/perf_logger.py` around lines 20 -
25, Make the nvdlfw_inspect import optional: wrap "import nvdlfw_inspect.api as
debug_api" in a try/except ImportError and set debug_api = None on failure, so
the module can import when FP8 stats are disabled; then add a runtime guard in
the perf logger initializer (the __init__ where fp8_stats_config.enabled is
read) that raises a clear ImportError if fp8_stats_config.enabled is true but
debug_api is None. Ensure all references to debug_api (used when
fp8_stats_config.enabled is true) remain unchanged so they still rely on the
runtime check.
| with torch.no_grad(): | ||
| self.grad_acc_step_count += 1 | ||
| self.running_loss += outputs.loss | ||
|
|
||
| if step % self.logging_frequency == 0 and step > 0: | ||
| self.num_tokens += batch["input_ids"].numel() | ||
| # Use attention_mask to count unpadded tokens (works for both BSHD and THD) | ||
| if "attention_mask" in batch: | ||
| self.num_unpadded_tokens += batch["attention_mask"].sum() | ||
| else: | ||
| # Fallback for pure sequence packing with no padding: all tokens are unpadded | ||
| self.num_unpadded_tokens += batch["input_ids"].numel() |
There was a problem hiding this comment.
Count tokens on every micro-step.
step_time spans the whole logging window, but num_tokens and num_unpadded_tokens are only accumulated on the boundary step. That underreports throughput by roughly logger.frequency.
Suggested fix
with torch.no_grad():
self.grad_acc_step_count += 1
self.running_loss += outputs.loss
- if step % self.logging_frequency == 0 and step > 0:
- self.num_tokens += batch["input_ids"].numel()
- # Use attention_mask to count unpadded tokens (works for both BSHD and THD)
- if "attention_mask" in batch:
- self.num_unpadded_tokens += batch["attention_mask"].sum()
- else:
- # Fallback for pure sequence packing with no padding: all tokens are unpadded
- self.num_unpadded_tokens += batch["input_ids"].numel()
+ self.num_tokens += batch["input_ids"].numel()
+ # Use attention_mask to count unpadded tokens (works for both BSHD and THD)
+ if "attention_mask" in batch:
+ self.num_unpadded_tokens += batch["attention_mask"].sum()
+ else:
+ # Fallback for pure sequence packing with no padding: all tokens are unpadded
+ self.num_unpadded_tokens += batch["input_ids"].numel()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/mixtral_native_te/perf_logger.py` around lines 109 -
120, The current token counters (self.num_tokens and self.num_unpadded_tokens)
are only incremented when the logging boundary is hit, causing underreported
throughput; move the token counting into every micro-step inside the training
loop so they are incremented on every call where self.grad_acc_step_count and
self.running_loss are updated (i.e., within the same with torch.no_grad() block
that updates self.grad_acc_step_count and self.running_loss), always increment
self.num_tokens by batch["input_ids"].numel() and increment
self.num_unpadded_tokens by batch["attention_mask"].sum() when "attention_mask"
exists or by batch["input_ids"].numel() as the fallback, leaving the existing
conditional that checks logging_frequency and step only for emitting logs, not
for accumulating token counts.
| def finish(self): | ||
| """Finish the logger and close the progress bar.""" | ||
| if not self._dist_config.is_main_process(): | ||
| return | ||
|
|
||
| wandb.finish() | ||
| self._progress_bar.close() | ||
|
|
||
| if self.fp8_stats_enabled and HAS_NVDLFW_INSPECT: | ||
| debug_api.end_debug() |
There was a problem hiding this comment.
Call end_debug() on every rank.
initialize_fp8_debugging() creates rank-scoped outputs, but finish() returns before cleanup on non-zero ranks. Those ranks never finalize their FP8 debug logs.
Possible fix
def finish(self):
"""Finish the logger and close the progress bar."""
+ if self.fp8_stats_enabled and HAS_NVDLFW_INSPECT:
+ debug_api.end_debug()
+
if not self._dist_config.is_main_process():
return
-
- wandb.finish()
- self._progress_bar.close()
-
- if self.fp8_stats_enabled and HAS_NVDLFW_INSPECT:
- debug_api.end_debug()
+ wandb.finish()
+ self._progress_bar.close()📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def finish(self): | |
| """Finish the logger and close the progress bar.""" | |
| if not self._dist_config.is_main_process(): | |
| return | |
| wandb.finish() | |
| self._progress_bar.close() | |
| if self.fp8_stats_enabled and HAS_NVDLFW_INSPECT: | |
| debug_api.end_debug() | |
| def finish(self): | |
| """Finish the logger and close the progress bar.""" | |
| if self.fp8_stats_enabled and HAS_NVDLFW_INSPECT: | |
| debug_api.end_debug() | |
| if not self._dist_config.is_main_process(): | |
| return | |
| wandb.finish() | |
| self._progress_bar.close() |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/perf_logger.py` around
lines 230 - 239, finish() currently returns early for non-main ranks and thus
never calls end_debug(); change finish() so that when self.fp8_stats_enabled and
HAS_NVDLFW_INSPECT it always calls debug_api.end_debug() on every rank (call
end_debug() before the early return or perform the check outside the
is_main_process branch), while preserving wandb.finish() and
self._progress_bar.close() only for the main process (keep the existing
_dist_config.is_main_process() gate for those). Also ensure the call uses the
same debug_api.end_debug() symbol and that no other main-only cleanup is moved
for non-main ranks.
| # TODO(@jomitchell): Delete once https://nvbugspro.nvidia.com/bug/5458694 is fixed. | ||
| requires_datacenter_hardware = pytest.mark.skipif( | ||
| not torch.cuda.is_available() | ||
| or not any( | ||
| gpu_name in torch.cuda.get_device_name(0).upper() for gpu_name in ["H100", "H200", "B100", "B200", "B300"] | ||
| ), | ||
| reason="Test requires datacenter hardware (H100, H200, B100, B200, B300)", | ||
| ) |
There was a problem hiding this comment.
Attach the datacenter-hardware guard to the actual tests.
requires_datacenter_hardware is defined for the known NVBug, but neither test uses it. On any 2-GPU non-datacenter host these tests still run and hit the unsupported path.
Suggested fix
-@requires_multi_gpu
+@requires_multi_gpu
+@requires_datacenter_hardware
def test_multi_gpu_train_fsdp2(tmp_path, recipe_path):
@@
-@requires_multi_gpu
+@requires_multi_gpu
+@requires_datacenter_hardware
def test_multi_gpu_train_fsdp2_with_checkpointing(tmp_path, recipe_path):Also applies to: 61-84
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In
`@bionemo-recipes/recipes/opengenome2_mixtral_native_te/tests/test_train_two_gpu.py`
around lines 33 - 40, The datacenter-hardware guard variable
requires_datacenter_hardware is defined but never applied to the tests, so add
the marker decorator to each test that needs datacenter GPUs (e.g., annotate the
test function(s) like test_train_two_gpu and the other test in this file with
`@requires_datacenter_hardware`) so they are skipped on non-datacenter hosts;
ensure the decorator is placed directly above the test function definitions (or
applied via pytest.mark.usefixtures/parametrization if that pattern is used).
| dist_config = DistributedConfig() | ||
| logger.info("Initializing distributed training: %s", dist_config) | ||
| device = torch.device(f"cuda:{dist_config.local_rank}") | ||
| torch.distributed.init_process_group(backend="cpu:gloo,cuda:nccl", device_id=device) | ||
| torch.cuda.set_device(dist_config.local_rank) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
fd -t f "train_fsdp2.py" --exec wc -l {}Repository: NVIDIA/bionemo-framework
Length of output: 463
🏁 Script executed:
cat -n bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py | sed -n '85,110p'Repository: NVIDIA/bionemo-framework
Length of output: 1516
🏁 Script executed:
cat -n bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py | sed -n '340,360p'Repository: NVIDIA/bionemo-framework
Length of output: 786
🏁 Script executed:
cat -n bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py | sed -n '86,360p'Repository: NVIDIA/bionemo-framework
Length of output: 13268
🏁 Script executed:
rg "destroy_process_group|init_process_group" bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 202
🏁 Script executed:
rg "def " bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py -nRepository: NVIDIA/bionemo-framework
Length of output: 150
Put the distributed teardown in a finally block.
Any exception after init_process_group() at line 95 skips perf_logger.finish(), the async checkpoint wait, and destroy_process_group(). In a distributed run this can strand communicators and background checkpoint work for the rest of the job.
Additionally, the rank-local torch.distributed.barrier() called in the validation exception handler at line 331 can deadlock if exceptions occur on different ranks in code paths outside the validation try-except block.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py` around
lines 92 - 96, Wrap the distributed initialization and the main
training/validation body that follows
DistributedConfig()/torch.distributed.init_process_group(...) in a try/finally
so that perf_logger.finish(), the async checkpoint wait, and
torch.distributed.destroy_process_group() always run in the finally block;
ensure torch.cuda.set_device(dist_config.local_rank) remains in the try before
training starts. Also change the rank-local torch.distributed.barrier() used in
the validation exception handler to a safe global barrier or guard it so it only
runs if the process group is initialized to avoid rank-mismatch deadlocks
(reference functions/objects: DistributedConfig, init_process_group,
perf_logger.finish, async checkpoint wait code, destroy_process_group, and
torch.distributed.barrier).
| perf_logger.log_step( | ||
| step=step, | ||
| grad_norm=total_norm, | ||
| lr=optimizer.param_groups[0]["lr"], | ||
| ) | ||
|
|
||
| if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps): | ||
| save_checkpoint_fsdp2( | ||
| model=model, | ||
| optimizer=optimizer, | ||
| scheduler=scheduler, | ||
| ckpt_path=ckpt_path, | ||
| step=step, | ||
| epoch=epoch, | ||
| dist_config=dist_config, | ||
| dataloader=train_dataloader if args.dataset.use_stateful_dataloader else None, | ||
| process_group=device_mesh.get_group("dp"), | ||
| max_checkpoints=args.checkpoint.max_checkpoints, | ||
| async_save=args.checkpoint.async_save, | ||
| ) | ||
|
|
There was a problem hiding this comment.
Advance step before scheduling checkpoints and validation.
step is only incremented after these side effects run, so save_every_n_steps=100 actually fires after 101 optimizer updates, and the recorded checkpoint/validation step drifts by one.
Suggested fix
- perf_logger.log_step(
- step=step,
- grad_norm=total_norm,
- lr=optimizer.param_groups[0]["lr"],
- )
+ step += 1
+ perf_logger.log_step(
+ step=step,
+ grad_norm=total_norm,
+ lr=optimizer.param_groups[0]["lr"],
+ )
if ckpt_path and should_save_checkpoint(step, args.checkpoint.save_every_n_steps):
save_checkpoint_fsdp2(
@@
- if val_enabled and val_dataloader is not None and step > 0 and step % val_config.eval_interval == 0:
+ if val_enabled and val_dataloader is not None and step % val_config.eval_interval == 0:
try:
val_metrics = run_validation(
@@
- step += 1
if step >= args.num_train_steps:
breakAlso applies to: 333-335
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py` around
lines 297 - 317, The code currently logs and triggers save/validation using the
pre-incremented variable `step`, causing off-by-one behavior; move the `step`
increment to occur before calling `perf_logger.log_step`,
`should_save_checkpoint(...)`, `save_checkpoint_fsdp2(...)` and the validation
scheduling logic so that these side effects use the updated step count (update
the same `step` variable used throughout the loop), i.e., increment `step`
immediately after the optimizer update and before invoking
`perf_logger.log_step`, `should_save_checkpoint`, `save_checkpoint_fsdp2`, and
the corresponding validation functions referenced around those lines (e.g., the
block that calls `perf_logger.log_step` and the save/validation calls at lines
~333-335).
| if val_enabled and val_dataloader is not None and step > 0 and step % val_config.eval_interval == 0: | ||
| try: | ||
| val_metrics = run_validation( | ||
| model=model, | ||
| val_dataloader=val_dataloader, | ||
| num_batches=val_config.num_batches, | ||
| device=device, | ||
| dist_config=dist_config, | ||
| ) | ||
| perf_logger.log_validation(step, val_metrics) | ||
| except Exception as e: | ||
| logger.error(f"Validation failed at step {step}: {e}") | ||
| torch.distributed.barrier() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, let's find and examine the file in question
find . -type f -name "train_fsdp2.py" | head -5Repository: NVIDIA/bionemo-framework
Length of output: 368
🏁 Script executed:
# Also look for the file pattern if it exists in a different location
fd "train_fsdp2.py"Repository: NVIDIA/bionemo-framework
Length of output: 427
🏁 Script executed:
# Examine the file at the specified lines with context
sed -n '310,340p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 1453
🏁 Script executed:
# Also check the full function to understand the broader context
wc -l ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 142
🏁 Script executed:
# Search for run_validation function definition
rg "def run_validation" ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/Repository: NVIDIA/bionemo-framework
Length of output: 157
🏁 Script executed:
# Also search in the entire recipes directory for this function
rg "def run_validation" ./bionemo-recipes/Repository: NVIDIA/bionemo-framework
Length of output: 245
🏁 Script executed:
# Check all barrier calls in this file
rg "torch\.distributed\.barrier" ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py -nRepository: NVIDIA/bionemo-framework
Length of output: 123
🏁 Script executed:
# Examine the run_validation function
cat -n ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.py | head -100Repository: NVIDIA/bionemo-framework
Length of output: 4223
🏁 Script executed:
# Check file size first
wc -l ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.pyRepository: NVIDIA/bionemo-framework
Length of output: 141
🏁 Script executed:
# Read the rest of validation.py
sed -n '100,113p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/validation.pyRepository: NVIDIA/bionemo-framework
Length of output: 400
🏁 Script executed:
# Check if there are other similar files with same patterns
cat ./bionemo-recipes/recipes/opengenome2_llama_native_te/validation.py | head -100 | tail -30Repository: NVIDIA/bionemo-framework
Length of output: 1295
🏁 Script executed:
# Check if other similar files have the same pattern
for file in ./bionemo-recipes/recipes/*/train_fsdp2.py; do
echo "=== $file ==="
rg -A 5 "except Exception" "$file" | grep -A 5 "barrier"
doneRepository: NVIDIA/bionemo-framework
Length of output: 743
🏁 Script executed:
# Also check if the suggested fix is already applied or if there are alternatives
rg "raise" ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py -B 2 -A 2 | grep -A 4 "Validation failed"Repository: NVIDIA/bionemo-framework
Length of output: 50
🏁 Script executed:
# Get broader context around the validation block
sed -n '305,345p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 1863
🏁 Script executed:
# Check what happens after the validation block
sed -n '330,345p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 591
🏁 Script executed:
# Verify the imports and confirm torch.distributed functions are being used
head -50 ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py | grep -E "import|torch"Repository: NVIDIA/bionemo-framework
Length of output: 680
🏁 Script executed:
# Check if there's any prior synchronization or error handling we're missing
sed -n '1,50p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.pyRepository: NVIDIA/bionemo-framework
Length of output: 1691
🏁 Script executed:
# Verify the exact line numbers one more time and check the full context of validation loop
sed -n '280,350p' ./bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py | cat -nRepository: NVIDIA/bionemo-framework
Length of output: 3475
Re-raise validation exceptions to avoid distributed deadlock.
The current exception handler calls torch.distributed.barrier() only on the rank that throws, while other ranks continue through run_validation() and reach different collective operations (another barrier() at line 76, then all_reduce() at line 82 in validation.py). This asymmetry causes ranks to block at different synchronization points, creating a deadlock.
Suggested fix
- except Exception as e:
- logger.error(f"Validation failed at step {step}: {e}")
- torch.distributed.barrier()
+ except Exception:
+ logger.exception("Validation failed at step %s", step)
+ raise📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if val_enabled and val_dataloader is not None and step > 0 and step % val_config.eval_interval == 0: | |
| try: | |
| val_metrics = run_validation( | |
| model=model, | |
| val_dataloader=val_dataloader, | |
| num_batches=val_config.num_batches, | |
| device=device, | |
| dist_config=dist_config, | |
| ) | |
| perf_logger.log_validation(step, val_metrics) | |
| except Exception as e: | |
| logger.error(f"Validation failed at step {step}: {e}") | |
| torch.distributed.barrier() | |
| if val_enabled and val_dataloader is not None and step > 0 and step % val_config.eval_interval == 0: | |
| try: | |
| val_metrics = run_validation( | |
| model=model, | |
| val_dataloader=val_dataloader, | |
| num_batches=val_config.num_batches, | |
| device=device, | |
| dist_config=dist_config, | |
| ) | |
| perf_logger.log_validation(step, val_metrics) | |
| except Exception: | |
| logger.exception("Validation failed at step %s", step) | |
| raise |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@bionemo-recipes/recipes/opengenome2_mixtral_native_te/train_fsdp2.py` around
lines 319 - 331, The except block inside the validation trigger in
train_fsdp2.py currently logs the exception and calls
torch.distributed.barrier() only on the catching rank, which can cause a
distributed deadlock; change the handler to first log the error, call
torch.distributed.barrier() to synchronize all ranks, and then re-raise the
exception (raise) so the error propagates and all processes exit consistently;
update the except block around run_validation(...) in the training loop to use
logger.error(...), torch.distributed.barrier(), then raise.
|
/ok to test 5cc5473 |
Signed-off-by: Timur Rvachov <trvachov@nvidia.com>
|
/ok to test 47cddb3 |
|
no multiple-GPU FSDP tests were ran with original mixtral model, so this needs to be "merged first" before we make the recipes (or incorporated into recipe): #1544 |
…reshold - _sync_expert_views: use gate_up_w.shape[0]/down_w.shape[0] instead of self.num_local_experts to correctly iterate over locally-sharded experts when FSDP2 shards stacked expert weights along dim 0 before init_empty_weights - _restack_from_views: handle DTensor params from FSDP2 by working with local shard and reconstructing DTensor after initialization - test_train.py: bump bshd loss threshold from 8.0 to 8.5 to match thd test, avoiding flaky failures when loss hovers near the boundary Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com> Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
…apping - Apply FSDP2 DTensor fix to bionemo-recipes/models/mixtral/modeling_mixtral_te.py (source) - Add mixtral modeling file to check_copied_files SOURCE_TO_DESTINATION_MAP - Recipe file now gets copied-file banner via check_copied_files --fix Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
… for sparse checkout CI uses sparse-checkout, so each recipe job only has its own directory. The opengenome2_mixtral_native_te tests were importing modeling_mixtral_te from the shared mixtral_native_te recipe path, which does not exist in the sparse checkout. Fix: - Copy modeling_mixtral_te.py to opengenome2_mixtral_native_te recipe root - Register the copy in check_copied_files.py source-destination map - Update test imports to use local recipe root instead of shared path Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
…rs 5.4+ transformers >= 5.4 checks cache.is_compileable in generate(). The custom HFInferenceParams class (TE-based cache) did not implement this attribute, causing AttributeError during test_generate_with_cache tests. Set is_compileable = False since this cache type is not compatible with torch.compile generate(). Tested locally: - models/mixtral: 52 passed, 3 skipped, 26 xfailed (3 local-only OOM on 32GB GPU, pass on CI L4) - recipes/mixtral_native_te: 7 passed - recipes/opengenome2_mixtral_native_te: 20 passed Signed-off-by: svc-bionemo <267129667+svc-bionemo@users.noreply.github.com>
- Add 2D device mesh (dp, ep) to both mixtral_native_te and opengenome2_mixtral_native_te train_fsdp2.py scripts - Call model.model.set_ep_groups() before fully_shard() to convert expert weights to DTensors with Shard(0) on the ep dimension - Pass expert_parallel_size to NVMixtralConfig so num_local_experts is set correctly per rank (num_experts // ep_size) - Add clip_grad_norm_ep_aware() helper that handles DTensor expert weight gradients on different device meshes (avoids aten.stack mesh mismatch error in torch.nn.utils.clip_grad_norm_) - Add expert_parallel_size config field to both defaults.yaml files - Update L0_sanity.yaml in both recipes for EP=2 on 2-GPU setup, W&B run names agent1-opengenome2 and agent1-lingua, project swarm-mixtral-development Validated on 2x H100 (CUDA_VISIBLE_DEVICES=2,3): - OG2: loss 5.37→1.22, W&B run agent1-opengenome2 - Lingua: loss 11.8→8.5, W&B run agent1-lingua Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…support for Mixtral recipes Adds EP test configs (EP1/2/4/8) for both Lingua and OpenGenome2 Mixtral recipes, fused all-to-all token dispatch modules, 8x7B model configs, and EP-aware gradient clipping in train_fsdp2.py. Validated on 8x H100 80GB: - 8x1B: all EP degrees (1,2,4,8) pass for both datasets - 8x7B: OG2 EP2 passes (seq_len=2048), Lingua OOMs (seq_len=4096) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/ok to test f280bac |
Description
Provides mixtral recipes to train on OpenGEnome2 and Lingua datasets
Usage
Type of changes
CI Pipeline Configuration
Configure CI behavior by applying the relevant labels. By default, only basic unit tests are run.
Unit tests marked as
@pytest.mark.multi_gpuor@pytest.mark.distributedare not run in the PR pipeline.For more details, see CONTRIBUTING
Note
By default, only basic unit tests are run. Add appropriate labels to enable an additional test coverage.
Authorizing CI Runs
We use copy-pr-bot to manage authorization of CI
runs on NVIDIA's compute resources.
automatically be copied to a pull-request/ prefixed branch in the source repository (e.g. pull-request/123)
/ok to testcomment on the pull request to trigger CI. This will need to be done for each new commit.Triggering Code Rabbit AI Review
To trigger a code review from code rabbit, comment on a pull request with one of these commands:
See https://docs.coderabbit.ai/reference/review-commands for a full list of commands.
Pre-submit Checklist
Summary by CodeRabbit
New Features
Documentation
Tests