only xfail thd tests if cuda arch is unsupported by pstjohn · Pull Request #1118 · NVIDIA-BioNeMo/bionemo-framework

pstjohn · 2025-09-05T15:21:07Z

Apparently we have xfail set to strict in the bionemo pyproject.toml which can get picked up in these runs?

Let's only xfail these if the cuda arch is not sm_100

Summary by CodeRabbit

Tests
- Made GPU-dependent tests conditionally xfail based on CUDA compute capability to avoid false negatives on certain devices.
- Added a required import and updated test annotations with a reference to a known NVIDIA issue for traceability.
- Removed a flaky masking assertion in one test (now noted with a TODO) to improve CI reliability; no changes to runtime behavior or public APIs.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

coderabbitai · 2025-09-05T15:21:14Z

Walkthrough

Tests were adjusted: two unconditional xfail markers were replaced with conditional xfail checks based on CUDA device capability and torch was imported. A separate test removed a flaky assertion about MLM masking and left a TODO. No public APIs or test bodies’ core logic were altered.

Changes

Cohort / File(s)	Summary of Changes
Tests: Conditional xfail by CUDA capability `recipes/esm2_native_te_nvfsdp_thd/test_train.py`	Added `import torch`. Replaced unconditional `xfail` decorators on `test_main_invocation` and `test_main_invocation_ddp` with conditional xfail markers that evaluate `torch.cuda.get_device_capability() == (12, 0)` and updated the reason to reference `nvbugs/5458694`. Test bodies and Hydra/config usage unchanged.
Tests: Relaxed/flaky MLM masking assertion `recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py`	Removed the assertion that enforced masked token count > 0 for `mlm_probability > 0`. The former check is commented out and a TODO added noting flakiness and that a larger batch is needed to reliably test masking. The `mlm_probability == 0.0` label assertion remains.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I hop through tests with careful might,
Torch in paw to check GPU light,
If twelve-point-zero whispers "stay",
We xfail gently, skip the fray.
A masked token lost in the throng—I'll fix it before long. 🐇✨

✨ Finishing Touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

Visit our Status Page to check the current availability of CodeRabbit.
Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

recipes/esm2_native_te_nvfsdp_thd/test_train.py (1)
19-19: DRY the condition and harden against unexpected CUDA init errors

Define a small helper (or module-level constant) once and reuse in both decorators. This also lets you catch rare initialization errors safely.

Example snippet to add near the imports (outside the shown ranges):
def _needs_xfail_thd() -> bool:
    try:
        if not torch.cuda.is_available():
            return True
        cc = torch.cuda.get_device_capability()
        return cc != (10, 0)
    except Exception:
        # Any CUDA init/driver error => conservatively xfail
        return True
Then:
@pytest.mark.xfail(condition=_needs_xfail_thd(), reason="CUDNN padded packed sequences not supported on all hardware currently (nvbugs/5458694).")
Verification: please run these two tests on (a) a CPU-only runner and (b) an sm_100 GPU host to confirm xfail-only-on-unsupported-arch behavior. I can push a patch if you prefer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e213d40 and 409a24a.

📒 Files selected for processing (1)

recipes/esm2_native_te_nvfsdp_thd/test_train.py (2 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: changed-files

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py (2)

229-233: Reintroduce a deterministic, non-flaky masking check instead of commenting it out

Small batches make masking probabilistic and flaky. Make the check deterministic and increase sample size to reduce variance.

-        # TODO: This is a very flaky test with such a small input batch, we should make it larger if we want to ensure a
-        # token is masked
-        # else: # Some masking should occur masked_count = (sample["labels"] != -100).sum() assert
-        #     masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked"
+        else:
+            # Reduce stochastic variance and make RNG deterministic for this assertion
+            big_batch = batch * 64  # ~1K tokens; keeps runtime low while avoiding flakiness
+            with torch.random.fork_rng(enabled=True):
+                torch.manual_seed(0)
+                big_sample = data_collator(big_batch)
+            masked_count = (big_sample["labels"] != -100).sum().item()
+            assert masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked"

120-128: Optional: seed RNG locally around the ratio check to further reduce variance

The masking ratio assertion can still be sensitive to RNG. Consider seeding locally to keep tests reproducible without affecting global state.

with torch.random.fork_rng(enabled=True):
    torch.manual_seed(0)
    sample = data_collator(batch)
    labels = sample["labels"]
    input_ids = sample["input_ids"]
    masked_positions = (labels != -100).sum()
    total_positions = labels.numel()
    masking_ratio = masked_positions.float() / total_positions

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

MCP integration is disabled by default for public repositories
Jira integration is disabled by default for public repositories
Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 409a24a and b0d8293.

📒 Files selected for processing (2)

recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py (1 hunks)
recipes/esm2_native_te_nvfsdp_thd/test_train.py (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

recipes/esm2_native_te_nvfsdp_thd/test_train.py

only xfail thd tests if cuda arch is unsupported

409a24a

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

pstjohn requested review from cspades, dorotat-nv, jomitchellnv, jstjohn, jwilber and trvachov as code owners September 5, 2025 15:21

coderabbitai Bot reviewed Sep 5, 2025

View reviewed changes

Comment thread recipes/esm2_native_te_nvfsdp_thd/test_train.py

Comment thread recipes/esm2_native_te_nvfsdp_thd/test_train.py

fix thd tests

b0d8293

Signed-off-by: Peter St. John <pstjohn@nvidia.com>

coderabbitai Bot reviewed Sep 5, 2025

View reviewed changes

pstjohn enabled auto-merge September 5, 2025 16:23

pstjohn disabled auto-merge September 5, 2025 16:23

pstjohn merged commit b3638d9 into NVIDIA-BioNeMo:main Sep 5, 2025
15 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

only xfail thd tests if cuda arch is unsupported#1118

only xfail thd tests if cuda arch is unsupported#1118
pstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/xfail-if-in-thd-recipe

pstjohn commented Sep 5, 2025 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Sep 5, 2025 •

edited

Loading

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pstjohn commented Sep 5, 2025 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Chat

Support

CodeRabbit Commands (Invoked using PR/Issue comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Status, Documentation and Community

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pstjohn commented Sep 5, 2025 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Sep 5, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)