Skip to content

only xfail thd tests if cuda arch is unsupported#1118

Merged
pstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/xfail-if-in-thd-recipe
Sep 5, 2025
Merged

only xfail thd tests if cuda arch is unsupported#1118
pstjohn merged 2 commits into
NVIDIA-BioNeMo:mainfrom
pstjohn:pstjohn/xfail-if-in-thd-recipe

Conversation

@pstjohn
Copy link
Copy Markdown
Collaborator

@pstjohn pstjohn commented Sep 5, 2025

Apparently we have xfail set to strict in the bionemo pyproject.toml which can get picked up in these runs?

Let's only xfail these if the cuda arch is not sm_100

Summary by CodeRabbit

  • Tests
    • Made GPU-dependent tests conditionally xfail based on CUDA compute capability to avoid false negatives on certain devices.
    • Added a required import and updated test annotations with a reference to a known NVIDIA issue for traceability.
    • Removed a flaky masking assertion in one test (now noted with a TODO) to improve CI reliability; no changes to runtime behavior or public APIs.

Signed-off-by: Peter St. John <pstjohn@nvidia.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Sep 5, 2025

Walkthrough

Tests were adjusted: two unconditional xfail markers were replaced with conditional xfail checks based on CUDA device capability and torch was imported. A separate test removed a flaky assertion about MLM masking and left a TODO. No public APIs or test bodies’ core logic were altered.

Changes

Cohort / File(s) Summary of Changes
Tests: Conditional xfail by CUDA capability
recipes/esm2_native_te_nvfsdp_thd/test_train.py
Added import torch. Replaced unconditional xfail decorators on test_main_invocation and test_main_invocation_ddp with conditional xfail markers that evaluate torch.cuda.get_device_capability() == (12, 0) and updated the reason to reference nvbugs/5458694. Test bodies and Hydra/config usage unchanged.
Tests: Relaxed/flaky MLM masking assertion
recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py
Removed the assertion that enforced masked token count > 0 for mlm_probability > 0. The former check is commented out and a TODO added noting flakiness and that a larger batch is needed to reliably test masking. The mlm_probability == 0.0 label assertion remains.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

I hop through tests with careful might,
Torch in paw to check GPU light,
If twelve-point-zero whispers "stay",
We xfail gently, skip the fray.
A masked token lost in the throng—I'll fix it before long. 🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

CodeRabbit Commands (Invoked using PR/Issue comments)

Type @coderabbitai help to get the list of available commands.

Other keywords and placeholders

  • Add @coderabbitai ignore or @coderabbit ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Status, Documentation and Community

  • Visit our Status Page to check the current availability of CodeRabbit.
  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
recipes/esm2_native_te_nvfsdp_thd/test_train.py (1)

19-19: DRY the condition and harden against unexpected CUDA init errors

Define a small helper (or module-level constant) once and reuse in both decorators. This also lets you catch rare initialization errors safely.

Example snippet to add near the imports (outside the shown ranges):

def _needs_xfail_thd() -> bool:
    try:
        if not torch.cuda.is_available():
            return True
        cc = torch.cuda.get_device_capability()
        return cc != (10, 0)
    except Exception:
        # Any CUDA init/driver error => conservatively xfail
        return True

Then:

@pytest.mark.xfail(condition=_needs_xfail_thd(), reason="CUDNN padded packed sequences not supported on all hardware currently (nvbugs/5458694).")

Verification: please run these two tests on (a) a CPU-only runner and (b) an sm_100 GPU host to confirm xfail-only-on-unsupported-arch behavior. I can push a patch if you prefer.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e213d40 and 409a24a.

📒 Files selected for processing (1)
  • recipes/esm2_native_te_nvfsdp_thd/test_train.py (2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: changed-files

Comment thread recipes/esm2_native_te_nvfsdp_thd/test_train.py
Comment thread recipes/esm2_native_te_nvfsdp_thd/test_train.py
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py (2)

229-233: Reintroduce a deterministic, non-flaky masking check instead of commenting it out

Small batches make masking probabilistic and flaky. Make the check deterministic and increase sample size to reduce variance.

-        # TODO: This is a very flaky test with such a small input batch, we should make it larger if we want to ensure a
-        # token is masked
-        # else: # Some masking should occur masked_count = (sample["labels"] != -100).sum() assert
-        #     masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked"
+        else:
+            # Reduce stochastic variance and make RNG deterministic for this assertion
+            big_batch = batch * 64  # ~1K tokens; keeps runtime low while avoiding flakiness
+            with torch.random.fork_rng(enabled=True):
+                torch.manual_seed(0)
+                big_sample = data_collator(big_batch)
+            masked_count = (big_sample["labels"] != -100).sum().item()
+            assert masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked"

120-128: Optional: seed RNG locally around the ratio check to further reduce variance

The masking ratio assertion can still be sensitive to RNG. Consider seeding locally to keep tests reproducible without affecting global state.

with torch.random.fork_rng(enabled=True):
    torch.manual_seed(0)
    sample = data_collator(batch)
    labels = sample["labels"]
    input_ids = sample["input_ids"]
    masked_positions = (labels != -100).sum()
    total_positions = labels.numel()
    masking_ratio = masked_positions.float() / total_positions
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

💡 Knowledge Base configuration:

  • MCP integration is disabled by default for public repositories
  • Jira integration is disabled by default for public repositories
  • Linear integration is disabled by default for public repositories

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 409a24a and b0d8293.

📒 Files selected for processing (2)
  • recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py (1 hunks)
  • recipes/esm2_native_te_nvfsdp_thd/test_train.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • recipes/esm2_native_te_nvfsdp_thd/test_train.py

@pstjohn pstjohn enabled auto-merge September 5, 2025 16:23
@pstjohn pstjohn disabled auto-merge September 5, 2025 16:23
@pstjohn pstjohn merged commit b3638d9 into NVIDIA-BioNeMo:main Sep 5, 2025
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant