only xfail thd tests if cuda arch is unsupported#1118
Conversation
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
WalkthroughTests were adjusted: two unconditional xfail markers were replaced with conditional xfail checks based on CUDA device capability and Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
✨ Finishing Touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. CodeRabbit Commands (Invoked using PR/Issue comments)Type Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
recipes/esm2_native_te_nvfsdp_thd/test_train.py (1)
19-19: DRY the condition and harden against unexpected CUDA init errorsDefine a small helper (or module-level constant) once and reuse in both decorators. This also lets you catch rare initialization errors safely.
Example snippet to add near the imports (outside the shown ranges):
def _needs_xfail_thd() -> bool: try: if not torch.cuda.is_available(): return True cc = torch.cuda.get_device_capability() return cc != (10, 0) except Exception: # Any CUDA init/driver error => conservatively xfail return TrueThen:
@pytest.mark.xfail(condition=_needs_xfail_thd(), reason="CUDNN padded packed sequences not supported on all hardware currently (nvbugs/5458694).")Verification: please run these two tests on (a) a CPU-only runner and (b) an sm_100 GPU host to confirm xfail-only-on-unsupported-arch behavior. I can push a patch if you prefer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (1)
recipes/esm2_native_te_nvfsdp_thd/test_train.py(2 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: changed-files
Signed-off-by: Peter St. John <pstjohn@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (2)
recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py (2)
229-233: Reintroduce a deterministic, non-flaky masking check instead of commenting it outSmall batches make masking probabilistic and flaky. Make the check deterministic and increase sample size to reduce variance.
- # TODO: This is a very flaky test with such a small input batch, we should make it larger if we want to ensure a - # token is masked - # else: # Some masking should occur masked_count = (sample["labels"] != -100).sum() assert - # masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked" + else: + # Reduce stochastic variance and make RNG deterministic for this assertion + big_batch = batch * 64 # ~1K tokens; keeps runtime low while avoiding flakiness + with torch.random.fork_rng(enabled=True): + torch.manual_seed(0) + big_sample = data_collator(big_batch) + masked_count = (big_sample["labels"] != -100).sum().item() + assert masked_count > 0, f"With mlm_probability={mlm_prob}, some tokens should be masked"
120-128: Optional: seed RNG locally around the ratio check to further reduce varianceThe masking ratio assertion can still be sensitive to RNG. Consider seeding locally to keep tests reproducible without affecting global state.
with torch.random.fork_rng(enabled=True): torch.manual_seed(0) sample = data_collator(batch) labels = sample["labels"] input_ids = sample["input_ids"] masked_positions = (labels != -100).sum() total_positions = labels.numel() masking_ratio = masked_positions.float() / total_positions
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
💡 Knowledge Base configuration:
- MCP integration is disabled by default for public repositories
- Jira integration is disabled by default for public repositories
- Linear integration is disabled by default for public repositories
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
recipes/esm2_native_te_nvfsdp_thd/test_thd_format.py(1 hunks)recipes/esm2_native_te_nvfsdp_thd/test_train.py(2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- recipes/esm2_native_te_nvfsdp_thd/test_train.py
Apparently we have xfail set to strict in the bionemo pyproject.toml which can get picked up in these runs?
Let's only xfail these if the cuda arch is not sm_100
Summary by CodeRabbit