[logging] Wrap `run` in try/except + surface exceptions to wandb if supported by erictang000 · Pull Request #1706 · NovaSky-AI/SkyRL

erictang000 · 2026-05-26T20:59:41Z

Summary

Add Tracking.log_exception: prints the traceback via loguru and, when wandb is configured, logs an error/tracebacks wandb.Table row then finishes the run to flush the async upload. Idempotent.
Wrap BasePPOExp.run and the SFT entrypoint in try/except so OOMs and other crashes raised inside actor init / training are routed through the tracker instead of only landing in Ray worker logs.

Test plan

Trigger an OOM / actor-init failure and confirm an error/tracebacks row appears in wandb

…upported Add `Tracking.log_exception`, which prints the traceback via loguru and, when wandb is configured, logs it to an `error/tracebacks` wandb.Table then finishes the run to flush the async upload. Wrap `BasePPOExp.run` and the SFT entrypoint in try/except so OOMs and other crashes raised inside actor init/training are routed through the tracker instead of only landing in Ray worker logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request adds robust exception and traceback logging to tracking backends (such as Weights & Biases) during training setup and execution failures (e.g., OOMs) in both main_base.py and main_sft.py. It introduces a new log_exception method in the tracking utility to handle formatting and uploading tracebacks. The reviewer identified a critical issue where trainer.tracker is not yet initialized when worker setup fails in SFT, which prevents exceptions from being logged to wandb, and suggested reordering the initialization steps. Additionally, the reviewer recommended formatting the passed exception object directly using traceback.format_exception rather than relying on the global traceback.format_exc().

gemini-code-assist · 2026-05-26T21:00:33Z

+        if trainer.tracker is not None:
+            trainer.tracker.log_exception(e, step=trainer.global_step)
+        else:
+            logger.error(f"SFT setup failed before tracker was initialized:\n{e}")


In SFTTrainer.setup() (defined in skyrl/train/sft_trainer.py), self._init_tracker() is called after self._init_workers(). Since actor initialization (which is highly prone to OOMs and other startup crashes) happens inside _init_workers(), any failure there will occur before trainer.tracker is initialized. As a result, trainer.tracker will be None, and the exception will not be logged to wandb.

To fix this, we should initialize the tracker before initializing the workers in SFTTrainer.setup(). Since _init_tracker() only depends on configuration and has no dependency on the workers, this is safe to reorder.

gemini-code-assist · 2026-05-26T21:00:33Z

+        if self._exception_logged:
+            return
+        self._exception_logged = True
+        tb_str = traceback.format_exc()[-10000:]


Using traceback.format_exc() retrieves the traceback of the currently active exception from the global state. Since the exception object e is already explicitly passed to log_exception, it is more robust and idiomatic to format e directly using traceback.format_exception. This avoids relying on the global exception context, which can sometimes be lost or cleared if other operations are performed before this call.

Suggested change

tb_str = traceback.format_exc()[-10000:]

tb_str = "".join(traceback.format_exception(type(e), e, e.__traceback__))[-10000:]

gemini-code-assist Bot reviewed May 26, 2026

View reviewed changes

x

ab51d5d

erictang000 merged commit 8e1a7da into main May 28, 2026
3 checks passed

erictang000 deleted the log_ooms_to_wandb branch May 28, 2026 01:32

erictang000 mentioned this pull request May 28, 2026

add validation sample logging #1713

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[logging] Wrap `run` in try/except + surface exceptions to wandb if supported#1706

[logging] Wrap `run` in try/except + surface exceptions to wandb if supported#1706
erictang000 merged 2 commits into
mainfrom
log_ooms_to_wandb

erictang000 commented May 26, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

gemini-code-assist Bot May 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	tb_str = traceback.format_exc()[-10000:]
	tb_str = "".join(traceback.format_exception(type(e), e, e.__traceback__))[-10000:]

Conversation

erictang000 commented May 26, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 26, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant