[logging] Wrap run in try/except + surface exceptions to wandb if supported#1706
Conversation
…upported Add `Tracking.log_exception`, which prints the traceback via loguru and, when wandb is configured, logs it to an `error/tracebacks` wandb.Table then finishes the run to flush the async upload. Wrap `BasePPOExp.run` and the SFT entrypoint in try/except so OOMs and other crashes raised inside actor init/training are routed through the tracker instead of only landing in Ray worker logs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request adds robust exception and traceback logging to tracking backends (such as Weights & Biases) during training setup and execution failures (e.g., OOMs) in both main_base.py and main_sft.py. It introduces a new log_exception method in the tracking utility to handle formatting and uploading tracebacks. The reviewer identified a critical issue where trainer.tracker is not yet initialized when worker setup fails in SFT, which prevents exceptions from being logged to wandb, and suggested reordering the initialization steps. Additionally, the reviewer recommended formatting the passed exception object directly using traceback.format_exception rather than relying on the global traceback.format_exc().
| if trainer.tracker is not None: | ||
| trainer.tracker.log_exception(e, step=trainer.global_step) | ||
| else: | ||
| logger.error(f"SFT setup failed before tracker was initialized:\n{e}") |
There was a problem hiding this comment.
In SFTTrainer.setup() (defined in skyrl/train/sft_trainer.py), self._init_tracker() is called after self._init_workers(). Since actor initialization (which is highly prone to OOMs and other startup crashes) happens inside _init_workers(), any failure there will occur before trainer.tracker is initialized. As a result, trainer.tracker will be None, and the exception will not be logged to wandb.
To fix this, we should initialize the tracker before initializing the workers in SFTTrainer.setup(). Since _init_tracker() only depends on configuration and has no dependency on the workers, this is safe to reorder.
| if self._exception_logged: | ||
| return | ||
| self._exception_logged = True | ||
| tb_str = traceback.format_exc()[-10000:] |
There was a problem hiding this comment.
Using traceback.format_exc() retrieves the traceback of the currently active exception from the global state. Since the exception object e is already explicitly passed to log_exception, it is more robust and idiomatic to format e directly using traceback.format_exception. This avoids relying on the global exception context, which can sometimes be lost or cleared if other operations are performed before this call.
| tb_str = traceback.format_exc()[-10000:] | |
| tb_str = "".join(traceback.format_exception(type(e), e, e.__traceback__))[-10000:] |
Summary
Tracking.log_exception: prints the traceback via loguru and, when wandb is configured, logs anerror/tracebackswandb.Tablerow then finishes the run to flush the async upload. Idempotent.BasePPOExp.runand the SFT entrypoint in try/except so OOMs and other crashes raised inside actor init / training are routed through the tracker instead of only landing in Ray worker logs.Test plan
error/tracebacksrow appears in wandb