Skip to content

[logging] Wrap run in try/except + surface exceptions to wandb if supported#1706

Merged
erictang000 merged 2 commits into
mainfrom
log_ooms_to_wandb
May 28, 2026
Merged

[logging] Wrap run in try/except + surface exceptions to wandb if supported#1706
erictang000 merged 2 commits into
mainfrom
log_ooms_to_wandb

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

Summary

  • Add Tracking.log_exception: prints the traceback via loguru and, when wandb is configured, logs an error/tracebacks wandb.Table row then finishes the run to flush the async upload. Idempotent.
  • Wrap BasePPOExp.run and the SFT entrypoint in try/except so OOMs and other crashes raised inside actor init / training are routed through the tracker instead of only landing in Ray worker logs.

Test plan

  • Trigger an OOM / actor-init failure and confirm an error/tracebacks row appears in wandb

…upported

Add `Tracking.log_exception`, which prints the traceback via loguru and, when
wandb is configured, logs it to an `error/tracebacks` wandb.Table then finishes
the run to flush the async upload. Wrap `BasePPOExp.run` and the SFT entrypoint
in try/except so OOMs and other crashes raised inside actor init/training are
routed through the tracker instead of only landing in Ray worker logs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds robust exception and traceback logging to tracking backends (such as Weights & Biases) during training setup and execution failures (e.g., OOMs) in both main_base.py and main_sft.py. It introduces a new log_exception method in the tracking utility to handle formatting and uploading tracebacks. The reviewer identified a critical issue where trainer.tracker is not yet initialized when worker setup fails in SFT, which prevents exceptions from being logged to wandb, and suggested reordering the initialization steps. Additionally, the reviewer recommended formatting the passed exception object directly using traceback.format_exception rather than relying on the global traceback.format_exc().

Comment thread skyrl/train/main_sft.py
Comment on lines +47 to +50
if trainer.tracker is not None:
trainer.tracker.log_exception(e, step=trainer.global_step)
else:
logger.error(f"SFT setup failed before tracker was initialized:\n{e}")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In SFTTrainer.setup() (defined in skyrl/train/sft_trainer.py), self._init_tracker() is called after self._init_workers(). Since actor initialization (which is highly prone to OOMs and other startup crashes) happens inside _init_workers(), any failure there will occur before trainer.tracker is initialized. As a result, trainer.tracker will be None, and the exception will not be logged to wandb.

To fix this, we should initialize the tracker before initializing the workers in SFTTrainer.setup(). Since _init_tracker() only depends on configuration and has no dependency on the workers, this is safe to reorder.

if self._exception_logged:
return
self._exception_logged = True
tb_str = traceback.format_exc()[-10000:]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using traceback.format_exc() retrieves the traceback of the currently active exception from the global state. Since the exception object e is already explicitly passed to log_exception, it is more robust and idiomatic to format e directly using traceback.format_exception. This avoids relying on the global exception context, which can sometimes be lost or cleared if other operations are performed before this call.

Suggested change
tb_str = traceback.format_exc()[-10000:]
tb_str = "".join(traceback.format_exception(type(e), e, e.__traceback__))[-10000:]

@erictang000 erictang000 merged commit 8e1a7da into main May 28, 2026
3 checks passed
@erictang000 erictang000 deleted the log_ooms_to_wandb branch May 28, 2026 01:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant