Route string/bool values to Pluto config, handle unmapped types#120
Route string/bool values to Pluto config, handle unmapped types#120asaiacai wants to merge 5 commits into
Conversation
WandbRunWrapper.log() pre-filtered logged values and only forwarded
Python int/float, torch scalar tensors, and wandb media to Pluto.
String values had no branch (_convert_wandb_to_pluto returns None for
str), so they were silently dropped.
Concretely, a checkpoint-metadata logger that does:
wandb.log({'checkpoint/step': step, # int -> reached Pluto
'checkpoint/epoch': epoch, # int -> reached Pluto
'checkpoint/r2_path': r2_path, # str -> DROPPED
'checkpoint/local_path': path}, # str -> DROPPED
step=step)
lost the two string paths while the numeric step/epoch came through
fine. Those string paths are exactly what a resume flow needs to know
which object to stage.
Now str/bool values route to update_config() (latest-wins, queryable
via get_run().config), mirroring where wandb places loose strings (run
summary/overview). A resume skill can read the most recent
checkpoint/r2_path off the run config instead of scanning storage.
Also forward numpy scalars (np.generic, excluding np.bool_) as metrics
via .item(). This is defensive hardening, not part of the above bug:
the shim was stricter than Pluto's own log(), which already accepts
anything exposing .item(). Frameworks that log np.int64/np.float32
were dropped here even though Pluto core would have kept them.
Generalize the shim's numeric forwarding to accept any value exposing a callable .item() returning a number, replacing the narrower torch-tensor + numpy-generic checks. This mirrors Pluto's own log() (op._process_log_item_sync), which already forwards anything with .item(). Motivation: a checkpoint logger annotated as log_checkpoint(step: int, epoch: int, ...) can still pass a non-plain-int at runtime (np.int64, a 0-d tensor, a framework scalar). The old shim dropped those even though Pluto core would have kept them, so a logged 'epoch' could silently never reach Pluto. A failing .item() (multi-element array) is treated as not-a-scalar and ignored, same as Pluto would fail it. bool/str remain routed to config.
The shim only forwarded recognized types (numbers, wandb media, str/bool) and silently dropped everything else — dicts, None, raw/multi-element tensors, numpy arrays, unconvertible wandb media (Html/Object3D/...), custom objects. Silent drops are what made missing data (e.g. a whole checkpoint metadata call) so hard to diagnose: no error, no warning, just absent. Add a 'preserve-what-we-can, otherwise fail loud' fallback in _handle_unforwardable(): - JSON-serializable leftovers (nested dicts/lists of primitives, None, etc.) are preserved as Pluto config, mirroring how wandb keeps loose values in the run summary. - Anything else warns ONCE per key (WARNING, naming key + type) instead of vanishing. The value still reaches W&B; only the Pluto copy is dropped, and now visibly. Never raises — wandb behavior is unaffected.
The previous change warned users (WARNING log) when a value had no Pluto mapping. But that's a gap in OUR type coverage, not a user error — people migrating away from wandb shouldn't be nagged about types only the package maintainers can fix. Reroute the signal: genuinely unforwardable values (non-JSON-serializable, no metric/media mapping) now emit a Sentry telemetry alert via the SDK's isolated client (pluto/sentry.py), once per key, grouped by type name so we can see which unhandled types show up in the wild and add coverage. The local log drops to debug. JSON-serializable leftovers still fall back to config silently. Sentry honors PLUTO_DISABLE_TELEMETRY and swallows all errors, so this never affects the user or wandb.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Comment @cursor review or bugbot run to trigger another review on this PR
Reviewed by Cursor Bugbot for commit 3d52d9c. Configure here.
| logger.debug( | ||
| f'pluto.compat.wandb: Failed to sync string/bool ' | ||
| f'values to Pluto config: {e}' | ||
| ) |
There was a problem hiding this comment.
Pluto log failure skips config
Medium Severity
update_config runs in the same outer try as pluto_run.log. If metric/media logging raises, execution jumps to the broad handler and string or bool config updates from the same wandb.log call are never sent to Pluto.
Reviewed by Cursor Bugbot for commit 3d52d9c. Configure here.
| try: | ||
| json.dumps(value) | ||
| return True | ||
| except (TypeError, ValueError): |
There was a problem hiding this comment.
Config gate rejects OmegaConf values
Medium Severity
_is_json_serializable uses plain json.dumps, while update_config normalizes via to_native_config (including OmegaConf). Logged DictConfig or nested OmegaConf nodes fail the JSON check and are treated as unforwardable instead of being stored as config.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3d52d9c. Configure here.
There was a problem hiding this comment.
Code Review
This pull request enhances the W&B compatibility shim by implementing robust value routing to Pluto. Specifically, it routes string and boolean values to the run configuration, supports numpy scalars as metrics, and provides fallback handling for unforwardable values (either saving them as config if JSON-serializable or reporting them to Sentry). The reviewer's feedback identifies a critical performance bottleneck where logging string or boolean values at every step triggers redundant, synchronous configuration updates. The reviewer suggests caching the last logged configuration values to only sync actual changes, and provides a unit test to verify this optimization.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| # Keys we've already warned about being unforwardable to Pluto, so a | ||
| # value logged every step warns once rather than spamming the logs. | ||
| self._unforwardable_warned: set = set() |
There was a problem hiding this comment.
Performance Bottleneck: Redundant Config Updates
Logging string or boolean values (such as training phase, status, or checkpoint paths) at every step is a very common pattern in machine learning training loops. Currently, every call to log() containing a string or boolean will trigger a call to self._pluto_run.update_config(pluto_config).
If the sync process is disabled or not yet initialized, update_config performs a synchronous, blocking HTTP POST request to the server. Even when the sync process is enabled, it triggers a synchronous write to the local SQLite database. Doing this at every single step will severely degrade training performance due to network or disk I/O bottlenecks.
To prevent this, we should cache the last logged config values and only send updates when a value actually changes.
| # Keys we've already warned about being unforwardable to Pluto, so a | |
| # value logged every step warns once rather than spamming the logs. | |
| self._unforwardable_warned: set = set() | |
| # Keys we've already warned about being unforwardable to Pluto, so a | |
| # value logged every step warns once rather than spamming the logs. | |
| self._unforwardable_warned: set = set() | |
| # Cache of the last logged config values to avoid redundant updates. | |
| self._last_logged_config: Dict[str, Any] = {} | |
| for key, value in data.items(): | ||
| if isinstance(value, (int, float)): | ||
| if isinstance(value, bool): | ||
| # bool is a subclass of int, but Pluto drops bool | ||
| # metrics — surface it as config so it isn't lost. | ||
| pluto_config[key] = value | ||
| elif isinstance(value, (int, float)): | ||
| pluto_data[key] = value | ||
| elif _is_torch_tensor_scalar(value): | ||
| pluto_data[key] = value.item() | ||
| elif (num := _as_scalar_number(value)) is not None: | ||
| pluto_data[key] = num | ||
| elif isinstance(value, str): | ||
| pluto_config[key] = value |
There was a problem hiding this comment.
Only add string and boolean values to pluto_config if they have actually changed from their previously logged values.
| for key, value in data.items(): | |
| if isinstance(value, (int, float)): | |
| if isinstance(value, bool): | |
| # bool is a subclass of int, but Pluto drops bool | |
| # metrics — surface it as config so it isn't lost. | |
| pluto_config[key] = value | |
| elif isinstance(value, (int, float)): | |
| pluto_data[key] = value | |
| elif _is_torch_tensor_scalar(value): | |
| pluto_data[key] = value.item() | |
| elif (num := _as_scalar_number(value)) is not None: | |
| pluto_data[key] = num | |
| elif isinstance(value, str): | |
| pluto_config[key] = value | |
| for key, value in data.items(): | |
| if isinstance(value, bool): | |
| # bool is a subclass of int, but Pluto drops bool | |
| # metrics — surface it as config so it isn't lost. | |
| if self._last_logged_config.get(key) != value: | |
| pluto_config[key] = value | |
| elif isinstance(value, (int, float)): | |
| pluto_data[key] = value | |
| elif (num := _as_scalar_number(value)) is not None: | |
| pluto_data[key] = num | |
| elif isinstance(value, str): | |
| if self._last_logged_config.get(key) != value: | |
| pluto_config[key] = value |
| if pluto_config: | ||
| try: | ||
| self._pluto_run.update_config(pluto_config) | ||
| except Exception as e: | ||
| logger.debug( | ||
| f'pluto.compat.wandb: Failed to sync string/bool ' | ||
| f'values to Pluto config: {e}' | ||
| ) |
There was a problem hiding this comment.
Update the local config cache self._last_logged_config once the config has been successfully synced.
| if pluto_config: | |
| try: | |
| self._pluto_run.update_config(pluto_config) | |
| except Exception as e: | |
| logger.debug( | |
| f'pluto.compat.wandb: Failed to sync string/bool ' | |
| f'values to Pluto config: {e}' | |
| ) | |
| if pluto_config: | |
| try: | |
| self._pluto_run.update_config(pluto_config) | |
| self._last_logged_config.update(pluto_config) | |
| except Exception as e: | |
| logger.debug( | |
| f'pluto.compat.wandb: Failed to sync string/bool ' | |
| f'values to Pluto config: {e}' | |
| ) |
| if _is_json_serializable(value): | ||
| pluto_config[key] = value | ||
| return |
There was a problem hiding this comment.
Only add fallback JSON-serializable values to pluto_config if they have changed from their previously logged values.
| if _is_json_serializable(value): | |
| pluto_config[key] = value | |
| return | |
| if _is_json_serializable(value): | |
| if self._last_logged_config.get(key) != value: | |
| pluto_config[key] = value | |
| return |
| def test_json_serializable_unmapped_value_falls_back_to_config(): | ||
| """A dict/None with no metric mapping is preserved as config, not dropped.""" | ||
| wrapper, pluto_run = _make_wrapper() | ||
|
|
||
| wrapper.log({'meta/info': {'kind': 'resume', 'attempt': 3}, 'note': None}) | ||
|
|
||
| cfg = pluto_run.update_config.call_args.args[0] | ||
| assert cfg['meta/info'] == {'kind': 'resume', 'attempt': 3} | ||
| assert cfg['note'] is None | ||
| assert not pluto_run.log.called # no numeric metrics in this call |
There was a problem hiding this comment.
Add a unit test to verify that redundant config updates are skipped to avoid performance bottlenecks.
def test_json_serializable_unmapped_value_falls_back_to_config():
"""A dict/None with no metric mapping is preserved as config, not dropped."""
wrapper, pluto_run = _make_wrapper()
wrapper.log({'meta/info': {'kind': 'resume', 'attempt': 3}, 'note': None})
cfg = pluto_run.update_config.call_args.args[0]
assert cfg['meta/info'] == {'kind': 'resume', 'attempt': 3}
assert cfg['note'] is None
assert not pluto_run.log.called # no numeric metrics in this call
def test_log_skips_redundant_config_updates():
"""Verify that redundant config updates are skipped to avoid performance bottlenecks."""
wrapper, pluto_run = _make_wrapper()
# First log: config is updated
wrapper.log({'phase': 'train', 'loss': 0.5})
assert pluto_run.update_config.call_count == 1
# Second log with same config value: update_config should NOT be called again
pluto_run.update_config.reset_mock()
wrapper.log({'phase': 'train', 'loss': 0.4})
assert pluto_run.update_config.call_count == 0
# Third log with changed config value: update_config should be called
wrapper.log({'phase': 'val', 'loss': 0.3})
assert pluto_run.update_config.call_count == 1
assert pluto_run.update_config.call_args.args[0] == {'phase': 'val'}Minimal end-to-end script mirroring linum's WandbLogger.log_checkpoint: numeric step/epoch (as np.int64) plus string r2_path/local_path logged in one wandb.log call, then read back from Pluto. Asserts step/epoch land as metrics and the paths land in config. Fails on pre-fix builds, passes on this branch. Uses wandb mode=disabled so no W&B account is needed.
8864c4c to
3d52d9c
Compare
Three fixes from PR #120 review (Cursor Bugbot + Gemini): 1. Config no longer skipped on metric-log failure. update_config() shared the outer try with pluto_run.log(); a metric/media logging exception jumped to the handler and dropped str/bool config from the same call. Metrics and config now send in independent try blocks. 2. OmegaConf values are storable as config. The fallback gate used plain json.dumps, but update_config normalizes via to_native_config (which deep-converts DictConfig/ListConfig). A logged OmegaConf node was wrongly treated as unforwardable (dropped + Sentry-alerted). _config_storable_value now mirrors update_config's normalization, so OmegaConf is kept while tensors/ndarrays/custom objects still fall through to Sentry. 3. Skip redundant config writes. Logging an unchanged str/bool/config value every step re-triggered update_config (a SQLite write) each step. Dedup against the last synced value via self._last_logged_config, updated only on a successful update_config. Uses a _MISSING sentinel so a first-time None is still sent (None != missing). Tests: dedup skip/change behavior, OmegaConf-node fallback.


Description
This PR improves how the wandb compatibility shim forwards logged values to Pluto by implementing proper value routing and handling for unmapped types.
Key Changes
Value Routing for Pluto:
.item()(numpy/torch/etc.) → Pluto metrics (time-series)get_run().config). This enables the /resume-crashed-run use case where checkpoint paths and other string metadata need to be readable from PlutoImproved Scalar Detection:
_is_torch_tensor_scalar()with_as_scalar_number()that mirrors Pluto core's ownop._process_log_item_sync.item()method (numpy scalars, 0-d arrays, custom tensor-like objects)Unforwardable Value Handling:
_handle_unforwardable()method to gracefully handle values with no metric/media/config mappingConfig Synchronization:
pluto_configdict to collect string/bool/JSON-serializable valuespluto_run.update_config()after logging metricsTesting
Added comprehensive test coverage in
test_wandb_compat.py:test_log_routes_strings_to_config_not_metrics: Verifies strings land in config, not metricstest_log_forwards_numpy_scalars_as_metrics: Ensures numpy scalars reach Plutotest_log_forwards_any_item_scalar_like_pluto_core: Guards against being stricter than Pluto coretest_log_does_not_treat_failing_item_as_scalar: Handles non-scalars with.item()gracefullytest_unforwardable_value_alerts_sentry_once_not_user: Verifies Sentry alerting and deduplicationtest_json_serializable_unmapped_value_falls_back_to_config: Confirms dict/None fallback to configTested (run the relevant ones):
bash format.shhttps://claude.ai/code/session_01JzBDybRDXt1TLnsQAxhKqz
Note
Medium Risk
Changes dual-logging routing in a hot path (
wandb.log); mis-routing could affect resume/metadata in Pluto while wandb stays correct, but failures are contained in try/except and covered by new unit tests.Overview
The wandb→Pluto shim
WandbRunWrapper.lognow splits eachwandb.logpayload before forwarding: numbers (including any scalar with a working.item(), not only torch) go topluto_run.log, whilestr,bool, and JSON-serializable leftovers go topluto_run.update_config(latest-wins), so checkpoint paths and similar metadata are readable from Pluto config (e.g. resume flows).Values that are not metrics, media, or JSON config are handled by new
_handle_unforwardable: serializable dicts/None/primitive lists are kept as config; otherwise the Pluto copy is skipped, debug is logged once per key, and Sentry gets a one-time maintainer alert—wandb behavior is unchanged.Reviewed by Cursor Bugbot for commit 3d52d9c. Configure here.