Skip to content

Fix VLM metric structured output#594

Merged
davidberenstein1957 merged 1 commit intofeat/metrics-vlm-supportfrom
davidberenstein1957/vlm-metrics-review
Mar 24, 2026
Merged

Fix VLM metric structured output#594
davidberenstein1957 merged 1 commit intofeat/metrics-vlm-supportfrom
davidberenstein1957/vlm-metrics-review

Conversation

@davidberenstein1957
Copy link
Copy Markdown
Member

Summary

  • fall back CPU-only stateful VLM metrics to cpu during metric construction
  • route transformers structured outputs through outlines for existing pydantic schemas
  • add focused regressions plus a manual VLM smoke script

Verification

  • uv run --extra dev pytest -q tests/evaluation/test_task.py::test_vlm_metrics_fallback_to_cpu_on_auto_device tests/evaluation/test_vlm_metrics.py::test_transformers_generate_routes_pydantic_response_format_to_outlines tests/evaluation/test_vlm_metrics.py::test_transformers_outlines_result_serialization tests/evaluation/test_vlm_metrics.py::test_evaluation_agent_update_stateful_metrics_with_stub_vlm
  • uv run --extra dev python scripts/smoke_vlm_metrics.py --stub

@davidberenstein1957 davidberenstein1957 merged commit 20f59c9 into feat/metrics-vlm-support Mar 24, 2026
1 check passed
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

Bugbot Autofix prepared a fix for the issue found in the latest run.

  • ✅ Fixed: Registry crashes when metric_device is None
    • I added a guard in MetricRegistry so device normalization/fallback only runs when a stateful metric device is provided, allowing None to pass through to StatefulMetric initialization as before.

Create PR

Or push these changes by commenting:

@cursor push b308d27d2e
Preview (b308d27d2e)
diff --git a/src/pruna/evaluation/metrics/registry.py b/src/pruna/evaluation/metrics/registry.py
--- a/src/pruna/evaluation/metrics/registry.py
+++ b/src/pruna/evaluation/metrics/registry.py
@@ -137,9 +137,10 @@
         elif isclass(metric_cls):
             if issubclass(metric_cls, StatefulMetric):
                 metric_device = stateful_metric_device if stateful_metric_device else device
-                requested_device, _ = split_device(device_to_string(metric_device), strict=False)
-                if requested_device not in metric_cls.runs_on and "cpu" in metric_cls.runs_on:
-                    metric_device = "cpu"
+                if metric_device is not None:
+                    requested_device, _ = split_device(device_to_string(metric_device), strict=False)
+                    if requested_device not in metric_cls.runs_on and "cpu" in metric_cls.runs_on:
+                        metric_device = "cpu"
                 kwargs["device"] = metric_device
             elif issubclass(metric_cls, BaseMetric):
                 kwargs["device"] = inference_device if inference_device else device

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Comment @cursor review or bugbot run to trigger another review on this PR

if issubclass(metric_cls, StatefulMetric):
kwargs["device"] = stateful_metric_device if stateful_metric_device else device
metric_device = stateful_metric_device if stateful_metric_device else device
requested_device, _ = split_device(device_to_string(metric_device), strict=False)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registry crashes when metric_device is None

High Severity

device_to_string(metric_device) raises ValueError when metric_device is None. This happens when MetricRegistry.get_metric is called for a StatefulMetric subclass without providing device, stateful_metric_device, or inference_device kwargs. The previous code simply passed None through to the metric constructor, which handled it gracefully via set_to_best_available_device(None). At least one existing caller (base_tester.py) invokes get_metric(metric) with no device args for arbitrary metrics.

Fix in Cursor Fix in Web

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant