fix(llm): ensure second LLM gets independent metrics and telemetry#1997
fix(llm): ensure second LLM gets independent metrics and telemetry#1997
Conversation
When LLM.model_copy() was called (e.g., to create a condenser LLM from an agent LLM), the _metrics and _telemetry private attributes were shared between the original and copied LLM instances. This caused metrics to be double-counted when both LLMs were used, as they were updating the same Metrics object. This fix overrides model_copy() to create fresh Metrics and Telemetry instances for the copied LLM, ensuring each LLM tracks its own usage independently. Fixes #418 Co-authored-by: openhands <openhands@all-hands.dev>
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable
Core logic solves a real bug, but violates parameter semantics and bundles unrelated changes.
VERDICT: ✅ Worth merging after addressing the critical parameter inconsistency
KEY INSIGHT: You fixed the symptom (shared metrics) but ignored the deep parameter, creating an API that lies about what it does—document this choice or respect the parameter.
See inline comments for specific issues. The metrics fix is sound, but the implementation needs clarity around parameter semantics and change scope.
Coverage Report •
|
||||||||||||||||||||
Replace the model_copy override with __copy__ implementation for cleaner separation of concerns. Pydantic's model_copy() uses __copy__ internally, so this achieves the same result without overriding a Pydantic method. Changes: - Implement __copy__ to reset _metrics and _telemetry to None - Change metrics and telemetry properties to use lazy initialization - Remove unused Mapping import This approach is cleaner because: 1. __copy__ is a standard Python protocol, not Pydantic-specific 2. Lazy initialization in properties centralizes the creation logic 3. Less likely to break with future Pydantic updates Co-authored-by: openhands <openhands@all-hands.dev>
|
Could we use the |
|
@OpenHands can you think about this #1997 (comment) and come up with a better solution for this? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
This change addresses the review feedback to use LLMRegistry for ensuring metrics independence instead of overriding __copy__ in the LLM class. Changes: - Remove __copy__ override from LLM class - Add reset_metrics() method to LLM for resetting metrics/telemetry - Add _ensure_independent_metrics() to LLMRegistry that detects and fixes shared metrics when LLMs are registered - Update tests to verify the registry-based approach The registry is the right place for this logic because: 1. It's the central point where LLMs are registered for a conversation 2. It has visibility into all registered LLMs 3. It can detect shared metrics by tracking metrics object IDs This approach fixes the issue where model_copy() creates LLMs with shared metrics (due to Pydantic's shallow copy of private attributes), which caused metrics to be double-counted when both the original and copied LLM were used. Fixes #418
|
I've addressed @enyst's feedback to use the Summary of ChangesApproachInstead of modifying copy behavior at the LLM level, the fix is now implemented in the
Changes Made
Tests:
All tests pass (586 passed in the LLM test suite). |
|
Looks like there are a few issues preventing this PR from being merged!
If you'd like me to help, just leave a comment, like Feel free to include any additional details that might help me get this PR into a better state. You can manage your notification settings |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Acceptable - Pragmatic fix for real bug
Context: All previous review threads are outdated - they commented on an old implementation that overrode model_copy(). This new approach is fundamentally different and better: lazy initialization + registry-based detection.
Why this works: Instead of fighting Pydantic's shallow-copy behavior, you work with it. Metrics are created on-demand, and the registry detects/fixes shared instances. Clean delegation of responsibility.
Review Feedback
🟡 Lazy Initialization Trade-off (llm.py, metrics/telemetry properties)
Before: Assertion failed immediately if metrics wasn't initialized (fail-fast)
After: Silently creates new metrics (hides potential bugs)
This is intentional for your use case (reset_metrics + lazy recreation), but worth noting: you're trading fail-fast for convenience. If someone expects metrics to be pre-initialized and finds None, they now get a fresh object instead of an error.
Not a blocker - the approach works for your design.
🟡 Unbounded Set Growth (llm_registry.py, _metrics_ids)
_metrics_ids grows forever. Each registered LLM adds an ID that's never removed, even if the LLM is later unregistered.
Impact: Probably negligible (registries are typically small), but technically a memory leak.
Fix if needed: Track LLM-to-metrics-id mapping in a dict instead of a set, and clean up on removal. But honestly, I'd ship this as-is unless you expect hundreds of registrations.
🟢 Side Effect in add() (llm_registry.py)
add() now modifies the LLM being added (via reset_metrics()). Registries typically don't mutate their inputs.
But: You documented this clearly in the docstring, and it solves a real problem. Pragmatism wins over purity here. Just noting it for future maintainers.
🟢 Good: Tests Prove the Fix (test_llm_registry.py)
Your tests actually exercise the real scenario:
- Create original LLM
- Copy with
model_copy() - Register both
- Verify metrics are independent
This is exactly what was needed. Well done.
Verdict
✅ Approve - Fixes issue #418 pragmatically without over-engineering. The suggestions above are minor and don't block merging.
Key insight: When you can't change library behavior (Pydantic shallow copy), design around it rather than fighting it.
|
@OpenHands can you review diff against main, understand the purpose of this PR, then help me reproduce the original issue, and I believe this PR contains a test case that will reproduce the original issue.Can you try to first Git reset to the original commit where no fix is applied,Make sure the test failed,and then you can reset your Git commit to the latest state of this PR and then rerun the same test again.Make sure that this PR actually fixed the original issue |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Summary: PR #1997 Verification CompleteI successfully verified that this PR fixes issue #418 (condensation breaks metrics accumulation) by reproducing the original issue and confirming the fix works. What I Did
ConclusionThis PR correctly fixes the original issue. When |
all-hands-bot
left a comment
There was a problem hiding this comment.
🟡 Taste Rating: Acceptable - But Incomplete Fix
This fixes the reported bug (#418) in the specific scenario (agent + condenser using registry), but the solution is a pragmatic workaround rather than a root cause fix.
Critical Issues
🟠 Incomplete Fix - Registry-Dependent Solution
File: openhands-sdk/openhands/sdk/llm/llm.py (lazy metrics initialization)
The core problem: model_copy() still creates shared metrics. You've added a registry-layer patch to detect and fix it, but this means:
# THIS STILL BREAKS (shared metrics):
original = LLM(model="gpt-4", api_key=SecretStr("test"))
copied = original.model_copy(update={"usage_id": "copied"})
original.metrics.add_cost(1.0)
print(copied.metrics.accumulated_cost) # 1.0 - STILL SHARED!Anyone using model_copy() outside the registry still gets shared metrics. The bug lives at the LLM level, not the registry level.
Simpler alternative:
def model_copy(self, *, update=None, deep=False):
copied = super().model_copy(update=update, deep=deep)
copied._metrics = None
copied._telemetry = None
return copiedThis is 4 lines and fixes the bug at the source for all cases, not just registry usage.
🟠 Registry Side Effects - Surprising Behavior
File: openhands-sdk/openhands/sdk/llm/llm_registry.py (lines 80-105, _ensure_independent_metrics)
registry.add() now mutates the LLM by calling reset_metrics(). This violates the principle of least surprise:
llm = create_my_llm()
llm.metrics.add_cost(5.0) # Track some usage
registry.add(llm) # SURPRISE! Your metrics just got resetA registry's add() method should register the object, not modify it. This is a leaky abstraction - the registry is now responsible for fixing copy-related bugs in LLM.
Improvement Opportunities
🟡 Lazy Init Removes Safety Guarantees
File: openhands-sdk/openhands/sdk/llm/llm.py (lines 507-514)
Changed from assertions (fail-fast) to lazy initialization (create-on-demand):
# Old: Explicit failure if not initialized
assert self._metrics is not None, "Metrics should be initialized"
# New: Silently create if missing
if self._metrics is None:
self._metrics = Metrics(...)This hides bugs. If there's a code path where metrics aren't properly initialized, you'll never know - they'll just be created lazily with potentially incorrect state.
🟡 Memory Leak Potential
File: openhands-sdk/openhands/sdk/llm/llm_registry.py (line 50)
self._metrics_ids: set[int] = set()This set grows forever. In a long-lived registry:
- Add 1000 LLMs → track 1000 IDs
- LLMs get garbage collected → IDs remain in the set
- No cleanup mechanism exists
Not critical (IDs are small ints), but it's sloppy.
🟡 Missing Test for Failure Case
File: tests/sdk/llm/test_llm_registry.py
Your tests verify the registry fixes shared metrics, but don't test that the bug still exists outside the registry. Consider adding:
def test_model_copy_without_registry_still_shares_metrics():
"""Document that the underlying bug exists if registry isn't used."""
from pydantic import SecretStr
original = LLM(model="gpt-4o", api_key=SecretStr("test-key"))
copied = original.model_copy(update={"usage_id": "copied"})
# Without registry, metrics are still shared (the incomplete fix)
assert original.metrics is copied.metrics
original.metrics.add_cost(1.0)
assert copied.metrics.accumulated_cost == 1.0 # Bug still existsIf you're shipping an incomplete fix, at least document the limitation explicitly.
Verdict
🟡 Worth merging - Fixes the reported bug in practice, but the design is compromised
KEY INSIGHT: You've built a registry-layer detection system to patch a problem that could be fixed in 4 lines at the LLM level with a model_copy() override. The tests pass and it works for the agent+condenser use case, but you've traded simplicity for avoiding a straightforward model_copy() override.
|
Ugh, OH fights with itself, and the last comment sounds quite bad, if it's really the case. 😓 Please let me take a step back here. I feel I have a question:
If so, the way forward here is to load the default profile for 'different_usage_id' as long as the user hasn't specified another. That has to pass through the LLMRegistry, it seems to me. And that won't be a model_copy, just initialize a new LLM from some profile. 🤔 |
enyst
left a comment
There was a problem hiding this comment.
IMHO this is actually better than overriding model_copy for a situation that is a bit of a temporary shortcut in the first place...
Unless you hate it, I'd go with it, and refactor later - this way, only llm and registry will need changes, but they do anyway; and we're not fighting Pydantic.
|
@OpenHands, please clone OpenHands/Openhands-CLI and OpenHands/OpenHands, and search for all occurrences of model_copy for the LLM class (including in the current repo) -- Tell me, how will this PR impact that code? Will it introduce some weird behaviors? |
|
I'm on it! xingyaoww can track my progress at all-hands.dev |
Analysis: Impact of PR #1997 on
|
Summary
This PR fixes issue #418 where condensation breaks the metrics accumulation.
Root Cause
When
LLM.model_copy()was called (e.g., to create a condenser LLM from an agent LLM), the_metricsand_telemetryprivate attributes were shared between the original and copied LLM instances. This is because Pydantic'smodel_copy()does a shallow copy of private attributes by default.This caused metrics to be double-counted when both LLMs were used, as they were updating the same
Metricsobject. The visualizer would then display incorrect metrics in the "Condensation" cell block.Fix
This PR overrides
model_copy()in theLLMclass to create freshMetricsandTelemetryinstances for the copied LLM, ensuring each LLM tracks its own usage independently.Testing
Added 3 new tests to verify:
test_llm_model_copy_creates_independent_metrics- Verifies that copied LLMs have separate metrics objectstest_llm_model_copy_creates_independent_telemetry- Verifies that copied LLMs have separate telemetry objectstest_llm_model_copy_preserves_configuration- Verifies that the LLM configuration is preserved when copyingFixes #418
Checklist
@xingyaoww can click here to continue refining the PR
Agent Server images for this PR
• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server
Variants & Base Images
eclipse-temurin:17-jdknikolaik/python-nodejs:python3.12-nodejs22golang:1.21-bookwormPull (multi-arch manifest)
# Each variant is a multi-arch manifest supporting both amd64 and arm64 docker pull ghcr.io/openhands/agent-server:b05e9e0-pythonRun
All tags pushed for this build
About Multi-Architecture Support
b05e9e0-python) is a multi-arch manifest supporting both amd64 and arm64b05e9e0-python-amd64) are also available if needed