feat: add message trace support for LLM generation#272
Conversation
Add support for capturing full conversation traces during LLM generation, enabling debugging and fine-tuning dataset creation. Changes: - Add `with_trace` field to LLMTextColumnConfig for per-column trace control - Add `debug_override_save_all_column_traces` to RunConfig for global trace - Introduce ChatMessage dataclass for structured message representation - Update ModelFacade.generate() to return full message trace - Rename trace column postfix from `__reasoning_trace` to `__trace` - Add comprehensive traces documentation Traces capture system/user/assistant messages in order, enabling visibility into the full generation conversation including correction retries.
Greptile OverviewGreptile SummaryThis PR adds comprehensive message trace support for LLM generation, enabling users to capture the full conversation history (system/user/assistant messages with reasoning content) during generation. This is valuable for debugging, prompt iteration, and understanding model behavior. Key Changes
Critical Issue FoundLogic bug in Additional Notes
Confidence Score: 2/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant User
participant LLMGenerator as LLM Column Generator
participant Facade as ModelFacade
participant LiteLLM as LiteLLM Router
participant Dataset
User->>LLMGenerator: generate(data)
LLMGenerator->>LLMGenerator: render prompt template
LLMGenerator->>Facade: generate(prompt, parser, system_prompt)
Note over Facade: Build initial messages
Facade->>Facade: prompt_to_messages()
Facade->>Facade: Create ChatMessage objects
loop Until valid or max retries
Facade->>Facade: completion(messages)
Facade->>LiteLLM: completion(model, message_dicts)
LiteLLM-->>Facade: ModelResponse
Facade->>Facade: Extract content + reasoning_content
Facade->>Facade: Append ChatMessage.as_assistant()
alt Parser succeeds
Facade->>Facade: parser(response)
Facade-->>LLMGenerator: (parsed_obj, trace)
else Parser fails
Facade->>Facade: Check retry limits
alt Can retry
Facade->>Facade: Append ChatMessage.as_user(error)
Note over Facade: Continue loop for correction
else Max retries reached
Facade-->>LLMGenerator: Raise GenerationValidationFailureError
end
end
end
alt with_trace enabled or debug override
LLMGenerator->>LLMGenerator: Serialize trace to dicts
LLMGenerator->>Dataset: Store {name}__trace column
end
LLMGenerator->>Dataset: Store {name} column
LLMGenerator-->>User: data with generated columns
|
|
Claude is happy with the extraction @nabinchha 👇 Review: feat/traces Branch vs feat/mcpSummary of ExtractionThe
What feat/traces Includes1. ChatMessage Dataclass (utils.py)New @dataclass
class ChatMessage:
role: Literal["user", "assistant", "system", "tool"]
content: str | list[dict[str, Any]] = ""
reasoning_content: str | None = None
tool_calls: list[dict[str, Any]] = field(default_factory=list)
tool_call_id: str | None = NoneIncludes factory methods: 2. Updated ModelFacade.generate() (facade.py)
3. Column Config Changes (column_configs.py)
4. RunConfig Addition (run_config.py)
5. Generator Updates (llm_completion.py)
6. Documentation (traces.md)New comprehensive trace documentation covering:
What Remains in feat/mcp (Not in feat/traces)MCP Infrastructure
CLI Extensions
ModelFacade Tool Calling
Column Config Extensions
Extended Documentation
Compatibility Analysisfeat/traces is Forward-CompatibleThe
This means Differences to Reconcile After Merge
Review FindingsCorrectness
Potential Issues to Verify
Documentation Quality
|
|
@nabinchha – all tutorial notebooks and example recipes ran successfully |
| reasoning_trace = reasoning_trace.strip() | ||
|
|
||
| messages.append(ChatMessage.as_assistant(content=response, reasoning_content=reasoning_trace or None)) | ||
| curr_num_correction_steps += 1 |
There was a problem hiding this comment.
curr_num_correction_steps is incremented unconditionally before attempting parsing, causing an off-by-one error. On the first attempt, this becomes 1 immediately, so when max_correction_steps=0, the condition on line 222 (curr_num_correction_steps <= max_correction_steps) evaluates to 1 <= 0 = False, incorrectly skipping correction.
Move the increment after the try/except block, only incrementing when retrying:
| curr_num_correction_steps += 1 | |
| messages.append(ChatMessage.as_assistant(content=response, reasoning_content=reasoning_trace or None)) |
There was a problem hiding this comment.
Your boss disagrees:
I'll search for the code mentioned in this feedback to evaluate its accuracy.
[2 tools called]
Let me also look for any tests related to this function to understand the intended behavior.
[2 tools called]
Based on my analysis, the feedback is incorrect. The current implementation is correct and the tests confirm this.
Analysis
Let me trace through the current code behavior:
Current code logic (lines 198-235 in facade.py):
curr_num_correction_steps = 0
while True:
completion_response = self.completion(...) # Make LLM call
curr_num_correction_steps += 1 # Incremented BEFORE parsing
try:
output_obj = parser(response)
break
except ParserException as exc:
if curr_num_correction_steps <= max_correction_steps:
# Attempt correctionTest expectations from test_facade.py (lines 66-75):
@pytest.mark.parametrize(
"max_correction_steps,max_conversation_restarts,total_calls",
[
(0, 0, 1), # 0 corrections, 0 restarts → 1 call
(5, 0, 6), # 5 corrections, 0 restarts → 6 calls (initial + 5)
(0, 5, 6), # 0 corrections, 5 restarts → 6 calls (1 per restart)
],
)Tracing max_correction_steps=5:
- Call 1:
curr=1, fails,1 <= 5→ correct - Call 2:
curr=2, fails,2 <= 5→ correct - ...
- Call 5:
curr=5, fails,5 <= 5→ correct - Call 6:
curr=6, fails,6 <= 5→ stop (raises error) - Total: 6 calls (matches test ✓)
Tracing max_correction_steps=0:
- Call 1:
curr=1, fails,1 <= 0→ skip correction (go to restart or fail) - Total: 1 call per restart window (matches test ✓)
Why the feedback is wrong
The feedback claims "incorrectly skipping correction" when max_correction_steps=0. However:
max_correction_steps=0semantically means "no corrections allowed" - skipping is the CORRECT behavior- The tests explicitly verify this -
(0, 5, 6)expects 6 total calls with 0 corrections and 5 restarts (1 call per window, no corrections) max_correction_steps=Nmeans N corrections after the initial attempt - with N=5, we get 6 total calls (initial + 5 corrections)
The variable curr_num_correction_steps tracks the number of attempts in the current conversation window, and the check curr_num_correction_steps <= max_correction_steps correctly allows exactly max_correction_steps additional attempts after the initial one.
Conclusion
The current implementation is correct - there is no off-by-one error. The feedback author appears to have misunderstood that max_correction_steps=0 should disable corrections entirely, which is exactly what the current code does.
Resolve conflicts between feat/traces PR (#272) and feat/mcp: - column_configs.py: Keep both with_trace and tool_alias fields - run_config.py: Include tool messages in trace description - facade.py: Keep MCP tool calling loop with trace support - traces.md: Include tool use examples alongside basic traces - columns.md: Document both trace and tool_alias features
📋 Summary
Add message trace support for LLM generation, allowing users to capture the full conversation history (system/user/assistant messages) during generation. This is essential for debugging, understanding model behavior, and iterating on prompts. Changes pulled from feat/mcp.
🔄 Changes
✨ Added
docs/concepts/traces.mddocumentation explaining trace functionalityChatMessagedataclass inengine/models/utils.pyfor representing conversation messages with factory methods (as_user,as_assistant,as_system,as_tool)with_tracefield onLLMTextColumnConfigto enable per-column tracesdebug_override_save_all_column_tracesfield onRunConfigfor global trace enablement🔧 Changed
REASONING_TRACE_COLUMN_POSTFIX→TRACE_COLUMN_POSTFIX(__reasoning_trace→__trace)ModelFacade.generate()now returnstuple[Any, list[ChatMessage]]instead oftuple[Any, str | None]to provide full conversation historyModelFacade.completion()now acceptslist[ChatMessage]instead oflist[dict]llm_completion.pygenerator to save traces when enabled via config or run config🔍 Attention Areas
facade.py- Signature change forgenerate()method now returns full message trace instead of just reasoning contentutils.py- NewChatMessagedataclass that replaces dict-based message handling🤖 Generated with AI