Skip to content

fix(helpers): correct LLM-output parsing (post-v0.0.1)#5

Merged
will-fawcett-trillium merged 3 commits intomainfrom
fix/helpers-llm-parsing
May 1, 2026
Merged

fix(helpers): correct LLM-output parsing (post-v0.0.1)#5
will-fawcett-trillium merged 3 commits intomainfrom
fix/helpers-llm-parsing

Conversation

@will-fawcett
Copy link
Copy Markdown
Collaborator

@will-fawcett will-fawcett commented Apr 28, 2026

Summary

Fixes four real correctness bugs in `src/eve_mcp/server/helpers.py` surfaced by the post-v0.0.1 code review. All concentrated in the LLM-output-parsing layer used by `extract_factuality_issues` and `assess_factuality_issue`. Smoke tests against EVE didn't catch these because the existing tests only exercised the happy path with synthetic well-formed inputs.

1. `extract_tag` was greedy — captured across fences

```python
pattern = rf"```{tag}(.*)```" # greedy
```

With input containing two ```json fences, the regex captured everything from the first opening fence to the last closing fence — yielding a string with intervening prose and other fenced blocks embedded.

Fix: `(.*?)` (non-greedy). Regression test added.

2. `_assess_factuality_issue` answer cleanup left tags in the answer

The original code:
```python
code_recommendations = extract_xml_tag(r["answer"], "CODE_RECOMMENDATIONS")
if "```json" in code_recommendations:
code_recommendations = extract_tag(code_recommendations, "json")
r["answer"] = r["answer"].replace(code_recommendations, "")
```

When the recommendations contained a ```json fence, `code_recommendations` was rebound to the unwrapped inner content. The `.replace()` then matched only that inner content as a substring inside the original fence, leaving the wrapping `<CODE_RECOMMENDATIONS>`, the ```json fence itself, and the closing tag in `answer`.

Fix: capture the original (pre-unwrap) inner text, splice the full `<CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS>` block out of the answer, then unwrap the inner fence for the structured `code_recommendations` field.

3. `_assess_factuality_issue` raised when the LLM omitted recommendations

The prompt explicitly says "If you have recommendations to fix or update the code, add a JSON object inside an xml tag…" — recommendations are optional. But `extract_xml_tag` was called unconditionally, so a successful assessment with no recommendations surfaced as a tool error to the MCP host. Now caught and returns `code_recommendations: ""`.

4. Prompt typos + invalid example JSON

Both prompts had typos ("asessment", "udpate", "attemps", "You task"). The assess prompt also showed the model an example schema using single-quoted "JSON" with no commas between objects — invalid JSON. LLMs that follow the example faithfully produced strings the downstream JSON parser would reject. Rewritten to be valid double-quoted JSON. Similar fixes to the extract prompt (implicit string concatenation, missing array commas).

Tests

  • `TestExtractTag` — added `test_does_not_span_multiple_fences` (regression for Submoduling #1).
  • `TestAssessFactualityIssue` — rewrote assertions: the existing `test_strips_recommendations_block_from_answer` was pinning the broken behavior (`assert "<CODE_RECOMMENDATIONS>" in result["answer"]`), so it's been flipped to assert the cleaned shape. Added `test_returns_empty_recommendations_when_tag_absent` (cleanup: remove EVEMCPClient and expand README for v0.0.1 #3) and `test_keeps_malformed_inner_fence_as_is` (defensive).

Coverage stays at 100%.

Test plan

  • `poetry run pre-commit run --all-files`
  • CI matrix
  • Manual smoke: `assess_factuality_issue` with a real GEE script after v0.0.1 lands

Base automatically changed from tests/helpers-coverage to main April 28, 2026 13:00
@will-fawcett-trillium will-fawcett-trillium marked this pull request as ready for review April 28, 2026 13:21
Four related bugs in the prompt-and-parse layer surfaced by the
post-v0.0.1 deep review:

1. extract_tag used a greedy `(.*)` which captured across multiple
   ``` fences, returning a string spanning intervening prose. Switch
   to non-greedy `(.*?)` so it stops at the first closing fence.

2. _assess_factuality_issue cleaned up the answer by string-replacing
   `code_recommendations` after potentially unwrapping its inner
   ```json fence, which left the wrapping <CODE_RECOMMENDATIONS> tags
   (and the fence) behind in the answer body. Refactor to splice the
   full <CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS> block out of
   the answer first, then unwrap the inner fence for the structured
   output field.

3. _assess_factuality_issue raised NoTagFoundError when the LLM
   omitted the <CODE_RECOMMENDATIONS> tag, even though the prompt
   explicitly says recommendations are optional. Catch that case and
   return an empty `code_recommendations` field.

4. Both prompts had typos ("asessment", "udpate", "attemps", "You
   task") and the assess prompt's example JSON used single quotes
   plus comma-less object separators, so the model was being shown
   invalid JSON as the schema. Fix typos and rewrite the example to
   be valid double-quoted JSON. Also fix the extract prompt's
   implicit string concatenation and missing array commas.

Tests updated:
- TestAssessFactualityIssue: assertions now check that the entire
  <CODE_RECOMMENDATIONS> block (including tags) is removed from
  `answer`. Added cases for the no-recs path and the malformed
  inner-fence path.
- TestExtractTag: added a regression test for the greedy-match bug
  using a string with two ```json fences.

Coverage stays at 100%.
@will-fawcett will-fawcett force-pushed the fix/helpers-llm-parsing branch from 217c614 to f4811a9 Compare April 28, 2026 13:31
Copy link
Copy Markdown
Collaborator

@r-spiewak r-spiewak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't test it, but there are some minor additions to your tests.

Comment thread tests/test_server/test_helpers.py
@will-fawcett-trillium will-fawcett-trillium merged commit f47a1b2 into main May 1, 2026
4 checks passed
@will-fawcett-trillium will-fawcett-trillium deleted the fix/helpers-llm-parsing branch May 1, 2026 14:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants