fix(helpers): correct LLM-output parsing (post-v0.0.1) by will-fawcett · Pull Request #5 · FrontierDevelopmentLab/eve-mcp

will-fawcett · 2026-04-28T12:33:14Z

Summary

Fixes four real correctness bugs in `src/eve_mcp/server/helpers.py` surfaced by the post-v0.0.1 code review. All concentrated in the LLM-output-parsing layer used by `extract_factuality_issues` and `assess_factuality_issue`. Smoke tests against EVE didn't catch these because the existing tests only exercised the happy path with synthetic well-formed inputs.

1. `extract_tag` was greedy — captured across fences

```python
pattern = rf"```{tag}(.*)```" # greedy
```

With input containing two ```json fences, the regex captured everything from the first opening fence to the last closing fence — yielding a string with intervening prose and other fenced blocks embedded.

Fix: `(.*?)` (non-greedy). Regression test added.

2. `_assess_factuality_issue` answer cleanup left tags in the answer

The original code:
```python
code_recommendations = extract_xml_tag(r["answer"], "CODE_RECOMMENDATIONS")
if "```json" in code_recommendations:
code_recommendations = extract_tag(code_recommendations, "json")
r["answer"] = r["answer"].replace(code_recommendations, "")
```

When the recommendations contained a ```json fence, `code_recommendations` was rebound to the unwrapped inner content. The `.replace()` then matched only that inner content as a substring inside the original fence, leaving the wrapping `<CODE_RECOMMENDATIONS>`, the ```json fence itself, and the closing tag in `answer`.

Fix: capture the original (pre-unwrap) inner text, splice the full `<CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS>` block out of the answer, then unwrap the inner fence for the structured `code_recommendations` field.

3. `_assess_factuality_issue` raised when the LLM omitted recommendations

The prompt explicitly says "If you have recommendations to fix or update the code, add a JSON object inside an xml tag…" — recommendations are optional. But `extract_xml_tag` was called unconditionally, so a successful assessment with no recommendations surfaced as a tool error to the MCP host. Now caught and returns `code_recommendations: ""`.

4. Prompt typos + invalid example JSON

Both prompts had typos ("asessment", "udpate", "attemps", "You task"). The assess prompt also showed the model an example schema using single-quoted "JSON" with no commas between objects — invalid JSON. LLMs that follow the example faithfully produced strings the downstream JSON parser would reject. Rewritten to be valid double-quoted JSON. Similar fixes to the extract prompt (implicit string concatenation, missing array commas).

Tests

`TestExtractTag` — added `test_does_not_span_multiple_fences` (regression for Submoduling #1).
`TestAssessFactualityIssue` — rewrote assertions: the existing `test_strips_recommendations_block_from_answer` was pinning the broken behavior (`assert "<CODE_RECOMMENDATIONS>" in result["answer"]`), so it's been flipped to assert the cleaned shape. Added `test_returns_empty_recommendations_when_tag_absent` (cleanup: remove EVEMCPClient and expand README for v0.0.1 #3) and `test_keeps_malformed_inner_fence_as_is` (defensive).

Coverage stays at 100%.

Test plan

`poetry run pre-commit run --all-files`
CI matrix
Manual smoke: `assess_factuality_issue` with a real GEE script after v0.0.1 lands

Four related bugs in the prompt-and-parse layer surfaced by the post-v0.0.1 deep review: 1. extract_tag used a greedy `(.*)` which captured across multiple ``` fences, returning a string spanning intervening prose. Switch to non-greedy `(.*?)` so it stops at the first closing fence. 2. _assess_factuality_issue cleaned up the answer by string-replacing `code_recommendations` after potentially unwrapping its inner ```json fence, which left the wrapping <CODE_RECOMMENDATIONS> tags (and the fence) behind in the answer body. Refactor to splice the full <CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS> block out of the answer first, then unwrap the inner fence for the structured output field. 3. _assess_factuality_issue raised NoTagFoundError when the LLM omitted the <CODE_RECOMMENDATIONS> tag, even though the prompt explicitly says recommendations are optional. Catch that case and return an empty `code_recommendations` field. 4. Both prompts had typos ("asessment", "udpate", "attemps", "You task") and the assess prompt's example JSON used single quotes plus comma-less object separators, so the model was being shown invalid JSON as the schema. Fix typos and rewrite the example to be valid double-quoted JSON. Also fix the extract prompt's implicit string concatenation and missing array commas. Tests updated: - TestAssessFactualityIssue: assertions now check that the entire <CODE_RECOMMENDATIONS> block (including tags) is removed from `answer`. Added cases for the no-recs path and the malformed inner-fence path. - TestExtractTag: added a regression test for the greedy-match bug using a string with two ```json fences. Coverage stays at 100%.

r-spiewak

I didn't test it, but there are some minor additions to your tests.

Base automatically changed from tests/helpers-coverage to main April 28, 2026 13:00

will-fawcett-trillium marked this pull request as ready for review April 28, 2026 13:21

will-fawcett mentioned this pull request Apr 28, 2026

Orphaned heading remains in answer after CODE_RECOMMENDATIONS splice #6

Open

will-fawcett force-pushed the fix/helpers-llm-parsing branch from 217c614 to f4811a9 Compare April 28, 2026 13:31

r-spiewak requested changes Apr 30, 2026

View reviewed changes

Comment thread tests/test_server/test_helpers.py

will-fawcett added 2 commits May 1, 2026 15:26

Add extra assert statements

49a1809

bump submodule version

ee1516e

r-spiewak approved these changes May 1, 2026

View reviewed changes

will-fawcett-trillium merged commit f47a1b2 into main May 1, 2026
4 checks passed

will-fawcett-trillium deleted the fix/helpers-llm-parsing branch May 1, 2026 14:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(helpers): correct LLM-output parsing (post-v0.0.1)#5

fix(helpers): correct LLM-output parsing (post-v0.0.1)#5
will-fawcett-trillium merged 3 commits intomainfrom
fix/helpers-llm-parsing

will-fawcett commented Apr 28, 2026 •

edited by will-fawcett-trillium

Loading

Uh oh!

r-spiewak left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

will-fawcett commented Apr 28, 2026 • edited by will-fawcett-trillium Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. `extract_tag` was greedy — captured across fences

2. `_assess_factuality_issue` answer cleanup left tags in the answer

3. `_assess_factuality_issue` raised when the LLM omitted recommendations

4. Prompt typos + invalid example JSON

Tests

Test plan

Uh oh!

r-spiewak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

will-fawcett commented Apr 28, 2026 •

edited by will-fawcett-trillium

Loading