fix(helpers): correct LLM-output parsing (post-v0.0.1)#5
Merged
will-fawcett-trillium merged 3 commits intomainfrom May 1, 2026
Merged
fix(helpers): correct LLM-output parsing (post-v0.0.1)#5will-fawcett-trillium merged 3 commits intomainfrom
will-fawcett-trillium merged 3 commits intomainfrom
Conversation
Four related bugs in the prompt-and-parse layer surfaced by the
post-v0.0.1 deep review:
1. extract_tag used a greedy `(.*)` which captured across multiple
``` fences, returning a string spanning intervening prose. Switch
to non-greedy `(.*?)` so it stops at the first closing fence.
2. _assess_factuality_issue cleaned up the answer by string-replacing
`code_recommendations` after potentially unwrapping its inner
```json fence, which left the wrapping <CODE_RECOMMENDATIONS> tags
(and the fence) behind in the answer body. Refactor to splice the
full <CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS> block out of
the answer first, then unwrap the inner fence for the structured
output field.
3. _assess_factuality_issue raised NoTagFoundError when the LLM
omitted the <CODE_RECOMMENDATIONS> tag, even though the prompt
explicitly says recommendations are optional. Catch that case and
return an empty `code_recommendations` field.
4. Both prompts had typos ("asessment", "udpate", "attemps", "You
task") and the assess prompt's example JSON used single quotes
plus comma-less object separators, so the model was being shown
invalid JSON as the schema. Fix typos and rewrite the example to
be valid double-quoted JSON. Also fix the extract prompt's
implicit string concatenation and missing array commas.
Tests updated:
- TestAssessFactualityIssue: assertions now check that the entire
<CODE_RECOMMENDATIONS> block (including tags) is removed from
`answer`. Added cases for the no-recs path and the malformed
inner-fence path.
- TestExtractTag: added a regression test for the greedy-match bug
using a string with two ```json fences.
Coverage stays at 100%.
217c614 to
f4811a9
Compare
r-spiewak
requested changes
Apr 30, 2026
Collaborator
r-spiewak
left a comment
There was a problem hiding this comment.
I didn't test it, but there are some minor additions to your tests.
r-spiewak
approved these changes
May 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes four real correctness bugs in `src/eve_mcp/server/helpers.py` surfaced by the post-v0.0.1 code review. All concentrated in the LLM-output-parsing layer used by `extract_factuality_issues` and `assess_factuality_issue`. Smoke tests against EVE didn't catch these because the existing tests only exercised the happy path with synthetic well-formed inputs.
1. `extract_tag` was greedy — captured across fences
```python
pattern = rf"```{tag}(.*)```" # greedy
```
With input containing two ```json fences, the regex captured everything from the first opening fence to the last closing fence — yielding a string with intervening prose and other fenced blocks embedded.
Fix: `(.*?)` (non-greedy). Regression test added.
2. `_assess_factuality_issue` answer cleanup left tags in the answer
The original code:
```python
code_recommendations = extract_xml_tag(r["answer"], "CODE_RECOMMENDATIONS")
if "```json" in code_recommendations:
code_recommendations = extract_tag(code_recommendations, "json")
r["answer"] = r["answer"].replace(code_recommendations, "")
```
When the recommendations contained a ```json fence, `code_recommendations` was rebound to the unwrapped inner content. The `.replace()` then matched only that inner content as a substring inside the original fence, leaving the wrapping `<CODE_RECOMMENDATIONS>`, the ```json fence itself, and the closing tag in `answer`.
Fix: capture the original (pre-unwrap) inner text, splice the full `<CODE_RECOMMENDATIONS>...</CODE_RECOMMENDATIONS>` block out of the answer, then unwrap the inner fence for the structured `code_recommendations` field.
3. `_assess_factuality_issue` raised when the LLM omitted recommendations
The prompt explicitly says "If you have recommendations to fix or update the code, add a JSON object inside an xml tag…" — recommendations are optional. But `extract_xml_tag` was called unconditionally, so a successful assessment with no recommendations surfaced as a tool error to the MCP host. Now caught and returns `code_recommendations: ""`.
4. Prompt typos + invalid example JSON
Both prompts had typos ("asessment", "udpate", "attemps", "You task"). The assess prompt also showed the model an example schema using single-quoted "JSON" with no commas between objects — invalid JSON. LLMs that follow the example faithfully produced strings the downstream JSON parser would reject. Rewritten to be valid double-quoted JSON. Similar fixes to the extract prompt (implicit string concatenation, missing array commas).
Tests
Coverage stays at 100%.
Test plan