skip conversation after retry by jgieringer · Pull Request #126 · SpringCare/VERA-MH

jgieringer · 2026-03-30T23:50:42Z

Description

Before, if an LLM call returned an error, that error would make it into the conversation and skew the remaining conversation and evaluation.

This PR introduces a retry method that retries calling the LLM 3 times max, unless returning a no-retry string, and skips the conversation all-together if any errors.

The same also occurs for judging. If a judging LLM call errors, we skip judging that conversation.

Follow-up work

If 80% of conversations are error-free, and we want the remaining 20%, we should be able to point generate.py to the existing folder, and it complete the remaining 20% without rerunning the other 80%.
Same for judging.

Copilot

Pull request overview

This PR changes LLM error handling so failed LLM calls are retried and, if they still fail, raise a dedicated exception that causes conversation generation/judging to be skipped rather than polluting transcripts/evaluations with error text.

Changes:

Add _run_with_retry + LLMGenerationFailed to standardize retries and failure signaling across LLM clients.
Update LLM client implementations and runners to raise/propagate LLMGenerationFailed and skip saving/judging on failures.
Update unit/integration tests to assert retry behavior and the new exception-driven control flow.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tests/unit/llm_clients/test_openai_llm.py	Updates OpenAI tests to expect retries and `LLMGenerationFailed`.
tests/unit/llm_clients/test_ollama_llm.py	Updates Ollama tests to assert exceptions instead of error strings.
tests/unit/llm_clients/test_llm_interface.py	Adds direct unit coverage for `_run_with_retry` semantics.
tests/unit/llm_clients/test_helpers.py	Adds reusable `assert_llm_generation_failed` and retry call count helper.
tests/unit/llm_clients/test_gemini_llm.py	Updates Gemini tests for exception-based failure handling.
tests/unit/llm_clients/test_endpoint_llm.py	Updates endpoint client tests to expect `LLMGenerationFailed`.
tests/unit/llm_clients/test_claude_llm.py	Updates Claude tests for retry + exception behavior.
tests/unit/llm_clients/test_base_llm.py	Updates shared base tests to assert `LLMGenerationFailed` on failures.
tests/unit/llm_clients/test_azure_llm.py	Updates Azure tests for exception-based failures and retry behavior.
tests/unit/generate_conversations/test_conversation_simulator.py	Expects simulator to propagate `LLMGenerationFailed` for LLM errors.
tests/mocks/mock_llm.py	Makes mock LLM raise `LLMGenerationFailed` for simulated failures.
tests/integration/test_conversation_runner.py	Ensures failed conversations are skipped and transcripts aren’t saved.
llm_clients/openai_llm.py	Uses `_run_with_retry` and adds OpenAI no-retry substrings.
llm_clients/ollama_llm.py	Uses `_run_with_retry` and adds Ollama no-retry substrings.
llm_clients/llm_interface.py	Introduces retry core (`MAX_LLM_RETRIES`, `_run_with_retry`, `LLMGenerationFailed`).
llm_clients/gemini_llm.py	Switches Gemini calls to `_run_with_retry`.
llm_clients/endpoint_llm.py	Switches endpoint calls to `_run_with_retry` (start + chat).
llm_clients/claude_llm.py	Switches Claude calls to `_run_with_retry`.
llm_clients/azure_llm.py	Switches Azure calls to `_run_with_retry` and preserves 404-specific messaging.
llm_clients/init.py	Exports `LLMGenerationFailed` from package top-level.
judge/runner.py	Improves completion summary output and reports skipped evaluations.
generate.py	Prints generated-vs-skipped conversation counts.
generate_conversations/runner.py	Skips entire conversations on `LLMGenerationFailed` and reports skip counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

nz-1

lgtm! all not live tests passed on my end. left one comment for logging failure cases too so we have consistent log cycle for consideration

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

refactor conversation generate runner to utilize asyncio queue

jgieringer · 2026-04-01T18:20:42Z

Merged in #128 to keep conversation generation consistent with judging in terms of using asyncio's Queue.
This branch still holds its value in that there's no direct redundancy/conflict between queue workers and invoking retry logic. The queue refactor in generate handles job scheduling/concurrency, while retry in LLMInterface handles single-call resiliency. They are layered, not duplicated 🚀
cc @sator-labs

sator-labs · 2026-04-03T20:36:15Z

could you merge main into that to highlight the changes?

emily-vanark

I have some questions/ suggestions (may just be the jet lag talking), but overall looks good... all the "not live" tests pass... however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

tests/integration/test_scoring.py:684: in test_pipeline_error_handling
    await self.run_generate_cli(
tests/integration/test_scoring.py:148: in run_generate_cli
    raise RuntimeError(f"Generated directory {conv_dir} is not valid")
E   RuntimeError: Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid

During handling of the above exception, another exception occurred:
tests/integration/test_scoring.py:683: in test_pipeline_error_handling
    with pytest.raises(RuntimeError, match="Generate CLI failed"):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   AssertionError: Regex pattern did not match.
E     Expected regex: 'Generate CLI failed'
E     Actual message: 'Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid'

jgieringer · 2026-04-10T01:37:12Z

however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

Fixed! 84d3708

sator-labs

LGTM modulo minor comments

emily-vanark · 2026-04-13T22:50:25Z

however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

Fixed! 84d3708

I agree!

skip conversation after retry

d6893aa

jgieringer marked this pull request as ready for review March 30, 2026 23:50

jgieringer requested a review from Copilot March 30, 2026 23:50

Copilot started reviewing on behalf of jgieringer March 30, 2026 23:51 View session

jgieringer requested review from nz-1 and sator-labs March 30, 2026 23:55

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Comment thread llm_clients/llm_interface.py

Comment thread generate_conversations/runner.py

Comment thread judge/runner.py Outdated

jgieringer added 3 commits March 31, 2026 10:41

set _last_response_metadata on error

71ac85a

correct log file path

4ad3d3e

generalize failure reason

2b0ab9d

jgieringer mentioned this pull request Mar 31, 2026

Retry on error but break if broken #78

Closed

sator-labs reviewed Mar 31, 2026

View reviewed changes

Comment thread generate_conversations/runner.py Outdated

Comment thread generate_conversations/runner.py

Comment thread llm_clients/llm_interface.py

nz-1 reviewed Mar 31, 2026

View reviewed changes

Comment thread generate_conversations/runner.py

nz-1 approved these changes Mar 31, 2026

View reviewed changes

jgieringer and others added 8 commits March 31, 2026 17:54

refactor runner to utilize queue

0eb2a92

Unlimited for max_concurrent None or <= 0

7b8c5e4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

ensure task_done is reached

66f8fa2

Merge pull request #128 from SpringCare/queue-for-generate

da0aefd

refactor conversation generate runner to utilize asyncio queue

log when convo fails

054f04b

explain RetryOnErrorCallback

f231a33

one try

3424119

delay retries

3627c94

jgieringer requested a review from sator-labs April 3, 2026 00:10