Skip to content

skip conversation after retry#126

Merged
jgieringer merged 20 commits intov1.1from
fail-after-retry
Apr 14, 2026
Merged

skip conversation after retry#126
jgieringer merged 20 commits intov1.1from
fail-after-retry

Conversation

@jgieringer
Copy link
Copy Markdown
Collaborator

Description

Before, if an LLM call returned an error, that error would make it into the conversation and skew the remaining conversation and evaluation.

This PR introduces a retry method that retries calling the LLM 3 times max, unless returning a no-retry string, and skips the conversation all-together if any errors.

The same also occurs for judging. If a judging LLM call errors, we skip judging that conversation.

Follow-up work

If 80% of conversations are error-free, and we want the remaining 20%, we should be able to point generate.py to the existing folder, and it complete the remaining 20% without rerunning the other 80%.
Same for judging.

@jgieringer jgieringer marked this pull request as ready for review March 30, 2026 23:50
@jgieringer jgieringer requested a review from Copilot March 30, 2026 23:50
@jgieringer jgieringer requested review from nz-1 and sator-labs March 30, 2026 23:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes LLM error handling so failed LLM calls are retried and, if they still fail, raise a dedicated exception that causes conversation generation/judging to be skipped rather than polluting transcripts/evaluations with error text.

Changes:

  • Add _run_with_retry + LLMGenerationFailed to standardize retries and failure signaling across LLM clients.
  • Update LLM client implementations and runners to raise/propagate LLMGenerationFailed and skip saving/judging on failures.
  • Update unit/integration tests to assert retry behavior and the new exception-driven control flow.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tests/unit/llm_clients/test_openai_llm.py Updates OpenAI tests to expect retries and LLMGenerationFailed.
tests/unit/llm_clients/test_ollama_llm.py Updates Ollama tests to assert exceptions instead of error strings.
tests/unit/llm_clients/test_llm_interface.py Adds direct unit coverage for _run_with_retry semantics.
tests/unit/llm_clients/test_helpers.py Adds reusable assert_llm_generation_failed and retry call count helper.
tests/unit/llm_clients/test_gemini_llm.py Updates Gemini tests for exception-based failure handling.
tests/unit/llm_clients/test_endpoint_llm.py Updates endpoint client tests to expect LLMGenerationFailed.
tests/unit/llm_clients/test_claude_llm.py Updates Claude tests for retry + exception behavior.
tests/unit/llm_clients/test_base_llm.py Updates shared base tests to assert LLMGenerationFailed on failures.
tests/unit/llm_clients/test_azure_llm.py Updates Azure tests for exception-based failures and retry behavior.
tests/unit/generate_conversations/test_conversation_simulator.py Expects simulator to propagate LLMGenerationFailed for LLM errors.
tests/mocks/mock_llm.py Makes mock LLM raise LLMGenerationFailed for simulated failures.
tests/integration/test_conversation_runner.py Ensures failed conversations are skipped and transcripts aren’t saved.
llm_clients/openai_llm.py Uses _run_with_retry and adds OpenAI no-retry substrings.
llm_clients/ollama_llm.py Uses _run_with_retry and adds Ollama no-retry substrings.
llm_clients/llm_interface.py Introduces retry core (MAX_LLM_RETRIES, _run_with_retry, LLMGenerationFailed).
llm_clients/gemini_llm.py Switches Gemini calls to _run_with_retry.
llm_clients/endpoint_llm.py Switches endpoint calls to _run_with_retry (start + chat).
llm_clients/claude_llm.py Switches Claude calls to _run_with_retry.
llm_clients/azure_llm.py Switches Azure calls to _run_with_retry and preserves 404-specific messaging.
llm_clients/init.py Exports LLMGenerationFailed from package top-level.
judge/runner.py Improves completion summary output and reports skipped evaluations.
generate.py Prints generated-vs-skipped conversation counts.
generate_conversations/runner.py Skips entire conversations on LLMGenerationFailed and reports skip counts.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread llm_clients/llm_interface.py
Comment thread generate_conversations/runner.py
Comment thread judge/runner.py Outdated
Comment thread generate_conversations/runner.py Outdated
Comment thread generate_conversations/runner.py
Comment thread llm_clients/llm_interface.py
Comment thread generate_conversations/runner.py
Copy link
Copy Markdown
Collaborator

@nz-1 nz-1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm! all not live tests passed on my end. left one comment for logging failure cases too so we have consistent log cycle for consideration

@jgieringer
Copy link
Copy Markdown
Collaborator Author

Merged in #128 to keep conversation generation consistent with judging in terms of using asyncio's Queue.
This branch still holds its value in that there's no direct redundancy/conflict between queue workers and invoking retry logic. The queue refactor in generate handles job scheduling/concurrency, while retry in LLMInterface handles single-call resiliency. They are layered, not duplicated 🚀
cc @sator-labs

@jgieringer jgieringer requested a review from sator-labs April 3, 2026 00:10
@sator-labs
Copy link
Copy Markdown
Collaborator

could you merge main into that to highlight the changes?

Comment thread generate_conversations/runner.py Outdated
Comment thread generate_conversations/runner.py
Comment thread generate_conversations/runner.py Outdated
Comment thread llm_clients/azure_llm.py
Comment thread llm_clients/llm_interface.py
Comment thread llm_clients/llm_interface.py
Copy link
Copy Markdown
Collaborator

@emily-vanark emily-vanark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some questions/ suggestions (may just be the jet lag talking), but overall looks good... all the "not live" tests pass... however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

tests/integration/test_scoring.py:684: in test_pipeline_error_handling
    await self.run_generate_cli(
tests/integration/test_scoring.py:148: in run_generate_cli
    raise RuntimeError(f"Generated directory {conv_dir} is not valid")
E   RuntimeError: Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid

During handling of the above exception, another exception occurred:
tests/integration/test_scoring.py:683: in test_pipeline_error_handling
    with pytest.raises(RuntimeError, match="Generate CLI failed"):
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E   AssertionError: Regex pattern did not match.
E     Expected regex: 'Generate CLI failed'
E     Actual message: 'Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid'

@jgieringer
Copy link
Copy Markdown
Collaborator Author

however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

Fixed! 84d3708

@jgieringer jgieringer requested a review from emily-vanark April 10, 2026 01:37
Copy link
Copy Markdown
Collaborator

@sator-labs sator-labs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM modulo minor comments

Comment thread generate_conversations/runner.py
Comment thread llm_clients/llm_interface.py Outdated
Comment thread llm_clients/llm_interface.py Outdated
@emily-vanark
Copy link
Copy Markdown
Collaborator

however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?

Fixed! 84d3708

I agree!

@jgieringer jgieringer merged commit 4bd2346 into v1.1 Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants