Conversation
There was a problem hiding this comment.
Pull request overview
This PR changes LLM error handling so failed LLM calls are retried and, if they still fail, raise a dedicated exception that causes conversation generation/judging to be skipped rather than polluting transcripts/evaluations with error text.
Changes:
- Add
_run_with_retry+LLMGenerationFailedto standardize retries and failure signaling across LLM clients. - Update LLM client implementations and runners to raise/propagate
LLMGenerationFailedand skip saving/judging on failures. - Update unit/integration tests to assert retry behavior and the new exception-driven control flow.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/llm_clients/test_openai_llm.py | Updates OpenAI tests to expect retries and LLMGenerationFailed. |
| tests/unit/llm_clients/test_ollama_llm.py | Updates Ollama tests to assert exceptions instead of error strings. |
| tests/unit/llm_clients/test_llm_interface.py | Adds direct unit coverage for _run_with_retry semantics. |
| tests/unit/llm_clients/test_helpers.py | Adds reusable assert_llm_generation_failed and retry call count helper. |
| tests/unit/llm_clients/test_gemini_llm.py | Updates Gemini tests for exception-based failure handling. |
| tests/unit/llm_clients/test_endpoint_llm.py | Updates endpoint client tests to expect LLMGenerationFailed. |
| tests/unit/llm_clients/test_claude_llm.py | Updates Claude tests for retry + exception behavior. |
| tests/unit/llm_clients/test_base_llm.py | Updates shared base tests to assert LLMGenerationFailed on failures. |
| tests/unit/llm_clients/test_azure_llm.py | Updates Azure tests for exception-based failures and retry behavior. |
| tests/unit/generate_conversations/test_conversation_simulator.py | Expects simulator to propagate LLMGenerationFailed for LLM errors. |
| tests/mocks/mock_llm.py | Makes mock LLM raise LLMGenerationFailed for simulated failures. |
| tests/integration/test_conversation_runner.py | Ensures failed conversations are skipped and transcripts aren’t saved. |
| llm_clients/openai_llm.py | Uses _run_with_retry and adds OpenAI no-retry substrings. |
| llm_clients/ollama_llm.py | Uses _run_with_retry and adds Ollama no-retry substrings. |
| llm_clients/llm_interface.py | Introduces retry core (MAX_LLM_RETRIES, _run_with_retry, LLMGenerationFailed). |
| llm_clients/gemini_llm.py | Switches Gemini calls to _run_with_retry. |
| llm_clients/endpoint_llm.py | Switches endpoint calls to _run_with_retry (start + chat). |
| llm_clients/claude_llm.py | Switches Claude calls to _run_with_retry. |
| llm_clients/azure_llm.py | Switches Azure calls to _run_with_retry and preserves 404-specific messaging. |
| llm_clients/init.py | Exports LLMGenerationFailed from package top-level. |
| judge/runner.py | Improves completion summary output and reports skipped evaluations. |
| generate.py | Prints generated-vs-skipped conversation counts. |
| generate_conversations/runner.py | Skips entire conversations on LLMGenerationFailed and reports skip counts. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
refactor conversation generate runner to utilize asyncio queue
|
Merged in #128 to keep conversation generation consistent with judging in terms of using asyncio's Queue. |
|
could you merge main into that to highlight the changes? |
emily-vanark
left a comment
There was a problem hiding this comment.
I have some questions/ suggestions (may just be the jet lag talking), but overall looks good... all the "not live" tests pass... however the test_pipeline_error_handling live test failed. Could we look into that before merging, please?
tests/integration/test_scoring.py:684: in test_pipeline_error_handling
await self.run_generate_cli(
tests/integration/test_scoring.py:148: in run_generate_cli
raise RuntimeError(f"Generated directory {conv_dir} is not valid")
E RuntimeError: Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid
During handling of the above exception, another exception occurred:
tests/integration/test_scoring.py:683: in test_pipeline_error_handling
with pytest.raises(RuntimeError, match="Generate CLI failed"):
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
E AssertionError: Regex pattern did not match.
E Expected regex: 'Generate CLI failed'
E Actual message: 'Generated directory /var/folders/ng/2g94p1rd0j31qh3jz4jl378w0000gp/T/vera_integration_test_mzsvi1ds/conversations/p_invalid_model_name__a_claude_opus_4_1_20250805__t6__r1__20260407_211520/ (1 skipped) is not valid'
Fixed! 84d3708 |
sator-labs
left a comment
There was a problem hiding this comment.
LGTM modulo minor comments
I agree! |
Description
Before, if an LLM call returned an error, that error would make it into the conversation and skew the remaining conversation and evaluation.
This PR introduces a retry method that retries calling the LLM 3 times max, unless returning a no-retry string, and skips the conversation all-together if any errors.
The same also occurs for judging. If a judging LLM call errors, we skip judging that conversation.
Follow-up work
If 80% of conversations are error-free, and we want the remaining 20%, we should be able to point generate.py to the existing folder, and it complete the remaining 20% without rerunning the other 80%.
Same for judging.