Skip to content

Chat services return success with empty response when LLM output is unparseable or empty #484

@elias-ba

Description

@elias-ba

What's broken

Both workflow_chat and global_chat can return HTTP 200 with response: "" (empty string) when the LLM produces output that the service can't use. There is no legitimate "empty response" path, so this silently degrades the contract: callers see a success and have no way to distinguish "the model said nothing useful" from "the model's normal answer".

In services/workflow_chat/workflow_chat.py, split_format_yaml initialises output_text = "" and only fills it from response_data.get("text", "") after JSON parsing succeeds. If the LLM returns text that isn't valid JSON (or JSON without a text field), the except branch logs an error and output_text stays empty. The wrapper still returns {"response": "", ...} with status 200.

In services/global_chat/planner.py, final_text = "" initialises and is filled by _extract_text(response) only on end_turn, on max_tool_calls exit, or on an unexpected stop_reason. If the model produces a response with no text blocks (only tool_use or thinking) and the loop exits without end_turn, _extract_text returns empty. The service logs Loop exited without end_turn but still returns 200 with response: "".

How it surfaced

In Lightning, the AI Assistant background worker treats Apollo's 200 as success and tries to persist the assistant turn. The ChatMessage changeset requires content (1 to 10,000 characters), so the insert fails with content: "can't be blank". The user-side message stays stuck in :processing and the user sees no response and no error indication.

Surfaced as Sentry alert LIGHTNING-1MP and tracked downstream at OpenFn/lightning#4710.

What to fix

When the LLM output can't be parsed (workflow_chat) or produces no text content (global_chat), the service should raise ApolloError (or include an explicit error signal in the response body), not return 200 with empty text.

Both services already use ApolloError for other failure modes (auth, rate limit, connection), and Lightning's handle_error_response already routes those into the error tuple cleanly, so the fix should fit the existing shape without new contracts.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions