fix(max): Handle async query and recursion limit errors#30233
Conversation
There was a problem hiding this comment.
PR Summary
This PR improves error handling for async queries and recursion limits in the PostHog AI assistant, focusing on user experience and robustness.
- Added user-friendly error message when recursion limit (48 steps) is reached in
assistant.py, preventing crashes - Implemented unified error handling between sync/async query execution with exponential backoff polling in
query_executor/nodes.py - Added timeout handling for long-running queries with a 726s maximum wait time
- Added comprehensive test coverage for both sync and async error scenarios in
test_assistant.py - Improved conversation state management with proper locking and state saving during cancellation
5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile
| if error_message := query_status.get("error_message"): | ||
| raise APIException(error_message) | ||
| raise Exception("Query failed") | ||
| results_response = query_status["results"] |
There was a problem hiding this comment.
Why doesn't the node return the exception to the tool call? I think it makes more sense than appending a failure message to the conversation. Failure messages are not associated with a specific tool call, but tool calls are important for orchestration.
There was a problem hiding this comment.
In the SQL case, we want to continue the loop to correct runtime errors, but the current approach stops the generation.
There was a problem hiding this comment.
What do you mean by returning the exception to the tool call?
There was a problem hiding this comment.
Instead of raising the exception, why don't we return the exception to the tool that initiated the insight generation?
There was a problem hiding this comment.
Coming back to this post-offsite:
Hmm, it seems much more versatile to let root handle these errors, as they can be arbitrary – while some are going to be basic and fixable type mismatches, especially in SQL, other HogQL or ClickHouse errors are not that solvable and would still have to bubble up to root after a couple retries. Hence going for a general approach here.
There was a problem hiding this comment.
I think it's still worth providing HogQL/ClickHouse errors. HogQL errors are usually about syntax (not very descriptive, though) but have information about unsupported functions (since HogQL is only a subset of ClickHouse SQL). ClickHouse errors are trickier as they can be generic (memory or timeout), but they're sometimes helpful (type mismatch). Let's ship this PR as is, and we can revisit better exception handling later.
This is a PAIN to test because Celery doesn't work in our tests, hence tests behave differently from prod.
2f1f9ef to
d1ab285
Compare
Problem
Async query errors were handled correctly in tests, but not in prod. I've unified the logic, so that it applies the same now regardless of the query executing sync or async (prod is async, tests not).
We also didn't have a human-friendly error on the recursion limit being hit.
Changes
Both classes of errors are now fixed.
How did you test this code?
Recursion limit got a test. The query execution errors were actually tested already, but the difference is that tests can't use async query execution – so I've ensured those tests continue to pass, and the fix is actually focused on unifying the logic of async vs sync running.