Skip to content

fix(max): Handle async query and recursion limit errors#30233

Merged
Twixes merged 4 commits intomasterfrom
max-improve-err-handling
Apr 1, 2025
Merged

fix(max): Handle async query and recursion limit errors#30233
Twixes merged 4 commits intomasterfrom
max-improve-err-handling

Conversation

@Twixes
Copy link
Member

@Twixes Twixes commented Mar 20, 2025

Problem

Async query errors were handled correctly in tests, but not in prod. I've unified the logic, so that it applies the same now regardless of the query executing sync or async (prod is async, tests not).
We also didn't have a human-friendly error on the recursion limit being hit.

Changes

Both classes of errors are now fixed.

How did you test this code?

Recursion limit got a test. The query execution errors were actually tested already, but the difference is that tests can't use async query execution – so I've ensured those tests continue to pass, and the fix is actually focused on unifying the logic of async vs sync running.

@Twixes Twixes requested a review from skoob13 March 20, 2025 16:03
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

This PR improves error handling for async queries and recursion limits in the PostHog AI assistant, focusing on user experience and robustness.

  • Added user-friendly error message when recursion limit (48 steps) is reached in assistant.py, preventing crashes
  • Implemented unified error handling between sync/async query execution with exponential backoff polling in query_executor/nodes.py
  • Added timeout handling for long-running queries with a 726s maximum wait time
  • Added comprehensive test coverage for both sync and async error scenarios in test_assistant.py
  • Improved conversation state management with proper locking and state saving during cancellation

5 file(s) reviewed, 2 comment(s)
Edit PR Review Bot Settings | Greptile

if error_message := query_status.get("error_message"):
raise APIException(error_message)
raise Exception("Query failed")
results_response = query_status["results"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't the node return the exception to the tool call? I think it makes more sense than appending a failure message to the conversation. Failure messages are not associated with a specific tool call, but tool calls are important for orchestration.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the SQL case, we want to continue the loop to correct runtime errors, but the current approach stops the generation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by returning the exception to the tool call?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of raising the exception, why don't we return the exception to the tool that initiated the insight generation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coming back to this post-offsite:
Hmm, it seems much more versatile to let root handle these errors, as they can be arbitrary – while some are going to be basic and fixable type mismatches, especially in SQL, other HogQL or ClickHouse errors are not that solvable and would still have to bubble up to root after a couple retries. Hence going for a general approach here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's still worth providing HogQL/ClickHouse errors. HogQL errors are usually about syntax (not very descriptive, though) but have information about unsupported functions (since HogQL is only a subset of ClickHouse SQL). ClickHouse errors are trickier as they can be generic (memory or timeout), but they're sometimes helpful (type mismatch). Let's ship this PR as is, and we can revisit better exception handling later.

@Twixes Twixes requested a review from skoob13 March 31, 2025 21:48
Twixes added 3 commits April 1, 2025 00:03
This is a PAIN to test because Celery doesn't work in our tests, hence tests behave differently from prod.
@Twixes Twixes force-pushed the max-improve-err-handling branch from 2f1f9ef to d1ab285 Compare March 31, 2025 22:04
@Twixes Twixes enabled auto-merge (squash) April 1, 2025 17:45
@Twixes Twixes merged commit 1de2c52 into master Apr 1, 2025
94 of 95 checks passed
@Twixes Twixes deleted the max-improve-err-handling branch April 1, 2025 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants