Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checkpointing Robustness #486

Merged
merged 1 commit into from Jan 12, 2024
Merged

Conversation

jacksonrnewhouse
Copy link
Contributor

This makes Arroyo more robust to transient issues with talking to the state backend, namely S3. Previously there were unwraps in the main control loop which could result in the job being suspended. We fix this with two changes. First, the controller will check that a subset of States have an active Receiver. If not, it will be restarted. Secondly, more of the BackendState methods now return Result<()>, which lets us handle these errors.

There are still a number of places where checkpointing errors are fatal, but that is preferable to them reporting as "Running" yet making no progress.

state: add Result<> return type to StateBackend methods.
@jacksonrnewhouse jacksonrnewhouse merged commit 8692233 into master Jan 12, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants