Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pg sources: subsource health status + snapshot resumption test #18782

Merged

Conversation

sploiselle
Copy link
Contributor

@sploiselle sploiselle commented Apr 16, 2023

As part of #18769, we discovered that the cluster test workflow_pg_snapshot_resumption was not working as intended, so I've refactored it to ensure that the source is down (and that the failure was caused by the failpoint).

I also needed to refactor how multi-output sources propagate their health statuses to the health operator––there were two issues previously:

  • The health operator halted as soon as it saw a halt, which precluded writing other statuses to persist. Solved this by moving the halting outside of the loop.
  • The health operator only kept the latest state for each output index, which introduced a kind of "race" condition with PG sources where it was possible to receive a "Running" status ordered after a failure. This new design ensures that any failure will get propagated to persist––if the source keeps running its status will get corrected in a following batch.

In service of the above, I also refactored some of the way source_reader_pipeline handles buffering.

Motivation

This PR refactors existing code.

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:
    • Resolve an issue where some Postgres source health status updates did not propagate to the source's subsources (i.e. the sources built from the Postgres tables).

@sploiselle sploiselle force-pushed the multi-output-health-refactor branch 2 times, most recently from 9cb0775 to 38195bb Compare April 16, 2023 12:34
@sploiselle sploiselle marked this pull request as ready for review April 16, 2023 13:49
@sploiselle sploiselle requested a review from a team April 16, 2023 13:49
@sploiselle sploiselle changed the title Multi output health refactor pg sources: subsource health status + snapshot resumption test Apr 16, 2023
Propagating irrecoverable/indefinite errors from the source
impl to the health operator was too optimistic about the errors
making it all the way to the health status shard.

We instead want to propagate errors only once per occurrence from
the source reader pipeline and instead handle determining what
to do with those values within the health_operator itself.

The changes in the health operator make errors more noticeable,
while still letting indefinite errors get cleared from being
reported.

Additionally, when encountering a halting error, we still let
all other updates that occurred in parallel to be written; we
previously crashed the process too eagerly.
@sploiselle
Copy link
Contributor Author

@aljoscha @petrosagg I know there was more conversation about this in Slack but as long as we intend to include definite errors in this collection, I believe the demuxing done previously is necessary, and if that sticks around then we should also apply this patch. Let me know if there are additional concerns or anything I can do to help with the review.

Copy link
Contributor

@aljoscha aljoscha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks very good to me! I agree that we should fix this, even if we decide not to surface definite errors in the status collections.

@sploiselle sploiselle merged commit 02ac7eb into MaterializeInc:main Apr 19, 2023
@sploiselle sploiselle deleted the multi-output-health-refactor branch April 19, 2023 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants