pg sources: demux subsource errors #18539

sploiselle · 2023-03-31T18:47:42Z

Previously, an error on any subsource produced an error that prevented all subsources from being used. We can instead de-multiplex subsource errors from the replication stream into n error streams (just as we do for the ok streams). This is a huge UX win as it means one mistake doesn't wedge the entire source.

Additionally, we can include the health of the subsources in the mz_internal.mz_source_status_history table.

Motivation

This PR adds a known-desirable feature. Closes #17490

Checklist

This PR has adequate test coverage / QA involvement has been duly considered.
This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
This PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way) and therefore is tagged with a T-proto label.
If this PR will require changes to cloud orchestration, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
This PR includes the following user-facing behavior changes:
- Allow reads from all non-errored tables replicated from PostgreSQL into Materialize. Previously, if any table encountered an error during replication, Materialize prohibited reads from all tables replicated via the same source.

sploiselle · 2023-04-03T23:27:05Z

@petrosagg Ready for review; will fixup failing doctests in next push

philip-stoev · 2023-04-05T09:32:32Z

Do you mind if I test this tomorrow? Thank you.

sploiselle · 2023-04-05T13:02:30Z

@philip-stoev no problem

philip-stoev

Apologies for the delay and thank you for refactoring the tests. I think they cover all the scenarios that we are relevant.

petrosagg · 2023-04-11T17:54:25Z

src/storage/src/source/types.rs

+    ///
+    /// This should be 0 to output an error for the primary source; however, you will need to
+    /// determine a numbering scheme to output errors for subsources.
+    pub output: usize,


I think we want to go the other way with where the output index lives. The only reason output is a member of SourceMessage was because this work of error demuxing had not been done. Now that it is done we should instead have the SourceRender::render function return a collection with this type Collection<G, (usize, Result<SourceMessage<Self::Key, Self::Value>, SourceReaderError>), Diff>,. Similarly for the health status updates where we should return a Stream<G, (usize, HealthStatusUpdate)>.

This makes partitioning to N separate streams trivial and also removes the concept of multiple outputs from these structs which don't really care about this concept of outputs.

sploiselle · 2023-04-13T03:13:05Z

@petrosagg From our discussions:

Moved the health operator construction into build_ingestion_dataflow; tysm for your help with Scope
Addressed pg sources: demux subsource errors #18539 (comment) in the last two commits

I left this work in new commits rather than aggressively pushing them down to simplify review, but if there's anything to go back through CI for I'll tidy things up.

petrosagg

Looks great! Thanks for pushing this through. I left a comment that I think we should address so that we don't accidentally release something that puts more CRDB pressure but it should be an easy change

petrosagg · 2023-04-13T11:31:17Z

src/storage/src/render/mod.rs

                        source_data,
                        storage_state,
                        metrics,
                    )
                };
                tokens.push(token);
+
+                let health_token = crate::source::health_operator(


You can delay rendering the health operator until after this for loop. Here you can just collect all the health streams in a vector:

health_streams.push(health);

And then render a single health operator with the combined stream:

use timely::dataflow::operators::Concatenate; if let Some(health) = health_streams.pop() { let combined_health = health.concatenate(health_streams); // A single operator for all the data let health_token = crate::source::health_operator(..., combined_health); }

This is desirable since each health operator competes with every other one to compare and append to the shard so it's beneficial to have as few of them as possible. In fact in a future refactor we should make it a single one per clusterd but for now we should at least keep their number the same.

ty for this pointer; this actually means we don't need to partition out the stream at all and can just pass it through and then "demux" it in the operator itself by just keeping the output_index intact. Makes the operator more complex because it maintains state for each output but I don't think it's too bad. Am going to merge this if it's green and can refactor later if it's too ugly to bear.

CLAassistant · 2023-04-14T19:08:45Z

All committers have signed the CLA.

in preparation for creating health status shards for subsources, we need the source export information.

- Propagate primary source health status to subsources - Log warnings on subsource errors

It looks like toxiproxy gets us to issue SuspendAndRestart commands based on the health reporting for sources. Now that subsources report their own health, we might issue this command for them. Add a check that simply prevents this from causing a panic, even though there is no actual work to be done here because subsource's tokens belong to their primary source.

Prior commits demultiplexed the health operator by means of partitioning the stream. Ultimately, we only want to create a single health operator per source, so instead only logically demultiplex the streams output in the operator itself.

sploiselle requested review from petrosagg and a team March 31, 2023 18:47

sploiselle requested a review from a team as a code owner March 31, 2023 18:47

sploiselle force-pushed the pg-subsource-errors branch 6 times, most recently from 85c20f3 to 79fd9d7 Compare April 3, 2023 22:30

sploiselle mentioned this pull request Apr 4, 2023

storage/sources/postgres: detect dropped tables #18519

Closed

philip-stoev self-requested a review April 5, 2023 09:32

philip-stoev approved these changes Apr 10, 2023

View reviewed changes

petrosagg reviewed Apr 11, 2023

View reviewed changes

sploiselle force-pushed the pg-subsource-errors branch 2 times, most recently from 6248513 to 26bda01 Compare April 13, 2023 02:17

petrosagg approved these changes Apr 13, 2023

View reviewed changes

sploiselle force-pushed the pg-subsource-errors branch 2 times, most recently from 4cc5a05 to 3912047 Compare April 14, 2023 19:08

sploiselle requested a review from benesch as a code owner April 14, 2023 19:08

sploiselle requested a review from a team April 14, 2023 19:08

sploiselle requested review from a team as code owners April 14, 2023 19:08

Sean Loiselle added 2 commits April 14, 2023 15:19

storage: demux error collections

9e1a9bc

storage: render source w/ demux err stream

0555081

Sean Loiselle added 20 commits April 14, 2023 15:19

storage: replace RawSourceCreationConfig::num_outputs with SourceExports

ce26669

in preparation for creating health status shards for subsources, we need the source export information.

sql: make subsources eligible for source status history

adf9fc4

storage: construct health check for all source exports

13307c6

storage: make subsource statuses usable

64b83ed

pg sources: produce inline errors during snapshot/replication

ddd143a

pg sources: allow sending errors via InternalMessage::Value

6554f08

pg sources: refactor table compatibility to error subsources

a0a5514

pg sources: generate inline errors during snapshots

9034715

pg sources: produce inline errors during replication

4cc3b04

pg sources: improve logging/status propagating for subsources

f4b135b

- Propagate primary source health status to subsources - Log warnings on subsource errors

pg sources: test demuxed error streams

625fd4c

test: test demuxed errors in pg-cdc-resumption

43ad027

pg sources: add tests for unsupported operations

78dcb21

pg sources: test adversarial schema change

389df4a

pg sources: document source table cleanup

d0f5d46

adapter: add subsources to mz_source_statuses

6aad25c

storage: move health operator construction into build_ingestion_dataflow

d269ca2

storage: obviate need for output_index in SourceMessage/Error

531bc5c

storage: remove output index from source messages

cdb791f

sploiselle force-pushed the pg-subsource-errors branch 2 times, most recently from b14941b to b3e79bc Compare April 14, 2023 19:42

storage: demultiplex health_operator in stream

820391f

Prior commits demultiplexed the health operator by means of partitioning the stream. Ultimately, we only want to create a single health operator per source, so instead only logically demultiplex the streams output in the operator itself.

sploiselle force-pushed the pg-subsource-errors branch from b3e79bc to 820391f Compare April 14, 2023 20:11

sploiselle enabled auto-merge (rebase) April 14, 2023 20:12

sploiselle merged commit b1f1f16 into MaterializeInc:main Apr 14, 2023

sploiselle deleted the pg-subsource-errors branch April 14, 2023 20:56

materialize-bot mentioned this pull request Apr 21, 2023

release: v0.52.0 required reviews #18907

Closed

12 tasks

morsapaes mentioned this pull request Apr 27, 2023

doc/user: polish v0.52 release notes #19000

Merged

morsapaes mentioned this pull request May 25, 2023

doc: Postgres direct source integration guides #19453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pg sources: demux subsource errors #18539

pg sources: demux subsource errors #18539

sploiselle commented Mar 31, 2023

sploiselle commented Apr 3, 2023

philip-stoev commented Apr 5, 2023

sploiselle commented Apr 5, 2023

philip-stoev left a comment

petrosagg Apr 11, 2023

sploiselle Apr 13, 2023

sploiselle commented Apr 13, 2023

petrosagg left a comment

petrosagg Apr 13, 2023

sploiselle Apr 14, 2023

CLAassistant commented Apr 14, 2023 •

edited

Loading

pg sources: demux subsource errors #18539

pg sources: demux subsource errors #18539

Conversation

sploiselle commented Mar 31, 2023

Motivation

Checklist

sploiselle commented Apr 3, 2023

philip-stoev commented Apr 5, 2023

sploiselle commented Apr 5, 2023

philip-stoev left a comment

Choose a reason for hiding this comment

petrosagg Apr 11, 2023

Choose a reason for hiding this comment

sploiselle Apr 13, 2023

Choose a reason for hiding this comment

sploiselle commented Apr 13, 2023

petrosagg left a comment

Choose a reason for hiding this comment

petrosagg Apr 13, 2023

Choose a reason for hiding this comment

sploiselle Apr 14, 2023

Choose a reason for hiding this comment

CLAassistant commented Apr 14, 2023 • edited Loading

CLAassistant commented Apr 14, 2023 •

edited

Loading