Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: real-time recency MVP #25195

Merged
merged 1 commit into from
May 24, 2024

Conversation

sploiselle
Copy link
Contributor

@sploiselle sploiselle commented Feb 13, 2024

MVP for real-time recency. Once this merges, it should be appropriate for private preview.

Motivation

This PR adds a known-desirable feature. MaterializeInc/database-issues#4632

Checklist

  • This PR has adequate test coverage / QA involvement has been duly considered.
  • This PR has an associated up-to-date design doc, is a design doc (template), or is sufficiently small to not require a design.
  • If this PR evolves an existing $T ⇔ Proto$T mapping (possibly in a backwards-incompatible way), then it is tagged with a T-proto label.
  • If this PR will require changes to cloud orchestration or tests, there is a companion cloud PR to account for those changes that is tagged with the release-blocker label (example).
  • This PR includes the following user-facing behavior changes:

@sploiselle
Copy link
Contributor Author

@MaterializeInc/testing Could use y'all's expertise in developing a testing suite against this prototype of real-time recency. The idea behind RTR is that when you issue a RTR query, we wait until we've ingested all data that is visible in the external system at the "same" physical time that the query gets issued.

Ideally, we would:

  1. Introduce a delay in ingesting it simulating ingestion lag (but not a transient network error from the broker)
  2. Produce Kafka data
  3. Issue a non-RTR query and show the data is not present
  4. Issue a RTR query and show that the data is present.

The delay we introduce would need to be long enough that we can convince ourselves that we're not just "getting lucky."

Which testing framework would you recommend for this? My guess is instrumenting the Kafka source with a failpoint to introduce the latency and then everything else could be done in testdrive.

One small wrinkle with that approach is that I would ultimately like to query multiple Kafka sources, each with their own distinct latency and getting that wired up with failpoint seems fussy, though maybe it's doable?

@def-
Copy link
Contributor

def- commented Feb 13, 2024

For the ingestion lag toxiproxy is good, we already use it in some tests, you can make it as slow as you want in ingesting.
Testdrive seems fine for the queries. I think the kafka-resumption tests in test/kafka-resumption/mzcompose.py come relatively close to this.

One small wrinkle with that approach is that I would ultimately like to query multiple Kafka sources, each with their own distinct latency and getting that wired up with failpoint seems fussy, though maybe it's doable?

This is doable with Toxiproxy, you can have multiple Kafka sources, all with their own delays, see for example workflow_sink_networking.

If this is too cumbersome, I can also write a test for you.

@sploiselle sploiselle force-pushed the rtr-prototype branch 2 times, most recently from 34cbeae to 51f68da Compare February 16, 2024 13:24
@sploiselle
Copy link
Contributor Author

@def- This is in a shape that we can begin testing it. The RTR operation itself is expensive, so opening 100 (1000? 10?) clients doing simultaneous RTR queries might fall over in some unexpected way. The most meaningful tests will ensure we uphold our semantics, but performance is a secondary concern.

I'm going to work on seeing if there's a way to make this the default for testing queries from Kafka sources but that seems a little tricky.

@sploiselle
Copy link
Contributor Author

@nrainer-materialize Here's the real-time recency work we chatted about expanding the tests of. I believe @def- had some outstanding tests, but am not sure exactly what he'd planned or had in mind.

@nrainer-materialize
Copy link
Contributor

@nrainer-materialize Here's the real-time recency work we chatted about expanding the tests of. I believe @def- had some outstanding tests, but am not sure exactly what he'd planned or had in mind.

Thanks! I will have a look and work on it next week.

@sploiselle
Copy link
Contributor Author

@rjobanp You have bandwidth/interest in adding a MySQL real-time recency test akin to the Kafka and PG ones? I'm sure you're much defter than I with their dialect of SQL, so thought I'd ask. nbd if you're bandwidth constrained.

@rjobanp
Copy link
Contributor

rjobanp commented May 3, 2024

@sploiselle based on the pg-rtr test it should be almost identical for you to add a mysql one -- the only change I that generate_series doesn't exist, but you can use the @i:=@i+1 pattern we have here: https://github.com/rjobanp/materialize/blob/766595589acdf619512ac8057b5df5cfb7c40187/test/mysql-cdc/mzcompose.py#L164

@nrainer-materialize
Copy link
Contributor

nrainer-materialize commented May 7, 2024

Oh, I do get some errors for #25566 in https://buildkite.com/materialize/nightly/builds/7683. Though, I haven't had the time to take a deeper look yet.

@nrainer-materialize
Copy link
Contributor

nrainer-materialize commented May 9, 2024

Alright. Regarding testing:

1) RTR enabled everywhere

#25463, originally by Dennis, enables RTR in all CI. This PR is a one-shot test and not supposed to be merged.

Tests pipeline

Nightly

  • The feature benchmark fails in scenario StartupLoaded, which is three times slower. However, I am actually surprised that we don't see more performance regressions here, given the testdrive build timeouts in the tests pipeline.
  • Testdrive build time out here as well, although I had increased timeouts for nightly as well.
  • Four parallel workload tests fail due to timed out before ingesting the source\'s visible frontier when real-time-recency query issued.

2) Additional tests

#25566, originally by Dennis, adds additional RTR tests and extends existing test frameworks (data-ingest, parallel-workload). That branch is based on the one of this PR. All additional commits of that branch should be cherry-picked to this branch (except for 719ac30, which was cherry-picked from main and could cause conflicts).

Added tests in tests pipeline

Of the added tests, there are some that do not pass but presumably should. They are in workflow resumption in kafka-rtr, manipulate the connection, fail due to BrokerTransportFailure (Local: Broker transport failure), and were disabled with 8071425. Consequently, the build step with the remaining workflow passes.

Nightly

The nightly pipeline (nearly) passes after adjusting some timeouts.

Summary

To sum up, there still seem to be some issues:

  • The feature comes with a major performance impact.
  • Postgres-CDC and MySQL-CDC results appear to no longer be correct when using this feature. (issue in the tests fixed with test: adjust existing tests w/ correct semantics discovered by RTR #27063)
  • When run in concurrency (parallel-workload), the problem timed out before ingesting the source\'s visible frontier when real-time-recency query issued may occur.
  • Resumption tests with this feature were not successful.

Let me know if you have any questions!

@sploiselle
Copy link
Contributor Author

@nrainer-materialize Thank you SO much for the detailed reporting here. The nightly timeouts make sense to me but I'm pretty spooked by the result mismatches

@nrainer-materialize
Copy link
Contributor

@sploiselle, please let me know if you need further support or want me to retest something. Thank you.

@sploiselle
Copy link
Contributor Author

@nrainer-materialize Will do! I'm making my way through the feedback you provided and so far all of the issues have been either issues of scale or subtle issues with the tests themselves. Planning a detailed accounting of what's to be done, but I'm feeling good that this prototype is in the right shape.

Copy link
Contributor Author

@sploiselle sploiselle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit of mess to clean up from integrating QA's tests, as well as notes to tighten up.

src/adapter/src/coord/sequencer/inner.rs Outdated Show resolved Hide resolved
misc/python/materialize/parallel_workload/action.py Outdated Show resolved Hide resolved
misc/python/materialize/parallel_workload/action.py Outdated Show resolved Hide resolved
misc/python/materialize/data_ingest/workload.py Outdated Show resolved Hide resolved
src/adapter/src/coord/message_handler.rs Outdated Show resolved Hide resolved
src/storage-controller/Cargo.toml Outdated Show resolved Hide resolved
(connection.clone(), *remap_collection_id)
}

// These internal sources will never support RTR.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never is too strong a word; maybe they should support RTR?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And lo, it came to pass...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should produce an error here as there are no recency guarantees we can offer

@sploiselle sploiselle marked this pull request as ready for review May 14, 2024 23:15
@sploiselle sploiselle requested review from a team May 14, 2024 23:15
@sploiselle sploiselle requested review from a team as code owners May 14, 2024 23:15
@sploiselle sploiselle requested a review from maddyblue May 14, 2024 23:15
Copy link

shepherdlybot bot commented May 14, 2024

Risk Score:83 / 100 Bug Hotspots:8 Resilience Coverage:66%

Mitigations

Completing required mitigations increases Resilience Coverage.

  • (Required) Code Review 🔍 Detected
  • (Required) Feature Flag
  • (Required) Integration Test 🔍 Detected
  • (Required) Observability 🔍 Detected
  • (Required) QA Review 🔍 Detected
  • (Required) Run Nightly Tests
  • Unit Test
Risk Summary:

The risk score for this pull request is high at 83, indicating a significant likelihood of introducing a bug. This assessment is driven by predictors such as the average line count and the number of executable lines within files. There are 8 modified files that are known hotspots for bugs, which increases the risk. Historically, pull requests with similar characteristics are 124% more likely to cause a bug compared to the repository's baseline. While the repository's predicted bug trend is decreasing, which is a positive sign, the observed bug trend remains steady.

Note: The risk score is not based on semantic analysis but on historical predictors of bug occurrence in the repository. The attributes above were deemed the strongest predictors based on that history. Predictors and the score may change as the PR evolves in code, time, and review activity.

Bug Hotspots:
What's This?

File Percentile
../inner/peek.rs 94
../src/lib.rs 91
../session/vars.rs 97
../src/coord.rs 99
../sources/mysql.rs 95
../sequencer/inner.rs 98
../coord/message_handler.rs 96
../src/lib.rs 98

@sploiselle
Copy link
Contributor Author

The feature comes with a major performance impact.

The performance impact mentioned here is largely (entirely?) related to latency. We expect the latency of RTR queries to exceed those of non-RTR queries commensurate with the speed at which we ingest data and propagate the ingestion through the entire system.

The prototype itself doesn't introduce much undue latency to the queries (i.e. envd cannot figure out much more quickly than listening to the remap shard when a value's been ingested).

When run in concurrency (parallel-workload), the problem timed out before ingesting the source's visible frontier when real-time-recency query issued may occur.

The scalability of the prototype is not great! We open a separate connection to the upstream object for each query. I expect the number of concurrent queries we support to be quite low, and this is something that we can both get customer feedback on, as well as easily improve. (e.g. an easy win is to only allow one outstanding RTR query per connection object, and we can stash all queries waiting on that object in a queue. Once the current query returns, we can issue another RTR timestamp fetch, and the timestamp returned is valid for all queued queries).

Resumption tests with this feature were not successful.

The resumption tests are, unfortunately, pathologically structured and can never succeed. One of the tenants of the current prototype is that we use the same connection object as the source itself (i.e. it must have the same parameters, etc.). The current design of the resumption tests introduces a partition (of sorts) between envd and the source's upstream object, and then issues a RTR query. Unfortunately, this partition not only affects the ingestion, but also impedes our ability to determine the RTR timestamp (which is why all of the queries fail).

A trickier design here would be to issues the RTR query and then introduce the issue and then fix it. The timing of all of this seems very racy, though so I'm not sure exactly how we'd want to instrument it.

@sploiselle
Copy link
Contributor Author

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

@nrainer-materialize
Copy link
Contributor

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

I can add some if we consider that important!

@nrainer-materialize
Copy link
Contributor

@MaterializeInc/storage This is ready for a code review. Note that this doesn't include a mysql-rtr test, though it probably should. QA tested that the existing tests do pass with RTR enabled and this is only moving to PrPr, so I didn't sweat it too profusely.

@sploiselle: I added mysql-rtr tests with 06b63ed.

@sploiselle
Copy link
Contributor Author

@nrainer-materialize tysm!

@sploiselle sploiselle changed the title [DNR] storage: real-time recency prototype storage: real-time recency MVP May 17, 2024
@@ -2016,22 +2016,27 @@ impl Coordinator {
// data. We "cheat" a little bit and filter out any IDs that aren't
// user objects because we know they are not a RTR source.
let mut to_visit = VecDeque::from_iter(source_ids.filter(GlobalId::is_user));
let mut visited = BTreeSet::new();
// If none of the sources are user objects, we don't support RTR.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// If none of the sources are user objects, we don't support RTR.
// If none of the sources are user objects, we don't need to provide a RTR timestamp.

What happens in the caller though? The query returns immediately? Or times out because it can't get a timestamp to wait for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have a test case for this? If not, could one be added easily?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens in the caller though? The query returns immediately? Or times out because it can't get a timestamp to wait for?

Their query is not influenced by any external system's frontier, so the RTR machinery just isn't involved in determining a timestamp. We can test this by querying something that doesn't depend on a user source and show that it returns values––certainly doesn't hurt to demonstrate.

@@ -59,8 +59,10 @@ def workflow_simple(c: Composition) -> None:
)


# TODO: All failure modes fail with: timed out before ingesting the source's visible frontier when real-time-recency query issued. input_1 failed to ingest data up to the real-time recency point
def workflow_resumption(c: Composition) -> None:
# It is impossible for this test to succeed because the network failures
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More accurate to say that this can't pass without making the RTR-related connections robust to network failure?
Is this something that would only make sense to re-visit in the context of the work to add HA/"always on" behavior? Or resuming these broken connections could add value even before HA work is done?
If this is correct, should we open a issue to work on this at some point, and add it to the mega tracker?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this to work, we would have to do something other than blithely partition envd away from the external system. If you partition envd away from the external system, you cannot reach out to it to determine its frontier. For this test to work, we would need to do something trickier, like only partition us away from the external system after the thread performing the RTR determination received a response from the query that determines the external frontier.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this meant to test though? If it can never be fixed do we need this function in the codebase?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaterializeInc/testing I'm going to remove these functions because as-is they cannot succeed but will leave a TODO for @def- to investigate a way to make them work with a link to the commits in #25566 that have the original tests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine with me.

@bosconi bosconi requested a review from a team May 19, 2024 23:49
INSERT INTO table_a SELECT 1,2 FROM generate_series(1, 100);
INSERT INTO table_b SELECT 1,2 FROM generate_series(1, 100);

> SELECT sum < 4000207 FROM sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has the potential to flake, right? It relies on this query executing fast enough before the inserts in pg above are ingested. Maybe we should only test that the correct result is served on the first try

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

INSERT INTO table_a SELECT 1,2 FROM mysql.time_zone t1, mysql.time_zone t2 LIMIT 100;
INSERT INTO table_b SELECT 1,2 FROM mysql.time_zone t1, mysql.time_zone t2 LIMIT 100;

> SELECT sum < 4000207 FROM sum;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern here for flakes

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t.join()


def workflow_multithreaded(c: Composition) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are we trying to test with this multithreaded setup? Can we record the intention in the doc comment for this workflow?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaterializeInc/testing PTAL

test/kafka-rtr/multithreaded/verify-rtr.td Outdated Show resolved Hide resolved
test/kafka-rtr/multithreaded/mz-setup.td Outdated Show resolved Hide resolved

async fn decode_remap_data_until_geq_external_frontier<
ExternalFrontier: SourceTimestamp,
T: Timestamp + Lattice + Codec64 + From<EpochMillis>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic in the body does not work for any Lattice timestamp. We either need a TotalOrder here or (my preference) just make this specific over MzTimestamp

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have pointers for concretizing this to mz_repr::Timestamp? That the storage controller gets made available through a trait makes this hard to do.

) -> BoxFuture<'static, T> {
// Dummy implementation
Box::pin(async { T::minimum() })
ids: BTreeSet<GlobalId>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc comment still refers to this argument as source_ids. Does the name change imply that this can contain more than sources?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified the comment

(connection.clone(), *remap_collection_id)
}

// These internal sources will never support RTR.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should produce an error here as there are no recency guarantees we can offer

src/storage-types/src/sources.rs Outdated Show resolved Hide resolved
src/storage-types/src/sources.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@maddyblue maddyblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapter parts lgtm.

src/adapter/src/coord/message_handler.rs Outdated Show resolved Hide resolved
src/adapter/src/coord/sequencer/inner.rs Outdated Show resolved Hide resolved
@sploiselle sploiselle force-pushed the rtr-prototype branch 3 times, most recently from ed0804d to aead2fd Compare May 21, 2024 05:28
Comment on lines +55 to +69
match connection {
GenericSourceConnection::Kafka(kafka) => {
let external_frontier = kafka
.fetch_write_frontier(&config)
.await
.map_err(StorageError::Generic)?;

decode_remap_data_until_geq_external_frontier(
id,
external_frontier,
as_of,
remap_subscribe,
)
.await
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered by any test according to the coverage report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible the coverage report is mistaken? The Kafka RTR tests should be using this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's definitively an option! The report is not 100% accurate.

Comment on lines +84 to +96
GenericSourceConnection::MySql(my_sql) => {
let external_frontier = my_sql
.fetch_write_frontier(&config)
.await
.map_err(StorageError::Generic)?;

decode_remap_data_until_geq_external_frontier(
id,
external_frontier,
as_of,
remap_subscribe,
)
.await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not covered by any test according to the coverage report

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -120,6 +129,61 @@ impl<C: ConnectionAccess> KafkaSourceConnection<C> {
}
}

impl KafkaSourceConnection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class not covered by any test according to the coverage report?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you double-check my thinking here that these should be covered by the Kafka RTR tests? Glad to hop on a Zoom to hammer this out.

@@ -135,6 +135,36 @@ pub static MYSQL_PROGRESS_DESC: Lazy<RelationDesc> = Lazy::new(|| {
.with_column("transaction_id", ScalarType::UInt64.nullable(true))
});

impl MySqlSourceConnection {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class not covered by any test according to the coverage report?
Or should this be covered by the MySQL tests that I added?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I inadvertently lost the diff that contained your tests; recovered it and they're back in this PR.

@sploiselle sploiselle force-pushed the rtr-prototype branch 2 times, most recently from b362abb to 0840961 Compare May 21, 2024 14:11
@sploiselle
Copy link
Contributor Author

@petrosagg Addressed all of your feedback modulo erroring if you have RTR enabled and query a load generator source. I think the ergonomics of that are less friendly than they could be (we let you query everything else that doesn't actually support RTR; making this an exception feels odd to me). I'd prefer to educate users about those semantics in documentation.

Also gave up on semantic commits and everything is just in one commit now. Apologies if that introduces any overhead for you.

Copy link
Contributor

@petrosagg petrosagg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for addressing the comments

@@ -59,8 +59,10 @@ def workflow_simple(c: Composition) -> None:
)


# TODO: All failure modes fail with: timed out before ingesting the source's visible frontier when real-time-recency query issued. input_1 failed to ingest data up to the real-time recency point
def workflow_resumption(c: Composition) -> None:
# It is impossible for this test to succeed because the network failures
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this meant to test though? If it can never be fixed do we need this function in the codebase?

@sploiselle sploiselle merged commit a1d2529 into MaterializeInc:main May 24, 2024
79 checks passed
@bosconi bosconi added the T-recency Theme: recency / Real-Time Recency (RTR) label Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
T-recency Theme: recency / Real-Time Recency (RTR)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants