-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic when transactional workload is mixed with concurrent DDL: cannot fail to peek: CollectionMissing or Dataflow requested as_of not >= since #19743
Comments
To reproduce, run the attached file in multiple concurrent connections:
The way the queries was generated was fairly primitive, so the file contains many queries that will error out for various reasons. With a more targeted workload the panics are very likely to happen much faster. Please let me know if a better test case is required. |
The "Dataflow oneshot-select-t1065 requested as_of (Antichain { elements: [1686041434420] }) not >= since" error is happening because a txn is started, it gets a timestamp, then the second query in the txn runs but the since has advanced beyond the query timestamp. This shouldn't happen due to read holds, so something might be going on there. u208 is a table. Here's the determine_timestamp from a select that's the 2nd select in a txn (no AS OF anywhere). See how the since is greater than the timestamp.
|
It is not yet clear if this is related to the other panic, which is caused by an unknown collection id. It is possible they are the same bug! One theory is that a collection is being destroyed and created (the creation happens at the higher since), and so sometimes this manifests as a since problem, sometimes as a unknown id problem depending on where in the race condition it hits. |
@mjibson can you give me a diff of whatever |
https://github.com/mjibson/materialize/tree/gh19743 is that branch |
Summary:
Here are the logs (gh19743.log) and here is the linear narration.
As the story begins,
conn 14 starts a transaction
The closest timestamp I see after conn14's txn begins is:
other collections are at 5468 though
Other txns are choosing 1686526195468 shen selecting from
conn8 selects from
conn14 selects from
conn 20 selects from
conn14 selects from
|
Maybe the transaction should not even be allowed to see tables that were created after it started? Or error our immediately when attempting to SELECT from such a table? |
Transactions should already be doing this! Will investigate. |
Current story:
This is happening in create_dataflows: https://github.com/MaterializeInc/materialize/blob/505338f15ddce774f7a2d83b1ddf6a824cf6a2b6/src/compute-client/src/controller/instance.rs#LL617C1-L617C2
This seems to be a problem in I tested some theories above about creating indexes or tables while another txn had started then querying them, but we have existing code protecting against that and everything I tried worked as expected (the txn failed with timedomain errors). |
I'm missing a lot of context here, but these are my initial thoughts:
|
I'm confused about how the transaction in connection 1 is able to reference the new Is it possible that moving work off the main coordinator thread has created a race here? Something like: Objects X and Y exist in the same schema.
|
This hasn't happened yet for SELECTs, which still happen in a single coord message. |
Ugh. I sorted the order of events incorrectly. The CREATE OR REPLACE executed before the transaction started. |
Ah, I see. Just to be absolutely certain I have the order right, can you post a corrected order of events? |
the |
The |
There are no CREATE queries from any connection after the transaction starts. |
Here's my guess: materialize/src/storage-client/src/controller.rs Line 2107 in ad6d319
|
I'm not able to reproduce this by hand. That combined with the other panic here suggests the read hold stuff is a red herring. I'm still lost! |
Just jotting this down for my own memory: Matt and I looked at this today and was saw that some sources that should have been considered during timestamp determination were not. |
Looking into the above question about correctly involving some sources: we just hadn't read everything yet. We were looking at the id bundle for the incoming query, not the id bundle for the timedomain saved in the transaction. The transaction's id bundle is correct, and it does seem to be correctly setting IDs during first statement, then correctly referencing that during later queries in the txn. |
Think I found it: timestamp determination is done with initial select sources, not full timedomain sources. So subsequent queries in the txn haven't had their sinces considered. This got broken within the last few months. Should have a fix and full explanation out soon. |
This got broke in #19026 |
Unmarking as a release blocker because the fix will take a bit longer, this doesn't result in catalog corruption, and it occurs only rarely. |
As follow ups, per our team retrospective, I filed: |
… areas (#20406) As a follow up to #19743, we want to be more prescriptive about what types of external contributions we'll encourage, and more explicit guidelines for picking up an issue. Surfaces has decided to disallow external contributions to the coordinator for two reasons: * The cost of educating and performing extra-thorough reviews is not worth the benefit-cost/risk tradeoff, given the complexity and importance of this code * Major coordinator changes should only be done as part of a more focused, planned body of work, again given the nuances. One-off changes picked up at random increase the risk of missed problems. Questions for reviewer: * Any other suggestions for both the "areas that are best suited for external contributions" and "ask first" sections? I picked persist there arbitrarily, and can remove if folks disagree. This is one of a series of follow up items from the above issue.
What version of Materialize are you using?
v0.58.0-dev (e015c3e)
What is the issue?
When running a concurrent workload that mixes transactional statements, SELECTs and DDLs, the coordinator will panic either like this:
or like this:
The text was updated successfully, but these errors were encountered: