sql: Add design doc for addressing the "optimizer/customer trade-off problem"#35441
sql: Add design doc for addressing the "optimizer/customer trade-off problem"#35441mgree wants to merge 5 commits intoMaterializeInc:mainfrom
Conversation
|
Thanks for opening this PR! Here are a few tips to help make the review process smooth for everyone. PR title guidelines
Pre-merge checklist
|
5801a05 to
7fbd51c
Compare
| - Feature flags apply to entire environments, not individual clusters. If different teams need different flags, they are out of luck. | ||
| - Exponentially many configurations---we can't test every combination of flags, and flags interact. | ||
| - Unknown support windows. | ||
|
|
There was a problem hiding this comment.
Isn't this missing: not everything is feature flaggable (or if it is, it might be VERY hard) - since the intro motivated this with the repr type work.
ggevay
left a comment
There was a problem hiding this comment.
I have a lot of questions
|
|
||
| ```sql | ||
| ALTER CLUSTER foo FREEZE; | ||
| ALTER CLUSTER foo UNFREEZE; |
There was a problem hiding this comment.
So this means to pin a plan, I freeze the cluster? As soon as I unfreeze, the plan immediately goes to the latest LIR plan?
There was a problem hiding this comment.
As soon as I unfreeze, the plan immediately goes to the latest LIR plan?
I'm not sure, but I would imagine that the typical procedure for moving away from a pinned plan would be to do either a blue-green deploy, or an ALTER MV-like workflow, so that you can see whether the new plan would work well. We should think more about the details of this, e.g., how it would look like when the "ALTER MV-like workflow" for moving to a newer plan would need to happen on a cluster level.
There was a problem hiding this comment.
There's a similar question here about what happens during a blue/green dbt deploy. We spin up & spin down new clusters during a blue green - which means that your plan pinning would disappear!
Is there a way for us to specify plan version explicitly? Something like CREATE CLUSTER WITH OPTIMIZER PLAN v1 or something like that?
There was a problem hiding this comment.
I'm not sure I understand: if a blue/green deploy is to change plans, why would you blue/green deploy something that was frozen? You're not allowed to change anything!
There was a problem hiding this comment.
Ah, maybe you want to change something just in one small dataflow of a cluster, but I think a blue-green operates at the whole cluster level, so it would blow away all the pinned plans. Is this the issue, @maheshwarip? If yes, then this is indeed a problem, because it means the splitting workaround (the "Mitigation: use MVs to separate the units you care about.") for the query change issue would not actually work, right?
There was a problem hiding this comment.
The mitigation would work for all of the upstream and downstream frozen clusters. But a downside to plan pinning is that it is very binary: you either have pinned the exact plan, or you are at the mercy of the optimizer.
| No changes can be made to `foo`: no new dataflows, no removals. | ||
| It will not be part of this work, but it seems sensible to limit other actions on frozen clusters, e.g., you many only run fast-path `SELECT`s and `SUBSCRIBE`s (with the possible exception of queries that touch introspection sources). | ||
|
|
||
| ### Why at the cluster level? |
mtabebe
left a comment
There was a problem hiding this comment.
I think this is a much stronger document and proposal. Nice work
ggevay
left a comment
There was a problem hiding this comment.
I'm liking this new version quite a bit! My biggest concern with plan pinning was
Any changes to the plan and you lose your pin.
(I'd formulate it as "Any changes to the query and you lose your pin.")
But
Mitigation: use MVs to separate the units you care about.
might be enough of a mitigation.
| If we freeze `C3`, we certainly can't make changes to `C3`. | ||
| What about `V2` (which is inlined into the definition of `MV2`)? | ||
| What about `MV1` (which is read from persist)? | ||
| What about `S1`? |
There was a problem hiding this comment.
I'd say the only optimizer-relevant question here is a potential change in V2. This is because changes in MV1 and S1 would change the persist-level input of MV2, so no direct effect on the plan of MV2. (And avoiding downstream surprises is an active topic of discussion in the ALTER MV workstream.)
For V2, actually, our current ALTER VIEW can't change the view's query (can only change the name and owner), so we don't have an issue with this for now.
There was a problem hiding this comment.
Okay, this makes sense to me! It's worth confirming that with how our customers are using blue/green deploys, they'll get a sensible error if they try to modify V2.
| - Who flips the bits? If it's us: high support burden. If it's someone else: what if they break things? | ||
| - Unknown support windows. | ||
|
|
||
| ### `plan-pinning` |
There was a problem hiding this comment.
A thing that is unclear to me with plan-pinning is how will users know when they'll have to migrate away from an old pinned plan? With optimizer-versions, this is more clear: We can warn them some time before the scheduled removal of an old optimizer version, and it should be easy even for self-managed users to see whether they'll be affected or not. But with plan-pinning, a breaking change that would make a pinned plan impossible to render anymore would not have an obvious warning before it.
To solve this, we might want to explicitly add some code some time before making a breaking change to just hunt down pinned plans that would be affected, but not break them yet, and warn the user loudly that they should try a migration away from the pinned plan, with also telling them when the actual breaking change is scheduled.
There was a problem hiding this comment.
Makes sense. Per @antiguru, though, it may be hard to know that we're making a breaking change.
|
|
||
| Pros: | ||
|
|
||
| + Ties in neatly with related ideas of "production clusters", guarantees, and auto-scaling. |
There was a problem hiding this comment.
Yes, I feel more and more momentum building lately behind the "production clusters" idea.
|
|
||
| + Ties in neatly with related ideas of "production clusters", guarantees, and auto-scaling. | ||
| + Ties in neatly with related ideas of "DDIR" or some other stable, low-level interface. | ||
| + Offers the most reliable possible experience---a fixed LIR plan would be stable even if bugfixes in MIR cause queries to change. |
There was a problem hiding this comment.
Well, optimizer-versions would be even more reliable, in that it would also solve the query-tweaking problem, i.e., when a tiny query tweak does a big plan change due to the tweaked query being planned with different optimizer code.
There was a problem hiding this comment.
If you change a query, all bets are off. I look forward to having a more "continuous" optimizer, but I don't think any of the proposals here directly address that.
We're not going to be able to completely eliminate the possibility of regressions (e.g., a change in TD or DD will affect everyone!).
|
|
||
| There are two closely related but not identical problems: | ||
| 1. **`our-bad`** MZ optimizer changed and it broke on redeploy. | ||
| 2. **`your-bad`** You changed something and it broke in staging. |
There was a problem hiding this comment.
Well, it's kinda debatable that when a user tweaks a query, and it has a big plan change not because of the change being big at the SQL level, or because of a discontinuity in optimizer code, but because of different optimizer code running, then is it our fault (optimizer code change) or their fault (query change).
But the "Mitigation: use MVs to separate the units you care about." might mitigate this problem enough.
|
|
||
| ```sql | ||
| ALTER CLUSTER foo FREEZE; | ||
| ALTER CLUSTER foo UNFREEZE; |
There was a problem hiding this comment.
As soon as I unfreeze, the plan immediately goes to the latest LIR plan?
I'm not sure, but I would imagine that the typical procedure for moving away from a pinned plan would be to do either a blue-green deploy, or an ALTER MV-like workflow, so that you can see whether the new plan would work well. We should think more about the details of this, e.g., how it would look like when the "ALTER MV-like workflow" for moving to a newer plan would need to happen on a cluster level.
antiguru
left a comment
There was a problem hiding this comment.
Leaving some comments. I'm not yet convinced that durably recording LIR is a good idea!
|
|
||
| Cons: | ||
|
|
||
| - Code duplication. (Somewhat mitigated by `git subtree`.) |
There was a problem hiding this comment.
Please do not use subtrees, it's too much of a headache.
| Cons: | ||
|
|
||
| - Code duplication. (Somewhat mitigated by `git subtree`.) | ||
| - We do not know what kind of support window we will want, and may get backed into things we end up disliking. |
There was a problem hiding this comment.
That's a product question, not an engineering problem I think!
There was a problem hiding this comment.
Maybe so---but it's a real con either way! We have not historically been good at managing this (e.g., current EDJ/VOJ situation).
|
|
||
| Cons: | ||
|
|
||
| - LIR is a not a stable interface. DDIR does not actually exist. |
There was a problem hiding this comment.
I think we're treating LIR as the stable boundary between optimizer and rendering. It can change, but requires adjustment in all optimizer versions. Changes to rendering are not part of this design doc (but essentially suffer from the same problems explained here.)
|
|
||
| Cons: | ||
|
|
||
| - LIR is a not a stable interface. DDIR does not actually exist. |
There was a problem hiding this comment.
Well, DDIR doesn't exist, so hard to say, but I think this:
When evolving it, we can just add a new field that does a new thing.
is potentially dangerous: LIR should have less special-casing in the future, not more. The fact that it has special-casing doesn't justify piling more onto it, but should rather motivate figuring out what's wrong.
Overall, I'd say this design should stop at LIR, especially given we're not going to invest in a DDIR right now.
| + Ties in neatly with related ideas of "DDIR" or some other stable, low-level interface. | ||
| + Offers the most reliable possible experience---a fixed LIR plan would be stable even if bugfixes in MIR cause queries to change. | ||
|
|
||
| Cons: |
There was a problem hiding this comment.
Another cons: More durable state.
| + The finest-grained control. | ||
| + Avoids/defers the need to have smart query planning. | ||
|
|
||
| Cons: |
There was a problem hiding this comment.
Another cons: It feels like a trap-door, undoing hints is hard once people adopt it.
|
|
||
| ## Solution Proposal | ||
|
|
||
| We propose using **`plan-pinning`**. |
There was a problem hiding this comment.
I think we shouldn't underestimate the burden to commit to a durable LIR format. What's the process of evolving it? How do we detect changes? How do we prevent silent incompatibilities?
LIR depends on large parts of Mz's implementation, relation expressions, scalars, etc. and I'm not sure all parts have committed to a stable representation.
There was a problem hiding this comment.
This feels like the thorniest question here. LIR has a relatively large surface (e.g., including MirScalarExpr, Row).
The last big changes to LIR that (a) had to happen and (b) would have been painful for migrations were in February 2026 and then... November 2024. So we likely have some time to think.
We ripped out the protobuf, which seemed like the simplest way to think about migrations. Wrong sometimes, but simple!
If we're truly "best effort", something like bincode will suffice... but then tiny changes at a distance will lead to losing pinned plans. What's the right compromise?
| ALTER CLUSTER foo UNFREEZE; | ||
| ``` | ||
|
|
||
| We will store the LIR for all of the dataflows on `foo`, and automatically use those LIR plans on reboot. |
There was a problem hiding this comment.
Updated the doc. The catalog seems like the right place.
| The environment and organization level is far too coarse. | ||
| Going finer, the replica and dataflow levels are too fine---freezing these but not the rest of the cluster seems like a recipe for a mismatch with dependencies. | ||
|
|
||
| ## Open questions |
There was a problem hiding this comment.
One more open question for plan-pinning: What do we do with non-LIR EXPLAINs (e.g., EXPLAIN OPTIMIZED PLAN FOR MATERIALIZED VIEW ...) for objects that only have a pinned plan? If we really only save the LIR plan, then we'll lose the ability to do HIR/MIR EXPLAINs on these objects. Maybe we could additionally best-effort save the MIR/HIR plans as well, where by "best effort" I mean just discarding them when there is a deserialization failure, so that at least most of the time we retain the ability to show HIR/MIR plans?
There was a problem hiding this comment.
I'm fine with best effort here, but... we're already storing these in the catalog... couldn't we continue to do so? (A breaking change in MIR would force a catalog migration anyway, right?)
There was a problem hiding this comment.
We are not storing them in the durable catalog. Instead, we recompute them from the SQL at every system startup. But the recomputation wouldn't give back those plans that would correspond to the pinned plans.
|
Thanks, all (@mtabebe @maheshwarip @antiguru @ggevay) for the feedback! I left comments and updated the document. |
|
|
||
| When we want to make a breaking change, we will likely have to customize deserialization to support a migration. | ||
| This will be painful the first time. | ||
| It will be less painful if we bake in version numbers up front. |
| It will be less painful if we bake in version numbers up front. | ||
|
|
||
| LIR will be stored in the catalog. | ||
| It could go in `CatalogPlans` (all currently keyed by global ID), in `CatalogState` (alongside per-cluster metadata), or as a sidecar (like the `ExpressionCacheHandle`). |
There was a problem hiding this comment.
My weakly held opinion is that it doesn't need to be in the catalog, but that is due to my pre-conceived notion of what overhead on the catalog is. In general, it doesn't need to have the same strong consistency, etc.
There was a problem hiding this comment.
I'm not sure where else makes sense to store things, though... could just be my ignorance about the storage set up! If it's not in the catalog, where is it? A persist shard? Where/how do we keep track of that shard?
When we make changes to the optimizer, we hope that everyone will have a good time (more freshness, less memory usage, etc.). But not every customer's workload is the same, and there's the potential for a customer to have a bad time.
It would be good to ensure that people can continue to have the kind of time they're having, even as we make changes to the optimizer.
This design doc (rendered) explores the space of solutions, and proposes a particular one: plan pinning by way of freezing clusters.