Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add design doc for proposed changes to mz_internal #17836

Merged
merged 8 commits into from
Apr 19, 2023

Conversation

umanwizard
Copy link
Contributor

@umanwizard umanwizard commented Feb 26, 2023

See rendered doc here for details.

Copy link
Member

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really excited to see this work happening, @umanwizard!

Comment on lines 154 to 164
* All per-worker relations should have a corresponding "global" view
where the results are aggregated across workers in the way most
likely to be useful (for example, summing, taking the minimum, or
restricting to worker 0). Because these global views are more likely
to be useful than the underlying sources, they should have the basic
name (e.g., `mz_arrangement_sharing`). The underlying source should
have a name with `per_worker` inserted (e.g.,
`mz_per_worker_arrangement_sharing`). Furthermore, the per-worker
variant should be clearly less prominent in documentation (either at
the bottom of the page, or in a "see more" section that has to be
clicked to expand).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strong agree on this.

Comment on lines 176 to 180
* All per-replica sources and views should have their names prepended
with some string (perhaps `clusterd_`, though I'm open to
suggestions) that makes it obvious which ones they are. Furthermore,
the limitations of per-replica relations should be clearly
documented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"clusterd" is not a user facing concept (outside of architecture discussion), but "cluster" is! Would strongly endorse calling these something with cluster in the name.

Cluster unification really did us a solid here, since we no longer need to figure out how to disambiguate between "storage" introspection views and "compute" introspection views.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tough. I would have intuitively suggested that we prefix with cluster_replica, to signify that these relations are implicitly scoped by the targeted replica. However, we already have a bunch of mz_cluster_replica_* relations that are written by the controller and expose the global view (with a replica_id column):

Copy link
Contributor

@teskje teskje Feb 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, here is some prior discussion about the mz_cluster_replica_* naming in Slack.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benesch , I think it's fine to expose non-user-facing concepts in internal stuff. I mean, dataflow channels and operators aren't a user-facing concept either :)

I liked clusterd because it precisely describes where the logging happens and where the relations actually live (as opposed to being persisted). I'd also be okay with cluster_replica (and renaming all the stuff that already has it)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be great to find something else than cluster_replica. mz_cluster_replicas is already a stable system table that we cannot rename, so using the very similar mz_cluster_replica as an introspection source prefix would be confusing.

Some brainstorm ideas, none of them great:

  • clusterd. Already mentioned above.
  • cluster_side. To avoid the internal name.
  • replica_side. A bit more specific/correct.
  • replica. Maybe harder to confuse than cluster_replica, but still quite confusing. It's not obvious that leaving out the "cluster" would have significance.
  • compute. We will use the introspection sources for storage dataflows too, so this would not be referring to COMPUTE the layer, but to our computation fabric.
  • computation. Less loaded alternative to the above.
  • Using a prefix different then mz_, strawman: log_.
  • Using a schema different from mz_internal, strawman: mz_log.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of having special query syntax too. I think the reason we didn't really consider it before was that people pointed out that changing the SELECT syntax is scary. I don't remember the details though. I think @sploiselle might?

Another thing to consider is that SUBSCRIBEs can also target replicas, so if we introduce dedicated syntax, that needs to apply to subscribes as well.

For me, the value of having a specific prefix for log sources and views over them is not so much that you can immediately see that they are replica-specific, but rather that they were collected on the replica-side, rather than on the environmentd-side. That's a subtle difference, since the latter applies even if we lift the replica-targeting restriction in the future. My thought was that we might want to collect the same information on both sides and need a way to express the distinction to the user. For example today we have mz_cluster_replica_frontiers and mz_compute_frontiers, and it's not really clear how they differ. That said, I'm not actually sure mz_compute_frontiers actually provides much additional value over mz_cluster_replica_frontiers, so maybe we should just remove it and resolve the ambiguity that way? If it is never the case that we want to collect the same information from both the replicas and environmentd, then a per-replica prefix for relation names is not needed either.

Aside: another confusing point is that the terms "logs" and "sources" are often used interchangeably in this context (whereas "sources" from Storage's perspective are a rather different thing)

I've thought that maybe we should have an mz_logs table that's different from mz_sources. But I think right now the distinction between sources and tables is pretty muddled anyway, so there is opportunity for coming up with a holistic design separately.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benesch , I think it's fine to expose non-user-facing concepts in internal stuff. I mean, dataflow channels and operators aren't a user-facing concept either :)

Dataflows, channels, and operators are user facing, in my view. Only for the most advanced of users, but they are concepts that must be understood to make sense of the introspection relations.

By contrast, I would like for clusterd to never leak into the user interface at all. If storaged or computed had leaked into the interface, for example, that would have sank cluster unification.

One idea I had was to simply refuse to allow replica-directed queries without special syntax, e.g. something like SELECT IN REPLICA (default.r1) * FROM mz_internal.mz_dataflow_channel_operators. This would also allow us to get rid of the cluster_replica session variable.

Another reason why I like this idea is that replica-directed queries are one of the very few ways of introducing non-determinism in Materialize (the only other one I can think of is non-materializable functions). I.e., if the state of persist and catalogs, the selected timestamps, and the query issued are all held constant, the results are fully determined. However, replica-directed queries are an exception to this rule so it makes sense to force them to be queried with different syntax.

I wrote up thoughts about this in the past: #15195 (comment)

IMO, introducing special syntax specifically for replica-targeted queries would be a mistake. There are plenty of other session variables that quietly impact how your query works (e.g., database, search_path, cluster). I don't see why cluster_replica is more special those session variables. The argument that it introduces non-determinism doesn't resonate for me, given that (as you say) SELECT mz_version() is also nondeterministic, and we don't make you use special syntax to type that.

As mentioned in the comment I linked above, no objection to extending all SQL statements to support setting arbitrary session variables.

Not sure how good of an argument that is, but typing queries against introspection sources is already annoying because you have to prefix all of them with mz_internal. every time.

As of recently, you can say set search_path = mz_catalog, mz_internal, public, and you'll no longer need to prefix all mz_internal tables with the schema name.

Aside: another confusing point is that the terms "logs" and "sources" are often used interchangeably in this context (whereas "sources" from Storage's perspective are a rather different thing)

I've thought that maybe we should have an mz_logs table that's different from mz_sources. But I think right now the distinction between sources and tables is pretty muddled anyway, so there is opportunity for coming up with a holistic design separately.

Totally agree that we should separate the concepts of "logs" and "sources".

I've never liked the term "log" though. It gets confused with what most people think of as a log: lines of text written to stdout and stderr. As unwieldy as "introspection" is, it's free of baggage.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best idea I have is a prefix or suffix like active_replica. For example: mz_active_replica_dataflow_operators. It's wordy, though...

To come clean: pre-cluster unification I was opposed to ever using the term "replica" alone (vs. "cluster replica"). I've softened my stance there, since we no longer are thinking we'll need to distinguish between "storage replicas" and "cluster replicas".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By contrast, I would like for clusterd to never leak into the user interface at all. If storaged or computed had leaked into the interface, for example, that would have sank cluster unification.

Assuming that "user interface" here only concerns the introspection sources, why would that have been an issue? Backwards compatibility shouldn't be a concern, since we declare all the introspection sources and views as unstable.

As unwieldy as "introspection" is, it's free of baggage.

Although we now have the mz_introspection cluster, which uses the term for "running SHOW commands and the like". Which also means that adding a separate mz_introspection schema for introspection sources and views to live in would be confusing.

Maybe there is no need to assign a specific prefix to introspection sources. AFAICS, there are two use-cases for a special prefix:

  1. Telling users which relations are per-replica and which are not. I think we can solve this with documentation, like having a separate section on the mz_internal catalog docs page that lists all the introspection sources. And hopefully we will end up with a solution that removes per-replica relations at some point.
  2. Differentiating between sources that provide the same/similar data but collected at different places, for example mz_compute_frontiers (collected on the replica) vs. mz_cluster_replica_frontiers (collected by the controller). I think in many of these cases, we can just get rid of one of the two relations. If there is a controller-exported relation exposing the same data as a replica-exported one, the replica-exported one is probably not necessary. There will of course be exceptions to this rule, but maybe those can be solved on a case-by-case basis.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By contrast, I would like for clusterd to never leak into the user interface at all. If storaged or computed had leaked into the interface, for example, that would have sank cluster unification.

Assuming that "user interface" here only concerns the introspection sources, why would that have been an issue? Backwards compatibility shouldn't be a concern, since we declare all the introspection sources and views as unstable.

Ah, sorry, I meant if the terms "storaged" or "computed" had leaked into any of the stable, public relations in mz_catalog.


(Not quite sure about this -- it's tricky to balance ease-of-use with
not blowing up the set of documented relations)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe @frankmcsherry has a laundry list of controller-maintained introspection views that would be nice to have. Is this design document the right place to begin gathering those, or are those best left to a future design document?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd love to have the laundry list!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One thing Frank mentioned on Slack:

In particular, it is hard to get end-to-end lag (from sources all the way trhough to indexes) though "easy" to get lag for one dataflow (it's inputs to outputs).

We should definitely have this, if we think of a good way to actually compute it ... you need to know the transitive dependencies of each dataflow, which is fundamentally recursive, and we don't (yet) have recursive queries outside of unsafe mode. I'll check whether it's easy to get transitive dependencies in the code.

Copy link
Contributor

@teskje teskje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up!

Comment on lines 176 to 180
* All per-replica sources and views should have their names prepended
with some string (perhaps `clusterd_`, though I'm open to
suggestions) that makes it obvious which ones they are. Furthermore,
the limitations of per-replica relations should be clearly
documented.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is tough. I would have intuitively suggested that we prefix with cluster_replica, to signify that these relations are implicitly scoped by the targeted replica. However, we already have a bunch of mz_cluster_replica_* relations that are written by the controller and expose the global view (with a replica_id column):

@umanwizard
Copy link
Contributor Author

I'm now thinking about whether we should also expose per-operator frontiers. I'm not sure why we don't have them; maybe the data volume was thought to be too high?

@teskje
Copy link
Contributor

teskje commented Feb 28, 2023

I'm now thinking about whether we should also expose per-operator frontiers. I'm not sure why we don't have them; maybe the data volume was thought to be too high?

We have mz_dataflow_operator_reachability, which I think is basically the same as operator frontiers? We don't have documentation for it, so I'm not entirely sure, but at least the amount of data should be similar to that of operator frontier tracking.

Copy link
Contributor

@vmarcos vmarcos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on! Overall, the suggestions here are really great and will help us a lot.


#### More views enriched with operator names, dataflow names, etc.

(Not quite sure about this -- it's tricky to balance ease-of-use with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a start would be to co-design these additional views with the external introspection initiative? That way, the scope could be restricted to views that would make implementing external introspection significantly easier. If we think these would be optimized for UI use and might exceed the scope of mz_internal, one could think of separating them out in a different schema, for example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed! I plan to continue updating this PR if we discover any possible new views through that effort.


## Open questions

* Are there further views that could be useful but that we don't have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calculating frontier lag, we usually have to write some tedious joins between mz_compute_frontiers and mz_compute_import_frontiers. Perhaps a ready-made view there would be pretty useful?

@vmarcos
Copy link
Contributor

vmarcos commented Feb 28, 2023

Ah, one other comment that crossed my mind: Our treatment of SUBSCRIBEs in introspection sources / views could probably be improved. For example, in mz_compute_frontiers et al., we show SUBSCRIBEs, though the documentation is inconsistent about it. Also, I am not sure whether we have a list of the SUBSCRIBEs that are running along with their SQL statements? We can look at dataflows and do some string matching, which is not ideal (but that is what our documentation suggests here).

@teskje
Copy link
Contributor

teskje commented Feb 28, 2023

Our treatment of SUBSCRIBEs in introspection sources / views could probably be improved.

Maybe we should just add an mz_subscribes table? Subscribes don't have names, but we could put in (time, user, SQL) instead, to make them identifiable.

umanwizard and others added 2 commits February 28, 2023 09:18
Co-authored-by: Marcos Antonio Vaz Salles <vmarcos@materialize.com>
@teskje
Copy link
Contributor

teskje commented Mar 7, 2023

An mz_subscriptions table listing all active subscribes exists now: #17906.

@teskje
Copy link
Contributor

teskje commented Mar 10, 2023

We have mz_dataflow_operator_reachability, which I think is basically the same as operator frontiers? We don't have documentation for it, so I'm not entirely sure, but at least the amount of data should be similar to that of operator frontier tracking.

I've checked this: mz_dataflow_operator_reachability doesn't give us operator frontiers directly. It gives us the capabilities held by each operator output and input. Frontiers can be derived from that, if you know the structure of the dataflow graph and the path summaries between individual operators. But a) we don't know the path summaries and b) we would surely not want to perform the work of deriving frontiers ourselves, even if we had all required information.

I think timely's reachability tracker could be made to log frontier updates as well as capability updates. If we want to show operator frontiers to users, this is the only feasible way I can think of.

Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great!

I'm not sure we actually want to log per-operator frontiers. It's potentially a lot of data and we should first determine what the cost would be in relation to the benefit it would bring us.

  * Duration intervals vs nanoseconds timestamps
  * Histogram sum fields
  * _raw variants
@umanwizard umanwizard enabled auto-merge (squash) April 19, 2023 14:24
@umanwizard umanwizard merged commit 4f6e48a into MaterializeInc:main Apr 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants