Develop coherent strategy for exposing historical data in internal tables #18823

umanwizard · 2023-04-18T17:41:12Z

It has become clear that the pattern of using internal tables as a historical record of events is common in practice. For example, the historical record of resource usage metrics, a log of sink and source errors, and a log of cluster status changes (OOMs, crashes, etc.) either are already available in the web console or will be soon.

This historical data is exposed in at least two ways today:

Only the most recent value is kept in the current version of the table, and the table's retention window is increased to the desired value (currently 30 days). Users must use time-travel queries (SELECT ... AS OF, SUBSCRIBE ... AS OF, etc.). The user should consider the differential-dataflow timestamp field to be the timestamp of the event or measurement.
The historical log of values is available in the current version of the table, and the timestamp is stored as a field of the data. The user may use normal default-timestamp queries to inspect the historical record.

The main advantages of the second approach are that it does not require knowledge of exotic features in order to use, and it also allows further SQL transformation of the result set including timestamps (e.g., bucketing), whereas the first approach requires such logic to be done on the client. The main advantage of the first approach is that it doesn't require ad-hoc compaction/cleanup logic (since Differential already does this for us). Furthermore, any machinery developed to support this approach can be straightforwardly extended to help real customers/users that have workflows that depend on historical queries. One possible mitigation of the difficulties with approach 1 is a CHANGES operator that would surface the changes to a relation as a relation that could then be further transformed. I am not sure whether we will ever have this feature or how difficult it is to implement.

This issue is intended to centralize discussion about these two methods; the ultimate goal is to settle on one to be used throughout the codebase.

The text was updated successfully, but these errors were encountered:

frankmcsherry · 2023-04-18T18:09:33Z

Some relevant background: the data warehousing space has (probably many) existing idioms, one of which is Kimball's dimensional modeling. There are idioms for slowly changing dimensions that have the benefits that we wouldn't be inventing things from scratch. Potentially, they would be convertible into DD style collections. Additionally, they probably resemble idioms that we would be expecting from users.

chuck-alt-delete · 2023-04-18T18:11:54Z

This is highly relevant to customer use cases as well. Customers often want to analyze historical data.

For option 1, we don't currently expose the ability to customize compaction to users, so AS OF isn't currently a viable option. We can discuss the pros and cons of allowing users to enable historical analysis using AS OF with custom compaction.

For option 2, I'd say customers are most comfortable with this since it's how things work in other databases. However, we would need an easy way for users to handle "compaction" themselves in this case.

umanwizard · 2023-04-18T18:20:06Z

This is highly relevant to customer use cases as well

I think if we ever do support user-configurable compaction (and @chaas 's team does want to support it eventually), then we don't need to be opinionated here: users are free to use whatever approach they want. So the question here is more about how we choose to use the system ourselves. (Obviously, though, our experiences can and should inform product decisions about what we should improve in order to improve the experience for users too).

benesch · 2023-04-18T18:56:57Z

Copying in my comment from Slack:

I like option 1, and think we should figure out how to make it ergonomic.

matthelm · 2023-04-18T19:01:12Z

Both of these methods seem to have their own merits. Option 1 seems more immediately useful to me as a Analytics person, but as more of the ecosystem evolves to streaming I could imagine Option 2 also having a role. Ideally we don't close the door on either.

Option 1 would certainly be useful for analytics where you typically want the current state of a SQL query DAG, but once in a while you're digging deeper (e.g. an exec asking why the numbers changed) and what was seen at some time in the past. This option is nice because it adds zero complexity until you're ready to use it. It isn't really doable with any existing mutable SQL database, and so to get this today an analytics person needs to emulate this experience. It'd be great if Materialize could make it happen automagically. I could imagine this being part of the SQL standard and BI vendors adding support for this new keyword.
Option 2 seems more like an implementation detail to Materialize and it certainly requires the user to know about TVCs. I can imagine it being helpful for systems downstream of Materialize that want to consume CDCv2 en-masse rather than having to SUBSCRIBE. That said, I don't have any specific use-case in mind today.

chuck-alt-delete · 2023-04-18T19:55:59Z

We don’t have to be opinionated in the sense of restricting users, but AS OF avoids a footgun users will run into if they choose option 2, which is that it is not possible (convenient?) today to expire data from sources. Option 1 handles expiration automatically through compaction.

matthelm · 2023-04-18T19:57:58Z

A few follow-up items on Option 1 that were discussed in Slack

The ergonomics of AS OF could get tricky if, for whatever reason, we were to start batching changes. We currently process changes for each logical timestamp and this is one more reason to keep doing so.
I expect users will want newly created materialized views to backfill such that a user could drop and recreate a materialized view and assuming the new query was equivalent still be able to see how the query progresses over logical time.
With that in mind, I think it'll be important to educate users that AS OF isn't based on processing time of the materialized view but rather based on the logical timestamp of the source data. I think its best to represent this feature as the state of a materialized view for any point in logical time, rather than it being to look up the state of the table in wall-clock time.

frankmcsherry · 2023-04-20T12:44:56Z

The difference that stands out to me about options 1 and 2 is that (1.) does not allow historical modification and (2.) does allow historical modification. At least, that is the difference in what they can communicate (otherwise they seem pretty similar). Flip side of the coin is that (1.) communicates completeness for a time, and (2.) does not.

RobinClowers · 2024-06-11T17:49:13Z

My experience building the source overview page in console is relevant here. Because materialize does not produce a diff when there is no change in a row, we had to synthesize points that represent a steady state (usually 0 bytes / rows ingested, but not always). This is further complicated by having to figure out exactly what timestamp we should assign to those points so they would properly align with other points. All of this would be avoided with option 2, where we just write down the values at each collection interval, and we know exactly when the collection happened.

umanwizard added the C-refactoring Category: replacing or reorganizing code label Apr 18, 2023

umanwizard self-assigned this Apr 18, 2023

aljoscha mentioned this issue Apr 20, 2023

storage: bound size of mz_*_status_history #17548

Closed

bkirwi mentioned this issue May 24, 2023

Kafka sources are marked running even if no replica is associated with the source cluster #18795

Closed

benesch mentioned this issue Feb 3, 2024

storage: consider stabilizing mz_source_statuses and mz_sink_statuses #24937

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Develop coherent strategy for exposing historical data in internal tables #18823

Develop coherent strategy for exposing historical data in internal tables #18823

umanwizard commented Apr 18, 2023

frankmcsherry commented Apr 18, 2023

chuck-alt-delete commented Apr 18, 2023 •

edited

Loading

umanwizard commented Apr 18, 2023

benesch commented Apr 18, 2023

matthelm commented Apr 18, 2023

chuck-alt-delete commented Apr 18, 2023

matthelm commented Apr 18, 2023

frankmcsherry commented Apr 20, 2023 •

edited

Loading

RobinClowers commented Jun 11, 2024

Develop coherent strategy for exposing historical data in internal tables #18823

Develop coherent strategy for exposing historical data in internal tables #18823

Comments

umanwizard commented Apr 18, 2023

frankmcsherry commented Apr 18, 2023

chuck-alt-delete commented Apr 18, 2023 • edited Loading

umanwizard commented Apr 18, 2023

benesch commented Apr 18, 2023

matthelm commented Apr 18, 2023

chuck-alt-delete commented Apr 18, 2023

matthelm commented Apr 18, 2023

frankmcsherry commented Apr 20, 2023 • edited Loading

RobinClowers commented Jun 11, 2024

chuck-alt-delete commented Apr 18, 2023 •

edited

Loading

frankmcsherry commented Apr 20, 2023 •

edited

Loading