-
Notifications
You must be signed in to change notification settings - Fork 459
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Develop coherent strategy for exposing historical data in internal tables #18823
Comments
Some relevant background: the data warehousing space has (probably many) existing idioms, one of which is Kimball's dimensional modeling. There are idioms for slowly changing dimensions that have the benefits that we wouldn't be inventing things from scratch. Potentially, they would be convertible into DD style collections. Additionally, they probably resemble idioms that we would be expecting from users. |
This is highly relevant to customer use cases as well. Customers often want to analyze historical data. For option 1, we don't currently expose the ability to customize compaction to users, so For option 2, I'd say customers are most comfortable with this since it's how things work in other databases. However, we would need an easy way for users to handle "compaction" themselves in this case. |
I think if we ever do support user-configurable compaction (and @chaas 's team does want to support it eventually), then we don't need to be opinionated here: users are free to use whatever approach they want. So the question here is more about how we choose to use the system ourselves. (Obviously, though, our experiences can and should inform product decisions about what we should improve in order to improve the experience for users too). |
Copying in my comment from Slack:
|
Both of these methods seem to have their own merits. Option 1 seems more immediately useful to me as a Analytics person, but as more of the ecosystem evolves to streaming I could imagine Option 2 also having a role. Ideally we don't close the door on either.
|
We don’t have to be opinionated in the sense of restricting users, but |
A few follow-up items on Option 1 that were discussed in Slack
|
The difference that stands out to me about options 1 and 2 is that (1.) does not allow historical modification and (2.) does allow historical modification. At least, that is the difference in what they can communicate (otherwise they seem pretty similar). Flip side of the coin is that (1.) communicates completeness for a time, and (2.) does not. |
My experience building the source overview page in console is relevant here. Because materialize does not produce a diff when there is no change in a row, we had to synthesize points that represent a steady state (usually 0 bytes / rows ingested, but not always). This is further complicated by having to figure out exactly what timestamp we should assign to those points so they would properly align with other points. All of this would be avoided with option 2, where we just write down the values at each collection interval, and we know exactly when the collection happened. |
It has become clear that the pattern of using internal tables as a historical record of events is common in practice. For example, the historical record of resource usage metrics, a log of sink and source errors, and a log of cluster status changes (OOMs, crashes, etc.) either are already available in the web console or will be soon.
This historical data is exposed in at least two ways today:
SELECT ... AS OF
,SUBSCRIBE ... AS OF
, etc.). The user should consider the differential-dataflow timestamp field to be the timestamp of the event or measurement.The main advantages of the second approach are that it does not require knowledge of exotic features in order to use, and it also allows further SQL transformation of the result set including timestamps (e.g., bucketing), whereas the first approach requires such logic to be done on the client. The main advantage of the first approach is that it doesn't require ad-hoc compaction/cleanup logic (since Differential already does this for us). Furthermore, any machinery developed to support this approach can be straightforwardly extended to help real customers/users that have workflows that depend on historical queries. One possible mitigation of the difficulties with approach 1 is a
CHANGES
operator that would surface the changes to a relation as a relation that could then be further transformed. I am not sure whether we will ever have this feature or how difficult it is to implement.This issue is intended to centralize discussion about these two methods; the ultimate goal is to settle on one to be used throughout the codebase.
The text was updated successfully, but these errors were encountered: