New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Compaction seems to be broken #2170
Comments
More detail on this. Compaction in materialized isn't broken. What is going on is that differential has a hard time automatically deciding between two reasonable choices:
The former is popular and was requested. The latter is what we should be doing here, at least if we want the footprint to drop once the compaction window catches up to the moment we stopped adding data. It is a bit tricky for differential to automatically sort this out. It could just keep running hot, continually compacting traces; this is fine, but a weird experience on personal computers. It could have specialized logic for total orders where there is easier to see; a lot more abstraction breaking here. Probably the compaction machinery should just be more "user programmable". There are some exacerbating factors in
Putting a single record in reduces the size to compacted state. Not a great solution, but this might also be something that we only expect in some cases (worst case: lots of changes all of a sudden, and then quiet; like when loading static data). |
Filed #2310 to track the issue that we may want to use fewer timestamps in loading data. That isn't the only potential resolution to this issue, and wouldn't resolve all cases, but it would probably mitigate the most obvious cases. |
I hit this yesterday with restarting from a persisted kafka upsert source. Just to make sure I clearly understand the issue here: The main reason we don't want to do logical compaction over and over again is because we would then continually be updating times and potentially consolidating records, and most likely, after some time, that would just be wasted cycles right? It's not so much "logical compaction is expensive" as "at some point, the data in the batch are fully logically compacted, and after that, doing more logical compaction is kind of silly" -- do I have that right? It seems like when there are single dimensional times, you can detect whether a batch is fully logically compacted by looking at the number of distinct I dunno yet how that generalizes to partially ordered times, and I know you said "could have specialized logic for total orders where there is easier to see; a lot more abstraction breaking here.". I'm not really advocating that we should do this, just want to double check
|
Reading some data from files, with tailing set to true, we end up in a state with a materialized
tripdata
collection wherewhich remains steady with time. This is with
DIFFERENTIAL_EAGER_MERGE=1000
set, so in principle this should be compacting down to the limiting size given enough time. For some reason this isn't happening. Quickprintln!
action indicates that the coordinator is providingAllowCompaction
messages for the relevant arrangementu15
.When we check out the arrangement information
which suggests that physical compaction continues to happen, but logical compaction (where times would align and we would collapse down to just 3105 records) doesn't seem to occur.
The text was updated successfully, but these errors were encountered: