Compaction seems to be broken #2170

frankmcsherry · 2020-03-02T21:21:47Z

Reading some data from files, with tailing set to true, we end up in a state with a materialized tripdata collection where

materialize=> select * from mz_catalog.mz_records_per_dataflow_global;
 id  |                                name                                | records 
-----+--------------------------------------------------------------------+---------
 779 | Dataflow: materialize.public.tripdata_primary_idx                  | 1829830
 197 | Dataflow: mz_catalog.mz_dataflow_names_primary_idx                 |    1016
 316 | Dataflow: mz_catalog.mz_records_per_dataflow_primary_idx           |     595
 520 | Dataflow: mz_catalog.mz_perf_peek_durations_core_primary_idx       |       0
 485 | Dataflow: mz_catalog.mz_perf_arrangement_records_primary_idx       |     974
 375 | Dataflow: mz_catalog.mz_perf_dependency_frontiers_primary_idx      |     208
 567 | Dataflow: mz_catalog.mz_perf_peek_durations_bucket_primary_idx     |       0
 167 | Dataflow: mz_catalog.mz_addresses_with_unit_length_primary_idx     |    1368
 355 | Dataflow: mz_catalog.mz_records_per_dataflow_global_primary_idx    |     504
 239 | Dataflow: mz_catalog.mz_dataflow_operator_dataflows_primary_idx    |    1684
 287 | Dataflow: mz_catalog.mz_records_per_dataflow_operator_primary_idx  |     895
 724 | Dataflow: mz_catalog.mz_perf_peek_durations_aggregates_primary_idx |       0
(12 rows)

Time: 1.829 ms
materialize=> select count(*) from (select distinct * from tripdata);
 count 
-------
  3105
(1 row)

which remains steady with time. This is with DIFFERENTIAL_EAGER_MERGE=1000 set, so in principle this should be compacting down to the limiting size given enough time. For some reason this isn't happening. Quick println! action indicates that the coordinator is providing AllowCompaction messages for the relevant arrangement u15.

When we check out the arrangement information

materialize=> select * from mz_arrangement_sizes where operator = 855;
 operator | worker | records | batches 
----------+--------+---------+---------
      855 |      0 | 1829830 |       2
(1 row)

Time: 1.046 ms
materialize=>

which suggests that physical compaction continues to happen, but logical compaction (where times would align and we would collapse down to just 3105 records) doesn't seem to occur.

The text was updated successfully, but these errors were encountered:

frankmcsherry · 2020-03-14T15:49:45Z

More detail on this.

Compaction in materialized isn't broken. What is going on is that differential has a hard time automatically deciding between two reasonable choices:

"you have advanced time but haven't added any data; I should go quiet" and
"you have advanced time but haven't added any data; I should apply logical compaction to otherwise physically compact traces".

The former is popular and was requested. The latter is what we should be doing here, at least if we want the footprint to drop once the compaction window catches up to the moment we stopped adding data.

It is a bit tricky for differential to automatically sort this out. It could just keep running hot, continually compacting traces; this is fine, but a weird experience on personal computers. It could have specialized logic for total orders where there is easier to see; a lot more abstraction breaking here. Probably the compaction machinery should just be more "user programmable".

There are some exacerbating factors in materialized:

We have a 60 second compaction window by default. This means that there is plenty of time for the collection to reach "physically compacted" form (one non-empty batch) before we would notice the benefits of logical compaction. Reducing this made the issue start to go away. It may not be as important to have such a window if we have more reliable timestamp advances in our input streams.
We load static data (pre-existing Kafka streams and files) with increasing timestamps, to give an "interactive" feel. This results in some wasted computation (like Create sources "as of" a timestamp/frontier #1210) but also a lot of uncompacted changes. Had the sources been loaded up with one time, there would be no need for compaction (but also no fun partial / incorrect answers streaming in). Resolving Create sources "as of" a timestamp/frontier #1210 and getting folks to use it would mitigate a fair bit of the pain here.
We don't allow users to control the compaction frontier. This is totally something that could be exposed, either explicitly (e.g. to "pin" a view at a logical time to ensure it stays available for queries at that time) or implicitly (if we cannot select a timestamp for a query due to tight compaction, we can stop compacting each input arrangement until the upper bounds of each align and allow the query).

Putting a single record in reduces the size to compacted state. Not a great solution, but this might also be something that we only expect in some cases (worst case: lots of changes all of a sudden, and then quiet; like when loading static data).

frankmcsherry · 2020-03-14T16:52:07Z

Filed #2310 to track the issue that we may want to use fewer timestamps in loading data. That isn't the only potential resolution to this issue, and wouldn't resolve all cases, but it would probably mitigate the most obvious cases.

ruchirK · 2021-12-04T14:00:20Z

I hit this yesterday with restarting from a persisted kafka upsert source. Just to make sure I clearly understand the issue here:

The main reason we don't want to do logical compaction over and over again is because we would then continually be updating times and potentially consolidating records, and most likely, after some time, that would just be wasted cycles right? It's not so much "logical compaction is expensive" as "at some point, the data in the batch are fully logically compacted, and after that, doing more logical compaction is kind of silly" -- do I have that right?

It seems like when there are single dimensional times, you can detect whether a batch is fully logically compacted by looking at the number of distinct (key, val) vs the number of distinct ((key, val), time) -- if theres more of the latter than the former, then there's potentially some compaction we could still do to consolidate things down.

I dunno yet how that generalizes to partially ordered times, and I know you said "could have specialized logic for total orders where there is easier to see; a lot more abstraction breaking here.". I'm not really advocating that we should do this, just want to double check

if we had a heuristic for "could doing logical compaction lead to data reduction" then it would be easier for the system to know what to do?
the heuristic mentioned above is a valid one for totally ordered times

frankmcsherry added C-bug Category: something is broken A-dataflow Area: dataflow labels Mar 2, 2020

frankmcsherry self-assigned this Mar 2, 2020

frankmcsherry mentioned this issue Mar 14, 2020

Source timestamping policy leads to performance defects #2310

Open

frankmcsherry added the T-memory Theme: memory label Sep 3, 2020

awang added the G-Dataflow label Oct 15, 2020

frankmcsherry removed their assignment May 13, 2021

awang added this to Icebox in Compute via automation Jun 30, 2021

awang removed the G-dataflow label Jun 30, 2021

uce added this to the Later milestone Jul 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compaction seems to be broken #2170

Compaction seems to be broken #2170

frankmcsherry commented Mar 2, 2020

frankmcsherry commented Mar 14, 2020 •

edited

frankmcsherry commented Mar 14, 2020

ruchirK commented Dec 4, 2021

Compaction seems to be broken #2170

Compaction seems to be broken #2170

Comments

frankmcsherry commented Mar 2, 2020

frankmcsherry commented Mar 14, 2020 • edited

frankmcsherry commented Mar 14, 2020

ruchirK commented Dec 4, 2021

frankmcsherry commented Mar 14, 2020 •

edited