-
Notifications
You must be signed in to change notification settings - Fork 458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
persist: compaction does not work correctly for certain (non uncommon) ingestion patterns #15093
Comments
To reproduce, clone the
And let it complete. Then see that the s3 usage goes from:
to
to
and does not ever come back down ... I gave it 1 hour |
The size of the data in reality:
So 100Mb for real. |
I tried a simpler workload that simply inserts some keys and then deletes them all. This is also not getting cleaned up. No extra mzcompose required, just put this in Storaged will spin at 100% for some time before settling down. The blob storage however will settle at 2.7G for 10M messages, even though the source is now empty:
If I understand correctly, the old stuff may get removed eventually but this is contingent on receiving even more messages from the source. That however is not a given -- the source may now receive any more messages in the future, but needs to compact and clean up existing ones for GPDR compliance.
|
CC'ing @hlburak due to the compliance issue. |
I believe I found the issue! I created a model of Philips problematic ingestion behavior using pure differential-dataflow to see how The ingestion pattern, which is modeled after continually ingesting upserts for the same
Results for SpineAs expected, DD/Spine is dealing with this just fine! We keep around
The updates would be consolidated down to This is important for later: Results for
|
I also think that this issue here is the root cause for a number of other issues:
|
@philip-stoev I changed the title to reflect the findings more explicitly, I hope that's okay. |
@aljoscha No problem. Now that I know the operation of the entire thing is very sensitive to the shape of the workload, once you have a fix out I will test with many more workload shapes then I initially envisioned. |
It should not be sensitive to the workload. Once we fix that, all workloads should just work! 🤞 |
@danhhz and I talked briefly about this, a sketch for an immediate fix is:
|
#15375 and #15356 will land in this release and should improve compaction performance for upserts, though we don't have a benchmark to specifically measure this yet. In other general compaction news, we've also merged in:
This still leaves us with several categories of work to continue on. The biggest ones that come to mind are: better handling of empty batches, fetching blobs from S3 in parallel (and eventually all blobs, not just the initial stage), and more comprehensive/structured timeouts. |
another compaction perf fix #15575 |
#15575 has been merged and is on prod and has massively reduced the time we're spending per compaction: This will help compaction keep pace with writes, and greatly improves the odds that our compaction results merge cleanly into state. #15732 has additionally been merged but is not yet released. This one greatly improves efficacy as well, identifying more opportunities to compact and improves the odds compaction applying successfully. With these changes, I tried running Philip's original test of gh15093. After running, we found
And then 20 min later:
We're no longer leaking bytes! The data didn't consolidate out particularly much, but eh... problem for another day (tomorrow? 😄). The amount we can logically compact is a function of the shard downgrading since, so that's no longer just the domain of |
This currently does not compact:
I gave it more than 1 hour but the reported consumption remains 2Mb:
|
Create various database objects and check the storage usage reported for them in the mz_storage_usage table. Scenarios involving upsert and deletions have been disabled due to MaterializeInc#15093
Create various database objects and check the storage usage reported for them in the mz_storage_usage table. Scenarios involving upsert and deletions have been disabled due to MaterializeInc#15093 Relates to MaterializeInc/cloud#3737
@aljoscha do you still have a copy of https://github.com/aljoscha/materialize/tree/gh15093 sitting around somewhere? |
I guess you meant https://github.com/philip-stoev/materialize/tree/gh15093? You can also use the test case from #15093 (comment) as well as the disabled parts of test/storage-usage/mzcompose.py, currently in main. |
nope! turns out paul had a copy of aloscha's branch, which had a tighter repro of the same badness you were seeing https://github.com/MaterializeInc/materialize/compare/main...pH14:materialize:gh15093-aljoscha?expand=1 |
We were accidentally serializing all compaction work in the PersistCompactionWorker task. The intent was to limit concurrency of the actual work using a semaphore, enqueue requests in a channel leading into that semaphore, and drop requests when the channel fills up (i.e. both the semaphore and the channel are full). This commit changes the impl to match that intent. Touches MaterializeInc#15093
We were accidentally serializing all compaction work in the PersistCompactionWorker task. The intent was to limit concurrency of the actual work using a semaphore, enqueue requests in a channel leading into that semaphore, and drop requests when the channel fills up (i.e. both the semaphore and the channel are full). This commit changes the impl to match that intent. Touches MaterializeInc#15093
okay, aljsocha's sim is pretty happy on top of #16784, so moving on to philip's simpler testdrive repro #15093 (comment). i ran it but lopping one zero off the end to start (seems to strike a nice balance of quick iteration and fidelity to the original bugs) once things settle into a state where the source is just appending empty batches, this is what our spine looks like
ignore the format, it's something i've invented that only exists locally. the way to read this is that we're continually hitting the empty batch optimization and squashing stuff into |
so close! here you see the idle merge effort fueling the big batch and then it compacts! but when it compacts, we end up with 2,000,000 updates instead of everything consolidating out. I suspect the since isn't advanced far enough to allow them to consolidate
|
In differential dataflow, idle_merge_effort configures an arrangement to introduce extra fuel on every operator invocation. This causes data to continue compacting, even when new updates are not being introduced. Persist doesn't have anything that directly corresponds to operator invocations, but we do the best we can to match the spirit of idle_merge_effort. Specifically, we introduce the additional fuel on each compare_and_append call. In practice, compare_and_append gets called on a regular cadence to advance the frontier, so even in no updates are being added, we'll eventually compact things down. This does mean, however, that tuning the constant involved will likely be quite different. Touches MaterializeInc#15093
In differential dataflow, idle_merge_effort configures an arrangement to introduce extra fuel on every operator invocation. This causes data to continue compacting, even when new updates are not being introduced. Persist doesn't have anything that directly corresponds to operator invocations, but we do the best we can to match the spirit of idle_merge_effort. Specifically, we introduce the additional fuel on each compare_and_append call. In practice, compare_and_append gets called on a regular cadence to advance the frontier, so even in no updates are being added, we'll eventually compact things down. This does mean, however, that tuning the constant involved will likely be quite different. Touches MaterializeInc#15093 Touches MaterializeInc#16607
What version of Materialize are you using?
1a7ceca86c5624e20100803f19f10bab8b49111e
How did you install Materialize?
Docker image
What is the issue?
If one performs repeated upsert updates over 100Mb worth of data, the total amount of stuff stored in S3 continues to grow and never goes down regardless of how much time it is given. In addition to a general inefficiency, this would be a problem for GPDR compliance.
This is also true if the source is left idle for a long time.
Relevant log output
No response
The text was updated successfully, but these errors were encountered: