persist-client: remove sink_correction_peak metrics #27341
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
SinkMetrics::update_sink_correction_peak_metrics
method assumed "quiescence", i.e. the temporary ceasing of updates to sink metrics while it was executing. While the method was correct under that assumption, the calling code unfortunately was not able to uphold it. Specifically,step_or_park
is invoked per worker, so whenupdate_sink_correction_peak_metrics
is invoked it can only assume that the current worker won't update the sink metrics, but it cannot assume anything about other workers.As a result of the violated assumption, "Negative aggregate length for persist sink correction" could be logged even when nothing was actually wrong with the persist sink implementation.
This is solved here by removing the problematic peak metrics entirely. Stopping all workers regularly to update them is too costly and updating them without quiescence can lead to incorrect metric values, reducing the metrics' usefullness. The current believe is that the remaining sink correction metrics should be sufficient to monitor persist_sink memory usage in production.
This is an alternative to #27306, following discussion on that PR.
Motivation
Fixes https://github.com/MaterializeInc/database-issues/issues/8072
Checklist
$T ⇔ Proto$T
mapping (possibly in a backwards-incompatible way), then it is tagged with aT-proto
label.