Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

persist-client: remove sink_correction_peak metrics #27341

Merged
merged 1 commit into from
May 29, 2024

Conversation

teskje
Copy link
Contributor

@teskje teskje commented May 29, 2024

The SinkMetrics::update_sink_correction_peak_metrics method assumed "quiescence", i.e. the temporary ceasing of updates to sink metrics while it was executing. While the method was correct under that assumption, the calling code unfortunately was not able to uphold it. Specifically, step_or_park is invoked per worker, so when update_sink_correction_peak_metrics is invoked it can only assume that the current worker won't update the sink metrics, but it cannot assume anything about other workers.

As a result of the violated assumption, "Negative aggregate length for persist sink correction" could be logged even when nothing was actually wrong with the persist sink implementation.

This is solved here by removing the problematic peak metrics entirely. Stopping all workers regularly to update them is too costly and updating them without quiescence can lead to incorrect metric values, reducing the metrics' usefullness. The current believe is that the remaining sink correction metrics should be sufficient to monitor persist_sink memory usage in production.

This is an alternative to #27306, following discussion on that PR.

Motivation

  • This PR fixes a recognized bug.

Fixes https://github.com/MaterializeInc/database-issues/issues/8072

Checklist

@teskje teskje force-pushed the remove-correction-peak-metrics branch from f3e7eda to f202b2c Compare May 29, 2024 08:15
The `SinkMetrics::update_sink_correction_peak_metrics` method assumed
"quiescence", i.e. the temporary ceasing of updates to sink metrics
while it was executing. While the method was correct under that
assumption, the calling code unfortunately was not able to uphold it.
Specifically, `step_or_park` is invoked per worker, so when
`update_sink_correction_peak_metrics` is invoked it can only assume that
the current worker won't update the sink metrics, but it cannot assume
anything about other workers.

As a result of the violated assumption, "Negative aggregate length for
persist sink correction" could be logged even when nothing was actually
wrong with the persist sink implementation.

This is solved here by removing the problematic peak metrics entirely.
Stopping all workers regularly to update them is too costly and updating
them without quiescence can lead to incorrect metric values, reducing
the metrics' usefullness. The current believe is that the remaining sink
correction metrics should be sufficient to monitor persist_sink memory
usage in production.
@teskje teskje force-pushed the remove-correction-peak-metrics branch from f202b2c to 7046b63 Compare May 29, 2024 08:52
@teskje teskje marked this pull request as ready for review May 29, 2024 09:46
@teskje teskje requested a review from a team as a code owner May 29, 2024 09:46
@teskje teskje requested review from a team, antiguru and bkirwi May 29, 2024 09:46
Copy link
Member

@antiguru antiguru left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks all good to me

@teskje
Copy link
Contributor Author

teskje commented May 29, 2024

TFTRs!

@teskje teskje merged commit 51a124b into MaterializeInc:main May 29, 2024
76 checks passed
@teskje teskje deleted the remove-correction-peak-metrics branch May 29, 2024 16:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants