Postgres Performance Optimization: Cache baseline metrics and apply updates incrementaly #17554

amw-zero · 2024-05-10T15:02:32Z

What does this PR do?

#17187 was reverted because we observed that it could lead to inflated metrics. The root cause of that is that computed_derivative_rows currently assumes that all rows in pg_stat_statements are passed as input. It does not compute correct results when a subset of pg_stat_statements is passed in, because that can lead to taking the diff of two different pg_stat_statement rows, which leads to incoherent results.

The approach taken here is to introduce a baseline_metrics cache, which holds one entry for each pg_stat_statements row and is populated once from the full pg_stat_statements dataset. On subsequent check runs, only queries that have been called since the previous run will be returned, and the baseline_metrics will be updated for just those queryids.

This allows the full set of normalized rows to be constructed with one new function: _apply_deltas. This gets run before compute_derivative_rows and uses the baseline_metrics to reconstruct the full set of metrics rows. This relies on the fact that a query that hasn't been called between check runs has no change in metrics, so we can reuse the cached values.

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

Feature or bugfix MUST have appropriate tests (unit, integration, e2e)
Changelog entries must be created for modifications to shipped code
Add the qa/skip-qa label if the PR doesn't need to be tested during QA.
If you need to backport this PR to another branch, you can add the backport/<branch-name> label to the PR and it will automatically open a backport PR once this one is merged

lu-zhengda

I'm a bit concerned about increased memory impact of _baseline_metrics because the size of this dict is not bounded by pg_stat_statements.max. In a high cardinality database with lots of evictions in pg_stat_statements, the size of _baseline_metrics could grow quickly.

postgres/datadog_checks/postgres/statements.py

amw-zero · 2024-05-10T21:32:56Z

Re the unbounded growth due to query eviction, I added an expiry time (currently set to 10 minutes) past which the baseline will be re-fetched: decf7e8.

amw-zero · 2024-05-13T13:50:16Z

@lu-zhengda I tweaked the implementation slightly. I realized that aggregating the metric data by query signature in _apply_deltas duplicates the logic in _merge_duplicate_rows. So instead I'm using the baseline_metrics to re-construct the full set of rows (with one row per query signature), and allowing _merge_duplicate_rows to continue doing the aggregation.

github-actions · 2024-05-13T14:46:10Z

Test Results

296 tests 256 ✅ 2m 0s ⏱️
1 suites 39 💤
1 files 1 ❌

For more details on these failures, see this check.

Results for commit f7b78bd.

postgres/datadog_checks/postgres/statements.py

amw-zero · 2024-05-14T11:45:07Z

postgres/tests/test_statements.py

@@ -77,6 +77,93 @@ def test_dbm_enabled_config(integration_check, dbm_instance, dbm_enabled_key, db
    assert check._config.dbm_enabled == dbm_enabled


+@requires_over_10


Note: This test is failing on PG9. We may have to skip this optimization on that version. It seems like not all expected rows are being returned in the QUERY_CALLS_QUERY which prevents correct functioning.

…ion: Limit how many records are pulled from pg_stat_statements. (#17187)" (#17397)" This reverts commit db9b87e.

github-actions · 2024-05-14T19:21:38Z

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

github-actions · 2024-05-14T19:25:40Z

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

lu-zhengda

LGTM. minor change on the config default

postgres/assets/configuration/spec.yaml

github-actions · 2024-05-15T21:20:33Z

The validations job has failed; please review the Files changed tab for possible suggestions to resolve.

amw-zero requested review from a team as code owners May 10, 2024 15:02

datadog-agent-integrations-bot bot added integration/postgres team/agent-integrations team/database-monitoring-agent labels May 10, 2024

lu-zhengda reviewed May 10, 2024

View reviewed changes

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch 3 times, most recently from 08deca3 to f7b78bd Compare May 13, 2024 14:39

lu-zhengda reviewed May 13, 2024

View reviewed changes

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

postgres/datadog_checks/postgres/statements.py Outdated Show resolved Hide resolved

amw-zero commented May 14, 2024

View reviewed changes

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch 2 times, most recently from 1d9227f to 1af5fec Compare May 14, 2024 14:47

amw-zero added 14 commits May 14, 2024 15:19

Revert "Revert "[SDBM-842] Postgres integration performance optimizat…

592f3fb

…ion: Limit how many records are pulled from pg_stat_statements. (#17187)" (#17397)" This reverts commit db9b87e.

Logging

cb6a051

Add targeted logging

2dcfba6

Improve logging

d586d38

Observe logs based on querying all pgss rows.

32d27dc

Reproduce inflated metrics in test.

e552d9c

Keep baseline metric cache.

48d4bd9

Replace baseline outright

d89db22

Remove unneeded setup and logs

e957ee0

Remove logs

d364a1d

Changelog

4744a92

Expire metrics caches to prevent unbounded growth.

22f1803

Create one row per queryid in _apply_deltas.

1eb7dc9

Logging

390ea86

Opt in to performance optimization

2630e2f

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from 32e8a45 to 2630e2f Compare May 14, 2024 19:20

datadog-agent-integrations-bot bot added the documentation label May 14, 2024

Lint fixes

e44b0b3

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from d35d2b3 to e44b0b3 Compare May 14, 2024 19:27

amw-zero added 2 commits May 14, 2024 19:42

Account for function rename

c825983

Account for new set_calls pattern.

217358d

lu-zhengda reviewed May 14, 2024

View reviewed changes

postgres/assets/configuration/spec.yaml Outdated Show resolved Hide resolved

Spec default

d252c9a

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from 1e11c25 to d252c9a Compare May 14, 2024 21:05

lu-zhengda previously approved these changes May 14, 2024

View reviewed changes

amw-zero dismissed lu-zhengda’s stale review via 7f0bd81 May 15, 2024 21:19

Properly expire baseline metrics cache.

8bdf3c6

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from 7f0bd81 to 8bdf3c6 Compare May 15, 2024 21:26

amw-zero added 3 commits May 15, 2024 17:44

Loggin

6edb871

wip

cdefa8e

Fix query text.

3ebcfbf

datadog-agent-integrations-bot bot added the base_package label May 16, 2024

Remove logging

dbc72b2

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from 90a6700 to dbc72b2 Compare May 17, 2024 00:11

amw-zero added 3 commits May 17, 2024 00:15

Lint fixes

eff72a3

Remove more logs

89c9293

Expire baseline metrics cache based on size.

7523552

amw-zero force-pushed the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch from 2a98e2b to 7523552 Compare May 17, 2024 02:13

lu-zhengda approved these changes May 17, 2024

View reviewed changes

amw-zero merged commit 031c7c6 into master May 17, 2024
35 checks passed

amw-zero deleted the alex.weisberger/dbmon-4046-pg-optimization-delta-application branch May 17, 2024 14:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres Performance Optimization: Cache baseline metrics and apply updates incrementaly #17554

Postgres Performance Optimization: Cache baseline metrics and apply updates incrementaly #17554

amw-zero commented May 10, 2024

lu-zhengda left a comment

amw-zero commented May 10, 2024

amw-zero commented May 13, 2024

github-actions bot commented May 13, 2024

amw-zero May 14, 2024

github-actions bot commented May 14, 2024

github-actions bot commented May 14, 2024

lu-zhengda left a comment

github-actions bot commented May 15, 2024

		@@ -77,6 +77,93 @@ def test_dbm_enabled_config(integration_check, dbm_instance, dbm_enabled_key, db
		assert check._config.dbm_enabled == dbm_enabled


		@requires_over_10

Postgres Performance Optimization: Cache baseline metrics and apply updates incrementaly #17554

Postgres Performance Optimization: Cache baseline metrics and apply updates incrementaly #17554

Conversation

amw-zero commented May 10, 2024

What does this PR do?

Motivation

Additional Notes

Review checklist (to be filled by reviewers)

lu-zhengda left a comment

Choose a reason for hiding this comment

amw-zero commented May 10, 2024

amw-zero commented May 13, 2024

github-actions bot commented May 13, 2024

Test Results

amw-zero May 14, 2024

Choose a reason for hiding this comment

github-actions bot commented May 14, 2024

github-actions bot commented May 14, 2024

lu-zhengda left a comment

Choose a reason for hiding this comment

github-actions bot commented May 15, 2024