-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We're seeing the following errors resulting from queries that Datadog runs against one of our Postgres databases:
function pg_last_xlog_receive_location() does not exist at character 35
It looks like Datadog agent sometimes runs a pre-PG10 query against a PG10+ database.
Output of the info page
● datadog-agent.service - Datadog Agent
Loaded: loaded (/lib/systemd/system/datadog-agent.service; enabled; vendor preset: enabled)
Active: active (running) since Thu 2019-08-15 20:07:29 UTC; 2min 26s ago
Main PID: 10238 (agent)
Tasks: 70 (limit: 4915)
Memory: 231.4M
CPU: 22.802s
CGroup: /system.slice/datadog-agent.service
├─10238 /opt/datadog-agent/bin/agent/agent run -p /opt/datadog-agent/run/agent.pid
└─10408 java -Xmx200m -Xms50m -classpath /opt/datadog-agent/bin/agent/dist/jmx/jmxfetch.jar org.datadog.jmxfetch.App --ipc_host localhost --ipc_port 5001 --check_period 15000 --thread_pool_size 3 --collection_timeout 60 --reconnection_timeout 10 --reconnection_thread_pool_size 3 --log_level INFO --reporter
Additional environment details (Operating System, Cloud provider, etc):
We're running the datadog-agent 1:6.13.0-1 Debian package on Debian Stretch on a VM within AWS. The agent runs on the host directly, but the Postgres databases it monitors run within Docker containers.
Steps to reproduce the issue:
I haven't built an isolated test case, but this seemed to work in our environment:
- Have the Datadog agent monitor multiple Postgres database of different versions (9.6, 10, and 11). Configure it so that the configuration for a v9.6 database is immediately followed by the configuration for a v11 database in
/etc/datadog-agent/conf.d/postgres.d/conf.yaml. If you flip the order (where the v9.6 database follows the v11 database inconf.yaml) the problem goes away, at least for the v11 database. - Watch the logs of the v11 database for errors.
Describe the results you received:
I'm seeing this in the logs for the Postgres 11 database:
2019-08-15 20:22:31 UTC [8802-121] staff@xref ERROR: function pg_last_xlog_receive_location() does not exist at character 35
2019-08-15 20:22:31 UTC [8802-122] staff@xref HINT: No function matches the given name and argument types. You might need to add explicit type casts.
2019-08-15 20:22:31 UTC [8802-123] staff@xref STATEMENT:
SELECT abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) END, abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes_dup
WHERE (SELECT pg_is_in_recovery())
2019-08-15 20:22:46 UTC [8802-124] staff@xref ERROR: function pg_last_xlog_receive_location() does not exist at character 35
2019-08-15 20:22:46 UTC [8802-125] staff@xref HINT: No function matches the given name and argument types. You might need to add explicit type casts.
2019-08-15 20:22:46 UTC [8802-126] staff@xref STATEMENT:
SELECT abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) END, abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes_dup
WHERE (SELECT pg_is_in_recovery())
Describe the results you expected:
No query errors.
Additional information you deem important (e.g. issue happens only occasionally):
I dug through the postgres.py code a bit and added some debugging information to our local copy. This Postgres check has different queries for different versions of Postgres, and I can see that it picks the right query (one that doesn't call pg_last_xlog_receive_location()) for the PG11 database. I'm wondering if some variable is leaking between multiple database checks, where it's executing the query for the wrong (previous) database against the database in question at times.
We're seeing this errors very frequently, although I'm not sure if it happens every time the database is checked.