Skip to content

pg_last_xlog_receive_location() is called for Postgres 10/11 databases #4384

@alexadriaanse

Description

@alexadriaanse

We're seeing the following errors resulting from queries that Datadog runs against one of our Postgres databases:

function pg_last_xlog_receive_location() does not exist at character 35

It looks like Datadog agent sometimes runs a pre-PG10 query against a PG10+ database.

Output of the info page

● datadog-agent.service - Datadog Agent
   Loaded: loaded (/lib/systemd/system/datadog-agent.service; enabled; vendor preset: enabled)
   Active: active (running) since Thu 2019-08-15 20:07:29 UTC; 2min 26s ago
 Main PID: 10238 (agent)
    Tasks: 70 (limit: 4915)
   Memory: 231.4M
      CPU: 22.802s
   CGroup: /system.slice/datadog-agent.service
           ├─10238 /opt/datadog-agent/bin/agent/agent run -p /opt/datadog-agent/run/agent.pid
           └─10408 java -Xmx200m -Xms50m -classpath /opt/datadog-agent/bin/agent/dist/jmx/jmxfetch.jar org.datadog.jmxfetch.App --ipc_host localhost --ipc_port 5001 --check_period 15000 --thread_pool_size 3 --collection_timeout 60 --reconnection_timeout 10 --reconnection_thread_pool_size 3 --log_level INFO --reporter

Additional environment details (Operating System, Cloud provider, etc):
We're running the datadog-agent 1:6.13.0-1 Debian package on Debian Stretch on a VM within AWS. The agent runs on the host directly, but the Postgres databases it monitors run within Docker containers.

Steps to reproduce the issue:
I haven't built an isolated test case, but this seemed to work in our environment:

  1. Have the Datadog agent monitor multiple Postgres database of different versions (9.6, 10, and 11). Configure it so that the configuration for a v9.6 database is immediately followed by the configuration for a v11 database in /etc/datadog-agent/conf.d/postgres.d/conf.yaml. If you flip the order (where the v9.6 database follows the v11 database in conf.yaml) the problem goes away, at least for the v11 database.
  2. Watch the logs of the v11 database for errors.

Describe the results you received:
I'm seeing this in the logs for the Postgres 11 database:

2019-08-15 20:22:31 UTC [8802-121] staff@xref ERROR:  function pg_last_xlog_receive_location() does not exist at character 35
2019-08-15 20:22:31 UTC [8802-122] staff@xref HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
2019-08-15 20:22:31 UTC [8802-123] staff@xref STATEMENT:
	SELECT abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) END, abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes_dup
	 WHERE (SELECT pg_is_in_recovery())
2019-08-15 20:22:46 UTC [8802-124] staff@xref ERROR:  function pg_last_xlog_receive_location() does not exist at character 35
2019-08-15 20:22:46 UTC [8802-125] staff@xref HINT:  No function matches the given name and argument types. You might need to add explicit type casts.
2019-08-15 20:22:46 UTC [8802-126] staff@xref STATEMENT:
	SELECT abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes, CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0 ELSE GREATEST (0, EXTRACT (EPOCH FROM now() - pg_last_xact_replay_timestamp())) END, abs(pg_xlog_location_diff(pg_last_xlog_receive_location(), pg_last_xlog_replay_location())) AS replication_delay_bytes_dup
	 WHERE (SELECT pg_is_in_recovery())

Describe the results you expected:
No query errors.

Additional information you deem important (e.g. issue happens only occasionally):
I dug through the postgres.py code a bit and added some debugging information to our local copy. This Postgres check has different queries for different versions of Postgres, and I can see that it picks the right query (one that doesn't call pg_last_xlog_receive_location()) for the PG11 database. I'm wondering if some variable is leaking between multiple database checks, where it's executing the query for the wrong (previous) database against the database in question at times.

We're seeing this errors very frequently, although I'm not sure if it happens every time the database is checked.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions