Client_addr on pg_catalog.pg_stat_replication  wrong ip address - istio enabled

Please, answer some short questions which should help us to understand your problem / question better?

- **Which image of the operator are you using?** e.g. registry.opensource.zalan.do/acid/postgres-operator:v1.6.3
- **Where do you run it - cloud or metal?** Kubernetes PKS
- **Are you running Postgres Operator in production?** yes
- **Type of issue?** Question / Bug

While trying to do a incluster upgrade from PGVERSION 12 to PGVERSION 13 discovered that members ip's are not correctly written into pg_catalog.pg_stat_replication 

While running **python3 /scripts/inplace_upgrade.py 3** (three nodes cluster), i have following error message:

`2021-09-27 14:58:37,457 inplace_upgrade INFO: No PostgreSQL configuration items changed, nothing to reload.
2021-09-27 14:58:37,500 inplace_upgrade WARNING: Kubernetes RBAC doesn't allow GET access to the 'kubernetes' endpoint in the 'default' namespace. Disabling 'bypass_api_service'.
2021-09-27 14:58:37,504 inplace_upgrade INFO: establishing a new patroni connection to the postgres cluster
2021-09-27 14:58:37,561 inplace_upgrade ERROR: Member hco-pg-1-1 is not streaming from the primary
`
After debugging, discovered that into pg_catalog.pg_stat_replication, client_addr is 127.0.0.6 for both nodes that are replicating data from master

`postgres=# SELECT * from pg_catalog.pg_stat_replication;
 pid | usesysid | usename | application_name | client_addr | client_hostname | client_port |         backend_start         | backend_xmin |   state   |  sent_lsn  | write_lsn  | flush_lsn  | replay_lsn |    write_lag    |    flush_lag    |   replay_lag    | sync_priority | sync_state |          reply_time
-----+----------+---------+------------------+-------------+-----------------+-------------+-------------------------------+--------------+-----------+------------+------------+------------+------------+-----------------+-----------------+-----------------+---------------+------------+-------------------------------
 894 |    16637 | standby | hco-pg-1-2       | **127.0.0.6**   |                 |       40175 | 2021-09-27 13:40:26.305049+00 |              | streaming | 9/4E027518 | 9/4E027518 | 9/4E027518 | 9/4E027518 | 00:00:00.002132 | 00:00:00.002812 | 00:00:00.002913 |             0 | async      | 2021-09-27 15:09:14.095679+00
 886 |    16637 | standby | hco-pg-1-1       | **127.0.0.6**   |                 |       36155 | 2021-09-27 13:40:05.528001+00 |              | streaming | 9/4E027518 | 9/4E027518 | 9/4E027518 | 9/4E027518 | 00:00:00.001441 | 00:00:00.002128 | 00:00:00.002146 |             0 | async      | 2021-09-27 15:09:14.09543+00
(2 rows)
`

Cluster looks like this:

`root@hco-pg-1-0:/home/postgres# patronictl list
+ Cluster: hco-pg-1 (6995167694694490192) ----+----+-----------+
| Member     | Host       | Role    | State   | TL | Lag in MB |
+------------+------------+---------+---------+----+-----------+
| hco-pg-1-0 | 11.32.16.6 | Leader  | running | 11 |           |
| hco-pg-1-1 | 11.32.16.7 | Replica | running | 11 |         0 |
| hco-pg-1-2 | 11.32.16.9 | Replica | running | 11 |         0 |
+------------+------------+---------+---------+----+-----------+
`

After debugging into /scripts/inplace_upgrade.py, found out that into below code, section **ip = member.conn_kwargs().get('host')** retrieves correct replica ip, then while searching replication lag by ip into **lag = streaming.get(ip)**, value of lag will be None since ip won't match as into pg_catalog.pg_stat_replication i only have client_addr = 127.0.0.6 for both nodes.

`    def ensure_replicas_state(self, cluster):
        """
        This method checks the satatus of all replicas and also tries to open connections
        to all of them and puts into the `self.replica_connections` dict for a future usage.
        """
        self.replica_connections = {}
        streaming = {a: l for a, l in self.postgresql.query(
            ("SELECT client_addr, pg_catalog.pg_{0}_{1}_diff(pg_catalog.pg_current_{0}_{1}(),"
             " COALESCE(replay_{1}, '0/0'))::bigint FROM pg_catalog.pg_stat_replication")
            .format(self.postgresql.wal_name, self.postgresql.lsn_name))}
        print("Streaming: ", streaming)
        def ensure_replica_state(member):
            ip = member.conn_kwargs().get('host')
            lag = streaming.get(ip)
            if lag is None:
                return logger.error('Member %s is not streaming from the primary', member.name)
            if lag > 16*1024*1024:
                return logger.error('Replication lag %s on member %s is too high', lag, member.name)
`

My question would be if this is because we are using istio injection (envoy proxy) for our zalando postgres clusters or if we have some other issue and how we can solve this.

Thank you !
/Cristi Vlad 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Client_addr on pg_catalog.pg_stat_replication wrong ip address - istio enabled #1629

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions