Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

wasiualhasib · 2021-10-06T16:23:07Z

Node1:pgPrimary:192.168.1.11
Node2:pgStandby1:192.168.1.12
Node3:pgStandby2:192.168.1.13
Node4:pgWitness:192.168.1.14
@ibarwick
@fanf2
@martinmarques

Here node1 was primary where node2 and node3 as standby node and witness node was node4. It was working fine.
Next I stop node1 in that case node2 became new primary and node3 is standby as a replica with node2 upstream as expected.
Next when I stop node2 in that case node3 became new primary as expected.
Now I want to back node1 as a primary and node2 and node3 standby in that case I could not do that. Below issue I found in that case. I have tried through node rejoin but in that with --dry-run it say it is ok but without --dry-run showing error. Those error given below.

Note: I am also giving may configuration of those postgresql.conf and repmgr.conf with this issue.

I want to make node1 as a primary and replica should be node2 and node3 . Also need to update those changes data to new primary.

This is the view from node1:

WARNING: following issues were detected

node "PG-Node2" (ID: 2) is registered as standby but running as primary
node "PG-Node-Witness" (ID: 4) reports a different upstream (reported: "PG-Node3", expected "PG-Node1")
`

This is the view from Node2:
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf cluster show

WARNING: following issues were detected

node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
node "PG-Node-Witness" (ID: 4) reports a different upstream (reported: "PG-Node3", expected "PG-Node2")
`

===================
This is the view from Node3:

WARNING: following issues were detected

node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
node "PG-Node2" (ID: 2) is running but the repmgr node record is inactive
node "PG-Node3" (ID: 3) is registered as primary but running as standby
`
===========================
This is the view from Node4:

WARNING: following issues were detected

node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
node "PG-Node2" (ID: 2) is running but the repmgr node record is inactive
node "PG-Node3" (ID: 3) is registered as primary but running as standby
`
====================
This is the view of postgresql.conf file of node1,node2 and node3

`
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica'
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /mnt/pg_log_archive/%f && cp %p /mnt/pg_log_archive/%f'
shared_preload_libraries = 'repmgr'

`

This is the view of postgresql.conf file of node4 witness node

listen_addresses = '*' shared_preload_libraries = 'repmgr'

=====================================
Node 1: repmgr.conf
`
node_id=1
node_name='PG-Node1'
conninfo='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=100
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60

`

Node 2: repmgr.conf

`
node_id=2
node_name='PG-Node2'
conninfo='host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=60
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60

`

Node3 : repmgr.conf

`
node_id=3
node_name='PG-Node3'
conninfo='host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=40
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60
`

Node4: repmgr.conf

`
node_id=4
node_name='PG-Node-Witness'
conninfo='host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
primary_visibility_consensus=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

`

Node Rejoin issue step by step:

Step 1: Run node rejoin command to join with node1:192.168.1.11
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run

NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node
`

Step 2: Standby1 stopped

[postgres@pgStandby1 pg_wal]$ systemctl stop postgresql-12.service

Step3: Now run node rejoin command again with --dry-run and showing below message.
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run

NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 7015769547447271848
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/var/lib/pgsql/12/data/postgresql.local.conf" not found, skipping
INFO: file "postgresql.conf" would be copied to "/tmp/repmgr-config-archive-PG-Node2/postgresql.conf"
INFO: 1 files would have been copied to "/tmp/repmgr-config-archive-PG-Node2"
WARNING: unable to delete directory "/tmp/repmgr-config-archive-PG-Node2"
DETAIL: Directory not empty
HINT: directory may need to be manually removed
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met
`
Step4: Now run node rejoin command again without --dry-run and showing below error message

'[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose
NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/var/lib/pgsql/12/data/postgresql.local.conf" not found, skipping
INFO: 1 files copied to "/tmp/repmgr-config-archive-PG-Node2"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: servers diverged at WAL location 0/D000000 on timeline 1
pg_rewind: error: could not open file "/var/lib/pgsql/12/data/pg_wal/00000002000000000000000D": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 0/D000028
'

As per my understanding wal not found but it is not available there. Not found on the standby(Primary). If it is the issue of wal then how can I retrive that wal.

I have given every section from my side. Several time i tried i could not do that. Anyone can help me. I need exeact solution.

The text was updated successfully, but these errors were encountered:

tiagoscan · 2022-01-26T14:34:27Z

Hi @wasiualhasib
You need to delete folder "main" in your Node1, and execute the command "standby clone" from your new primary, also execute "standby register" in Node1.

After this, execute "standby switchover" in Node1, check output "cluster show" and you'l see Node1 set like primary.

wasiualhasib changed the title ~~Unable to do standby to primary and primary to standby.~~ Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) Oct 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

wasiualhasib commented Oct 6, 2021 •

edited

Loading

tiagoscan commented Jan 26, 2022

Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

Comments

wasiualhasib commented Oct 6, 2021 • edited Loading

`

`

`

tiagoscan commented Jan 26, 2022

wasiualhasib commented Oct 6, 2021 •

edited

Loading