Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) #721

Open
wasiualhasib opened this issue Oct 6, 2021 · 1 comment

Comments

@wasiualhasib
Copy link

wasiualhasib commented Oct 6, 2021

Node1:pgPrimary:192.168.1.11
Node2:pgStandby1:192.168.1.12
Node3:pgStandby2:192.168.1.13
Node4:pgWitness:192.168.1.14
@ibarwick
@fanf2
@martinmarques

  1. Here node1 was primary where node2 and node3 as standby node and witness node was node4. It was working fine.
  2. Next I stop node1 in that case node2 became new primary and node3 is standby as a replica with node2 upstream as expected.
  3. Next when I stop node2 in that case node3 became new primary as expected.
  4. Now I want to back node1 as a primary and node2 and node3 standby in that case I could not do that. Below issue I found in that case. I have tried through node rejoin but in that with --dry-run it say it is ok but without --dry-run showing error. Those error given below.

Note: I am also giving may configuration of those postgresql.conf and repmgr.conf with this issue.

I want to make node1 as a primary and replica should be node2 and node3 . Also need to update those changes data to new primary.

This is the view from node1:

`
[postgres@pgPrimary data]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+----------------------+------------+----------+----------+----------+---------------------------------------------------------------
1 | PG-Node1 | primary | * running | | default | 100 | 1 | host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2
2 | PG-Node2 | standby | ! running as primary | | default | 60 | 3 | host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2
3 | PG-Node3 | standby | running | | default | 40 | 3 | host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2
4 | PG-Node-Witness | witness | * running | ! PG-Node3 | default | 0 | n/a | host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "PG-Node2" (ID: 2) is registered as standby but running as primary
  • node "PG-Node-Witness" (ID: 4) reports a different upstream (reported: "PG-Node3", expected "PG-Node1")
    `

This is the view from Node2:
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf cluster show

ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+-----------+------------+----------+----------+----------+---------------------------------------------------------------
1 | PG-Node1 | primary | ! running | | default | 100 | 1 | host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2
2 | PG-Node2 | primary | * running | | default | 60 | 3 | host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2
3 | PG-Node3 | standby | running | | default | 40 | 3 | host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2
4 | PG-Node-Witness | witness | * running | ! PG-Node3 | default | 0 | n/a | host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
  • node "PG-Node-Witness" (ID: 4) reports a different upstream (reported: "PG-Node3", expected "PG-Node2")
    `

===================
This is the view from Node3:

`[postgres@pgStandby2 data]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------------------------
1 | PG-Node1 | primary | ! running | | default | 100 | 1 | host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2
2 | PG-Node2 | primary | ! running | | default | 60 | 3 | host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2
3 | PG-Node3 | primary | ! running as standby | | default | 40 | 3 | host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2
4 | PG-Node-Witness | witness | * running | PG-Node3 | default | 0 | n/a | host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
  • node "PG-Node2" (ID: 2) is running but the repmgr node record is inactive
  • node "PG-Node3" (ID: 3) is registered as primary but running as standby
    `
    ===========================
    This is the view from Node4:

`
[osboxes@pgWitness ~]$ /usr/pgsql-12/bin/repmgr -f /etc/repmgr/12/repmgr.conf cluster show
ID | Name | Role | Status | Upstream | Location | Priority | Timeline | Connection string
----+-----------------+---------+----------------------+----------+----------+----------+----------+---------------------------------------------------------------
1 | PG-Node1 | primary | ! running | | default | 100 | 1 | host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2
2 | PG-Node2 | primary | ! running | | default | 60 | 3 | host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2
3 | PG-Node3 | primary | ! running as standby | | default | 40 | 3 | host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2
4 | PG-Node-Witness | witness | * running | PG-Node3 | default | 0 | n/a | host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2

WARNING: following issues were detected

  • node "PG-Node1" (ID: 1) is running but the repmgr node record is inactive
  • node "PG-Node2" (ID: 2) is running but the repmgr node record is inactive
  • node "PG-Node3" (ID: 3) is registered as primary but running as standby
    `
    ====================
    This is the view of postgresql.conf file of node1,node2 and node3

`
listen_addresses = '*'
max_wal_senders = 10
max_replication_slots = 10
wal_level = 'replica'
hot_standby = on
archive_mode = on
archive_command = 'test ! -f /mnt/pg_log_archive/%f && cp %p /mnt/pg_log_archive/%f'
shared_preload_libraries = 'repmgr'

`

This is the view of postgresql.conf file of node4 witness node

listen_addresses = '*' shared_preload_libraries = 'repmgr'

=====================================
Node 1: repmgr.conf
`
node_id=1
node_name='PG-Node1'
conninfo='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=100
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60

`

Node 2: repmgr.conf

`
node_id=2
node_name='PG-Node2'
conninfo='host=192.168.1.12 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=60
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60

`

Node3 : repmgr.conf

`
node_id=3
node_name='PG-Node3'
conninfo='host=192.168.1.13 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
failover='automatic'
promote_command='/usr/pgsql-12/bin/repmgr standby promote -f /etc/repmgr/12/repmgr.conf --log-to-file'
follow_command='/usr/pgsql-12/bin/repmgr standby follow -f /etc/repmgr/12/repmgr.conf --log-to-file --upstream-node-id=%n'
priority=40
monitor_interval_secs=2
connection_check_type='ping'
reconnect_attempts=4
reconnect_interval=8
primary_visibility_consensus=true
standby_disconnect_on_failover=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

monitoring_history=yes
log_status_interval=60
`

Node4: repmgr.conf

`
node_id=4
node_name='PG-Node-Witness'
conninfo='host=192.168.1.14 user=repmgr dbname=repmgr connect_timeout=2'
data_directory='/var/lib/pgsql/12/data'
primary_visibility_consensus=true

repmgrd_service_start_command='sudo /usr/bin/systemctl start repmgr12.service'
repmgrd_service_stop_command='sudo /usr/bin/systemctl stop repmgr12.service'

service_start_command='sudo /usr/bin/systemctl start postgresql-12.service'
service_stop_command='sudo /usr/bin/systemctl stop postgresql-12.service'
service_restart_command='sudo /usr/bin/systemctl restart postgresql-12.service'
service_reload_command='sudo /usr/bin/systemctl reload postgresql-12.service'

`

Node Rejoin issue step by step:

Step 1: Run node rejoin command to join with node1:192.168.1.11
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run

NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
ERROR: database is still running in state "in production"
HINT: "repmgr node rejoin" cannot be executed on a running node
`

Step 2: Standby1 stopped

[postgres@pgStandby1 pg_wal]$ systemctl stop postgresql-12.service

Step3: Now run node rejoin command again with --dry-run and showing below message.
`
[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose --dry-run

NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
INFO: replication connection to the rejoin target node was successful
INFO: local and rejoin target system identifiers match
DETAIL: system identifier is 7015769547447271848
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/var/lib/pgsql/12/data/postgresql.local.conf" not found, skipping
INFO: file "postgresql.conf" would be copied to "/tmp/repmgr-config-archive-PG-Node2/postgresql.conf"
INFO: 1 files would have been copied to "/tmp/repmgr-config-archive-PG-Node2"
WARNING: unable to delete directory "/tmp/repmgr-config-archive-PG-Node2"
DETAIL: Directory not empty
HINT: directory may need to be manually removed
INFO: pg_rewind would now be executed
DETAIL: pg_rewind command is:
/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'
INFO: prerequisites for executing NODE REJOIN are met
`
Step4: Now run node rejoin command again without --dry-run and showing below error message

'[postgres@pgStandby1 pg_wal]$ /usr/pgsql-12/bin/repmgr node rejoin -f /etc/repmgr/12/repmgr.conf -d 'host=192.168.1.11 dbname=repmgr user=repmgr' --force-rewind --config-files=postgresql.local.conf,postgresql.conf --verbose
NOTICE: using provided configuration file "/etc/repmgr/12/repmgr.conf"
INFO: prerequisites for using pg_rewind are met
WARNING: specified file "/var/lib/pgsql/12/data/postgresql.local.conf" not found, skipping
INFO: 1 files copied to "/tmp/repmgr-config-archive-PG-Node2"
NOTICE: executing pg_rewind
DETAIL: pg_rewind command is "/usr/pgsql-12/bin/pg_rewind -D '/var/lib/pgsql/12/data' --source-server='host=192.168.1.11 user=repmgr dbname=repmgr connect_timeout=2'"
ERROR: pg_rewind execution failed
DETAIL: pg_rewind: servers diverged at WAL location 0/D000000 on timeline 1
pg_rewind: error: could not open file "/var/lib/pgsql/12/data/pg_wal/00000002000000000000000D": No such file or directory
pg_rewind: fatal: could not find previous WAL record at 0/D000028
'

As per my understanding wal not found but it is not available there. Not found on the standby(Primary). If it is the issue of wal then how can I retrive that wal.

I have given every section from my side. Several time i tried i could not do that. Anyone can help me. I need exeact solution.

@wasiualhasib wasiualhasib changed the title Unable to do standby to primary and primary to standby. Unable to do standby to primary and primary to standby (node rejoin/failover/switchover) Oct 6, 2021
@tiagoscan
Copy link

Hi @wasiualhasib
You need to delete folder "main" in your Node1, and execute the command "standby clone" from your new primary, also execute "standby register" in Node1.

After this, execute "standby switchover" in Node1, check output "cluster show" and you'l see Node1 set like primary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants