-
Notifications
You must be signed in to change notification settings - Fork 631
Description
** Which example are you working with? **
I have created a cluster with 1 primary and 1 replica and enabling the sync-replication (sync-level set to remote-apply).
I have setup this based on one of my previous questions: #2034.
The cluster has been working for some days. For any reason, the pods have been restarted and from that point we had no more primary but only 2 replicas:
$ pgo -n cfae-bit-cbcd show cluster pg-jose-benchmark-c4-remote-apply-35b24287
cluster : pg-jose-benchmark-c4-remote-apply-35b24287 (crunchy-postgres-ha:centos7-11.10-4.5.1)
pod : pg-jose-benchmark-c4-remote-apply-35b24287-869c9986fb-w4w7z (Running) on worker12.ccp02.adr.admin.ch (2/2) (replica)
pvc: pg-jose-benchmark-c4-remote-apply-35b24287 (80Gi)
pod : pg-jose-benchmark-c4-remote-apply-35b24287-ocul-6967cbf64dvsqgp (Running) on worker01.ccp02.adr.admin.ch (2/2) (replica)
pvc: pg-jose-benchmark-c4-remote-apply-35b24287-ocul (80Gi)
resources : CPU: 1 Memory: 4Gi
limits : CPU: 1 Memory: 4Gi
deployment : pg-jose-benchmark-c4-remote-apply-35b24287
deployment : pg-jose-benchmark-c4-remote-apply-35b24287-backrest-shared-repo
deployment : pg-jose-benchmark-c4-remote-apply-35b24287-ocul
service : pg-jose-benchmark-c4-remote-apply-35b24287 - ClusterIP (10.98.71.131) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
service : pg-jose-benchmark-c4-remote-apply-35b24287-np - ClusterIP (10.102.204.208) - Ports (5432/TCP)
service : pg-jose-benchmark-c4-remote-apply-35b24287-replica - ClusterIP (10.102.174.187) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
pgreplica : pg-jose-benchmark-c4-remote-apply-35b24287-ocul
labels : cpu-in-milli-cpu=1000 pgo-version=4.5.1 name=pg-jose-benchmark-c4-remote-apply-35b24287 pg-cluster=pg-jose-benchmark-c4-remote-apply-35b24287 space-guid=guid-b1569ac2-a5bc-4d01-9037-b5e7dda1cb43 storage-config=storage-80 pgo-osb-instance=15b9d562-a960-4410-920a-ce484cda7c93 service-type=ClusterIP autofail=true crunchy-pgbadger=false crunchy-pgha-scope=pg-jose-benchmark-c4-remote-apply-35b24287 memory-in-gb=4 pgo-backrest=true pgouser=admin sync-replication=true workflowid=a392296e-1842-4e87-9b7d-910893dbd18b crunchy-postgres-exporter=true deployment-name=pg-jose-benchmark-c4-remote-apply-35b24287-ocul org-guid=guid-7751d33a-be22-4539-8629-231bb99ef686 pg-pod-anti-affinity=
The *-config configMap:
$ kubectl get configmap pg-jose-benchmark-c4-remote-apply-35b24287-config -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
annotations:
config: '{"postgresql":{"parameters":{"archive_command":"source /opt/cpm/bin/pgbackrest/pgbackrest-set-env.sh
&& pgbackrest archive-push \"%p\"","archive_mode":true,"archive_timeout":"1800","checkpoint_completion_target":0.9,"checkpoint_timeout":"15min","effective_cache_size":"3GB","effective_io_concurrency":300,"huge_pages":"off","log_connections":"off","log_directory":"pg_log","log_disconnections":"off","log_duration":"off","log_min_duration_statement":60000,"log_min_messages":"WARNING","log_statement":"none","logging_collector":"off","maintenance_work_mem":"320MB","max_connections":100,"max_parallel_workers":1,"max_wal_senders":10,"max_wal_size":"1024MB","max_worker_processes":1,"min_wal_size":"512MB","shared_buffers":"1024MB","shared_preload_libraries":"pgaudit.so,pg_stat_statements.so,pgnodemx.so","superuser_reserved_connections":3,"synchronous_commit":"remote_apply","synchronous_standby_names":"*","temp_buffers":"8MB","unix_socket_directories":"/tmp,/crunchyadm","wal_keep_segments":130,"wal_level":"logical","work_mem":"32MB"},"recovery_conf":{"restore_command":"source
/opt/cpm/bin/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-get %f \"%p\""},"use_pg_rewind":true},"synchronous_mode":true,"tags":{}}'
history: '[[1,4596957336,"no recovery target specified","2020-12-15T10:49:37.113733+00:00"]]'
initialize: "6904548279672279201"
creationTimestamp: "2020-12-10T08:53:39Z"
labels:
crunchy-pgha-scope: pg-jose-benchmark-c4-remote-apply-35b24287
vendor: crunchydata
name: pg-jose-benchmark-c4-remote-apply-35b24287-config
namespace: cfae-bit-cbcd
resourceVersion: "23801950"
selfLink: /api/v1/namespaces/cfae-bit-cbcd/configmaps/pg-jose-benchmark-c4-remote-apply-35b24287-config
uid: 7d4844e5-92a5-45ff-a918-59dc0afc51f8
it contains the necessary information for the sync-commit:
- "synchronous_commit":"remote_apply"
- "synchronous_standby_names":"*"
- "synchronous_mode":true
By looking at the *-sync config map, we can see that the leader is pointing to a non-existing pod
$ kubectl describe configmap pg-jose-benchmark-c4-remote-apply-35b24287-sync
Name: pg-jose-benchmark-c4-remote-apply-35b24287-sync
Namespace: cfae-bit-cbcd
Labels: crunchy-pgha-scope=pg-jose-benchmark-c4-remote-apply-35b24287
vendor=crunchydata
Annotations: leader: pg-jose-benchmark-c4-remote-apply-35b24287-ocul-6967cbf64df2rp9
Data
====
Events: <none>
I have followed the https://info.crunchydata.com/blog/synchronous-replication-in-the-postgresql-operator-for-kubernetes-guarding-against-transactions-loss documentation and created several clusters and performed same failover tests as described and was always ok. However, this case occured without manual intervention and I do not see how to proceed to get it up again.
Thanks,
José
Please tell us about your environment:
- Operating System: SUSE Linux Enterprise Server 15 SP1
- Where is this running ( Local)
- Storage being used (NFS)
- Container Image Tag: crunchy-postgres-ha:centos7-11.10-4.5.1
- PostgreSQL Version: 11.10
- Platform (kubernetes 1.17.13 )