Skip to content

Cluster with 2 replicas and no primary anymore #2132

@jose-joye

Description

@jose-joye

** Which example are you working with? **
I have created a cluster with 1 primary and 1 replica and enabling the sync-replication (sync-level set to remote-apply).
I have setup this based on one of my previous questions: #2034.

The cluster has been working for some days. For any reason, the pods have been restarted and from that point we had no more primary but only 2 replicas:

$ pgo -n cfae-bit-cbcd show cluster pg-jose-benchmark-c4-remote-apply-35b24287

cluster : pg-jose-benchmark-c4-remote-apply-35b24287 (crunchy-postgres-ha:centos7-11.10-4.5.1)
        pod : pg-jose-benchmark-c4-remote-apply-35b24287-869c9986fb-w4w7z (Running) on worker12.ccp02.adr.admin.ch (2/2) (replica)
                pvc: pg-jose-benchmark-c4-remote-apply-35b24287 (80Gi)
        pod : pg-jose-benchmark-c4-remote-apply-35b24287-ocul-6967cbf64dvsqgp (Running) on worker01.ccp02.adr.admin.ch (2/2) (replica)
                pvc: pg-jose-benchmark-c4-remote-apply-35b24287-ocul (80Gi)
        resources : CPU: 1 Memory: 4Gi
        limits : CPU: 1 Memory: 4Gi
        deployment : pg-jose-benchmark-c4-remote-apply-35b24287
        deployment : pg-jose-benchmark-c4-remote-apply-35b24287-backrest-shared-repo
        deployment : pg-jose-benchmark-c4-remote-apply-35b24287-ocul
        service : pg-jose-benchmark-c4-remote-apply-35b24287 - ClusterIP (10.98.71.131) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
        service : pg-jose-benchmark-c4-remote-apply-35b24287-np - ClusterIP (10.102.204.208) - Ports (5432/TCP)
        service : pg-jose-benchmark-c4-remote-apply-35b24287-replica - ClusterIP (10.102.174.187) - Ports (9187/TCP, 2022/TCP, 5432/TCP)
        pgreplica : pg-jose-benchmark-c4-remote-apply-35b24287-ocul
        labels : cpu-in-milli-cpu=1000 pgo-version=4.5.1 name=pg-jose-benchmark-c4-remote-apply-35b24287 pg-cluster=pg-jose-benchmark-c4-remote-apply-35b24287 space-guid=guid-b1569ac2-a5bc-4d01-9037-b5e7dda1cb43 storage-config=storage-80 pgo-osb-instance=15b9d562-a960-4410-920a-ce484cda7c93 service-type=ClusterIP autofail=true crunchy-pgbadger=false crunchy-pgha-scope=pg-jose-benchmark-c4-remote-apply-35b24287 memory-in-gb=4 pgo-backrest=true pgouser=admin sync-replication=true workflowid=a392296e-1842-4e87-9b7d-910893dbd18b crunchy-postgres-exporter=true deployment-name=pg-jose-benchmark-c4-remote-apply-35b24287-ocul org-guid=guid-7751d33a-be22-4539-8629-231bb99ef686 pg-pod-anti-affinity=

The *-config configMap:

$ kubectl get configmap pg-jose-benchmark-c4-remote-apply-35b24287-config -o yaml
apiVersion: v1
kind: ConfigMap
metadata:
  annotations:
    config: '{"postgresql":{"parameters":{"archive_command":"source /opt/cpm/bin/pgbackrest/pgbackrest-set-env.sh
      && pgbackrest archive-push \"%p\"","archive_mode":true,"archive_timeout":"1800","checkpoint_completion_target":0.9,"checkpoint_timeout":"15min","effective_cache_size":"3GB","effective_io_concurrency":300,"huge_pages":"off","log_connections":"off","log_directory":"pg_log","log_disconnections":"off","log_duration":"off","log_min_duration_statement":60000,"log_min_messages":"WARNING","log_statement":"none","logging_collector":"off","maintenance_work_mem":"320MB","max_connections":100,"max_parallel_workers":1,"max_wal_senders":10,"max_wal_size":"1024MB","max_worker_processes":1,"min_wal_size":"512MB","shared_buffers":"1024MB","shared_preload_libraries":"pgaudit.so,pg_stat_statements.so,pgnodemx.so","superuser_reserved_connections":3,"synchronous_commit":"remote_apply","synchronous_standby_names":"*","temp_buffers":"8MB","unix_socket_directories":"/tmp,/crunchyadm","wal_keep_segments":130,"wal_level":"logical","work_mem":"32MB"},"recovery_conf":{"restore_command":"source
      /opt/cpm/bin/pgbackrest/pgbackrest-set-env.sh && pgbackrest archive-get %f \"%p\""},"use_pg_rewind":true},"synchronous_mode":true,"tags":{}}'
    history: '[[1,4596957336,"no recovery target specified","2020-12-15T10:49:37.113733+00:00"]]'
    initialize: "6904548279672279201"
  creationTimestamp: "2020-12-10T08:53:39Z"
  labels:
    crunchy-pgha-scope: pg-jose-benchmark-c4-remote-apply-35b24287
    vendor: crunchydata
  name: pg-jose-benchmark-c4-remote-apply-35b24287-config
  namespace: cfae-bit-cbcd
  resourceVersion: "23801950"
  selfLink: /api/v1/namespaces/cfae-bit-cbcd/configmaps/pg-jose-benchmark-c4-remote-apply-35b24287-config
  uid: 7d4844e5-92a5-45ff-a918-59dc0afc51f8

it contains the necessary information for the sync-commit:

  • "synchronous_commit":"remote_apply"
  • "synchronous_standby_names":"*"
  • "synchronous_mode":true

By looking at the *-sync config map, we can see that the leader is pointing to a non-existing pod

$ kubectl describe  configmap pg-jose-benchmark-c4-remote-apply-35b24287-sync
Name:         pg-jose-benchmark-c4-remote-apply-35b24287-sync
Namespace:    cfae-bit-cbcd
Labels:       crunchy-pgha-scope=pg-jose-benchmark-c4-remote-apply-35b24287
              vendor=crunchydata
Annotations:  leader: pg-jose-benchmark-c4-remote-apply-35b24287-ocul-6967cbf64df2rp9

Data
====
Events:  <none>

I have followed the https://info.crunchydata.com/blog/synchronous-replication-in-the-postgresql-operator-for-kubernetes-guarding-against-transactions-loss documentation and created several clusters and performed same failover tests as described and was always ok. However, this case occured without manual intervention and I do not see how to proceed to get it up again.

Thanks,
José

Please tell us about your environment:

  • Operating System: SUSE Linux Enterprise Server 15 SP1
  • Where is this running ( Local)
  • Storage being used (NFS)
  • Container Image Tag: crunchy-postgres-ha:centos7-11.10-4.5.1
  • PostgreSQL Version: 11.10
  • Platform (kubernetes 1.17.13 )

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions