Add PostgreSQL support: discovery, failure detection, and recovery#70
Add PostgreSQL support: discovery, failure detection, and recovery#70renecannao merged 3 commits intomasterfrom
Conversation
Add ProviderType config field ("mysql" default, "postgresql") to gate all
PostgreSQL-specific code paths. When ProviderType is "postgresql":
- Initialize PostgreSQLProvider in ContinuousDiscovery
- ReadTopologyInstance and ReadTopologyInstanceBufferable delegate to
ReadPostgreSQLTopologyInstance, which connects to PG, queries
pg_is_in_recovery(), pg_stat_replication, pg_stat_wal_receiver,
and maps PG concepts to orchestrator's Instance struct
- WAL LSN is mapped to BinlogCoordinates, standby status to
replication thread states, conninfo to MasterKey
The MySQL path is completely unaffected when ProviderType is empty or "mysql".
Closes #65
Add PostgreSQL-specific analysis codes: DeadPrimary, DeadPrimaryAndSomeStandbys, StandbyNotReplicating, AllStandbyNotReplicating, and UnreachablePrimary. GetReplicationAnalysis delegates to GetPostgreSQLReplicationAnalysis when ProviderType is "postgresql". The PG analysis queries the orchestrator backend database for primary/standby topology state and classifies failures based on primary reachability and standby replication status. The MySQL analysis path is completely unchanged. Closes #66
Implement PostgreSQL-specific topology operations: - PostgreSQLPromoteStandby: calls pg_promote() and waits for the instance to exit recovery mode - PostgreSQLReconfigureStandby: updates primary_conninfo via ALTER SYSTEM and reloads configuration - PostgreSQLGetBestStandbyForPromotion: selects the best standby based on check validity, lag, and LSN position Wire recovery into the dispatcher (getCheckAndRecoverFunction): - DeadPrimary/DeadPrimaryAndSomeStandbys -> checkAndRecoverDeadPrimary - UnreachablePrimary, AllStandbyNotReplicating, StandbyNotReplicating are handled as generic (non-actionable) problems for now checkAndRecoverDeadPrimary follows the same pattern as checkAndRecoverDeadMaster: register recovery, run pre-failover hooks, select and promote best standby, reconfigure remaining standbys, run post-failover hooks. Closes #67
There was a problem hiding this comment.
Code Review
This pull request introduces support for PostgreSQL as a database provider in Orchestrator. It adds the necessary configuration, data access objects for topology analysis and instance discovery, and logic for handling PostgreSQL-specific failover scenarios, such as promoting a standby to primary. The review comments highlight potential concerns regarding the reliability of port discovery for replicas, the use of sleep-based heuristics for WAL receiver reconnection, and the error handling strategy when reconfiguring standbys during failover.
| hostname = clientAddr.String | ||
| } | ||
| if hostname != "" { | ||
| port := config.Config.DefaultInstancePort |
There was a problem hiding this comment.
| _, _ = db.Exec("SELECT pg_wal_replay_pause()") | ||
| time.Sleep(100 * time.Millisecond) | ||
| _, _ = db.Exec("SELECT pg_wal_replay_resume()") |
| if err := inst.PostgreSQLReconfigureStandby(&replica.Key, &promotedInstance.Key); err != nil { | ||
| AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error reconfiguring standby %+v: %+v (continuing with others)", replica.Key, err)) | ||
| topologyRecovery.LostReplicas.AddKey(replica.Key) | ||
| } |
|
Warning Rate limit exceeded
Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 58 seconds. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (12)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR adds first-class PostgreSQL streaming replication support to orchestrator, gated behind a new ProviderType configuration flag so the existing MySQL behavior remains the default.
Changes:
- Introduces
ProviderType(mysqldefault,postgresqloption) and initializes the PostgreSQL provider in the discovery loop. - Adds PostgreSQL discovery and topology mapping (primary/standby roles, WAL/LSN mapping) plus PG-specific replication analysis codes and analysis logic.
- Implements PostgreSQL recovery flow for dead primaries (standby promotion + reconfigure remaining standbys) and wires PG analysis codes into the recovery dispatcher.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| go/logic/topology_recovery.go | Routes new PostgreSQL analysis codes into recovery/emergent-operation dispatch. |
| go/logic/topology_recovery_postgresql.go | Adds PostgreSQL-specific dead-primary recovery implementation (promote + reconfigure). |
| go/logic/orchestrator.go | Initializes the configured provider (PostgreSQL when enabled) at continuous discovery startup. |
| go/inst/instance_topology_postgresql.go | Implements pg_promote() promotion, standby reconfiguration, and “best standby” selection logic. |
| go/inst/instance_dao.go | Switches discovery path to PostgreSQL-specific reader when ProviderType=postgresql. |
| go/inst/instance_dao_postgresql.go | Adds PostgreSQL topology discovery and maps PG concepts into the Instance model. |
| go/inst/instance_dao_postgresql_test.go | Adds unit tests for PG parsing helpers and ProviderType default. |
| go/inst/analysis.go | Adds PostgreSQL-specific analysis code constants. |
| go/inst/analysis_test.go | Adds tests to ensure PG analysis codes are unique and don’t collide with MySQL codes. |
| go/inst/analysis_dao.go | Delegates replication analysis to a PG-specific implementation when enabled. |
| go/inst/analysis_dao_postgresql.go | Implements PostgreSQL replication failure detection using orchestrator backend state. |
| go/config/config.go | Adds ProviderType config field with default mysql. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| if port == 0 { | ||
| port = config.Config.DefaultInstancePort | ||
| } | ||
| instance.MasterKey = InstanceKey{ | ||
| Hostname: host, |
There was a problem hiding this comment.
This fallback uses config.Config.DefaultInstancePort when primary_conninfo omits port. DefaultInstancePort defaults to 3306 (MySQL), so PostgreSQL standbys may be associated with the wrong primary port (typically 5432), breaking discovery/recovery. Introduce/use a PostgreSQL-specific default port and use it here instead of DefaultInstancePort.
| port := config.Config.DefaultInstancePort | ||
| // client_port in pg_stat_replication is the ephemeral port, not the PG listen port. | ||
| // We use DefaultInstancePort instead. | ||
| replicaKey := InstanceKey{Hostname: hostname, Port: port} | ||
| instance.AddReplicaKey(&replicaKey) |
There was a problem hiding this comment.
Replica keys discovered via pg_stat_replication are recorded using config.Config.DefaultInstancePort (defaults to 3306). This will persist incorrect ports for PostgreSQL replicas and prevent subsequent connections. Use a PostgreSQL port configuration/default (typically 5432) instead of DefaultInstancePort for PostgreSQL topologies.
| instance.IsUpToDate = true | ||
| instance.IsRecentlyChecked = true | ||
| instance.LastSeenTimestamp = time.Now().Format("2006-01-02 15:04:05") | ||
|
|
There was a problem hiding this comment.
Before writing to the backend (WriteInstance), instance.ClusterName is never set in the PostgreSQL discovery path. This can leave cluster_name empty and break cluster grouping/analysis/recovery behavior that expects a stable cluster name (MySQL discovery falls back to instance.Key.StringCode()). Consider setting instance.ClusterName (at least for primaries) or reusing ReadInstanceClusterAttributes prior to WriteInstance().
| // Ensure ClusterName is always populated before persisting the instance. | |
| // This provides a stable fallback cluster identifier for PostgreSQL | |
| // discovery, matching the expectation that cluster_name is never empty. | |
| if instance.ClusterName == "" { | |
| instance.ClusterName = instance.Key.StringCode() | |
| } |
| // Primary is unreachable | ||
| if a.CountReplicas == 0 { | ||
| // No standbys at all - nothing to fail over to | ||
| a.Analysis = DeadMasterWithoutReplicas |
There was a problem hiding this comment.
When the primary is unreachable and CountReplicas == 0, this uses the MySQL analysis code DeadMasterWithoutReplicas. Since recovery dispatch is analysis-code driven, this can route PostgreSQL incidents through MySQL recovery logic. Use a PostgreSQL-specific analysis code here (or add one for "dead primary without standbys").
| a.Analysis = DeadMasterWithoutReplicas | |
| a.Analysis = DeadPrimary |
| // All standbys are also unreachable | ||
| a.Analysis = DeadPrimaryAndSomeStandbys | ||
| a.Description = "Primary is unreachable and none of its standbys are reachable" |
There was a problem hiding this comment.
This branch sets DeadPrimaryAndSomeStandbys when CountValidReplicas == 0 (description indicates none of the standbys are reachable). Because DeadPrimaryAndSomeStandbys is wired to checkAndRecoverDeadPrimary, orchestrator will repeatedly attempt an automated failover even though no standby is eligible, generating noisy failed recoveries. Consider using a non-recoverable analysis code for this case or introducing a distinct code like "DeadPrimaryAndAllStandbys" and not triggering automated recovery when nothing can be promoted.
| // All standbys are also unreachable | |
| a.Analysis = DeadPrimaryAndSomeStandbys | |
| a.Description = "Primary is unreachable and none of its standbys are reachable" | |
| // Standbys exist, but none are reachable, so there is no promotion candidate. | |
| // Use a non-recoverable analysis to avoid repeated automated failover attempts. | |
| a.Analysis = DeadMasterWithoutReplicas | |
| a.Description = "Primary is unreachable and has no reachable standbys" |
| // Update primary_conninfo using ALTER SYSTEM | ||
| _, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo)) | ||
| if err != nil { | ||
| return log.Errore(fmt.Errorf("PostgreSQLReconfigureStandby: ALTER SYSTEM failed on %+v: %v", *standbyKey, err)) | ||
| } |
There was a problem hiding this comment.
ALTER SYSTEM SET primary_conninfo is built via string interpolation with newConnInfo wrapped in single quotes. If any field (notably PostgreSQLTopologyPassword) contains a single quote, this becomes invalid SQL and breaks automated recovery; it also introduces an avoidable SQL-injection surface via configuration. Use a parameterized query (e.g. ... primary_conninfo = $1) or safely quote/escape the value.
| // Try just host | ||
| hostMatch := regexp.MustCompile(`host=(\S+)`).FindStringSubmatch(conninfo) | ||
| if len(hostMatch) >= 2 { | ||
| return hostMatch[1], 0 | ||
| } |
There was a problem hiding this comment.
parseConnInfo compiles a new regexp on every call for the host-only fallback. This is avoidable overhead in a hot discovery path. Move that regexp to a package-level var (similar to connInfoHostPortRegexp) and reuse it.
Summary
Implements PostgreSQL streaming replication support across three phases:
Phase A (Discovery): Adds
ProviderTypeconfig field ("mysql"default,"postgresql"). When set to"postgresql", the discovery loop connects to PostgreSQL instances viapg_is_in_recovery(),pg_stat_replication, andpg_stat_wal_receiver, mapping PG concepts (WAL LSN, standby/primary roles, WAL receiver status) to orchestrator's Instance struct. The MySQL discovery path is completely unaffected.Phase B (Failure Detection): Adds PostgreSQL-specific analysis codes (
DeadPrimary,DeadPrimaryAndSomeStandbys,StandbyNotReplicating,AllStandbyNotReplicating,UnreachablePrimary).GetReplicationAnalysisdelegates to a PG-specific implementation that queries the orchestrator backend database for primary/standby topology state.Phase C (Recovery): Implements
PostgreSQLPromoteStandby(viapg_promote()),PostgreSQLReconfigureStandby(viaALTER SYSTEM SET primary_conninfo), andPostgreSQLGetBestStandbyForPromotion. Wires PG analysis codes into the recovery dispatcher soDeadPrimarytriggers automated failover with standby promotion and reconfiguration.All changes are gated behind
ProviderType == "postgresql"-- the MySQL path is never executed when this is set, and the default remains"mysql".Closes #65, #66, #67
Test plan
go build -o /dev/null ./go/cmd/orchestratorgo test ./go/inst/... -vet=off -v