Add PostgreSQL support: discovery, failure detection, and recovery by renecannao · Pull Request #70 · ProxySQL/orchestrator

renecannao · 2026-04-03T07:05:54Z

Summary

Implements PostgreSQL streaming replication support across three phases:

Phase A (Discovery): Adds ProviderType config field ("mysql" default, "postgresql"). When set to "postgresql", the discovery loop connects to PostgreSQL instances via pg_is_in_recovery(), pg_stat_replication, and pg_stat_wal_receiver, mapping PG concepts (WAL LSN, standby/primary roles, WAL receiver status) to orchestrator's Instance struct. The MySQL discovery path is completely unaffected.
Phase B (Failure Detection): Adds PostgreSQL-specific analysis codes (DeadPrimary, DeadPrimaryAndSomeStandbys, StandbyNotReplicating, AllStandbyNotReplicating, UnreachablePrimary). GetReplicationAnalysis delegates to a PG-specific implementation that queries the orchestrator backend database for primary/standby topology state.
Phase C (Recovery): Implements PostgreSQLPromoteStandby (via pg_promote()), PostgreSQLReconfigureStandby (via ALTER SYSTEM SET primary_conninfo), and PostgreSQLGetBestStandbyForPromotion. Wires PG analysis codes into the recovery dispatcher so DeadPrimary triggers automated failover with standby promotion and reconfiguration.

All changes are gated behind ProviderType == "postgresql" -- the MySQL path is never executed when this is set, and the default remains "mysql".

Closes #65, #66, #67

Test plan

Build verification: go build -o /dev/null ./go/cmd/orchestrator
Unit tests pass: go test ./go/inst/... -vet=off -v
New tests for PG utility functions (LSN parsing, conninfo parsing, version extraction)
New tests for PG analysis codes (uniqueness, no collision with MySQL codes)
Config default test (ProviderType defaults to "mysql")
Integration test with actual PostgreSQL primary/standby topology (requires PG cluster)

Add ProviderType config field ("mysql" default, "postgresql") to gate all PostgreSQL-specific code paths. When ProviderType is "postgresql": - Initialize PostgreSQLProvider in ContinuousDiscovery - ReadTopologyInstance and ReadTopologyInstanceBufferable delegate to ReadPostgreSQLTopologyInstance, which connects to PG, queries pg_is_in_recovery(), pg_stat_replication, pg_stat_wal_receiver, and maps PG concepts to orchestrator's Instance struct - WAL LSN is mapped to BinlogCoordinates, standby status to replication thread states, conninfo to MasterKey The MySQL path is completely unaffected when ProviderType is empty or "mysql". Closes #65

Add PostgreSQL-specific analysis codes: DeadPrimary, DeadPrimaryAndSomeStandbys, StandbyNotReplicating, AllStandbyNotReplicating, and UnreachablePrimary. GetReplicationAnalysis delegates to GetPostgreSQLReplicationAnalysis when ProviderType is "postgresql". The PG analysis queries the orchestrator backend database for primary/standby topology state and classifies failures based on primary reachability and standby replication status. The MySQL analysis path is completely unchanged. Closes #66

Implement PostgreSQL-specific topology operations: - PostgreSQLPromoteStandby: calls pg_promote() and waits for the instance to exit recovery mode - PostgreSQLReconfigureStandby: updates primary_conninfo via ALTER SYSTEM and reloads configuration - PostgreSQLGetBestStandbyForPromotion: selects the best standby based on check validity, lag, and LSN position Wire recovery into the dispatcher (getCheckAndRecoverFunction): - DeadPrimary/DeadPrimaryAndSomeStandbys -> checkAndRecoverDeadPrimary - UnreachablePrimary, AllStandbyNotReplicating, StandbyNotReplicating are handled as generic (non-actionable) problems for now checkAndRecoverDeadPrimary follows the same pattern as checkAndRecoverDeadMaster: register recovery, run pre-failover hooks, select and promote best standby, reconfigure remaining standbys, run post-failover hooks. Closes #67

gemini-code-assist

Code Review

This pull request introduces support for PostgreSQL as a database provider in Orchestrator. It adds the necessary configuration, data access objects for topology analysis and instance discovery, and logic for handling PostgreSQL-specific failover scenarios, such as promoting a standby to primary. The review comments highlight potential concerns regarding the reliability of port discovery for replicas, the use of sleep-based heuristics for WAL receiver reconnection, and the error handling strategy when reconfiguring standbys during failover.

gemini-code-assist · 2026-04-03T07:09:19Z

+			hostname = clientAddr.String
+		}
+		if hostname != "" {
+			port := config.Config.DefaultInstancePort


The use of config.Config.DefaultInstancePort for all replicas in pg_stat_replication might be incorrect if replicas are running on different ports. Consider if there is a way to reliably determine the actual port, or if this assumption is acceptable for the current implementation.

gemini-code-assist · 2026-04-03T07:09:19Z

+	_, _ = db.Exec("SELECT pg_wal_replay_pause()")
+	time.Sleep(100 * time.Millisecond)
+	_, _ = db.Exec("SELECT pg_wal_replay_resume()")


The use of time.Sleep to force a WAL receiver reconnect is a heuristic that might be flaky. Consider if there is a more robust way to trigger a reconnection, such as using pg_reload_conf() or checking if the configuration change is sufficient on its own.

gemini-code-assist · 2026-04-03T07:09:19Z

+		if err := inst.PostgreSQLReconfigureStandby(&replica.Key, &promotedInstance.Key); err != nil {
+			AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error reconfiguring standby %+v: %+v (continuing with others)", replica.Key, err))
+			topologyRecovery.LostReplicas.AddKey(replica.Key)
+		}


If PostgreSQLReconfigureStandby fails for a replica, the current implementation logs the error and continues, but it might be safer to mark the entire recovery as failed or at least explicitly handle the failure of critical replicas.

coderabbitai · 2026-04-03T07:11:32Z

Warning

Rate limit exceeded

@renecannao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 58 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 58 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc4937ba-d608-4923-852b-aa2b99f13501

📥 Commits

Reviewing files that changed from the base of the PR and between 314d850 and 306e8a1.

📒 Files selected for processing (12)

go/config/config.go
go/inst/analysis.go
go/inst/analysis_dao.go
go/inst/analysis_dao_postgresql.go
go/inst/analysis_test.go
go/inst/instance_dao.go
go/inst/instance_dao_postgresql.go
go/inst/instance_dao_postgresql_test.go
go/inst/instance_topology_postgresql.go
go/logic/orchestrator.go
go/logic/topology_recovery.go
go/logic/topology_recovery_postgresql.go

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch issue64-postgresql-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Copilot

Pull request overview

This PR adds first-class PostgreSQL streaming replication support to orchestrator, gated behind a new ProviderType configuration flag so the existing MySQL behavior remains the default.

Changes:

Introduces ProviderType (mysql default, postgresql option) and initializes the PostgreSQL provider in the discovery loop.
Adds PostgreSQL discovery and topology mapping (primary/standby roles, WAL/LSN mapping) plus PG-specific replication analysis codes and analysis logic.
Implements PostgreSQL recovery flow for dead primaries (standby promotion + reconfigure remaining standbys) and wires PG analysis codes into the recovery dispatcher.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
go/logic/topology_recovery.go	Routes new PostgreSQL analysis codes into recovery/emergent-operation dispatch.
go/logic/topology_recovery_postgresql.go	Adds PostgreSQL-specific dead-primary recovery implementation (promote + reconfigure).
go/logic/orchestrator.go	Initializes the configured provider (PostgreSQL when enabled) at continuous discovery startup.
go/inst/instance_topology_postgresql.go	Implements `pg_promote()` promotion, standby reconfiguration, and “best standby” selection logic.
go/inst/instance_dao.go	Switches discovery path to PostgreSQL-specific reader when `ProviderType=postgresql`.
go/inst/instance_dao_postgresql.go	Adds PostgreSQL topology discovery and maps PG concepts into the `Instance` model.
go/inst/instance_dao_postgresql_test.go	Adds unit tests for PG parsing helpers and ProviderType default.
go/inst/analysis.go	Adds PostgreSQL-specific analysis code constants.
go/inst/analysis_test.go	Adds tests to ensure PG analysis codes are unique and don’t collide with MySQL codes.
go/inst/analysis_dao.go	Delegates replication analysis to a PG-specific implementation when enabled.
go/inst/analysis_dao_postgresql.go	Implements PostgreSQL replication failure detection using orchestrator backend state.
go/config/config.go	Adds `ProviderType` config field with default `mysql`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-03T07:12:56Z

+			if port == 0 {
+				port = config.Config.DefaultInstancePort
+			}
+			instance.MasterKey = InstanceKey{
+				Hostname: host,


This fallback uses config.Config.DefaultInstancePort when primary_conninfo omits port. DefaultInstancePort defaults to 3306 (MySQL), so PostgreSQL standbys may be associated with the wrong primary port (typically 5432), breaking discovery/recovery. Introduce/use a PostgreSQL-specific default port and use it here instead of DefaultInstancePort.

Copilot · 2026-04-03T07:12:56Z

+			port := config.Config.DefaultInstancePort
+			// client_port in pg_stat_replication is the ephemeral port, not the PG listen port.
+			// We use DefaultInstancePort instead.
+			replicaKey := InstanceKey{Hostname: hostname, Port: port}
+			instance.AddReplicaKey(&replicaKey)


Replica keys discovered via pg_stat_replication are recorded using config.Config.DefaultInstancePort (defaults to 3306). This will persist incorrect ports for PostgreSQL replicas and prevent subsequent connections. Use a PostgreSQL port configuration/default (typically 5432) instead of DefaultInstancePort for PostgreSQL topologies.

Copilot · 2026-04-03T07:12:56Z

+	instance.IsUpToDate = true
+	instance.IsRecentlyChecked = true
+	instance.LastSeenTimestamp = time.Now().Format("2006-01-02 15:04:05")
+


Before writing to the backend (WriteInstance), instance.ClusterName is never set in the PostgreSQL discovery path. This can leave cluster_name empty and break cluster grouping/analysis/recovery behavior that expects a stable cluster name (MySQL discovery falls back to instance.Key.StringCode()). Consider setting instance.ClusterName (at least for primaries) or reusing ReadInstanceClusterAttributes prior to WriteInstance().

Suggested change

// Ensure ClusterName is always populated before persisting the instance.

// This provides a stable fallback cluster identifier for PostgreSQL

// discovery, matching the expectation that cluster_name is never empty.

if instance.ClusterName == "" {

instance.ClusterName = instance.Key.StringCode()

}

Copilot · 2026-04-03T07:12:57Z

+			// Primary is unreachable
+			if a.CountReplicas == 0 {
+				// No standbys at all - nothing to fail over to
+				a.Analysis = DeadMasterWithoutReplicas


When the primary is unreachable and CountReplicas == 0, this uses the MySQL analysis code DeadMasterWithoutReplicas. Since recovery dispatch is analysis-code driven, this can route PostgreSQL incidents through MySQL recovery logic. Use a PostgreSQL-specific analysis code here (or add one for "dead primary without standbys").

Suggested change

a.Analysis = DeadMasterWithoutReplicas

a.Analysis = DeadPrimary

Copilot · 2026-04-03T07:12:57Z

+				// All standbys are also unreachable
+				a.Analysis = DeadPrimaryAndSomeStandbys
+				a.Description = "Primary is unreachable and none of its standbys are reachable"


This branch sets DeadPrimaryAndSomeStandbys when CountValidReplicas == 0 (description indicates none of the standbys are reachable). Because DeadPrimaryAndSomeStandbys is wired to checkAndRecoverDeadPrimary, orchestrator will repeatedly attempt an automated failover even though no standby is eligible, generating noisy failed recoveries. Consider using a non-recoverable analysis code for this case or introducing a distinct code like "DeadPrimaryAndAllStandbys" and not triggering automated recovery when nothing can be promoted.

Suggested change

// All standbys are also unreachable

a.Analysis = DeadPrimaryAndSomeStandbys

a.Description = "Primary is unreachable and none of its standbys are reachable"

// Standbys exist, but none are reachable, so there is no promotion candidate.

// Use a non-recoverable analysis to avoid repeated automated failover attempts.

a.Analysis = DeadMasterWithoutReplicas

a.Description = "Primary is unreachable and has no reachable standbys"

Copilot · 2026-04-03T07:12:57Z

+	// Update primary_conninfo using ALTER SYSTEM
+	_, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo))
+	if err != nil {
+		return log.Errore(fmt.Errorf("PostgreSQLReconfigureStandby: ALTER SYSTEM failed on %+v: %v", *standbyKey, err))
+	}


ALTER SYSTEM SET primary_conninfo is built via string interpolation with newConnInfo wrapped in single quotes. If any field (notably PostgreSQLTopologyPassword) contains a single quote, this becomes invalid SQL and breaks automated recovery; it also introduces an avoidable SQL-injection surface via configuration. Use a parameterized query (e.g. ... primary_conninfo = $1) or safely quote/escape the value.

Copilot · 2026-04-03T07:12:57Z

+	// Try just host
+	hostMatch := regexp.MustCompile(`host=(\S+)`).FindStringSubmatch(conninfo)
+	if len(hostMatch) >= 2 {
+		return hostMatch[1], 0
+	}


parseConnInfo compiles a new regexp on every call for the host-only fallback. This is avoidable overhead in a hot discovery path. Move that regexp to a package-level var (similar to connInfoHostPortRegexp) and reuse it.

renecannao added 3 commits April 3, 2026 07:01

Copilot AI review requested due to automatic review settings April 3, 2026 07:05

renecannao merged commit 51ff6f6 into master Apr 3, 2026
6 of 7 checks passed

Copilot started reviewing on behalf of renecannao April 3, 2026 07:06 View session

gemini-code-assist bot reviewed Apr 3, 2026

View reviewed changes

Copilot AI reviewed Apr 3, 2026

View reviewed changes

renecannao mentioned this pull request Apr 3, 2026

PostgreSQL support: concrete implementation plan #64

Closed

6 tasks

+	// Ensure ClusterName is always populated before persisting the instance.
+	// This provides a stable fallback cluster identifier for PostgreSQL
+	// discovery, matching the expectation that cluster_name is never empty.
+	if instance.ClusterName == "" {
+		instance.ClusterName = instance.Key.StringCode()
+	}

	a.Analysis = DeadMasterWithoutReplicas
	a.Analysis = DeadPrimary

Conversation

renecannao commented Apr 3, 2026

Summary

Test plan

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Apr 3, 2026

Rate limit exceeded

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants