Skip to content

Add PostgreSQL support: discovery, failure detection, and recovery#70

Merged
renecannao merged 3 commits intomasterfrom
issue64-postgresql-support
Apr 3, 2026
Merged

Add PostgreSQL support: discovery, failure detection, and recovery#70
renecannao merged 3 commits intomasterfrom
issue64-postgresql-support

Conversation

@renecannao
Copy link
Copy Markdown

Summary

Implements PostgreSQL streaming replication support across three phases:

  • Phase A (Discovery): Adds ProviderType config field ("mysql" default, "postgresql"). When set to "postgresql", the discovery loop connects to PostgreSQL instances via pg_is_in_recovery(), pg_stat_replication, and pg_stat_wal_receiver, mapping PG concepts (WAL LSN, standby/primary roles, WAL receiver status) to orchestrator's Instance struct. The MySQL discovery path is completely unaffected.

  • Phase B (Failure Detection): Adds PostgreSQL-specific analysis codes (DeadPrimary, DeadPrimaryAndSomeStandbys, StandbyNotReplicating, AllStandbyNotReplicating, UnreachablePrimary). GetReplicationAnalysis delegates to a PG-specific implementation that queries the orchestrator backend database for primary/standby topology state.

  • Phase C (Recovery): Implements PostgreSQLPromoteStandby (via pg_promote()), PostgreSQLReconfigureStandby (via ALTER SYSTEM SET primary_conninfo), and PostgreSQLGetBestStandbyForPromotion. Wires PG analysis codes into the recovery dispatcher so DeadPrimary triggers automated failover with standby promotion and reconfiguration.

All changes are gated behind ProviderType == "postgresql" -- the MySQL path is never executed when this is set, and the default remains "mysql".

Closes #65, #66, #67

Test plan

  • Build verification: go build -o /dev/null ./go/cmd/orchestrator
  • Unit tests pass: go test ./go/inst/... -vet=off -v
  • New tests for PG utility functions (LSN parsing, conninfo parsing, version extraction)
  • New tests for PG analysis codes (uniqueness, no collision with MySQL codes)
  • Config default test (ProviderType defaults to "mysql")
  • Integration test with actual PostgreSQL primary/standby topology (requires PG cluster)

Add ProviderType config field ("mysql" default, "postgresql") to gate all
PostgreSQL-specific code paths. When ProviderType is "postgresql":

- Initialize PostgreSQLProvider in ContinuousDiscovery
- ReadTopologyInstance and ReadTopologyInstanceBufferable delegate to
  ReadPostgreSQLTopologyInstance, which connects to PG, queries
  pg_is_in_recovery(), pg_stat_replication, pg_stat_wal_receiver,
  and maps PG concepts to orchestrator's Instance struct
- WAL LSN is mapped to BinlogCoordinates, standby status to
  replication thread states, conninfo to MasterKey

The MySQL path is completely unaffected when ProviderType is empty or "mysql".

Closes #65
Add PostgreSQL-specific analysis codes: DeadPrimary,
DeadPrimaryAndSomeStandbys, StandbyNotReplicating,
AllStandbyNotReplicating, and UnreachablePrimary.

GetReplicationAnalysis delegates to GetPostgreSQLReplicationAnalysis
when ProviderType is "postgresql". The PG analysis queries the
orchestrator backend database for primary/standby topology state
and classifies failures based on primary reachability and standby
replication status.

The MySQL analysis path is completely unchanged.

Closes #66
Implement PostgreSQL-specific topology operations:
- PostgreSQLPromoteStandby: calls pg_promote() and waits for the
  instance to exit recovery mode
- PostgreSQLReconfigureStandby: updates primary_conninfo via
  ALTER SYSTEM and reloads configuration
- PostgreSQLGetBestStandbyForPromotion: selects the best standby
  based on check validity, lag, and LSN position

Wire recovery into the dispatcher (getCheckAndRecoverFunction):
- DeadPrimary/DeadPrimaryAndSomeStandbys -> checkAndRecoverDeadPrimary
- UnreachablePrimary, AllStandbyNotReplicating, StandbyNotReplicating
  are handled as generic (non-actionable) problems for now

checkAndRecoverDeadPrimary follows the same pattern as
checkAndRecoverDeadMaster: register recovery, run pre-failover hooks,
select and promote best standby, reconfigure remaining standbys,
run post-failover hooks.

Closes #67
Copilot AI review requested due to automatic review settings April 3, 2026 07:05
@renecannao renecannao merged commit 51ff6f6 into master Apr 3, 2026
6 of 7 checks passed
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for PostgreSQL as a database provider in Orchestrator. It adds the necessary configuration, data access objects for topology analysis and instance discovery, and logic for handling PostgreSQL-specific failover scenarios, such as promoting a standby to primary. The review comments highlight potential concerns regarding the reliability of port discovery for replicas, the use of sleep-based heuristics for WAL receiver reconnection, and the error handling strategy when reconfiguring standbys during failover.

hostname = clientAddr.String
}
if hostname != "" {
port := config.Config.DefaultInstancePort
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of config.Config.DefaultInstancePort for all replicas in pg_stat_replication might be incorrect if replicas are running on different ports. Consider if there is a way to reliably determine the actual port, or if this assumption is acceptable for the current implementation.

Comment on lines +124 to +126
_, _ = db.Exec("SELECT pg_wal_replay_pause()")
time.Sleep(100 * time.Millisecond)
_, _ = db.Exec("SELECT pg_wal_replay_resume()")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The use of time.Sleep to force a WAL receiver reconnect is a heuristic that might be flaky. Consider if there is a more robust way to trigger a reconnection, such as using pg_reload_conf() or checking if the configuration change is sufficient on its own.

Comment on lines +88 to +91
if err := inst.PostgreSQLReconfigureStandby(&replica.Key, &promotedInstance.Key); err != nil {
AuditTopologyRecovery(topologyRecovery, fmt.Sprintf("error reconfiguring standby %+v: %+v (continuing with others)", replica.Key, err))
topologyRecovery.LostReplicas.AddKey(replica.Key)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If PostgreSQLReconfigureStandby fails for a replica, the current implementation logs the error and continues, but it might be safer to mark the entire recovery as failed or at least explicitly handle the failure of critical replicas.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 3, 2026

Warning

Rate limit exceeded

@renecannao has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 10 minutes and 58 seconds before requesting another review.

Your organization is not enrolled in usage-based pricing. Contact your admin to enable usage-based pricing to continue reviews beyond the rate limit, or try again in 10 minutes and 58 seconds.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: fc4937ba-d608-4923-852b-aa2b99f13501

📥 Commits

Reviewing files that changed from the base of the PR and between 314d850 and 306e8a1.

📒 Files selected for processing (12)
  • go/config/config.go
  • go/inst/analysis.go
  • go/inst/analysis_dao.go
  • go/inst/analysis_dao_postgresql.go
  • go/inst/analysis_test.go
  • go/inst/instance_dao.go
  • go/inst/instance_dao_postgresql.go
  • go/inst/instance_dao_postgresql_test.go
  • go/inst/instance_topology_postgresql.go
  • go/logic/orchestrator.go
  • go/logic/topology_recovery.go
  • go/logic/topology_recovery_postgresql.go
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch issue64-postgresql-support

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds first-class PostgreSQL streaming replication support to orchestrator, gated behind a new ProviderType configuration flag so the existing MySQL behavior remains the default.

Changes:

  • Introduces ProviderType (mysql default, postgresql option) and initializes the PostgreSQL provider in the discovery loop.
  • Adds PostgreSQL discovery and topology mapping (primary/standby roles, WAL/LSN mapping) plus PG-specific replication analysis codes and analysis logic.
  • Implements PostgreSQL recovery flow for dead primaries (standby promotion + reconfigure remaining standbys) and wires PG analysis codes into the recovery dispatcher.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
go/logic/topology_recovery.go Routes new PostgreSQL analysis codes into recovery/emergent-operation dispatch.
go/logic/topology_recovery_postgresql.go Adds PostgreSQL-specific dead-primary recovery implementation (promote + reconfigure).
go/logic/orchestrator.go Initializes the configured provider (PostgreSQL when enabled) at continuous discovery startup.
go/inst/instance_topology_postgresql.go Implements pg_promote() promotion, standby reconfiguration, and “best standby” selection logic.
go/inst/instance_dao.go Switches discovery path to PostgreSQL-specific reader when ProviderType=postgresql.
go/inst/instance_dao_postgresql.go Adds PostgreSQL topology discovery and maps PG concepts into the Instance model.
go/inst/instance_dao_postgresql_test.go Adds unit tests for PG parsing helpers and ProviderType default.
go/inst/analysis.go Adds PostgreSQL-specific analysis code constants.
go/inst/analysis_test.go Adds tests to ensure PG analysis codes are unique and don’t collide with MySQL codes.
go/inst/analysis_dao.go Delegates replication analysis to a PG-specific implementation when enabled.
go/inst/analysis_dao_postgresql.go Implements PostgreSQL replication failure detection using orchestrator backend state.
go/config/config.go Adds ProviderType config field with default mysql.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +179 to +183
if port == 0 {
port = config.Config.DefaultInstancePort
}
instance.MasterKey = InstanceKey{
Hostname: host,
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fallback uses config.Config.DefaultInstancePort when primary_conninfo omits port. DefaultInstancePort defaults to 3306 (MySQL), so PostgreSQL standbys may be associated with the wrong primary port (typically 5432), breaking discovery/recovery. Introduce/use a PostgreSQL-specific default port and use it here instead of DefaultInstancePort.

Copilot uses AI. Check for mistakes.
Comment on lines +244 to +248
port := config.Config.DefaultInstancePort
// client_port in pg_stat_replication is the ephemeral port, not the PG listen port.
// We use DefaultInstancePort instead.
replicaKey := InstanceKey{Hostname: hostname, Port: port}
instance.AddReplicaKey(&replicaKey)
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replica keys discovered via pg_stat_replication are recorded using config.Config.DefaultInstancePort (defaults to 3306). This will persist incorrect ports for PostgreSQL replicas and prevent subsequent connections. Use a PostgreSQL port configuration/default (typically 5432) instead of DefaultInstancePort for PostgreSQL topologies.

Copilot uses AI. Check for mistakes.
instance.IsUpToDate = true
instance.IsRecentlyChecked = true
instance.LastSeenTimestamp = time.Now().Format("2006-01-02 15:04:05")

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before writing to the backend (WriteInstance), instance.ClusterName is never set in the PostgreSQL discovery path. This can leave cluster_name empty and break cluster grouping/analysis/recovery behavior that expects a stable cluster name (MySQL discovery falls back to instance.Key.StringCode()). Consider setting instance.ClusterName (at least for primaries) or reusing ReadInstanceClusterAttributes prior to WriteInstance().

Suggested change
// Ensure ClusterName is always populated before persisting the instance.
// This provides a stable fallback cluster identifier for PostgreSQL
// discovery, matching the expectation that cluster_name is never empty.
if instance.ClusterName == "" {
instance.ClusterName = instance.Key.StringCode()
}

Copilot uses AI. Check for mistakes.
// Primary is unreachable
if a.CountReplicas == 0 {
// No standbys at all - nothing to fail over to
a.Analysis = DeadMasterWithoutReplicas
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the primary is unreachable and CountReplicas == 0, this uses the MySQL analysis code DeadMasterWithoutReplicas. Since recovery dispatch is analysis-code driven, this can route PostgreSQL incidents through MySQL recovery logic. Use a PostgreSQL-specific analysis code here (or add one for "dead primary without standbys").

Suggested change
a.Analysis = DeadMasterWithoutReplicas
a.Analysis = DeadPrimary

Copilot uses AI. Check for mistakes.
Comment on lines +148 to +150
// All standbys are also unreachable
a.Analysis = DeadPrimaryAndSomeStandbys
a.Description = "Primary is unreachable and none of its standbys are reachable"
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This branch sets DeadPrimaryAndSomeStandbys when CountValidReplicas == 0 (description indicates none of the standbys are reachable). Because DeadPrimaryAndSomeStandbys is wired to checkAndRecoverDeadPrimary, orchestrator will repeatedly attempt an automated failover even though no standby is eligible, generating noisy failed recoveries. Consider using a non-recoverable analysis code for this case or introducing a distinct code like "DeadPrimaryAndAllStandbys" and not triggering automated recovery when nothing can be promoted.

Suggested change
// All standbys are also unreachable
a.Analysis = DeadPrimaryAndSomeStandbys
a.Description = "Primary is unreachable and none of its standbys are reachable"
// Standbys exist, but none are reachable, so there is no promotion candidate.
// Use a non-recoverable analysis to avoid repeated automated failover attempts.
a.Analysis = DeadMasterWithoutReplicas
a.Description = "Primary is unreachable and has no reachable standbys"

Copilot uses AI. Check for mistakes.
Comment on lines +111 to +115
// Update primary_conninfo using ALTER SYSTEM
_, err = db.Exec(fmt.Sprintf("ALTER SYSTEM SET primary_conninfo = '%s'", newConnInfo))
if err != nil {
return log.Errore(fmt.Errorf("PostgreSQLReconfigureStandby: ALTER SYSTEM failed on %+v: %v", *standbyKey, err))
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ALTER SYSTEM SET primary_conninfo is built via string interpolation with newConnInfo wrapped in single quotes. If any field (notably PostgreSQLTopologyPassword) contains a single quote, this becomes invalid SQL and breaks automated recovery; it also introduces an avoidable SQL-injection surface via configuration. Use a parameterized query (e.g. ... primary_conninfo = $1) or safely quote/escape the value.

Copilot uses AI. Check for mistakes.
Comment on lines +275 to +279
// Try just host
hostMatch := regexp.MustCompile(`host=(\S+)`).FindStringSubmatch(conninfo)
if len(hostMatch) >= 2 {
return hostMatch[1], 0
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parseConnInfo compiles a new regexp on every call for the host-only fallback. This is avoidable overhead in a hot discovery path. Move that regexp to a package-level var (similar to connInfoHostPortRegexp) and reuse it.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Phase A: Wire provider into discovery loop

2 participants