feat(postgresql): add test_replication script for replication topology by renecannao · Pull Request #123 · ProxySQL/dbdeployer

renecannao · 2026-05-14T11:50:06Z

Summary

Closes #120.

PostgreSQL replication deployments now get a test_replication script in the topology directory, alongside the existing check_replication and check_recovery. Mirrors the MySQL test_replication template but uses PostgreSQL primitives.

What it does

When you run it from the topology dir (postgresql_repl_<port>/):

Creates a test table dbdeployer_test_replication on the primary.
Inserts 20 rows.
Captures the primary's pg_current_wal_lsn() and row count.
Briefly sleeps to give streaming replication a moment to catch up.
For each replica: asserts pg_is_in_recovery() = t, pg_last_wal_replay_lsn() matches the primary's LSN, and count(*) matches.
Reports tap-style ok / not ok per assertion; exits 1 on any failure.

Files

providers/postgresql/scripts.go — new GenerateTestReplicationScript(opts, numReplicas) function. Embeds an ok_equal / test_summary helper pair (same shape as the MySQL template).
cmd/replication.go — generates test_replication next to the existing check_replication / check_recovery writes in deployReplicationNonMySQL.
providers/postgresql/postgresql_test.go — TestGenerateTestReplicationScript covers expected substrings and bound checks on numReplicas (no replica3 references for a 2-replica deploy).

Reference

Sample script in the issue body was provided by Ronald Bradford (issue author). Implementation follows it closely, adapted to the actual PG provider's per-sandbox use script layout (./primary/use, ./replicaN/use) rather than the MySQL-style ./m / ./n1 shortcuts that the PG topology doesn't create.

Test plan

go build ./... clean.
go test ./providers/postgresql/... — TestGenerateTestReplicationScript and existing tests pass.
Generated script bash -n syntax-checks cleanly.
Visually inspected the generated output for a 2-replica deploy — each replica section properly numbered, sandbox dir interpolated, LSN/recovery/row-count assertions all present.

Note on timing flakiness

PostgreSQL streaming replication is asynchronous; a one-second sleep between primary writes and replica reads is typically enough but can flake on a busy system. The Ronald-provided reference uses sleep 0.5 and the same approach. A more robust variant would poll-with-timeout for the replica's LSN to match the primary's — happy to fold that in as a follow-up if this turns out flaky in practice.

Summary by CodeRabbit

New Features
- PostgreSQL replication topologies now automatically generate a test script that validates streaming replication health by verifying each replica's recovery status, WAL LSN synchronization with the primary, and data consistency.
Tests
- Added comprehensive test coverage for the replication validation script generator.

Closes #120. PostgreSQL replication deployments now get a `test_replication` script in the topology directory, alongside the existing `check_replication` and `check_recovery` scripts. It mirrors the MySQL `test_replication` template (sandbox/templates/replication/test_replication.gotxt) but uses PostgreSQL primitives: - pg_current_wal_lsn() on the primary to capture WAL position - pg_last_wal_replay_lsn() on each replica to compare - pg_is_in_recovery() to confirm each replica is actually a standby Flow: create a test table on the primary, insert 20 rows, capture LSN + row count, then for each replica assert recovery mode + LSN match + row count match. Reports tap-style ok/not ok per assertion and exits 1 on any failure. Generated via a new `GenerateTestReplicationScript(opts, numReplicas)` in providers/postgresql/scripts.go, wired into deployReplicationNonMySQL in cmd/replication.go. Sample reference script was provided by Ronald Bradford in the issue.

coderabbitai · 2026-05-14T11:50:18Z

📝 Walkthrough

Walkthrough

This PR adds a new GenerateTestReplicationScript function that generates a Bash script to validate PostgreSQL streaming replication by verifying recovery state, WAL LSN synchronization, and row counts across primary and replica instances, and integrates it into the replication deployment command.

Changes

PostgreSQL Replication Test Script

Layer / File(s)	Summary
Replication test script generator `providers/postgresql/scripts.go`, `providers/postgresql/postgresql_test.go`	`GenerateTestReplicationScript` constructs a Bash script that creates a test table on the primary, captures WAL LSN and row count, then iterates over replicas checking recovery state, LSN match, and row-count match. Tests validate script content includes required environment setup and correct replica references.
Command integration for topology deployment `cmd/replication.go`	The `deployReplicationNonMySQL` command now generates and installs the `test_replication` script for PostgreSQL replication topologies by creating `ScriptOptions` and writing the generated script to the topology directory.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

test_replication for PostgreSQL #120: Changes directly implement the test_replication script generation feature requested in this issue.

Possibly related PRs

ProxySQL/dbdeployer#54: Both PRs extend PostgreSQL replication capabilities by modifying cmd/replication.go and adding script generation in providers/postgresql/scripts.go.

Poem

🐰 A rabbit built a script so fine,
To test if primary and replicas align!
LSN and rows must match just so,
PostgreSQL replication verified—go, go, go! 🗄️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title directly and accurately describes the main change: adding a test_replication script for PostgreSQL replication topology. It is concise, clear, and specific.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/postgresql-test-replication-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist

Code Review

This pull request introduces a test_replication script for PostgreSQL sandboxes, enabling automated verification of replication state and data consistency. The implementation includes the script generation logic, integration into the deployment command, and unit tests. Feedback suggests refactoring cmd/replication.go to reuse existing configuration options and improving the reliability of the replication check by using PostgreSQL's LSN comparison operators instead of string equality to account for background WAL activity.

gemini-code-assist · 2026-05-14T11:51:55Z

+	testReplOpts := postgresql.ScriptOptions{
+		SandboxDir: topologyDir,
+		BinDir:     binDir,
+		LibDir:     libDir,
+		Port:       primaryPort,
+	}
+	testReplScript := postgresql.GenerateTestReplicationScript(testReplOpts, len(replicaPorts))


The testReplOpts variable is redundant. You can reuse the existing scriptOpts (defined on line 141) by setting its SandboxDir field. This makes the code cleaner and avoids duplicating the configuration for BinDir, LibDir, and Port.

scriptOpts.SandboxDir = topologyDir testReplScript := postgresql.GenerateTestReplicationScript(scriptOpts, len(replicaPorts))

gemini-code-assist · 2026-05-14T11:51:55Z

+    REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication')
+    REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()')
+    ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode"
+    ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary"


Comparing LSNs as strings for exact equality is fragile in PostgreSQL. Even in an idle sandbox, background processes (like the checkpointer or autovacuum) can advance the WAL LSN on the primary between the time the inserts finish and the LSN is captured, or during the sleep period. A more robust approach would be to use PostgreSQL's LSN comparison operator to ensure the replica has replayed at least up to the primary's LSN: SELECT pg_last_wal_replay_lsn() >= '$PRIMARY_LSN'::pg_lsn. This would return 't' if the replica is caught up or ahead, avoiding false negatives.

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/postgresql/scripts.go`:
- Around line 126-139: The test currently does a fixed "sleep 1" and then strict
string equality between REPLICA_LSN and PRIMARY_LSN (variables used in the
replica loop and the ok_equal assertions), which is flaky; replace that with a
polling loop per replica inside the numReplicas loop that repeatedly queries
pg_last_wal_replay_lsn() (via ./replica%d/use) and checks whether the replay LSN
is >= PRIMARY_LSN (or use pg_lsn comparison/CAST) until either the condition is
true or a short timeout (e.g., several seconds) elapses, then assert
success/failure; remove the fixed sleep and change the ok_equal LSN assertion to
assert that the poll succeeded (and keep the row-count and recovery-mode checks
as before).

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89a3d65a-c752-488e-87f7-ba6b8d925370

📥 Commits

Reviewing files that changed from the base of the PR and between b738c7d and fb4ffbd.

📒 Files selected for processing (3)

cmd/replication.go
providers/postgresql/postgresql_test.go
providers/postgresql/scripts.go

coderabbitai · 2026-05-14T11:53:41Z

+# Allow a brief window for streaming replication to catch up
+sleep 1
+
+`, opts.SandboxDir))
+
+	for n := 1; n <= numReplicas; n++ {
+		b.WriteString(fmt.Sprintf(`echo "# Testing replica %d"
+if [ -x ./replica%d/use ]; then
+    REPLICA_LSN=$(./replica%d/use -Atqc 'SELECT pg_last_wal_replay_lsn()')
+    REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication')
+    REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()')
+    ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode"
+    ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary"
+    ok_equal "$REPLICA_RECS" "$PRIMARY_RECS" "replica %d row count matches primary"


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

PostgreSQL standby validation best practice: should replication catch-up checks use pg_last_wal_replay_lsn() >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality?

💡 Result:

Yes—best practice is not to rely on fixed sleeps plus strict equality checks. Prefer waiting semantics (deadline/polling with timeout) against LSN replay progress, typically using a “wait until LSN is reached” mechanism; if you must poll, use a >= comparison rather than ==. Rationale (grounded in PostgreSQL behavior and docs): 1) Replay progresses in steps and may temporarily not exactly match your target LSN. Using a threshold check like pg_last_wal_replay_lsn >= target is robust against minor timing differences, and polling with a timeout avoids both indefinite hangs and coarse fixed sleeps. The WAL replay position on the standby is specifically exposed via pg_last_wal_replay_lsn as the “most recent WAL location replayed” by a streaming standby [1]. 2) PostgreSQL provides a built-in SQL facility designed for this exact pattern: WAIT FOR, which “waits until the specified lsn is reached” and supports timeout. It also clarifies standby modes: standby_replay waits for replay (applied to the database) which matches your use case (“catch-up” before validation) [2]. On timeout it errors unless NO_THROW is set [2]. 3) Replication catch-up is about lags and states rather than instantaneous equality. The documentation for warm standby explains the system transitions from “catchup” until the lag between standby and primary reaches zero for the first time, after which it becomes real-time streaming; this naturally implies using lag/position thresholds and waits rather than strict equality at an arbitrary sampling time [3]. Concrete recommendation: - If you can, use WAIT FOR with standby_replay and a TIMEOUT rather than custom loops. Example (conceptual): store the target LSN on the primary client side, then on the standby: WAIT FOR <target_lsn> IN standby_replay TIMEOUT '' (syntax/interval depends on your PostgreSQL version/client usage) [2]. - If you cannot use WAIT FOR, implement polling on the standby replay LSN with a timeout and a >= comparison: - compute target_lsn (from the primary, e.g., primary current WAL LSN) - loop: SELECT pg_last_wal_replay_lsn; succeed when replay_lsn >= target_lsn - abort when deadline exceeded This aligns with how pg_last_wal_replay_lsn represents “replayed” progress [1] and with “wait until reached” semantics [2]. So to your exact question: - Should you use pg_last_wal_replay_lsn >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality? Yes: >= + timeout polling (or ideally WAIT FOR standby_replay + TIMEOUT) is the safer, more correct best practice than == with fixed sleep [2][1].

Citations:

1: https://pgpedia.info/p/pg_last_wal_replay_lsn.html

2: http://www.postgresql.org/docs/devel/sql-wait-for.html

3: https://postgresql.org/docs/15/warm-standby.html

🏁 Script executed:

# First, check if the file exists and examine its structure head -20 providers/postgresql/scripts.go

Repository: ProxySQL/dbdeployer

Length of output: 363

🏁 Script executed:

# Check for the GenerateTestReplicationScript function and the specific code section sed -n '64,150p' providers/postgresql/scripts.go | head -100

Repository: ProxySQL/dbdeployer

Length of output: 3266

🏁 Script executed:

# Verify copyright header requirement for .go files per guidelines head -1 providers/postgresql/scripts.go

Repository: ProxySQL/dbdeployer

Length of output: 82

Use polling with >= comparison instead of fixed sleep plus strict LSN equality to avoid flaky replica validation.

Fixed sleep 1 followed by strict string equality on LSN (REPLICA_LSN == PRIMARY_LSN) causes false failures when replay progresses or lags naturally occur. PostgreSQL replication best practice is to poll with a timeout using pg_last_wal_replay_lsn() >= target_lsn comparison, or use WAIT FOR with timeout if available. This ensures the test tolerates normal replication timing variations.

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@providers/postgresql/scripts.go` around lines 126 - 139, The test currently does a fixed "sleep 1" and then strict string equality between REPLICA_LSN and PRIMARY_LSN (variables used in the replica loop and the ok_equal assertions), which is flaky; replace that with a polling loop per replica inside the numReplicas loop that repeatedly queries pg_last_wal_replay_lsn() (via ./replica%d/use) and checks whether the replay LSN is >= PRIMARY_LSN (or use pg_lsn comparison/CAST) until either the condition is true or a short timeout (e.g., several seconds) elapses, then assert success/failure; remove the fixed sleep and change the ok_equal LSN assertion to assert that the poll succeeded (and keep the row-count and recovery-mode checks as before).

gemini-code-assist Bot reviewed May 14, 2026

View reviewed changes

coderabbitai Bot reviewed May 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(postgresql): add test_replication script for replication topology#123

feat(postgresql): add test_replication script for replication topology#123
renecannao wants to merge 1 commit into
masterfrom
feat/postgresql-test-replication-script

renecannao commented May 14, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 14, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

gemini-code-assist Bot May 14, 2026

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

renecannao commented May 14, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What it does

Files

Reference

Test plan

Note on timing flakiness

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related issues

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

renecannao commented May 14, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 14, 2026 •

edited

Loading