feat(postgresql): add test_replication script for replication topology#123
feat(postgresql): add test_replication script for replication topology#123renecannao wants to merge 1 commit into
Conversation
Closes #120. PostgreSQL replication deployments now get a `test_replication` script in the topology directory, alongside the existing `check_replication` and `check_recovery` scripts. It mirrors the MySQL `test_replication` template (sandbox/templates/replication/test_replication.gotxt) but uses PostgreSQL primitives: - pg_current_wal_lsn() on the primary to capture WAL position - pg_last_wal_replay_lsn() on each replica to compare - pg_is_in_recovery() to confirm each replica is actually a standby Flow: create a test table on the primary, insert 20 rows, capture LSN + row count, then for each replica assert recovery mode + LSN match + row count match. Reports tap-style ok/not ok per assertion and exits 1 on any failure. Generated via a new `GenerateTestReplicationScript(opts, numReplicas)` in providers/postgresql/scripts.go, wired into deployReplicationNonMySQL in cmd/replication.go. Sample reference script was provided by Ronald Bradford in the issue.
📝 WalkthroughWalkthroughThis PR adds a new ChangesPostgreSQL Replication Test Script
Estimated code review effort🎯 2 (Simple) | ⏱️ ~12 minutes Possibly related issues
Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Code Review
This pull request introduces a test_replication script for PostgreSQL sandboxes, enabling automated verification of replication state and data consistency. The implementation includes the script generation logic, integration into the deployment command, and unit tests. Feedback suggests refactoring cmd/replication.go to reuse existing configuration options and improving the reliability of the replication check by using PostgreSQL's LSN comparison operators instead of string equality to account for background WAL activity.
| testReplOpts := postgresql.ScriptOptions{ | ||
| SandboxDir: topologyDir, | ||
| BinDir: binDir, | ||
| LibDir: libDir, | ||
| Port: primaryPort, | ||
| } | ||
| testReplScript := postgresql.GenerateTestReplicationScript(testReplOpts, len(replicaPorts)) |
There was a problem hiding this comment.
The testReplOpts variable is redundant. You can reuse the existing scriptOpts (defined on line 141) by setting its SandboxDir field. This makes the code cleaner and avoids duplicating the configuration for BinDir, LibDir, and Port.
scriptOpts.SandboxDir = topologyDir
testReplScript := postgresql.GenerateTestReplicationScript(scriptOpts, len(replicaPorts))| REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication') | ||
| REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()') | ||
| ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode" | ||
| ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary" |
There was a problem hiding this comment.
Comparing LSNs as strings for exact equality is fragile in PostgreSQL. Even in an idle sandbox, background processes (like the checkpointer or autovacuum) can advance the WAL LSN on the primary between the time the inserts finish and the LSN is captured, or during the sleep period. A more robust approach would be to use PostgreSQL's LSN comparison operator to ensure the replica has replayed at least up to the primary's LSN: SELECT pg_last_wal_replay_lsn() >= '$PRIMARY_LSN'::pg_lsn. This would return 't' if the replica is caught up or ahead, avoiding false negatives.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@providers/postgresql/scripts.go`:
- Around line 126-139: The test currently does a fixed "sleep 1" and then strict
string equality between REPLICA_LSN and PRIMARY_LSN (variables used in the
replica loop and the ok_equal assertions), which is flaky; replace that with a
polling loop per replica inside the numReplicas loop that repeatedly queries
pg_last_wal_replay_lsn() (via ./replica%d/use) and checks whether the replay LSN
is >= PRIMARY_LSN (or use pg_lsn comparison/CAST) until either the condition is
true or a short timeout (e.g., several seconds) elapses, then assert
success/failure; remove the fixed sleep and change the ok_equal LSN assertion to
assert that the poll succeeded (and keep the row-count and recovery-mode checks
as before).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 89a3d65a-c752-488e-87f7-ba6b8d925370
📒 Files selected for processing (3)
cmd/replication.goproviders/postgresql/postgresql_test.goproviders/postgresql/scripts.go
| # Allow a brief window for streaming replication to catch up | ||
| sleep 1 | ||
|
|
||
| `, opts.SandboxDir)) | ||
|
|
||
| for n := 1; n <= numReplicas; n++ { | ||
| b.WriteString(fmt.Sprintf(`echo "# Testing replica %d" | ||
| if [ -x ./replica%d/use ]; then | ||
| REPLICA_LSN=$(./replica%d/use -Atqc 'SELECT pg_last_wal_replay_lsn()') | ||
| REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication') | ||
| REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()') | ||
| ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode" | ||
| ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary" | ||
| ok_equal "$REPLICA_RECS" "$PRIMARY_RECS" "replica %d row count matches primary" |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
PostgreSQL standby validation best practice: should replication catch-up checks use pg_last_wal_replay_lsn() >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality?
💡 Result:
Yes—best practice is not to rely on fixed sleeps plus strict equality checks. Prefer waiting semantics (deadline/polling with timeout) against LSN replay progress, typically using a “wait until LSN is reached” mechanism; if you must poll, use a >= comparison rather than ==. Rationale (grounded in PostgreSQL behavior and docs): 1) Replay progresses in steps and may temporarily not exactly match your target LSN. Using a threshold check like pg_last_wal_replay_lsn >= target is robust against minor timing differences, and polling with a timeout avoids both indefinite hangs and coarse fixed sleeps. The WAL replay position on the standby is specifically exposed via pg_last_wal_replay_lsn as the “most recent WAL location replayed” by a streaming standby [1]. 2) PostgreSQL provides a built-in SQL facility designed for this exact pattern: WAIT FOR, which “waits until the specified lsn is reached” and supports timeout. It also clarifies standby modes: standby_replay waits for replay (applied to the database) which matches your use case (“catch-up” before validation) [2]. On timeout it errors unless NO_THROW is set [2]. 3) Replication catch-up is about lags and states rather than instantaneous equality. The documentation for warm standby explains the system transitions from “catchup” until the lag between standby and primary reaches zero for the first time, after which it becomes real-time streaming; this naturally implies using lag/position thresholds and waits rather than strict equality at an arbitrary sampling time [3]. Concrete recommendation: - If you can, use WAIT FOR with standby_replay and a TIMEOUT rather than custom loops. Example (conceptual): store the target LSN on the primary client side, then on the standby: WAIT FOR <target_lsn> IN standby_replay TIMEOUT '' (syntax/interval depends on your PostgreSQL version/client usage) [2]. - If you cannot use WAIT FOR, implement polling on the standby replay LSN with a timeout and a >= comparison: - compute target_lsn (from the primary, e.g., primary current WAL LSN) - loop: SELECT pg_last_wal_replay_lsn; succeed when replay_lsn >= target_lsn - abort when deadline exceeded This aligns with how pg_last_wal_replay_lsn represents “replayed” progress [1] and with “wait until reached” semantics [2]. So to your exact question: - Should you use pg_last_wal_replay_lsn >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality? Yes: >= + timeout polling (or ideally WAIT FOR standby_replay + TIMEOUT) is the safer, more correct best practice than == with fixed sleep [2][1].
Citations:
- 1: https://pgpedia.info/p/pg_last_wal_replay_lsn.html
- 2: http://www.postgresql.org/docs/devel/sql-wait-for.html
- 3: https://postgresql.org/docs/15/warm-standby.html
🏁 Script executed:
# First, check if the file exists and examine its structure
head -20 providers/postgresql/scripts.goRepository: ProxySQL/dbdeployer
Length of output: 363
🏁 Script executed:
# Check for the GenerateTestReplicationScript function and the specific code section
sed -n '64,150p' providers/postgresql/scripts.go | head -100Repository: ProxySQL/dbdeployer
Length of output: 3266
🏁 Script executed:
# Verify copyright header requirement for .go files per guidelines
head -1 providers/postgresql/scripts.goRepository: ProxySQL/dbdeployer
Length of output: 82
Use polling with >= comparison instead of fixed sleep plus strict LSN equality to avoid flaky replica validation.
Fixed sleep 1 followed by strict string equality on LSN (REPLICA_LSN == PRIMARY_LSN) causes false failures when replay progresses or lags naturally occur. PostgreSQL replication best practice is to poll with a timeout using pg_last_wal_replay_lsn() >= target_lsn comparison, or use WAIT FOR with timeout if available. This ensures the test tolerates normal replication timing variations.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@providers/postgresql/scripts.go` around lines 126 - 139, The test currently
does a fixed "sleep 1" and then strict string equality between REPLICA_LSN and
PRIMARY_LSN (variables used in the replica loop and the ok_equal assertions),
which is flaky; replace that with a polling loop per replica inside the
numReplicas loop that repeatedly queries pg_last_wal_replay_lsn() (via
./replica%d/use) and checks whether the replay LSN is >= PRIMARY_LSN (or use
pg_lsn comparison/CAST) until either the condition is true or a short timeout
(e.g., several seconds) elapses, then assert success/failure; remove the fixed
sleep and change the ok_equal LSN assertion to assert that the poll succeeded
(and keep the row-count and recovery-mode checks as before).
Summary
Closes #120.
PostgreSQL replication deployments now get a
test_replicationscript in the topology directory, alongside the existingcheck_replicationandcheck_recovery. Mirrors the MySQLtest_replicationtemplate but uses PostgreSQL primitives.What it does
When you run it from the topology dir (
postgresql_repl_<port>/):dbdeployer_test_replicationon the primary.pg_current_wal_lsn()and row count.pg_is_in_recovery()=t,pg_last_wal_replay_lsn()matches the primary's LSN, andcount(*)matches.ok/not okper assertion; exits 1 on any failure.Files
providers/postgresql/scripts.go— newGenerateTestReplicationScript(opts, numReplicas)function. Embeds anok_equal/test_summaryhelper pair (same shape as the MySQL template).cmd/replication.go— generatestest_replicationnext to the existingcheck_replication/check_recoverywrites indeployReplicationNonMySQL.providers/postgresql/postgresql_test.go—TestGenerateTestReplicationScriptcovers expected substrings and bound checks onnumReplicas(no replica3 references for a 2-replica deploy).Reference
Sample script in the issue body was provided by Ronald Bradford (issue author). Implementation follows it closely, adapted to the actual PG provider's per-sandbox
usescript layout (./primary/use,./replicaN/use) rather than the MySQL-style./m/./n1shortcuts that the PG topology doesn't create.Test plan
go build ./...clean.go test ./providers/postgresql/...—TestGenerateTestReplicationScriptand existing tests pass.bash -nsyntax-checks cleanly.Note on timing flakiness
PostgreSQL streaming replication is asynchronous; a one-second sleep between primary writes and replica reads is typically enough but can flake on a busy system. The Ronald-provided reference uses
sleep 0.5and the same approach. A more robust variant would poll-with-timeout for the replica's LSN to match the primary's — happy to fold that in as a follow-up if this turns out flaky in practice.Summary by CodeRabbit
New Features
Tests