Skip to content

feat(postgresql): add test_replication script for replication topology#123

Open
renecannao wants to merge 1 commit into
masterfrom
feat/postgresql-test-replication-script
Open

feat(postgresql): add test_replication script for replication topology#123
renecannao wants to merge 1 commit into
masterfrom
feat/postgresql-test-replication-script

Conversation

@renecannao
Copy link
Copy Markdown

@renecannao renecannao commented May 14, 2026

Summary

Closes #120.

PostgreSQL replication deployments now get a test_replication script in the topology directory, alongside the existing check_replication and check_recovery. Mirrors the MySQL test_replication template but uses PostgreSQL primitives.

What it does

When you run it from the topology dir (postgresql_repl_<port>/):

  1. Creates a test table dbdeployer_test_replication on the primary.
  2. Inserts 20 rows.
  3. Captures the primary's pg_current_wal_lsn() and row count.
  4. Briefly sleeps to give streaming replication a moment to catch up.
  5. For each replica: asserts pg_is_in_recovery() = t, pg_last_wal_replay_lsn() matches the primary's LSN, and count(*) matches.
  6. Reports tap-style ok / not ok per assertion; exits 1 on any failure.

Files

  • providers/postgresql/scripts.go — new GenerateTestReplicationScript(opts, numReplicas) function. Embeds an ok_equal / test_summary helper pair (same shape as the MySQL template).
  • cmd/replication.go — generates test_replication next to the existing check_replication / check_recovery writes in deployReplicationNonMySQL.
  • providers/postgresql/postgresql_test.goTestGenerateTestReplicationScript covers expected substrings and bound checks on numReplicas (no replica3 references for a 2-replica deploy).

Reference

Sample script in the issue body was provided by Ronald Bradford (issue author). Implementation follows it closely, adapted to the actual PG provider's per-sandbox use script layout (./primary/use, ./replicaN/use) rather than the MySQL-style ./m / ./n1 shortcuts that the PG topology doesn't create.

Test plan

  • go build ./... clean.
  • go test ./providers/postgresql/...TestGenerateTestReplicationScript and existing tests pass.
  • Generated script bash -n syntax-checks cleanly.
  • Visually inspected the generated output for a 2-replica deploy — each replica section properly numbered, sandbox dir interpolated, LSN/recovery/row-count assertions all present.

Note on timing flakiness

PostgreSQL streaming replication is asynchronous; a one-second sleep between primary writes and replica reads is typically enough but can flake on a busy system. The Ronald-provided reference uses sleep 0.5 and the same approach. A more robust variant would poll-with-timeout for the replica's LSN to match the primary's — happy to fold that in as a follow-up if this turns out flaky in practice.

Summary by CodeRabbit

  • New Features

    • PostgreSQL replication topologies now automatically generate a test script that validates streaming replication health by verifying each replica's recovery status, WAL LSN synchronization with the primary, and data consistency.
  • Tests

    • Added comprehensive test coverage for the replication validation script generator.

Review Change Stack

Closes #120.

PostgreSQL replication deployments now get a `test_replication` script
in the topology directory, alongside the existing `check_replication`
and `check_recovery` scripts. It mirrors the MySQL
`test_replication` template (sandbox/templates/replication/test_replication.gotxt)
but uses PostgreSQL primitives:

- pg_current_wal_lsn() on the primary to capture WAL position
- pg_last_wal_replay_lsn() on each replica to compare
- pg_is_in_recovery() to confirm each replica is actually a standby

Flow: create a test table on the primary, insert 20 rows, capture LSN
+ row count, then for each replica assert recovery mode + LSN match +
row count match. Reports tap-style ok/not ok per assertion and exits 1
on any failure.

Generated via a new `GenerateTestReplicationScript(opts, numReplicas)`
in providers/postgresql/scripts.go, wired into deployReplicationNonMySQL
in cmd/replication.go. Sample reference script was provided by Ronald
Bradford in the issue.
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

📝 Walkthrough

Walkthrough

This PR adds a new GenerateTestReplicationScript function that generates a Bash script to validate PostgreSQL streaming replication by verifying recovery state, WAL LSN synchronization, and row counts across primary and replica instances, and integrates it into the replication deployment command.

Changes

PostgreSQL Replication Test Script

Layer / File(s) Summary
Replication test script generator
providers/postgresql/scripts.go, providers/postgresql/postgresql_test.go
GenerateTestReplicationScript constructs a Bash script that creates a test table on the primary, captures WAL LSN and row count, then iterates over replicas checking recovery state, LSN match, and row-count match. Tests validate script content includes required environment setup and correct replica references.
Command integration for topology deployment
cmd/replication.go
The deployReplicationNonMySQL command now generates and installs the test_replication script for PostgreSQL replication topologies by creating ScriptOptions and writing the generated script to the topology directory.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related issues

Possibly related PRs

  • ProxySQL/dbdeployer#54: Both PRs extend PostgreSQL replication capabilities by modifying cmd/replication.go and adding script generation in providers/postgresql/scripts.go.

Poem

🐰 A rabbit built a script so fine,
To test if primary and replicas align!
LSN and rows must match just so,
PostgreSQL replication verified—go, go, go! 🗄️

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly and accurately describes the main change: adding a test_replication script for PostgreSQL replication topology. It is concise, clear, and specific.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/postgresql-test-replication-script

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a test_replication script for PostgreSQL sandboxes, enabling automated verification of replication state and data consistency. The implementation includes the script generation logic, integration into the deployment command, and unit tests. Feedback suggests refactoring cmd/replication.go to reuse existing configuration options and improving the reliability of the replication check by using PostgreSQL's LSN comparison operators instead of string equality to account for background WAL activity.

Comment thread cmd/replication.go
Comment on lines +157 to +163
testReplOpts := postgresql.ScriptOptions{
SandboxDir: topologyDir,
BinDir: binDir,
LibDir: libDir,
Port: primaryPort,
}
testReplScript := postgresql.GenerateTestReplicationScript(testReplOpts, len(replicaPorts))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The testReplOpts variable is redundant. You can reuse the existing scriptOpts (defined on line 141) by setting its SandboxDir field. This makes the code cleaner and avoids duplicating the configuration for BinDir, LibDir, and Port.

	scriptOpts.SandboxDir = topologyDir
	testReplScript := postgresql.GenerateTestReplicationScript(scriptOpts, len(replicaPorts))

REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication')
REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()')
ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode"
ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Comparing LSNs as strings for exact equality is fragile in PostgreSQL. Even in an idle sandbox, background processes (like the checkpointer or autovacuum) can advance the WAL LSN on the primary between the time the inserts finish and the LSN is captured, or during the sleep period. A more robust approach would be to use PostgreSQL's LSN comparison operator to ensure the replica has replayed at least up to the primary's LSN: SELECT pg_last_wal_replay_lsn() >= '$PRIMARY_LSN'::pg_lsn. This would return 't' if the replica is caught up or ahead, avoiding false negatives.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@providers/postgresql/scripts.go`:
- Around line 126-139: The test currently does a fixed "sleep 1" and then strict
string equality between REPLICA_LSN and PRIMARY_LSN (variables used in the
replica loop and the ok_equal assertions), which is flaky; replace that with a
polling loop per replica inside the numReplicas loop that repeatedly queries
pg_last_wal_replay_lsn() (via ./replica%d/use) and checks whether the replay LSN
is >= PRIMARY_LSN (or use pg_lsn comparison/CAST) until either the condition is
true or a short timeout (e.g., several seconds) elapses, then assert
success/failure; remove the fixed sleep and change the ok_equal LSN assertion to
assert that the poll succeeded (and keep the row-count and recovery-mode checks
as before).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 89a3d65a-c752-488e-87f7-ba6b8d925370

📥 Commits

Reviewing files that changed from the base of the PR and between b738c7d and fb4ffbd.

📒 Files selected for processing (3)
  • cmd/replication.go
  • providers/postgresql/postgresql_test.go
  • providers/postgresql/scripts.go

Comment on lines +126 to +139
# Allow a brief window for streaming replication to catch up
sleep 1

`, opts.SandboxDir))

for n := 1; n <= numReplicas; n++ {
b.WriteString(fmt.Sprintf(`echo "# Testing replica %d"
if [ -x ./replica%d/use ]; then
REPLICA_LSN=$(./replica%d/use -Atqc 'SELECT pg_last_wal_replay_lsn()')
REPLICA_RECS=$(./replica%d/use -Atqc 'SELECT count(*) FROM dbdeployer_test_replication')
REPLICA_IS_REPLICA=$(./replica%d/use -Atqc 'SELECT pg_is_in_recovery()')
ok_equal "$REPLICA_IS_REPLICA" "t" "replica %d is in recovery mode"
ok_equal "$REPLICA_LSN" "$PRIMARY_LSN" "replica %d LSN matches primary"
ok_equal "$REPLICA_RECS" "$PRIMARY_RECS" "replica %d row count matches primary"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🌐 Web query:

PostgreSQL standby validation best practice: should replication catch-up checks use pg_last_wal_replay_lsn() >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality?

💡 Result:

Yes—best practice is not to rely on fixed sleeps plus strict equality checks. Prefer waiting semantics (deadline/polling with timeout) against LSN replay progress, typically using a “wait until LSN is reached” mechanism; if you must poll, use a >= comparison rather than ==. Rationale (grounded in PostgreSQL behavior and docs): 1) Replay progresses in steps and may temporarily not exactly match your target LSN. Using a threshold check like pg_last_wal_replay_lsn >= target is robust against minor timing differences, and polling with a timeout avoids both indefinite hangs and coarse fixed sleeps. The WAL replay position on the standby is specifically exposed via pg_last_wal_replay_lsn as the “most recent WAL location replayed” by a streaming standby [1]. 2) PostgreSQL provides a built-in SQL facility designed for this exact pattern: WAIT FOR, which “waits until the specified lsn is reached” and supports timeout. It also clarifies standby modes: standby_replay waits for replay (applied to the database) which matches your use case (“catch-up” before validation) [2]. On timeout it errors unless NO_THROW is set [2]. 3) Replication catch-up is about lags and states rather than instantaneous equality. The documentation for warm standby explains the system transitions from “catchup” until the lag between standby and primary reaches zero for the first time, after which it becomes real-time streaming; this naturally implies using lag/position thresholds and waits rather than strict equality at an arbitrary sampling time [3]. Concrete recommendation: - If you can, use WAIT FOR with standby_replay and a TIMEOUT rather than custom loops. Example (conceptual): store the target LSN on the primary client side, then on the standby: WAIT FOR <target_lsn> IN standby_replay TIMEOUT '' (syntax/interval depends on your PostgreSQL version/client usage) [2]. - If you cannot use WAIT FOR, implement polling on the standby replay LSN with a timeout and a >= comparison: - compute target_lsn (from the primary, e.g., primary current WAL LSN) - loop: SELECT pg_last_wal_replay_lsn; succeed when replay_lsn >= target_lsn - abort when deadline exceeded This aligns with how pg_last_wal_replay_lsn represents “replayed” progress [1] and with “wait until reached” semantics [2]. So to your exact question: - Should you use pg_last_wal_replay_lsn >= ''::pg_lsn with polling timeout instead of fixed sleep + strict equality? Yes: >= + timeout polling (or ideally WAIT FOR standby_replay + TIMEOUT) is the safer, more correct best practice than == with fixed sleep [2][1].

Citations:


🏁 Script executed:

# First, check if the file exists and examine its structure
head -20 providers/postgresql/scripts.go

Repository: ProxySQL/dbdeployer

Length of output: 363


🏁 Script executed:

# Check for the GenerateTestReplicationScript function and the specific code section
sed -n '64,150p' providers/postgresql/scripts.go | head -100

Repository: ProxySQL/dbdeployer

Length of output: 3266


🏁 Script executed:

# Verify copyright header requirement for .go files per guidelines
head -1 providers/postgresql/scripts.go

Repository: ProxySQL/dbdeployer

Length of output: 82


Use polling with >= comparison instead of fixed sleep plus strict LSN equality to avoid flaky replica validation.

Fixed sleep 1 followed by strict string equality on LSN (REPLICA_LSN == PRIMARY_LSN) causes false failures when replay progresses or lags naturally occur. PostgreSQL replication best practice is to poll with a timeout using pg_last_wal_replay_lsn() >= target_lsn comparison, or use WAIT FOR with timeout if available. This ensures the test tolerates normal replication timing variations.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@providers/postgresql/scripts.go` around lines 126 - 139, The test currently
does a fixed "sleep 1" and then strict string equality between REPLICA_LSN and
PRIMARY_LSN (variables used in the replica loop and the ok_equal assertions),
which is flaky; replace that with a polling loop per replica inside the
numReplicas loop that repeatedly queries pg_last_wal_replay_lsn() (via
./replica%d/use) and checks whether the replay LSN is >= PRIMARY_LSN (or use
pg_lsn comparison/CAST) until either the condition is true or a short timeout
(e.g., several seconds) elapses, then assert success/failure; remove the fixed
sleep and change the ok_equal LSN assertion to assert that the poll succeeded
(and keep the row-count and recovery-mode checks as before).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

test_replication for PostgreSQL

1 participant