GitHub - nurturenature/jepsen-powersync: Testing PowerSync with Jepsen for Causual Consistency, atomic transactions, and strong convergence.

jepsen-powersync

Testing PowerSync with Jepsen for Causal Consistency, Atomic transactions, and Strong Convergence.

PowerSync is a full featured active/active sync service for PostgreSQL and local SQLite3 clients. It offers a rich API for developers to configure and define the sync behavior.

Our primary goal is to test the sync algorithm, its core implementation, and develop best practices for

Causal Consistency
- read your writes
- monotonic reads and writes
- writes follow reads
- happens before relationships
Atomic transactions
Strong Convergence

Operating under

normal environmental conditions
environmental failures
diverse user behavior
random property based conditions and behavior

Safety First

The initial implementations of the PowerSync client and backend connector take a safety first bias:

stay as close as possible to direct SQLite3/PostgreSQL transactions
replicate at this transaction level
favors consistency and full client syncing over immediate performance

clients do straight ahead SQL transactions:

await db.writeTransaction((tx) async {
  // SQLTransactions.readAll
  final select = await tx.getAll('SELECT k,v FROM mww ORDER BY k;');
  
  // SQLTransactions.writeSome
  final update = await tx.execute(
    "UPDATE mww SET v = ${kv.value} WHERE k = ${kv.key} RETURNING *;",
  );
});

backend replication is transaction based:

CrudTransaction? crudTransaction = await db.getNextCrudTransaction();

await pg.runTx(
  // max write wins, so GREATEST() value of v
  final patchV = crudEntry.opData!['v'] as int;
  final patch = await tx.execute(
    'UPDATE mww SET v = GREATEST($patchV, mww.v) WHERE id = \'${crudEntry.id}\' RETURNING *',
  );
);

Progressive Test Enhancement

✔️ Single user, generic SQLite3 db, no PowerSync

tests the tests
demonstrates test fidelity, i.e. accurately represent the database
🗸 as expected, tests show totally availability with strict serializability

✔️ Multi user, generic SQLite3 shared db, no PowerSync

tests the tests
demonstrates test fidelity, i.e. accurately represent the database
🗸 as expected, tests show totally availability with strict serializability

✔️ Single user, PowerSync db, local only

expectation is total availability and strict serializability
SQLite3 is tight and using PowerSync APIs should reflect that
🗸 as expected, tests show totally availability with strict serializability

✔️ Single user, PowerSync db, with replication

expectation is total availability and strict serializability
SQLite3 is tight and using PowerSync APIs should reflect that
🗸 as expected, tests show total availability with strict serializability

✔️ Multi user, PowerSync db, with replication, using getCrudBatch() backend connector

expectation is
- read committed vs Causal
- non-atomic transactions with intermediate reads
- strong convergence
🗸 as expected, tests show read committed, non-atomic with intermediate reads transactions, and all replicated databases strongly converge

✔️ Multi user, PowerSync db, with replication, using newly developed Causal getNextCrudTransaction() backend connector

expectation is
- Atomic transactions
- Causal Consistency
- Strong Convergence
🗸 as expected, tests show Atomic transactions with Causal Consistency, and all replicated databases strongly converge

✔️ Multi user, Active/Active PostgreSQL/SQLite3, with replication, using newly developed Causal getNextCrudTransaction() backend connector

mix of clients, some PostgreSQL, some PowerSync
expectation is
- Atomic transactions
- Causal Consistency
- Strong Convergence
🗸 as expected, tests show Atomic transactions with Causal Consistency, and all replicated databases strongly converge

Clients

The client will be a simple Dart CLI PowerSync implementation.

Clients are expected to have total sticky availability

throughout the session, each client uses the
- same API
- same connection
clients are expected to be totally available, liveness, unless explicitly stopped, killed or paused

Observe and interact with the database

PowerSyncDatabase driver
single db.writeTransaction() with multiple SQL statements
- tx.getAll('SELECT')
- tx.execute('UPDATE')
PostgreSQL driver
- most used Dart pub.dev driver
- single pg.runTx() with multiple SQL statements
  - tx.execute('SELECT')
  - tx.execute('UPDATE')

The client will expose an endpoint for SQL transactions and PowerSyncDatabase API calls

HTTP for Jepsen
Isolate ReceivePort for Dart Fuzzer

Tests can use a mix of clients

active/active, replicate central PostgreSQL and local client activities
- some clients read/write PostgreSQL database
- some clients read/write local SQLite3 databases
replicate central PostgreSQL activity to local clients
- some clients read/write PostgreSQL database
- some clients read only local SQLite3 databases
replicate local clients to local clients
- some clients read only PostgreSQL database (to check for consistency)
- some clients read/write local SQLite3 databases

No Fault Environments

PowerSync tests 100% successfully in no fault environments using a test matrix of

5-10 clients
10-50 transactions/second
active/active, simultaneous PostgreSQL and/or local SQLite3 client transactions
local SQLite3 client transactions
complex transactions that read 100 keys and write 4 keys in a single transaction

Faults

LoFi and distributed systems live in a rough and tumble environment.

Successful applications, applications that receive a meaningful amount or duration of use, will be exposed to faults. Reality is like that.

Faults are applied

to random clients
at random intervals
for random durations

Even during faults, we still expect

total sticky availability (unless the client has been explicitly stopped/paused/killed)
Atomic transactions
Causal Consistency
Strong Convergence

Disconnect / Connect

Use the PowerSyncDatabase API to repeatedly and randomly disconnect and connect clients to the sync service.

await db.disconnect();
...
await db.connect(connector: connector);

Orderly

repeatedly
- wait a random interval
- 1 to all clients are randomly disconnected
- wait a random interval
- connect disconnected clients
at the end of the test connect any disconnected clients

Example of 3 clients being disconnected for ~1.6s:

2025-04-26 03:37:10,938{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:disconnect-orderly	{"n1" :disconnected, "n4" :disconnected, "n6" :disconnected}
...
2025-04-26 03:37:12,517{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:connect-orderly	{"n1" :connected, "n4" :connected, "n6" :connected}

Random

repeatedly
- wait a random interval
- 1 to all clients are randomly disconnected
- wait a random interval
- 1 to all clients are randomly connected
at the end of the test connect any disconnected clients

Example of 3 clients being disconnected, waiting ~2.5s, then connecting 2 clients

2025-04-26 03:37:14,623{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:disconnect-random	{"n1" :disconnected, "n2" :disconnected, "n3" :disconnected}
2025-04-26 03:37:17,193{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:connect-random	{"n1" :connected, "n4" :connected}

Upload Que Count

repeatedly
- wait a random interval
- query the upload que count

Example of observing differing queue counts for disconnected/connected clients:

2025-04-26 03:38:02,041{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:upload-queue-count	{"n2" 0, "n3" 0, "n4" 0, "n5" 0, "n6" 0}
...
2025-04-26 03:38:02,362{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:disconnect-orderly	{"n5" :disconnected, "n6" :disconnected}
...
2025-04-26 03:38:07,807{GMT}	INFO	[jepsen worker nemesis] jepsen.util: :nemesis	:info	:upload-queue-count	{"n2" 0, "n3" 0, "n4" 0, "n5" 28, "n6" 28}

Impact on Consistency/Correctness

The update to powersync: 1.12.3, PR #267, shows significant improvements when fuzzing disconnect/connect.

See issue #253 comment.

The new release eliminates all occurrences of

multiple calls to BackendConnector.uploadData() for the same transaction id
SyncStatus.lastSyncedAt goes backwards in time
reads that appear to go back in time, read of previous versions
select * reads that return [], empty database

And although less frequent, and requiring more

demanding transaction rates
total run times
occurrences of disconnect/connecting

than before, it is still possible to fuzz into a state where

UploadQueueStats.count appears to be stuck for a single client
which leads to incomplete replication for that client and divergent final reads

The bug is hard enough to reproduce due to its lower frequency and longer run times that GitHub actions are a more appropriate test bed

using Dart Isolates: https://github.com/nurturenature/jepsen-powersync/actions/workflows/fuzz-disconnect.yml
using Jepsen: https://github.com/nurturenature/jepsen-powersync/actions/workflows/jepsen-disconnect.yml

As the error behavior usually (always?) presents as a single stuck transaction, is it theorized that

replication occasionally gets stuck
sometimes the test ends during this stuck phase
sometimes the test is ongoing and replication is restarted by further transactions

Stop / Start

Use the PowerSyncDatabase API to repeatedly and randomly stop and start clients.

await db.close();
Isolate.kill();
...
// note: create db reusing existing SQLite3 files
await Isolate.spawn(...);
db = PowerSyncDatabase(...);
await db.initialize()/connect()/waitForFirstSync();

repeatedly
- wait a random interval
- 1 to all clients are randomly closed then stopped
- wait a random interval
- clients that were closed/stopped are restarted reusing existing SQLite3 files
at the end of the test restart any clients that are currently closed/stopped reusing existing SQLite3 files

Sample Jepsen log

:nemesis	:info	:stop-nodes	nil
:nemesis	:info	:stop-nodes	{"n10" :stopped, "n3" :stopped, "n7" :stopped}
...
:nemesis	:info	:start-nodes	nil
:nemesis	:info	:start-nodes	{"n1" :already-running, "n10" :started, "n2" :already-running, "n3" :started, "n4" :already-running, "n5" :already-running, "n6" :already-running, "n7" :started, "n8" :already-running, "n9" :already-running}

Impact on Consistency/Correctness

In a small, ~0.2% of the tests, the UploadQueueStats.count is stuck at the end of the test preventing Strong Convergence.

Similar to disconnect/connect, see above.

See issue #253 comment for similar behavior but with stop/start.

Network Partition

Use iptables to partition client hosts from the PowerSync sync service and PostgreSQL hosts.

// inbound
await Process.run('/usr/sbin/iptables', [ '-A', 'INPUT', '-s', powersyncServiceHost | postgreSQLHost, '-j', 'DROP', '-w' ]);

// outbound
await Process.run('/usr/sbin/iptables', [ '-A', 'OUTPUT', '-d', powersyncServiceHost | postgreSQLHost, '-j', 'DROP', '-w' ]);

// bidirectional
await Process.run('/usr/sbin/iptables', [ '-A', 'INPUT', '-s', powersyncServiceHost | postgreSQLHost, '-j', 'DROP', '-w' ]);
await Process.run('/usr/sbin/iptables', [ '-A', 'OUTPUT', '-d', powersyncServiceHost | postgreSQLHost, '-j', 'DROP', '-w' ]);

// heal network
await Process.run('/usr/sbin/iptables', ['-F', '-w']);
await Process.run('/usr/sbin/iptables', ['-X', '-w']);

repeatedly
- wait a random interval
- all clients for powersync_fuzz, 1 to all clients for Jepsen, are randomly partitioned from the PowerSync sync service and PostgreSQL hosts
- wait a random interval
- all client networks are healed
at the end of the test insure all client networks are healed

Example of partitioning nemesis from powersync_fuzz.log

$ grep nemesis powersync_fuzz.log 
[2025-05-08 02:58:38.540582] [main] [INFO] nemesis: partition: start listening to stream of partition messages
[2025-05-08 02:58:44.308471] [main] [INFO] nemesis: partition: starting: bidirectional
[2025-05-08 02:58:44.367180] [main] [INFO] nemesis: partition: current: bidirectional hosts: {powersync, pg-db}
[2025-05-08 02:58:46.957552] [main] [INFO] nemesis: partition: starting: none
[2025-05-08 02:58:47.031792] [main] [INFO] nemesis: partition: current: none hosts: {}
[2025-05-08 02:58:49.190407] [main] [INFO] nemesis: partition: starting: outbound
[2025-05-08 02:58:49.219263] [main] [INFO] nemesis: partition: current: outbound hosts: {powersync, pg-db}
[2025-05-08 02:58:55.021681] [main] [INFO] nemesis: partition: starting: none
[2025-05-08 02:58:55.128247] [main] [INFO] nemesis: partition: current: none hosts: {}
...
[2025-05-08 03:00:04.102570] [main] [INFO] nemesis: partition: starting: inbound
[2025-05-08 03:00:04.120517] [main] [INFO] nemesis: partition: current: inbound hosts: {powersync, pg-db}
...
[2025-05-08 03:00:18.591748] [main] [INFO] nemesis: partition: stop listening to stream of partition messages
[2025-05-08 03:00:20.257646] [main] [INFO] nemesis: partition: starting: none
[2025-05-08 03:00:20.282057] [main] [INFO] nemesis: partition: current: none hosts: {}

Impact on Consistency/Correctness

Client `<SyncStatus error: null>`

Unexpectedly, there's often no errors in the client logs

$ grep -P 'error: (?!null)' powersync_fuzz.log 
$

even when there's consistency errors.

Uploading Stops

Clients can end the test with a large number of transactions stuck in the UploadQueueStats.count

[2025-05-03 04:33:14.152256] [ps-8] [SEVERE] UploadQueueStats.count appears to be stuck at 120 after waiting for 11s
[2025-05-03 04:33:14.152278] [ps-8] [SEVERE] 	db.closed: false
[2025-05-03 04:33:14.152288] [ps-8] [SEVERE] 	db.connected: true
[2025-05-03 04:33:14.152299] [ps-8] [SEVERE] 	db.currentStatus: SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-05-03 04:32:12.685048, hasSynced: true, error: null>
[2025-05-03 04:33:14.152527] [ps-8] [SEVERE] 	db.getUploadQueueStats(): UploadQueueStats<count: 120 size: 2.34375kB>

preventing Strong Convergence in ~20% of the tests.

Clients appear to enter and get stuck in a SyncStatus<downloading: true> state after the partition is healed and the BackendConnector.UploadData() is never called.

Replication Stops

Clients can end the test with divergent final reads, i.e. not Strongly Consistent, ~10% of the time.

# observe client 5 write, then read, then upload {0: 1965}  
[2025-05-03 04:33:02.952207] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 197, type: ok, value: [{f: readAll, v: {0: 1651, ...}}, {f: writeSome, v: {0: 1965, ...}}]}
[2025-05-03 04:33:03.902696] [ps-5] [FINE] SQL txn: response: {clientNum: 5, clientType: ps, f: txn, id: 198, type: ok, value: [{f: readAll, v: {0: 1965, ...}}, {f: writeSome, v: {...}}]}
[2025-05-03 04:33:03.940824] [ps-5] [FINER] uploadData: call: 68 txn: 198 patch: {0: 1965} 

# observe write in PostgreSQL
2025-05-03 04:33:03.937 UTC [144] LOG:  statement: BEGIN ISOLATION LEVEL REPEATABLE READ;
...
2025-05-03 04:33:03.940 UTC [144] LOG:  execute s/811/p/811: UPDATE mww SET v = GREATEST(1965, mww.v) WHERE id = '0' RETURNING *
...
2025-05-03 04:33:03.941 UTC [144] LOG:  statement: COMMIT;

# but write is missing in most client final reads
[2025-05-03 04:33:45.722564] [main] [SEVERE] Divergent final reads!:
[2025-05-03 04:33:45.722652] [main] [SEVERE] pg: {0: 1965, ...}
[2025-05-03 04:33:45.722717] [main] [SEVERE] ps-1 {0: 1734, ...}
[2025-05-03 04:33:45.723038] [main] [SEVERE] ps-2 {0: 1386, ...}
[2025-05-03 04:33:45.723083] [main] [SEVERE] ps-3 {0: 1760, ...}
[2025-05-03 04:33:45.723125] [main] [SEVERE] ps-4 {0: 1932, ...}
[2025-05-03 04:33:45.723201] [main] [SEVERE] ps-8 {0: 1566, ...}
[2025-05-03 04:33:45.723260] [main] [SEVERE] ps-9 {0: 1313, ...}
[2025-05-03 04:33:45.722793] [main] [SEVERE] ps-10 {0: 1769, ...}

At some point, individual clients appear to go into, and remain in a SyncStatus.downloading: true state but there is no replication from the PowerSync sync service

[2025-05-03 04:33:05.960464] [ps-9] [FINEST] SyncStatus<connected: true connecting: false downloading: true uploading: false lastSyncedAt: 2025-05-03 04:31:57.783573, hasSynced: true, error: null>
...
[2025-05-03 04:33:15.299509] [ps-9] [FINE] database api: request: {clientNum: 9, f: api, id: 2, type: invoke, value: {f: downloadingWait, v: {}}}
[2025-05-03 04:33:15.299528] [ps-9] [FINE] database api: downloadingWait: waiting on SyncStatus.downloading: true...
...
[2025-05-03 04:33:44.328266] [ps-9] [FINE] database api: downloadingWait: waiting on SyncStatus.downloading: true...
[2025-05-03 04:33:45.329282] [ps-9] [WARNING] database api: downloadingWait: waited for SyncStatus.downloading: false 31 times every 1000ms

Client Pause

In powersync_fuzz, use

Capability resumeCapability = Isolate.pause();
...
// ok to resume an unpaused client
Isolate.resume(resumeCapability);

In Jepsen, use

$ grepkill stop powersync_http
...
# ok to cont an unpaused client
$ grepkill cont powersync_http

repeatedly
- wait a random interval
- 1 to all clients are randomly paused
- wait a random interval
- resume all clients
at the end of the test resume all clients

Impact on Consistency/Correctness

TODO

Client Kill

In powersync_fuzz, use

Isolate.kill();
...
// note: create db reusing existing SQLite3 files
await Isolate.spawn(...);
db = PowerSyncDatabase(...);
await db.initialize()/connect()/waitForFirstSync();

In Jepsen, use

$ grepkill kill powersync_http
...
$ ./powersync_http

repeatedly
- wait a random interval
- 1 to a majority of clients are randomly killed
- wait a random interval
- start clients that were killed reusing existing SQLite3 files
at the end of the test start any clients that are killed reusing existing SQLite3 files

TODO

Impact on Consistency/Correctness

TODO

GitHub Actions

There's a suite of GitHub Actions with an action for every type of fault.

Each action uses a test matrix for

number of clients
rate of transactions
fault characteristics
and other common configuration options

Oddly, GitHub Actions can fail

pulling Docker images from the GitHub Container Registry
building images
pause in the middle of a test run and timeout

Ignorable failure status messages

"GitHub Actions has encountered an internal error when running your job."
"The action '5c-20tps-100s...' has timed out after 25 minutes."
"Process completed with exit code 255."

These failures are Microsoft resource allocation and infrastructure issues and are not related to the tests.

The action will fail with an exit code of 255 and "no logs available" log file contents when this happens.

Not Testing

auth - using a permissive JWT
sync filtering - using SELECT *
Byzantine - natural faults, not malicious behavior

Conflict Resolution

Maximum write value for the key wins

SQLite3
- MAX()
PostgreSQL
- GREATEST()
- repeatable read isolation

The conflict/merge algorithm isn't important to the test. It just needs to be

consistently applied
consistently replicated

Development

Public GitHub repository

docs
samples/demos
actions that run tests in a CI/CD fashion

Development Logbook.

GitHub Actions.

Docker environment to run tests.

Non-Jepsen, Dart CLI fuzzer instructions.

Name		Name	Last commit message	Last commit date
Latest commit History 399 Commits
.github/workflows		.github/workflows
config		config
doc		doc
docker		docker
init-scripts		init-scripts
jepsen		jepsen
powersync_endpoint		powersync_endpoint
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

jepsen-powersync

Safety First

Progressive Test Enhancement

Clients

No Fault Environments

Faults

Disconnect / Connect

Orderly

Random

Upload Que Count

Impact on Consistency/Correctness

Stop / Start

Impact on Consistency/Correctness

Network Partition

Impact on Consistency/Correctness

Client `<SyncStatus error: null>`

Uploading Stops

Replication Stops

Client Pause

Impact on Consistency/Correctness

Client Kill

Impact on Consistency/Correctness

GitHub Actions

Not Testing

Conflict Resolution

Development

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

nurturenature/jepsen-powersync

Folders and files

Latest commit

History

Repository files navigation

jepsen-powersync

Safety First

Progressive Test Enhancement

Clients

No Fault Environments

Faults

Disconnect / Connect

Orderly

Random

Upload Que Count

Impact on Consistency/Correctness

Stop / Start

Impact on Consistency/Correctness

Network Partition

Impact on Consistency/Correctness

Client <SyncStatus error: null>

Uploading Stops

Replication Stops

Client Pause

Impact on Consistency/Correctness

Client Kill

Impact on Consistency/Correctness

GitHub Actions

Not Testing

Conflict Resolution

Development

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Client `<SyncStatus error: null>`

Packages