Skip to content

Persistence of synced data #865

@samwillis

Description

@samwillis

A common request, and something we want to develop in time, is a persistence layer for TanStack DB. This would allow users to persist synced data onto the user’s device, providing faster startup and potentially offline support.

This issue is intended as an opening to the discussion: laying out the problem space and a few initial ideas. We will follow up with a more detailed plan once we’ve done some research and experiments.

See also the older issue for the wider "offline support" #82

Requirements

  1. A pluggable persistence API that allows multiple alternative backends, such as IndexedDB or OPFS in the browser, or SQLite in a React Native app. OPFS + SQLite-WASM is now a real option for high-performance browser persistence, but comes with a different set of constraints to IDB.
  2. Support for larger-than-memory data, with paging in/out of the persistence layer depending on the queries being executed.
  3. In the browser we need to safely handle multiple tabs/windows syncing to the same persistence layer.
  4. Access to the persisted data should be possible while the app is offline.
  5. Schema evolution - persisted stores need a version number and upgrade path. We should assume schemas (and internal encodings) will evolve over time and provide hooks to migrate, or to wipe/resync safely when migration isn’t possible.

Relationship to offline transactions

We already have an early offline mutation persister in @tanstack/offline-transactions:
https://github.com/TanStack/db/tree/main/packages/offline-transactions

This is intentionally a separate problem to “persist synced data”. That said, once we do have persisted collections, queries will need to compose:

  • state loaded from the persisted store plus
  • any pending offline mutations from the offline transaction log

So persistence can’t assume “disk == canonical local view”. The in-memory view is: persisted base + pending local delta, and that delta may exist even if the main persistence layer is not enabled for a given collection. This distinction is important to keep in mind in the design.

Issues with larger-than-memory data and paging

A core motivation for persistence is not only to provide offline support, but also to enable very large datasets that are persisted to a non-memory-backed store. We would then need to support paging in/out of the persistence layer depending on the queries being executed.

There is significant complexity in implementing this, and it needs to be carefully considered in the initial design. We likely need some form of indexing that enables fetching rows or pages that are needed. There are two strategies we need to consider:

  1. Row-level paging.
    We can use the facilities of the underlying persistence layer to handle paging. Both IndexedDB and SQLite can support row-level filtering and pagination. We can use this to only load the rows that are needed.

    It’s worth noting that we likely can’t just naïvely forward predicates to the persistence layer. The way predicates are processed could be different from how they are processed in the DB query engine. We would need to over-fetch and still filter in memory.

  2. Page-level paging.
    Rather than storing DB rows as rows in the underlying persistence layer, we could construct pages and store them as rows/blobs in the persistence layer. This can be more efficient as we are loading much larger chunks of data at once, and aligns with how many durable stores handle eviction + range scans. (jlongster.com)

    It does however bring a more complex implementation. We need to manage those pages, remove dead rows, and vacuum/compact the pages when they are no longer needed. Local-first clients and browser-SQLite VFS projects are useful references for this kind of design. (doc.replicache.dev)

Making a decision on this strategy is core to the design of the persistence layer, and likely requires testing and experimentation to get right.

Issues with multiple tabs/windows syncing to the same persistence layer

In the browser there would be a single persisted store that is shared by all tabs/windows. If we then have each tab/window syncing into this shared store we can have issues with consistency and correctness:

  • one tab may be further forward in time than another tab
  • two tabs could be syncing the same collection and both write to the store

The likely solution is to ensure that only one tab/window/process is responsible for syncing to the store (single-writer, multi-reader). There are two ways we could achieve this:

  • Use a leader election algorithm to determine which tab/window/process is responsible for syncing to the store.

    • We have some experience with leader elections from our work on PGlite, and understand this brings a lot of complexity.
    • The leader election can be achieved using the Web Locks API, but we then need to consider background tabs becoming throttled (which can happen aggressively on mobile devices). (greenvitriol.com)
  • Use a SharedWorker to run the sync process. This is conceptually a simpler solution: we know we only have one process syncing to the store, but it brings a few more problems to solve:

    • We would need a way to communicate between the main thread and the shared worker, particularly for query-driven sync. Predicates will need to be serialised and sent to the shared worker, and synced data needs to be pushed back to all listening tabs/windows.
    • Shared workers are not yet supported on Android Chrome. This isn’t necessarily a blocker, as most mobile browsers don’t have multiple visible tabs/windows at once. We can fall back to locking background tabs and only the foreground one syncing to the store. (MDN Web Docs)

Issues with schema evolution

Migrations are a core part of doing persistence safely. We need to consider if this is in scope of the persistence layer, or if it's something that can be pushed back on the application. Possible solutions:

  • We support schema evolution, and so would need to have:
    • a way to version the persisted store
    • a way to migrate the persisted store to the new schema
  • We provide hooks to the application to handle schema evolution, they implement it themselves on application startup.
  • The alternative is to just truncate the local data and resync from the server on all schema changes, this makes things much simpler, but at the cost of a potentially large resync for large applications.

Next steps

We are intending to work on persistence after v1 of DB. It’s a significant undertaking and only something that a smaller portion of users need right now. This issue is intended as a tracking issue for the design and development of the persistence layer and so a good place to discuss, prototype and experiment with ideas.

Alternative for now

Many of the collections support some level of persistence or caching themselves:

  • QueryCollection can use the native TanStack Query persistQueryClient option to persist the query client to localStorage or other storage options. (tanstack.com)
  • ElectricCollection can load synced data directly from the browser cache. Shape logs can be aggressively cached so you don’t go back to the server on subsequent page loads. (electric-sql.com)
  • RXDBCollection wraps RxDB which has its own persistence layer (including OPFS/IDB backends and multi-tab behaviour). (rxdb.info)
  • PowerSyncCollection uses SQLite for persistence in the browser. OPFS-backed SQLite is particularly relevant to this discussion. (powersync.com)

Research / related work

Useful pointers for people who want to dig in or share experience:

Implementations to look at

  • Replicache — local-first client DB with durable persistence, mutation logging, and a lot of practical experience around storage tradeoffs. (doc.replicache.dev)
  • Zero — query-driven sync with a client-side persistent cache; worth comparing how they reuse local data for future queries. (zero.rocicorp.dev)
  • RxDB — pluggable browser persistence with multiple backends including OPFS, plus cross-tab sync patterns. (rxdb.info)
  • PowerSync / SQLite-WASM on the web — current state of OPFS vs IDB, and the constraints of different SQLite VFS layers. (powersync.com)
  • absurd-sql / wa-sqlite / other SQLite VFS work — prior art on treating IndexedDB/OPFS as a block device, and the real-world performance implications. (powersync.com)
  • ElectricSQL client caching — shape-log-based persistence/caching and chunking strategies over append-only logs. (electric-sql.com)

Research / background

  • Modern KV / LSM-style storage and compaction (useful framing for log-structured paging, indexes, and vacuuming). (Medium)
  • Browser persistence constraints and IndexedDB performance notes — practical pitfalls, limits, and why some projects build custom VFS layers. (rxdb.info)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions