Skip to content

Multistore Data Architecture

Andy Theuninck edited this page Oct 9, 2015 · 6 revisions

This page includes brainstorming about how best to structure data across multiple stores. The single guiding principle here is probably distributed writes are really, really hard.

Idea One: Using nothing but SQL

In this structure, writes simply are not distributed. One database is designated the master and all writes are sent to that database. As many slave nodes as are necessary may be added at any location to handle read queries. At least one slave node per location is necessary to avoid complete outages during cross-site connectivity failure.

Pros:

  • ACID. Data cannot diverge. The master database is always correct.
  • No new services to learn and maintain.
  • Easy to understand and reason about

Cons:

  • Not partition tolerant. Losing connection to the master database means no writes.
  • Developers must consciously choose read and write database connections at the application level.
    • Throwing everything at the master database would work albeit with some performance penalty
    • Microservices in front of tables/entities could shield developers from this decision making but such services don't exist currently and would take a while to build.

Idea Two: Message Queue

In this structure, each store has its own read/write database. Write operations are written to the local SQL database and then published to a message queue. Other stores subscribe to the message queue and on receiving a message perform the corresponding write on their own database. A GUID/UUID column would be added to every table to ensure the publisher and subscriber(s) are manipulating the same record.

Pros

  • Partition tolerant. Losing connection to the queue does not inhibit local operations.
  • Auditable by GUID/UUID. Discrepancies can exist but are easily discovered.
  • Probably more transparent to developers. Publishing could happen automatically in many cases.

Cons

  • No concurrency controls. Near-simultaneous writes from multiple locations have unpredictable results.
    • INSERT-only tables would be immune from this effect. Transaction-logging tables should definitely carry that restriction.
    • Hard to say what the real world impact is. My hunch is it's fairly minor.
  • No canonically correct data. Discrepancies must be resolved manually.
  • Message queue itself and/or subscriber services introduce additional failure points

Dismissed Ideas

The following options have already been dismissed:

  • Multi-master SQL replication
    • Scaling with integer primary keys sucks. Unless you know the maximum number of nodes from the outset, adding a new node means redoing the the starting points and offsets for every table
    • Circular replication fails badly with 3+ nodes. The loss of any single node breaks the circle and replication among remaining nodes no longer works correctly.
  • Using something other than SQL as the primary data store
    • Far too many unknowns with performance and failure conditions
    • Too big of a refactor to get there
  • Publishing to a permanent distributed data store and syncing back to SQL
    • Using a message queue simplifies the subscriber end. It can be event driven rather than having a periodic synchronization process or a separate message from the publisher to the subscribers alerting them to new data in the distributed store.
    • If a distributed data store turns out to be useful in and of itself (i.e., supports certain kinds of queries better than SQL), the data store could be another subscriber to the queue.
  • Manual synchronization - i.e., the application itself reads and writes to other stores and/or sends other stores various messages via web service calls
    • Requires the most developer overhead by far to keep track of various ways writes are distributed
    • Still has concurrency issues. The longer the delay before a write is propagated the more likely it will conflict with another operation.
Clone this wiki locally