======================================== Multi-store database design
Working through scenarios and usages for shuttling different types of data amongst stores. Kind of thinking out loud.
======================================== Transactional data
Transaction data is generated at the lanes and in a few edge cases on store servers. However, by definition, transaction data generated at one store will not ever conflict with transaction data generated at another store.
- A store server periodically pushes transaction data to N other stores where N is between 1 and the total number of stores in the organization
- A store server periodically pulls transaction data from N other stores where N is between 1 and the total number of stores in the organization.
This could effectively be a mesh where each store server [eventually] contains all transaction data for all stores. Otherwise, the architecture could designate a specific master/HQ server that collects all transaction data from all stores. Push/pull probably does not matter all that much.
======================================== Operational data
Operational data is persistent and may be edited at an organization or store level. Multi-master SQL replication is likely not a good option beyond 2 store nodes (if even then). Key collisions are difficult to manage and cleanup after replication errors result in questionable synchronization.
Unlike transaction data, I don't think there's a single solution here. Different tables very likely merit different approaches depending how often they change and whether data needs to vary per store.
- In a given table, one record exists for each store.
- Minimal collisions. Changes made at one store should not conflict with changes made at another store
- Syncable. Each store is authoritative for its own records. If there is ever any doubt about the correct value for a given store, all stores can reload the value from that store server.
- Completely network-resilient. Losing communication with other stores is merely a nuisance and does not corrupt data or halt operations.
- Directly modifying the table in SQL does not add extra problems
- Generates large tables at scale with lots of duplication
- Updates are expensive. Each store must push a change to every other store (or each store must pull periodically from every other store).
- Consistency is a challenge. If two stores update a given logical object (same item, same cashier, etc), the object is simply incosistent whether intentionally or not. Allowing an "update all stores" mechanism breaks the no-collision model. Two stores issuing an "update all" request simultaneously could create difficult/impossible to resolve inconsistencies. Auditing for unexpected inconsistencies would have to happen and be resolved at a per-store level.
- In a given table, only one record exists for each logical item
(e.g. item, cashier, etc). This table is only editable at a
single store (i.e., master/HQ).
- No collisions. All edits take place via master/HQ server.
- Syncable. The master/HQ table can always be copied down to a store or lane and the fresh copy is guaranteed correct.
- Table size is minimal regardless of scale
- Consistency is easy to enforce
- Directly modifying the table in SQL does not add extra problems
- Editing goes away in a network outage. Stores could edit their local copy of the data in a pinch, but once HQ comes back online and resyncing any local edits would be overwritten
- Consistency is mandatory. There is no mechanism for any data to vary on a per-store basis.
- In a given table, HQ maintains one record for each store (including
itself). The other stores have two records in the table (theirs plus
HQ). HQ may edit all records; stores may edit their own records
and request updates to their own records at HQ.
- Minimal collisions while allowing store-specific records.
- Syncable. HQ maintains the authoritative copy of each record.
- Consistency can be monitored and fixed from HQ.
- Allows store-level variations
- Possibly more network resilient if a store queues change requests and submits them to HQ once connectivity is restored.
- Sending change requests to HQ could be relatively slow. It would probably have to be one record at a time.
- HQ would have to review store-level changes periodically.
- Change requests would likely get lost/dropped occasionally at scale.
- Complicated to implement and ensure change requests get queued / sent properly
- Requires an entirely different editing interface at HQ to deal with editing multiple records per logical object.
- Build a whole REST API in front of the table. All requests to add, view, or
update an object are brokered through the API. Editing interfaces do not
ever query the table in the database directly.
- Protects us from what we don't know. Walling the database off on the other side of the API means we can make architecture revisions without changing any code that interacts with the public side of the API.
- Decoupling dependencies is never a bad idea. This feels like a really smart approach.
- Versioning the API could be incredibly useful in adding features without updating every last piece of legacy code. GET /api/v1/item/4011 might return something different than GET /api/v2/item/4011
- Insanely complicated - as in re-write every aspect of Fannie that interacts with this table.
- Doesn't actually solve any multi-store data management issues; it just hides those problems on the other side of the API. It's the "toss everything in the closet" method of cleaning the house.
Membership is the most straightforward piece of data. Membership must be centrally managed - at least amongst any set of stores sharing the same set of members. The challenges here are strictly operational stuff, not data design. Users at the store level that want to manage member data need access to [organizational] HQ or need to submit some kind of request to staff at HQ.
Products are the least straightforward in that some degree of store-level variation is pretty much guaranteed and the exact level of centralized management could also vary with size. I think it would be reasonable to focus on a heavily centralized model first where items are edited primarily at HQ but with one record per item per store to allow per-store overrides. Users would interact with HQ - even when editing a store specific product record - and data would flow strictly one direction from HQ down to stores.
The other big product-related piece is batches. Exporting and importing batches should be extremely robust. In the de-centralized model, store servers would be set up like current single store operation while the organizational HQ could distribute both regular and sale price batches to stores with the individual stores responsible for applying those batches. If fully centralized and fully de-centralized models are both supported, it should be possible to setup more hybrid arrangements by combining the two models in different ways.
I think more-or-less all other operational data (excepting members) should follow the same path as products. If products are managed at HQ, then so are cashiers, departments, tenders, etc. If products are managed at the store level, then so are cashiers, departments, tenders, etc. The upside as I see it is the users are always interacting with the same backend server. Asking users to interact with an HQ server for some operations and their local server for others (and keep track of which is which) seems unreasonable and error prone.
======================================== Proposed Solution
One store database server must be designated HQ. All store web servers will be configured to use the HQ database server. This requires cross-site connectivity, obviously, but eliminates all problems related to data inconsistency. The other downsides are load and latency. The volume of traffic the HQ database has to handle increases with every store. Latency could growth both as a direct result of load as well as due to link speed between sites. To address latency, load, and (to a lesser extent) resiliency each store should provide its own database server that replicates the HQ database. Code will need to be re-written to distinguish between read queries and write queries. In theory this isn't even strictly multistore. Anyone could add read-only nodes replicating the primary database to spread out requests.
Each store database server should probably have one non-replicated database that's strictly an accumulator for lane data (e.g., dtransactions). That way lanes can always publish transaction data to the local server with minimal latency. A periodic task to push or pull data from the accumulator and into the HQ database would get transaction data to its final destination without adding any human-facing lag.
- DB1 is the primary database server at store 1.
- WEB1 is running Office at store 1. It's configured to use DB1 and does not have any read-only servers.
- DB2 is the database server at store 2. It replicates DB1 (master-slave).
- WEB2 is running Office at store 2. It's also configured to use DB1 but lists DB2 as a read-only server.
Why not master-master replication? It doesn't scale particularly well beyond two nodes. The standard auto_increment hack doesn't scale nicely. Adding a new node to an existing system causes all sorts of havoc in ID numbering. Circular replication also creates N points of failure (if any single node dies, the whole system is kaput).
Why not cluster/galera? Primarily because it's a bigger leap in complexity of setup. Cluster would almost certainly work but synchronous writes across nodes in different locations is still going to be slow. An administrator who's familiar with cluster seems like a prerequisite to even try it, where as coping with reads vs writes in code can be done without bringing in any new expertise.
Why not sync everything manually? There's too much to keep track of. Coding that solution would be a huge undertaking and explaining it to an end-user admin would be even worse.