Description
Following up on #7572 and #10407, I've found that Elasticsearch will lose inserted documents even in the event of a node hiccup due to garbage collection, swapping, disk failure, IO panic, virtual machine pauses, VM migration, etc. https://gist.github.com/aphyr/b8c98e6149bc66a2d839 shows a log where we pause an elasticsearch primary via SIGSTOP and SIGCONT. Even though no operations can take place against the suspended node during this time, and a new primary for the cluster comes to power, it looks like the old primary is still capable of acking inserts which are not replicated to the new primary--somewhere right before or right after the pause. The result is the loss of ~10% of acknowledged inserts.
You can replicate these results with Jepsen (commit e331ff3578), by running lein test :only elasticsearch.core-test/create-pause
in the elasticsearch
directory.
Looking through the Elasticsearch cluster state code (which I am by no means qualified to understand or evaluate), I get the... really vague, probably incorrect impression that Elasticsearch might make a couple assumptions:
- Primaries are considered authoritative "now", without a logical clock that identifies what "now" means.
- Operations like "insert a document" don't... seem... to carry a logical clock with them allowing replicas to decide whether or not the operation supercedes their state, which means that messages delayed in flight can show up and cause interesting things to happen.
Are these at all correct? Have you considered looking in to an epoch/term/generation scheme? If primaries are elected uniquely for a certain epoch, you can tag each operation with that epoch and use it to reject invalid requests from the logical past--invariants around advancing the epoch, in turn, can enforce the logical monotonicity of operations. It might make it easier to tamp down race conditions like this.