Skip to content

A VM pause (due to GC, high IO load, etc) can cause the loss of inserted documents #10426

Closed
@aphyr

Description

@aphyr

Following up on #7572 and #10407, I've found that Elasticsearch will lose inserted documents even in the event of a node hiccup due to garbage collection, swapping, disk failure, IO panic, virtual machine pauses, VM migration, etc. https://gist.github.com/aphyr/b8c98e6149bc66a2d839 shows a log where we pause an elasticsearch primary via SIGSTOP and SIGCONT. Even though no operations can take place against the suspended node during this time, and a new primary for the cluster comes to power, it looks like the old primary is still capable of acking inserts which are not replicated to the new primary--somewhere right before or right after the pause. The result is the loss of ~10% of acknowledged inserts.

You can replicate these results with Jepsen (commit e331ff3578), by running lein test :only elasticsearch.core-test/create-pause in the elasticsearch directory.

Looking through the Elasticsearch cluster state code (which I am by no means qualified to understand or evaluate), I get the... really vague, probably incorrect impression that Elasticsearch might make a couple assumptions:

  1. Primaries are considered authoritative "now", without a logical clock that identifies what "now" means.
  2. Operations like "insert a document" don't... seem... to carry a logical clock with them allowing replicas to decide whether or not the operation supercedes their state, which means that messages delayed in flight can show up and cause interesting things to happen.

Are these at all correct? Have you considered looking in to an epoch/term/generation scheme? If primaries are elected uniquely for a certain epoch, you can tag each operation with that epoch and use it to reject invalid requests from the logical past--invariants around advancing the epoch, in turn, can enforce the logical monotonicity of operations. It might make it easier to tamp down race conditions like this.

Metadata

Metadata

Assignees

Labels

:Distributed Indexing/CRUDA catch all label for issues around indexing, updating and getting a doc by id. Not search.>bugresiliency

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions