Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A VM pause (due to GC, high IO load, etc) can cause the loss of inserted documents #10426

Closed
aphyr opened this issue Apr 4, 2015 · 10 comments
Closed
Assignees
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. resiliency

Comments

@aphyr
Copy link

aphyr commented Apr 4, 2015

Following up on #7572 and #10407, I've found that Elasticsearch will lose inserted documents even in the event of a node hiccup due to garbage collection, swapping, disk failure, IO panic, virtual machine pauses, VM migration, etc. https://gist.github.com/aphyr/b8c98e6149bc66a2d839 shows a log where we pause an elasticsearch primary via SIGSTOP and SIGCONT. Even though no operations can take place against the suspended node during this time, and a new primary for the cluster comes to power, it looks like the old primary is still capable of acking inserts which are not replicated to the new primary--somewhere right before or right after the pause. The result is the loss of ~10% of acknowledged inserts.

You can replicate these results with Jepsen (commit e331ff3578), by running lein test :only elasticsearch.core-test/create-pause in the elasticsearch directory.

Looking through the Elasticsearch cluster state code (which I am by no means qualified to understand or evaluate), I get the... really vague, probably incorrect impression that Elasticsearch might make a couple assumptions:

  1. Primaries are considered authoritative "now", without a logical clock that identifies what "now" means.
  2. Operations like "insert a document" don't... seem... to carry a logical clock with them allowing replicas to decide whether or not the operation supercedes their state, which means that messages delayed in flight can show up and cause interesting things to happen.

Are these at all correct? Have you considered looking in to an epoch/term/generation scheme? If primaries are elected uniquely for a certain epoch, you can tag each operation with that epoch and use it to reject invalid requests from the logical past--invariants around advancing the epoch, in turn, can enforce the logical monotonicity of operations. It might make it easier to tamp down race conditions like this.

@bleskes
Copy link
Contributor

bleskes commented Apr 4, 2015

Thx @aphyr . In general I can give a quick answer to, while we research the rest:

Have you considered looking in to an epoch/term/generation scheme?

This is indeed the current plan.

@bleskes
Copy link
Contributor

bleskes commented Apr 10, 2015

We have made some effort to reproduce this failure. In general, we see GC as just another disruption that can happen, the same way we view network issues and file corruptions. If anyone is interested in the work we do there, the org.elasticsearch.test.disruption package and DiscoveryWithServiceDisruptionsTests are a good place to look.

In the Jepsen runs that failed for us, Jepsen created an index and have paused the master node's JVM where the primary of one of the index shards was allocated to that master node. At the time the JVM was paused, no other replica of this shard was fully initialized after initial creation. Because the master JVM was pause, other nodes elected another master but that cluster had no copies left for that specific shard. This left the cluster at a red state. When the node is unpaused it rejoins the cluster. The shard is not re-allocated because we require a qurom of copies to assign a primary (in order to make sure we do not reuse a dated copy). As such the cluster stays red and all the data previously indexed into this shard is not available for searches.

When we changed Jepsen to wait for all replicas to be assigned before starting the nemsis, the failure doesn't happen anymore. This change, and some other improvements are part this PR to Jepsen.

That said, because of the similar nature between GC and an unresponsive network, there is still small window to loose documents which is captured by #7572 and documented on the resiliency status page

@aphyr can you confirm that changes in the PR offers the same behavior for you?

@aphyr
Copy link
Author

aphyr commented Apr 15, 2015

Thanks for this, @bleskes! I have been super busy with a few other issues but this is the last one I have to clear before talks go! I'll take a look tomorrow morning. :)

@bleskes
Copy link
Contributor

bleskes commented Apr 21, 2015

@aphyr re our previous discussion of:

Have you considered looking in to an epoch/term/generation scheme?
This is indeed the current plan.

If you're curious - I've open a (high level) issue describing our current thinking - see #10708 .

@aphyr
Copy link
Author

aphyr commented Apr 28, 2015

I've merged your PR, and can confirm that ES still drops documents when a primary process is paused.

{:valid? false,
 :lost "#{1761}",
 :recovered
 "#{0 2..3 8 30 51 73 97 119 141 165 187 211 233 257 279 302 324 348 371 394 436 457 482 504 527 550 572 597 619 642 664 688 711 734 758 781 804 827 850 894 911 934 957 979 1003 1025 1049 1071 1092 1117 1138 1163 1185 1208 1230 1253 1277 1299 1342 1344 1350 1372 1415 1439 1462 1485 1508 1553 1576 1599 1623 1645 1667 1690 1714 1736 1779 1803 1825 1848 1871 1893 1917 1939 1964 1985 2010 2031 2054 2077 2100 2123 2146 2169 2192}",
 :ok "#{0..1344 1346..1392 1394..1530 1532..1760 1762..2203}",
 :recovered-frac 24/551,
 :unexpected-frac 0,
 :unexpected "#{}",
 :lost-frac 1/2204,
 :ok-frac 550/551}

@dakrone
Copy link
Member

dakrone commented Apr 28, 2015

@aphyr thanks for running it! I think the PR helps remove the index not being in a green state before starting the test as a cause of document loss (not the only cause). I will keep running the test with additional logging to try and reproduce the failure you see.

@colings86 colings86 added the :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. label Apr 24, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@edeak
Copy link

edeak commented Jan 7, 2020

any updates on this?

@matthiasg
Copy link

@edeak indeed! A little unnerving to see this issue still open (either it is done and this was just an oversight or they did not fix it ?). Either way a shame for all the work that @aphyr put into this.

@ywelsch
Copy link
Contributor

ywelsch commented Feb 13, 2020

The issues found here were caused by problems in both the data replication subsystem as well as the cluster coordination subsystem, upon which the data replication relies as well for correctness. All known issues in this area relating to this problem have since been fixed. As part of the sequence numbers effort, we've introduced primary terms that allow rejecting invalid requests from the logical past. With the new cluster coordination subsystem introduced in ES 7 (#32006), the remaining known coordination-level issues ("Repeated network partitions can cause cluster state updates to be lost") have been fixed as well.

@ywelsch ywelsch closed this as completed Feb 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. resiliency
Projects
None yet
Development

No branches or pull requests

9 participants