index corruption through delete_by_query of child documents #5828

morus · 2014-04-16T11:50:35Z

we experienced index corruption (indices having UNASSIGNED shards)
during incremental indexing.

The index contains one main document type and three other document
types for child documents.

The corruption is related to delete by query requests for the child
documents.
We had cases where the index was recovered on ES restart and cases
where ES could not fix the index any more. This might be related to
the number of replicas (in cases where recovery worked we only had
one index copy on one instance) but we did not do an exhaustive
analysis on that.

After we changed the delete by query requests into a client side
delete by query (search the child documents, send bulk requests for
delete by document id then) the issues stopped.

Elastic search version is 1.0.2, client operates through the ruby
bindings using http (should not matter). OS is linux, the index
had 6 shards, ~ 12 mio documents (1.7 mio "main" documents,
the rest is child documents of three different types) in ~ 7.3GB.
The server has 16 GB memory, ES runs with 8 GB heap memory and
65535 file descriptors.

Sorry, we could not try ES 1.1 because of an error regarding empty
geo points (seems to be fixed in master but not released).

Part of the problem might be, that we are not too strict to ensure
referential integrity between parent and child documents.
My - perhaps naive - expectation was, that this should not matter.
So there might be child documents naming a parent, where the
parent document does not exist.

When we ran the incremental updates on a partial index containing
only a handful of documents (while the incremental stream was on
changes in all data) the issue did not show up.
So it's either related to the index size or (more probably (?)) to
the question if the delete by query finds something to delete or not.

We only had one process working on the updates strictly sequential.
So it should not be a race condition between serveral changes the same
time. There were parallel updates to other indices though.

The error itself is not too enlightening (at least for a pure
es user): the indexer dies from a http timeout in an indexing request.

The ES log shows an error stating
failed to merge
and
org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
see full stack trace below.
Raising ES log level to debug did not provide additional information,
why the index reader was closed.

I'm afraid I cannot provide a full sample, how to reproduce the
problem.

best
Morus

PS:
The initial error messages look like
[WARN ][index.merge.scheduler ] [pjpp-production mas
ter] [candidates_v0004][5] failed to merge
org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:252)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.ja
va:102)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.ja
va:56)
at org.apache.lucene.index.IndexReader.leaves(IndexReader.java:502)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.cont
ains(DeleteByQueryWrappingFilter.java:122)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.getD
ocIdSet(DeleteByQueryWrappingFilter.java:81)
at org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDoc
IdSet(ApplyAcceptedDocsFilter.java:45)
at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(Con
stantScoreQuery.java:142)
at org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.fil
teredScorer(FilteredQuery.java:533)
at org.apache.lucene.search.FilteredQuery$1.scorer(FilteredQuery.java:13
3)
at org.apache.lucene.search.QueryWrapperFilter$1.iterator(QueryWrapperFi
lter.java:59)
at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(Buffe
redUpdatesStream.java:546)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(
BufferedUpdatesStream.java:284)
at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3844)
at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3806)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe
rgeScheduler.java:405)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(Trac
kingConcurrentMergeScheduler.java:107)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
urrentMergeScheduler.java:482)
[WARN ][index.engine.internal ] [pjpp-production mas
ter] [candidates_v0004][5] failed engine
org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.store.Alre
adyClosedException: this IndexReader is closed
at org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvi
der$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler
Provider.java:109)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
urrentMergeScheduler.java:518)
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:252)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.java:102)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.java:56)
at org.apache.lucene.index.IndexReader.leaves(IndexReader.java:502)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.contains(DeleteByQueryWrappingFilter.java:122)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.getDocIdSet(DeleteByQueryWrappingFilter.java:81)
at org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(ApplyAcceptedDocsFilter.java:45)
at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:142)
at org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:533)
at org.apache.lucene.search.FilteredQuery$1.scorer(FilteredQuery.java:133)
at org.apache.lucene.search.QueryWrapperFilter$1.iterator(QueryWrapperFilter.java:59)
at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:546)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:284)
at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3844)
at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3806)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

martijnvg · 2014-04-16T14:15:07Z

This is indeed an issue and I can see how this can fail shards. Thanks for reporting this bug!

martijnvg · 2014-04-17T09:47:49Z

I don't see how this bug can be fixed easily and for now I recommend not to use any parent/child query or filter (has_child, has_parent, top_children) in the delete by query api.

…roperly implemented and could lead to a shard being failed and not able to recover. Closes elastic#5828

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes #5828 #5916

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes elastic#5828 elastic#5916

martijnvg self-assigned this Apr 16, 2014

martijnvg mentioned this issue Apr 23, 2014

Disabled parent/child queries in the delete by query api. #5916

Closed

martijnvg added bug labels Apr 23, 2014

s1monw added the blocker label Apr 24, 2014

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 28, 2014

Disabled parent/child queries in the delete by query api. It wasn't p…

27bf7d4

…roperly implemented and could lead to a shard being failed and not able to recover. Closes elastic#5828

martijnvg removed blocker labels Apr 28, 2014

martijnvg added a commit that referenced this issue Apr 28, 2014

Disabled parent/child queries in the delete by query api.

e0026e1

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes #5828 #5916

martijnvg added a commit that referenced this issue Apr 28, 2014

Disabled parent/child queries in the delete by query api.

b796812

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes #5828 #5916

martijnvg added a commit that referenced this issue Apr 28, 2014

Disabled parent/child queries in the delete by query api.

add8881

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes #5828 #5916

martijnvg closed this as completed in 17a5575 Apr 28, 2014

jondavidford mentioned this issue Aug 26, 2014

Support the has_child, has_parent, top_children queries (and filters) in other apis than just search api. #3823

Closed

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Disabled parent/child queries in the delete by query api.

0aed83d

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes elastic#5828 elastic#5916

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Disabled parent/child queries in the delete by query api.

bb88141

It wasn't properly implemented and could lead to a shard being failed and not able to recover. Closes elastic#5828 elastic#5916

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

index corruption through delete_by_query of child documents #5828

index corruption through delete_by_query of child documents #5828

morus commented Apr 16, 2014

martijnvg commented Apr 16, 2014

martijnvg commented Apr 17, 2014

index corruption through delete_by_query of child documents #5828

index corruption through delete_by_query of child documents #5828

Comments

morus commented Apr 16, 2014

martijnvg commented Apr 16, 2014

martijnvg commented Apr 17, 2014