Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

index corruption through delete_by_query of child documents #5828

Closed
morus opened this issue Apr 16, 2014 · 2 comments
Closed

index corruption through delete_by_query of child documents #5828

morus opened this issue Apr 16, 2014 · 2 comments
Assignees

Comments

@morus
Copy link

morus commented Apr 16, 2014

we experienced index corruption (indices having UNASSIGNED shards)
during incremental indexing.

The index contains one main document type and three other document
types for child documents.

The corruption is related to delete by query requests for the child
documents.
We had cases where the index was recovered on ES restart and cases
where ES could not fix the index any more. This might be related to
the number of replicas (in cases where recovery worked we only had
one index copy on one instance) but we did not do an exhaustive
analysis on that.

After we changed the delete by query requests into a client side
delete by query (search the child documents, send bulk requests for
delete by document id then) the issues stopped.

Elastic search version is 1.0.2, client operates through the ruby
bindings using http (should not matter). OS is linux, the index
had 6 shards, ~ 12 mio documents (1.7 mio "main" documents,
the rest is child documents of three different types) in ~ 7.3GB.
The server has 16 GB memory, ES runs with 8 GB heap memory and
65535 file descriptors.

Sorry, we could not try ES 1.1 because of an error regarding empty
geo points (seems to be fixed in master but not released).

Part of the problem might be, that we are not too strict to ensure
referential integrity between parent and child documents.
My - perhaps naive - expectation was, that this should not matter.
So there might be child documents naming a parent, where the
parent document does not exist.

When we ran the incremental updates on a partial index containing
only a handful of documents (while the incremental stream was on
changes in all data) the issue did not show up.
So it's either related to the index size or (more probably (?)) to
the question if the delete by query finds something to delete or not.

We only had one process working on the updates strictly sequential.
So it should not be a race condition between serveral changes the same
time. There were parallel updates to other indices though.

The error itself is not too enlightening (at least for a pure
es user): the indexer dies from a http timeout in an indexing request.

The ES log shows an error stating
failed to merge
and
org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
see full stack trace below.
Raising ES log level to debug did not provide additional information,
why the index reader was closed.

I'm afraid I cannot provide a full sample, how to reproduce the
problem.

best
Morus

PS:
The initial error messages look like
[WARN ][index.merge.scheduler ] [pjpp-production mas
ter] [candidates_v0004][5] failed to merge
org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:252)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.ja
va:102)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.ja
va:56)
at org.apache.lucene.index.IndexReader.leaves(IndexReader.java:502)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.cont
ains(DeleteByQueryWrappingFilter.java:122)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.getD
ocIdSet(DeleteByQueryWrappingFilter.java:81)
at org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDoc
IdSet(ApplyAcceptedDocsFilter.java:45)
at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(Con
stantScoreQuery.java:142)
at org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.fil
teredScorer(FilteredQuery.java:533)
at org.apache.lucene.search.FilteredQuery$1.scorer(FilteredQuery.java:13
3)
at org.apache.lucene.search.QueryWrapperFilter$1.iterator(QueryWrapperFi
lter.java:59)
at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(Buffe
redUpdatesStream.java:546)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(
BufferedUpdatesStream.java:284)
at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3844)
at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3806)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMe
rgeScheduler.java:405)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(Trac
kingConcurrentMergeScheduler.java:107)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
urrentMergeScheduler.java:482)
[WARN ][index.engine.internal ] [pjpp-production mas
ter] [candidates_v0004][5] failed engine
org.apache.lucene.index.MergePolicy$MergeException: org.apache.lucene.store.Alre
adyClosedException: this IndexReader is closed
at org.elasticsearch.index.merge.scheduler.ConcurrentMergeSchedulerProvi
der$CustomConcurrentMergeScheduler.handleMergeException(ConcurrentMergeScheduler
Provider.java:109)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(Conc
urrentMergeScheduler.java:518)
Caused by: org.apache.lucene.store.AlreadyClosedException: this IndexReader is closed
at org.apache.lucene.index.IndexReader.ensureOpen(IndexReader.java:252)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.java:102)
at org.apache.lucene.index.CompositeReader.getContext(CompositeReader.java:56)
at org.apache.lucene.index.IndexReader.leaves(IndexReader.java:502)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.contains(DeleteByQueryWrappingFilter.java:122)
at org.elasticsearch.index.search.child.DeleteByQueryWrappingFilter.getDocIdSet(DeleteByQueryWrappingFilter.java:81)
at org.elasticsearch.common.lucene.search.ApplyAcceptedDocsFilter.getDocIdSet(ApplyAcceptedDocsFilter.java:45)
at org.apache.lucene.search.ConstantScoreQuery$ConstantWeight.scorer(ConstantScoreQuery.java:142)
at org.apache.lucene.search.FilteredQuery$RandomAccessFilterStrategy.filteredScorer(FilteredQuery.java:533)
at org.apache.lucene.search.FilteredQuery$1.scorer(FilteredQuery.java:133)
at org.apache.lucene.search.QueryWrapperFilter$1.iterator(QueryWrapperFilter.java:59)
at org.apache.lucene.index.BufferedUpdatesStream.applyQueryDeletes(BufferedUpdatesStream.java:546)
at org.apache.lucene.index.BufferedUpdatesStream.applyDeletesAndUpdates(BufferedUpdatesStream.java:284)
at org.apache.lucene.index.IndexWriter._mergeInit(IndexWriter.java:3844)
at org.apache.lucene.index.IndexWriter.mergeInit(IndexWriter.java:3806)
at org.apache.lucene.index.IndexWriter.merge(IndexWriter.java:3659)
at org.apache.lucene.index.ConcurrentMergeScheduler.doMerge(ConcurrentMergeScheduler.java:405)
at org.apache.lucene.index.TrackingConcurrentMergeScheduler.doMerge(TrackingConcurrentMergeScheduler.java:107)
at org.apache.lucene.index.ConcurrentMergeScheduler$MergeThread.run(ConcurrentMergeScheduler.java:482)

@martijnvg martijnvg self-assigned this Apr 16, 2014
@martijnvg
Copy link
Member

This is indeed an issue and I can see how this can fail shards. Thanks for reporting this bug!

@martijnvg
Copy link
Member

I don't see how this bug can be fixed easily and for now I recommend not to use any parent/child query or filter (has_child, has_parent, top_children) in the delete by query api.

@s1monw s1monw added the blocker label Apr 24, 2014
martijnvg added a commit to martijnvg/elasticsearch that referenced this issue Apr 28, 2014
…roperly implemented and could lead to a shard being failed and not able to recover.

Closes elastic#5828
martijnvg added a commit that referenced this issue Apr 28, 2014
It wasn't properly implemented and could lead to a shard being failed and not able to recover.

Closes #5828 #5916
martijnvg added a commit that referenced this issue Apr 28, 2014
It wasn't properly implemented and could lead to a shard being failed and not able to recover.

Closes #5828 #5916
martijnvg added a commit that referenced this issue Apr 28, 2014
It wasn't properly implemented and could lead to a shard being failed and not able to recover.

Closes #5828 #5916
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
It wasn't properly implemented and could lead to a shard being failed and not able to recover.

Closes elastic#5828 elastic#5916
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
It wasn't properly implemented and could lead to a shard being failed and not able to recover.

Closes elastic#5828 elastic#5916
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants