Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

Closed
polyfractal opened this issue Oct 8, 2014 · 6 comments
Assignees
Labels

Comments

@polyfractal
Copy link
Contributor

An NPE was encountered when upgrading from 1.1.1 to 1.3.4. During the rolling upgrade, a background cron tried to execute a delete-by-query which included a parent/child query. This was allowed in 1.1.1, but disabled in later versions.

This caused a delete-by-query to queue up in the translog of a 1.1.1 node. Before the translog was cleared, the shard tried to move to a 1.3.4 node, which caused an NPE. The shards repeatedly failed recovery and kept bouncing around the cluster. Because allocation filtering was being used to migrate data from old -> new, the cluster tried to recover the shards on only 1.3.4 nodes...leading to a continuous failure.

The situation eventually resolved itself, likely because a background flush cleared out the translog and allowed the recovery to finally proceed normally.

Stack trace (sanitized to remove sensitive names/ips):


[2014-10-08 21:43:26,881][WARN ][indices.cluster          ] [prod-1.3.4] [my_index][6] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][6]: Recovery failed from [prod-1.1.1][YhcqkTzLTGSF8dyKAQPRBQ][prod-1.1.1.localdomain][inet[...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1} into [prod-1.3.4][0cRcLbzTTAm15PMu_R_U2w][prod-1.3.4.localdomain][inet[prod-1.3.4.localdomain/...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1}
    at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
    at org.elasticsearch.indices.recovery.RecoveryTarget.access$200(RecoveryTarget.java:65)
    at org.elasticsearch.indices.recovery.RecoveryTarget$2.run(RecoveryTarget.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.1.1][inet[/...]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [my_index][6] Phase[2] Execution failed
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1109)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:627)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:117)
    at org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:61)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:323)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.3.4][inet[/...]][index/shard/recovery/translogOps]
Caused by: org.elasticsearch.index.query.QueryParsingException: [my_index] Failed to parse
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:330)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareDeleteByQuery(InternalIndexShard.java:449)
    at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:431)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:410)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.elasticsearch.index.query.QueryParserUtils.ensureNotDeleteByQuery(QueryParserUtils.java:36)
    at org.elasticsearch.index.query.HasParentFilterParser.parse(HasParentFilterParser.java:52)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.NotFilterParser.parse(NotFilterParser.java:63)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.FilteredQueryParser.parse(FilteredQueryParser.java:74)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239)
    at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263)
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:314)
    ... 8 more
@martijnvg
Copy link
Member

This is bad. First of all a the actual exception should be a QueryParsingException with the message the p/c queries are unsupported in the delete by query api and second I think the translog should just skip a operation if it fails with a QueryParsingException.

@s1monw
Copy link
Contributor

s1monw commented Oct 9, 2014

@martijnvg can we somehow reproduce this with bwc test? just curious.... I think we should work on something with @dakrone to be able to skip individual operations in the translog... might be even a standalone tool? @dakrone any ideas?

@martijnvg
Copy link
Member

@s1monw I'm sure that this can be reproduced in a bwc test :)

@clintongormley
Copy link

@martijnvg assigned this to you, but perhaps @dakrone is the person best placed to look at this?

@martijnvg
Copy link
Member

This issue is less severe as I initially thought. What it boils down to is that any delete by query translog operation with a p/c query is just ignored, but the rest of all translog operations are successfully executed and the shard gets assigned.

The NPE is annoying (which I will fix) but that gets wrapped by a QueryParsingException (in IndexQueryParserService#parseQuery(...) line 370) and because of this in LocalIndexShardGateway#recover(...) at line 276 we ignore the delete by query operation. A QueryParsingException exception status is seen as bad request, so the idea here is to ignore it.

@martijnvg
Copy link
Member

I opened this PR for the NPE during recovery: #8177

martijnvg added a commit that referenced this issue Oct 22, 2014
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177
martijnvg added a commit that referenced this issue Oct 22, 2014
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177
martijnvg added a commit that referenced this issue Oct 22, 2014
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes #8031
Closes #8177
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes elastic#8031
Closes elastic#8177
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
…uery parse exception.

Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade.

Closes elastic#8031
Closes elastic#8177
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants