NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

polyfractal · 2014-10-08T23:51:59Z

An NPE was encountered when upgrading from 1.1.1 to 1.3.4. During the rolling upgrade, a background cron tried to execute a delete-by-query which included a parent/child query. This was allowed in 1.1.1, but disabled in later versions.

This caused a delete-by-query to queue up in the translog of a 1.1.1 node. Before the translog was cleared, the shard tried to move to a 1.3.4 node, which caused an NPE. The shards repeatedly failed recovery and kept bouncing around the cluster. Because allocation filtering was being used to migrate data from old -> new, the cluster tried to recover the shards on only 1.3.4 nodes...leading to a continuous failure.

The situation eventually resolved itself, likely because a background flush cleared out the translog and allowed the recovery to finally proceed normally.

Stack trace (sanitized to remove sensitive names/ips):


[2014-10-08 21:43:26,881][WARN ][indices.cluster          ] [prod-1.3.4] [my_index][6] failed to start shard
org.elasticsearch.indices.recovery.RecoveryFailedException: [my_index][6]: Recovery failed from [prod-1.1.1][YhcqkTzLTGSF8dyKAQPRBQ][prod-1.1.1.localdomain][inet[...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1} into [prod-1.3.4][0cRcLbzTTAm15PMu_R_U2w][prod-1.3.4.localdomain][inet[prod-1.3.4.localdomain/...]]{aws_availability_zone=us-east-1e, max_local_storage_nodes=1}
    at org.elasticsearch.indices.recovery.RecoveryTarget.doRecovery(RecoveryTarget.java:306)
    at org.elasticsearch.indices.recovery.RecoveryTarget.access$200(RecoveryTarget.java:65)
    at org.elasticsearch.indices.recovery.RecoveryTarget$2.run(RecoveryTarget.java:175)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.1.1][inet[/...]][index/shard/recovery/startRecovery]
Caused by: org.elasticsearch.index.engine.RecoveryEngineException: [my_index][6] Phase[2] Execution failed
    at org.elasticsearch.index.engine.internal.InternalEngine.recover(InternalEngine.java:1109)
    at org.elasticsearch.index.shard.service.InternalIndexShard.recover(InternalIndexShard.java:627)
    at org.elasticsearch.indices.recovery.RecoverySource.recover(RecoverySource.java:117)
    at org.elasticsearch.indices.recovery.RecoverySource.access$1600(RecoverySource.java:61)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:337)
    at org.elasticsearch.indices.recovery.RecoverySource$StartRecoveryTransportRequestHandler.messageReceived(RecoverySource.java:323)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:270)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.transport.RemoteTransportException: [prod-1.3.4][inet[/...]][index/shard/recovery/translogOps]
Caused by: org.elasticsearch.index.query.QueryParsingException: [my_index] Failed to parse
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:330)
    at org.elasticsearch.index.shard.service.InternalIndexShard.prepareDeleteByQuery(InternalIndexShard.java:449)
    at org.elasticsearch.index.shard.service.InternalIndexShard.performRecoveryOperation(InternalIndexShard.java:780)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:431)
    at org.elasticsearch.indices.recovery.RecoveryTarget$TranslogOperationsRequestHandler.messageReceived(RecoveryTarget.java:410)
    at org.elasticsearch.transport.netty.MessageChannelHandler$RequestHandler.run(MessageChannelHandler.java:275)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: java.lang.NullPointerException
    at org.elasticsearch.index.query.QueryParserUtils.ensureNotDeleteByQuery(QueryParserUtils.java:36)
    at org.elasticsearch.index.query.HasParentFilterParser.parse(HasParentFilterParser.java:52)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.NotFilterParser.parse(NotFilterParser.java:63)
    at org.elasticsearch.index.query.QueryParseContext.executeFilterParser(QueryParseContext.java:302)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerFilter(QueryParseContext.java:283)
    at org.elasticsearch.index.query.FilteredQueryParser.parse(FilteredQueryParser.java:74)
    at org.elasticsearch.index.query.QueryParseContext.parseInnerQuery(QueryParseContext.java:239)
    at org.elasticsearch.index.query.IndexQueryParserService.innerParse(IndexQueryParserService.java:342)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:268)
    at org.elasticsearch.index.query.IndexQueryParserService.parse(IndexQueryParserService.java:263)
    at org.elasticsearch.index.query.IndexQueryParserService.parseQuery(IndexQueryParserService.java:314)
    ... 8 more

The text was updated successfully, but these errors were encountered:

martijnvg · 2014-10-09T07:28:21Z

This is bad. First of all a the actual exception should be a QueryParsingException with the message the p/c queries are unsupported in the delete by query api and second I think the translog should just skip a operation if it fails with a QueryParsingException.

s1monw · 2014-10-09T16:06:23Z

@martijnvg can we somehow reproduce this with bwc test? just curious.... I think we should work on something with @dakrone to be able to skip individual operations in the translog... might be even a standalone tool? @dakrone any ideas?

martijnvg · 2014-10-09T16:07:08Z

@s1monw I'm sure that this can be reproduced in a bwc test :)

clintongormley · 2014-10-15T19:55:47Z

@martijnvg assigned this to you, but perhaps @dakrone is the person best placed to look at this?

martijnvg · 2014-10-21T12:15:02Z

This issue is less severe as I initially thought. What it boils down to is that any delete by query translog operation with a p/c query is just ignored, but the rest of all translog operations are successfully executed and the shard gets assigned.

The NPE is annoying (which I will fix) but that gets wrapped by a QueryParsingException (in IndexQueryParserService#parseQuery(...) line 370) and because of this in LocalIndexShardGateway#recover(...) at line 276 we ignore the delete by query operation. A QueryParsingException exception status is seen as bad request, so the idea here is to ignore it.

martijnvg · 2014-10-21T12:21:09Z

I opened this PR for the NPE during recovery: #8177

…uery parse exception. Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade. Closes #8031 Closes #8177

…uery parse exception. Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade. Closes elastic#8031 Closes elastic#8177

clintongormley added v1.4.0 blocker >bug labels Oct 15, 2014

clintongormley assigned martijnvg Oct 15, 2014

martijnvg mentioned this issue Oct 21, 2014

Check if there is a search context, otherwise throw a query parse exception. #8177

Closed

martijnvg closed this as completed in 319878e Oct 22, 2014

martijnvg removed blocker v1.4.0 labels Oct 29, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

polyfractal commented Oct 8, 2014

martijnvg commented Oct 9, 2014

s1monw commented Oct 9, 2014

martijnvg commented Oct 9, 2014

clintongormley commented Oct 15, 2014

martijnvg commented Oct 21, 2014

martijnvg commented Oct 21, 2014

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031

Comments

polyfractal commented Oct 8, 2014

martijnvg commented Oct 9, 2014

s1monw commented Oct 9, 2014

martijnvg commented Oct 9, 2014

clintongormley commented Oct 15, 2014

martijnvg commented Oct 21, 2014

martijnvg commented Oct 21, 2014