New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NPE due to delete-by-query with parent/child when upgrading from 1.1.1 to 1.3.x #8031
Comments
This is bad. First of all a the actual exception should be a |
@martijnvg can we somehow reproduce this with bwc test? just curious.... I think we should work on something with @dakrone to be able to skip individual operations in the translog... might be even a standalone tool? @dakrone any ideas? |
@s1monw I'm sure that this can be reproduced in a bwc test :) |
@martijnvg assigned this to you, but perhaps @dakrone is the person best placed to look at this? |
This issue is less severe as I initially thought. What it boils down to is that any delete by query translog operation with a p/c query is just ignored, but the rest of all translog operations are successfully executed and the shard gets assigned. The NPE is annoying (which I will fix) but that gets wrapped by a QueryParsingException (in IndexQueryParserService#parseQuery(...) line 370) and because of this in LocalIndexShardGateway#recover(...) at line 276 we ignore the delete by query operation. A QueryParsingException exception status is seen as bad request, so the idea here is to ignore it. |
I opened this PR for the NPE during recovery: #8177 |
…uery parse exception. Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade. Closes elastic#8031 Closes elastic#8177
…uery parse exception. Also added a bwc test that runs a delete by query with a has_child query and verifies that only that operation is ignored when recovering from disk during a upgrade. Closes elastic#8031 Closes elastic#8177
An NPE was encountered when upgrading from 1.1.1 to 1.3.4. During the rolling upgrade, a background cron tried to execute a delete-by-query which included a parent/child query. This was allowed in 1.1.1, but disabled in later versions.
This caused a delete-by-query to queue up in the translog of a 1.1.1 node. Before the translog was cleared, the shard tried to move to a 1.3.4 node, which caused an NPE. The shards repeatedly failed recovery and kept bouncing around the cluster. Because allocation filtering was being used to migrate data from old -> new, the cluster tried to recover the shards on only 1.3.4 nodes...leading to a continuous failure.
The situation eventually resolved itself, likely because a background flush cleared out the translog and allowed the recovery to finally proceed normally.
Stack trace (sanitized to remove sensitive names/ips):
The text was updated successfully, but these errors were encountered: