Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

Closed
mitallast opened this issue Jan 14, 2015 · 8 comments

Comments

@mitallast
Copy link

Hi there, I have a lot of error messages in logs like this:

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:152)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IndexOutOfBoundsException: Invalid combined index of 1471746, maximum is 64403
    at org.elasticsearch.common.netty.buffer.SlicedChannelBuffer.<init>(SlicedChannelBuffer.java:46)
    at org.elasticsearch.common.netty.buffer.HeapChannelBuffer.slice(HeapChannelBuffer.java:201)
    at org.elasticsearch.transport.netty.ChannelBufferStreamInput.readBytesReference(ChannelBufferStreamInput.java:56)
    at org.elasticsearch.common.io.stream.StreamInput.readText(StreamInput.java:222)
    at org.elasticsearch.common.io.stream.HandlesStreamInput.readSharedText(HandlesStreamInput.java:69)
    at org.elasticsearch.search.SearchShardTarget.readFrom(SearchShardTarget.java:103)
    at org.elasticsearch.search.SearchShardTarget.readSearchShardTarget(SearchShardTarget.java:87)
    at org.elasticsearch.search.internal.InternalSearchHits.readFrom(InternalSearchHits.java:217)
    at org.elasticsearch.search.internal.InternalSearchHits.readFrom(InternalSearchHits.java:203)
    at org.elasticsearch.search.internal.InternalSearchHits.readSearchHits(InternalSearchHits.java:197)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits.readFrom(InternalTopHits.java:137)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits$1.readResult(InternalTopHits.java:50)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits$1.readResult(InternalTopHits.java:46)
    at org.elasticsearch.search.aggregations.InternalAggregations.readFrom(InternalAggregations.java:190)
    at org.elasticsearch.search.aggregations.InternalAggregations.readAggregations(InternalAggregations.java:172)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms.readFrom(LongTerms.java:148)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms$1.readResult(LongTerms.java:48)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms$1.readResult(LongTerms.java:44)
    at org.elasticsearch.search.aggregations.InternalAggregations.readFrom(InternalAggregations.java:190)
    at org.elasticsearch.search.aggregations.InternalAggregations.readAggregations(InternalAggregations.java:172)
    at org.elasticsearch.search.query.QuerySearchResult.readFromWithId(QuerySearchResult.java:175)
    at org.elasticsearch.search.query.QuerySearchResult.readFrom(QuerySearchResult.java:162)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
    ... 23 more

Query uses search type "count", uses a lot of filters, and have multiple aggregations, including top hits agg.
If I'm retry equals query to ES, I see correct answer without error messages in log in 99,9% times.
If I'm not use top hits agg, or standard search type "query and fetch", threre's no error message.

I have ES test cluster with CentOs 6.4 with xeon and 20gb memory, Oracle Java 1.8.0_25, ElasticSearch 1.4.2 with 10gb heap, two instances.

@clintongormley
Copy link

Hi @mitallast

Thanks for reporting. Could you provide the smallest recreation of the problem that you can come up with? Will help with debugging.

thanks

@martijnvg sounds like it could be a problem with top-hits?

@martijnvg
Copy link
Member

@clintongormley Yes, this does look like a bug in top_hits agg.
@mitallast A recreation would be helpful

@mitallast
Copy link
Author

I'm still try to reproduce like a integration test, but without success.
But at our test cluster, this bug also manually reproduced, and looks like this:

2 nodes 1.4.2 with centos and Oracle JVM 8, first node contains shards 1/4/5/7, second contains 0/2/3/6 (8 total, 0 replicas)
query:

/index_name/type_name/_search?search_type=count&routing=3016
{
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "clients": {
      "terms": {
        "field": "client_id",
        "size": 5
      },
      "aggregations": {
        "best": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}

Error output:

{
error: SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[pBrJ6zkHS8CKfAopFnmKUg][online][7]: RemoteTransportException[Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]]; nested: TransportSerializationException[Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]]; nested: IndexOutOfBoundsException[Invalid index: 214 - Bytes needed: 1470744, maximum is 48585]; }]
status: 500
}

I have this error only if query executed at second node. If first, all is OK.
With search shards api, looks like 3016 route to 7th shard. Second node does not contain 7th, but first does.

Any ideas?

@mitallast
Copy link
Author

Sorry, I can't show our index data - DNA restrictions. I have not find any reproduce test data.

I'm reproduce it in debugger Intellij IDEA, and try to find where is an mistake.

Seems like SearchShardTarget#readFrom code contains extra reads of two bytes with comparing to SearchShardTarget#writeTo

See at screenshots:
https://www.dropbox.com/s/7ukfm9ozlg0rxzb/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2019.52.16.png?dl=0
https://www.dropbox.com/s/bux416lf3gnzfpg/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2020.01.28.png?dl=0

int length in line 221 must be an 22 , but ChannelBufferStreamInput.buffer.readerIndex = 439. Looks like It must be 437 (4 bytes for int, value 22 presents as 0-0-0-22 in byte-as-decimal view). Maybe somewhere uses writeVInt for write, and readInt for read, or something like this.

Look:
https://www.dropbox.com/s/gnuntjuamw1drjp/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2020.02.29.png?dl=0
Earlier in program stack InternalSearchHits#readFrom lines between 207 and 210 have a correct parsed values for totalHits, maxScore, size and lookupSize.

@mitallast
Copy link
Author

Maybe I found mistake:

org.elasticsearch.search.SearchShardTarget has this correct methods:

    @Override
    public void readFrom(StreamInput in) throws IOException {
        if (in.readBoolean()) {
            nodeId = in.readSharedText();
        }
        index = in.readSharedText();
        shardId = in.readVInt();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        if (nodeId == null) {
            out.writeBoolean(false);
        } else {
            out.writeBoolean(true);
            out.writeSharedText(nodeId);
        }
        out.writeSharedText(index);
        out.writeVInt(shardId);
    }

But when ES at first time execute no-cached query, argument for writeTo is an BytesStreamOutput with parent class StreamOutput contains:

    public void writeText(Text text) throws IOException {
        if (!text.hasBytes()) {
            final String string = text.string();
            spare.copyChars(string);
            writeInt(spare.length());
            write(spare.bytes(), 0, spare.length());
        } else {
            BytesReference bytes = text.bytes();
            writeInt(bytes.length());
            bytes.writeTo(this);
        }
    }

    public void writeSharedText(Text text) throws IOException {
        writeText(text);
    }

It means that org.elasticsearch.common.io.stream.StreamOutput#writeSharedText just proxy to org.elasticsearch.common.io.stream.StreamOutput#writeText

But when it reads at next node, uses a HandlesStreamInput with this override implementation

@Override
    public Text readSharedText() throws IOException {
        byte b = in.readByte();
        if (b == 0) {
            int handle = in.readVInt();
            Text s = in.readText();
            handlesText.put(handle, s);
            return s;
        } else if (b == 1) {
            return handlesText.get(in.readVInt());
        } else if (b == 2) {
            return in.readText();
        } else {
            throw new IOException("Expected handle header, got [" + b + "]");
        }
    }

This two implementations are not byte compatible together.
Maybe HandlesStreamInput should not override readSharedText, but should only readSharedString ?

@thomas11
Copy link
Contributor

Unfortunately I couldn't reproduce it either, but I can confirm the issue exactly as mitallast described. Not setting search_type to "count" makes it disappear.

@s1monw
Copy link
Contributor

s1monw commented Feb 11, 2015

@mitallast you are absolutely right. I found the same issue on a different occasion and must have missed this issue. We fixed this lately and removed the read/writeSharedText and friends in master completely. For 1.x and the upcoming 1.4.3 release this is fixed by #9500

@s1monw
Copy link
Contributor

s1monw commented Feb 11, 2015

Unfortunately I couldn't reproduce it either, but I can confirm the issue exactly as mitallast described. Not setting search_type to "count" makes it disappear.

yeah we don't use the query cache if search_type is not count :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants