Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

mitallast · 2015-01-14T10:17:59Z

Hi there, I have a lot of error messages in logs like this:

org.elasticsearch.transport.RemoteTransportException: Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]
Caused by: org.elasticsearch.transport.TransportSerializationException: Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:152)
    at org.elasticsearch.transport.netty.MessageChannelHandler.messageReceived(MessageChannelHandler.java:127)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline$DefaultChannelHandlerContext.sendUpstream(DefaultChannelPipeline.java:791)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:296)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.unfoldAndFireMessageReceived(FrameDecoder.java:462)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.callDecode(FrameDecoder.java:443)
    at org.elasticsearch.common.netty.handler.codec.frame.FrameDecoder.messageReceived(FrameDecoder.java:303)
    at org.elasticsearch.common.netty.channel.SimpleChannelUpstreamHandler.handleUpstream(SimpleChannelUpstreamHandler.java:70)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:564)
    at org.elasticsearch.common.netty.channel.DefaultChannelPipeline.sendUpstream(DefaultChannelPipeline.java:559)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:268)
    at org.elasticsearch.common.netty.channel.Channels.fireMessageReceived(Channels.java:255)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.read(NioWorker.java:88)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.process(AbstractNioWorker.java:108)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioSelector.run(AbstractNioSelector.java:318)
    at org.elasticsearch.common.netty.channel.socket.nio.AbstractNioWorker.run(AbstractNioWorker.java:89)
    at org.elasticsearch.common.netty.channel.socket.nio.NioWorker.run(NioWorker.java:178)
    at org.elasticsearch.common.netty.util.ThreadRenamingRunnable.run(ThreadRenamingRunnable.java:108)
    at org.elasticsearch.common.netty.util.internal.DeadLockProofWorker$1.run(DeadLockProofWorker.java:42)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.lang.Thread.run(Unknown Source)
Caused by: java.lang.IndexOutOfBoundsException: Invalid combined index of 1471746, maximum is 64403
    at org.elasticsearch.common.netty.buffer.SlicedChannelBuffer.<init>(SlicedChannelBuffer.java:46)
    at org.elasticsearch.common.netty.buffer.HeapChannelBuffer.slice(HeapChannelBuffer.java:201)
    at org.elasticsearch.transport.netty.ChannelBufferStreamInput.readBytesReference(ChannelBufferStreamInput.java:56)
    at org.elasticsearch.common.io.stream.StreamInput.readText(StreamInput.java:222)
    at org.elasticsearch.common.io.stream.HandlesStreamInput.readSharedText(HandlesStreamInput.java:69)
    at org.elasticsearch.search.SearchShardTarget.readFrom(SearchShardTarget.java:103)
    at org.elasticsearch.search.SearchShardTarget.readSearchShardTarget(SearchShardTarget.java:87)
    at org.elasticsearch.search.internal.InternalSearchHits.readFrom(InternalSearchHits.java:217)
    at org.elasticsearch.search.internal.InternalSearchHits.readFrom(InternalSearchHits.java:203)
    at org.elasticsearch.search.internal.InternalSearchHits.readSearchHits(InternalSearchHits.java:197)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits.readFrom(InternalTopHits.java:137)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits$1.readResult(InternalTopHits.java:50)
    at org.elasticsearch.search.aggregations.metrics.tophits.InternalTopHits$1.readResult(InternalTopHits.java:46)
    at org.elasticsearch.search.aggregations.InternalAggregations.readFrom(InternalAggregations.java:190)
    at org.elasticsearch.search.aggregations.InternalAggregations.readAggregations(InternalAggregations.java:172)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms.readFrom(LongTerms.java:148)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms$1.readResult(LongTerms.java:48)
    at org.elasticsearch.search.aggregations.bucket.terms.LongTerms$1.readResult(LongTerms.java:44)
    at org.elasticsearch.search.aggregations.InternalAggregations.readFrom(InternalAggregations.java:190)
    at org.elasticsearch.search.aggregations.InternalAggregations.readAggregations(InternalAggregations.java:172)
    at org.elasticsearch.search.query.QuerySearchResult.readFromWithId(QuerySearchResult.java:175)
    at org.elasticsearch.search.query.QuerySearchResult.readFrom(QuerySearchResult.java:162)
    at org.elasticsearch.transport.netty.MessageChannelHandler.handleResponse(MessageChannelHandler.java:150)
    ... 23 more

Query uses search type "count", uses a lot of filters, and have multiple aggregations, including top hits agg.
If I'm retry equals query to ES, I see correct answer without error messages in log in 99,9% times.
If I'm not use top hits agg, or standard search type "query and fetch", threre's no error message.

I have ES test cluster with CentOs 6.4 with xeon and 20gb memory, Oracle Java 1.8.0_25, ElasticSearch 1.4.2 with 10gb heap, two instances.

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-01-15T20:21:04Z

Hi @mitallast

Thanks for reporting. Could you provide the smallest recreation of the problem that you can come up with? Will help with debugging.

thanks

@martijnvg sounds like it could be a problem with top-hits?

martijnvg · 2015-01-19T08:13:14Z

@clintongormley Yes, this does look like a bug in top_hits agg.
@mitallast A recreation would be helpful

mitallast · 2015-01-19T12:58:34Z

I'm still try to reproduce like a integration test, but without success.
But at our test cluster, this bug also manually reproduced, and looks like this:

2 nodes 1.4.2 with centos and Oracle JVM 8, first node contains shards 1/4/5/7, second contains 0/2/3/6 (8 total, 0 replicas)
query:

/index_name/type_name/_search?search_type=count&routing=3016
{
  "query": {
    "match_all": {}
  },
  "aggregations": {
    "clients": {
      "terms": {
        "field": "client_id",
        "size": 5
      },
      "aggregations": {
        "best": {
          "top_hits": {
            "size": 10
          }
        }
      }
    }
  }
}

Error output:

{
error: SearchPhaseExecutionException[Failed to execute phase [query], all shards failed; shardFailures {[pBrJ6zkHS8CKfAopFnmKUg][online][7]: RemoteTransportException[Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]]; nested: TransportSerializationException[Failed to deserialize response of type [org.elasticsearch.search.query.QuerySearchResult]]; nested: IndexOutOfBoundsException[Invalid index: 214 - Bytes needed: 1470744, maximum is 48585]; }]
status: 500
}

I have this error only if query executed at second node. If first, all is OK.
With search shards api, looks like 3016 route to 7th shard. Second node does not contain 7th, but first does.

Any ideas?

mitallast · 2015-01-19T17:08:01Z

Sorry, I can't show our index data - DNA restrictions. I have not find any reproduce test data.

I'm reproduce it in debugger Intellij IDEA, and try to find where is an mistake.

Seems like SearchShardTarget#readFrom code contains extra reads of two bytes with comparing to SearchShardTarget#writeTo

See at screenshots:
https://www.dropbox.com/s/7ukfm9ozlg0rxzb/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2019.52.16.png?dl=0
https://www.dropbox.com/s/bux416lf3gnzfpg/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2020.01.28.png?dl=0

int length in line 221 must be an 22 , but ChannelBufferStreamInput.buffer.readerIndex = 439. Looks like It must be 437 (4 bytes for int, value 22 presents as 0-0-0-22 in byte-as-decimal view). Maybe somewhere uses writeVInt for write, and readInt for read, or something like this.

Look:
https://www.dropbox.com/s/gnuntjuamw1drjp/%D0%A1%D0%BA%D1%80%D0%B8%D0%BD%D1%88%D0%BE%D1%82%202015-01-19%2020.02.29.png?dl=0
Earlier in program stack InternalSearchHits#readFrom lines between 207 and 210 have a correct parsed values for totalHits, maxScore, size and lookupSize.

mitallast · 2015-01-20T10:23:37Z

Maybe I found mistake:

org.elasticsearch.search.SearchShardTarget has this correct methods:

    @Override
    public void readFrom(StreamInput in) throws IOException {
        if (in.readBoolean()) {
            nodeId = in.readSharedText();
        }
        index = in.readSharedText();
        shardId = in.readVInt();
    }

    @Override
    public void writeTo(StreamOutput out) throws IOException {
        if (nodeId == null) {
            out.writeBoolean(false);
        } else {
            out.writeBoolean(true);
            out.writeSharedText(nodeId);
        }
        out.writeSharedText(index);
        out.writeVInt(shardId);
    }

But when ES at first time execute no-cached query, argument for writeTo is an BytesStreamOutput with parent class StreamOutput contains:

    public void writeText(Text text) throws IOException {
        if (!text.hasBytes()) {
            final String string = text.string();
            spare.copyChars(string);
            writeInt(spare.length());
            write(spare.bytes(), 0, spare.length());
        } else {
            BytesReference bytes = text.bytes();
            writeInt(bytes.length());
            bytes.writeTo(this);
        }
    }

    public void writeSharedText(Text text) throws IOException {
        writeText(text);
    }

It means that org.elasticsearch.common.io.stream.StreamOutput#writeSharedText just proxy to org.elasticsearch.common.io.stream.StreamOutput#writeText

But when it reads at next node, uses a HandlesStreamInput with this override implementation

@Override
    public Text readSharedText() throws IOException {
        byte b = in.readByte();
        if (b == 0) {
            int handle = in.readVInt();
            Text s = in.readText();
            handlesText.put(handle, s);
            return s;
        } else if (b == 1) {
            return handlesText.get(in.readVInt());
        } else if (b == 2) {
            return in.readText();
        } else {
            throw new IOException("Expected handle header, got [" + b + "]");
        }
    }

This two implementations are not byte compatible together.
Maybe HandlesStreamInput should not override readSharedText, but should only readSharedString ?

thomas11 · 2015-02-11T09:29:34Z

Unfortunately I couldn't reproduce it either, but I can confirm the issue exactly as mitallast described. Not setting search_type to "count" makes it disappear.

s1monw · 2015-02-11T09:34:06Z

@mitallast you are absolutely right. I found the same issue on a different occasion and must have missed this issue. We fixed this lately and removed the read/writeSharedText and friends in master completely. For 1.x and the upcoming 1.4.3 release this is fixed by #9500

s1monw · 2015-02-11T09:35:08Z

Unfortunately I couldn't reproduce it either, but I can confirm the issue exactly as mitallast described. Not setting search_type to "count" makes it disappear.

yeah we don't use the query cache if search_type is not count :)

clintongormley added the feedback_needed label Jan 15, 2015

clintongormley assigned martijnvg Jan 15, 2015

clintongormley added the :Analytics/Aggregations Aggregations label Jan 15, 2015

clintongormley added >bug and removed feedback_needed labels Jan 20, 2015

s1monw closed this as completed Feb 11, 2015

s1monw added :Analytics/Aggregations Aggregations >bug v2.0.0-beta1 v1.5.0 v1.4.3 and removed :Analytics/Aggregations Aggregations >bug labels Feb 11, 2015

jpountz mentioned this issue Feb 11, 2015

Remove query-cache serialization optimization. #9500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

mitallast commented Jan 14, 2015

clintongormley commented Jan 15, 2015

martijnvg commented Jan 19, 2015

mitallast commented Jan 19, 2015

mitallast commented Jan 19, 2015

mitallast commented Jan 20, 2015

thomas11 commented Feb 11, 2015

s1monw commented Feb 11, 2015

s1monw commented Feb 11, 2015

Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

Failed to deserialize response when using Top Hits aggregation with search_type "count" #9294

Comments

mitallast commented Jan 14, 2015

clintongormley commented Jan 15, 2015

martijnvg commented Jan 19, 2015

mitallast commented Jan 19, 2015

mitallast commented Jan 19, 2015

mitallast commented Jan 20, 2015

thomas11 commented Feb 11, 2015

s1monw commented Feb 11, 2015

s1monw commented Feb 11, 2015