Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate results when using a Scrolled Search #60

Closed
Edward-Francis opened this issue Mar 5, 2015 · 10 comments
Closed

Duplicate results when using a Scrolled Search #60

Edward-Francis opened this issue Mar 5, 2015 · 10 comments

Comments

@Edward-Francis
Copy link

When using the scroll search I am getting duplicate results. I am expecting 1 document to be returned with my query, but it returns either one, two or three documents. The documents returned are exactly the same and have the same ID.

This is what I am doing:

my $scroll = $es->scroll_helper(
    index => 'my_index',
    type  => 'my_type',
    size  => 1000,
    query => $query.
);

while ( $scroll->refill_buffer ) {

    push @results, $scroll->drain_buffer;

}

Response:

[  
    {   "_id": "AUvp7na9A9AKOp2XfDLp",
        "_index": "my_index",
        "_score": "undef",
        "_type": "my_type"
    },
    {   "_id": "AUvp7na9A9AKOp2XfDLp",
        "_index": "my_index",
        "_score": "undef",
        "_type": "my_type"
    },
    {   "_id": "AUvp7na9A9AKOp2XfDLp",
        "_index": "my_index",
        "_score": "undef",
        "_type": "my_type"
    }
]

We currently have 2 clusters on different versions - the code works as expected on Elasticsearch 1.1.1 but not on Elasticsearch 1.4.4.

@clintongormley
Copy link
Contributor

Hi @Edward-Francis

I think you may be running into this bug in Elasticsearch elastic/elasticsearch#8788

Is this data you have indexed newly on 1.4.4, or are they documents that you indexed on some older version?

@Edward-Francis
Copy link
Author

Hi @clintongormley,

The index is brand new on both versions.

I don't seem to be able to replicate the problem just using curls only when I use Search::Elasticsearch.

@clintongormley
Copy link
Contributor

@Edward-Francis Please could you do the following:

Run this query and send the output:

GET _search?explain&pretty
{
  "query": {
    "term": {
      "_id": "AUvp7na9A9AKOp2XfDLp"
    }
  }
}

Turn on trace logging, run your scroll request (which generates duplicates) and send me the logs, eg:

$es = Seach::Elasticsearch->new ( trace_to => ['File','output.log'])

thanks

@Edward-Francis
Copy link
Author

Hi @clintongormley,

I've had to change the id because that id has been lost during re-indexing.

I've noticed that when I remove the sort clause it seems to be fine. Could this be the cause?

The curl:

ed@w-play-dev-es-1:~$ curl -XGET 'http://localhost:9200/_search?explain&pretty' -d '{ "query" : { "term" : { "_id": "AUv-9FpveML5spcUZJ81" } } }'
{
  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 102,
    "successful" : 102,
    "failed" : 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 0,
      "_node" : "0m5tq6RYTBaY_oLtaMuJgA",
      "_index" : "my_index",
      "_type" : "my_type",
      "_id" : "AUv-9FpveML5spcUZJ81",
      "_score" : 1.0,
      "_source":{"name":null,"location_id":"217"},
      "_explanation" : {
        "value" : 1.0,
        "description" : "ConstantScore(_uid:my_type#AUv-9FpveML5spcUZJ81), product of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "boost"
        }, {
          "value" : 1.0,
          "description" : "queryNorm"
        } ]
      }
    } ]
  }
}

And the logging:

[Mon Mar  9 15:15:43 2015] # Request to: http://w-play-dev-es-2:9200
curl -XGET 'http://localhost:9200/my_index/my_type/_search?pretty=1&scroll=1m&size=1000' -d '
{
   "sort" : [
      {
         "name" : {
            "order" : "asc"
         }
      }
   ],
   "query" : {
      "term" : {
         "property_id" : 914662
      }
   }
}
'

[Mon Mar  9 15:15:43 2015] # Response: 200, Took: 12 ms
# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "name" : null,
#                "location_id" : "217",
#             },
#             "sort" : [
#                null
#             ],
#             "_score" : null,
#             "_index" : "my_index",
#             "_id" : "AUv-9FpveML5spcUZJ81",
#             "_type" : "my_type"
#          }
#       ],
#       "max_score" : null,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsyMzEyMDpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIzOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MTY1MTA6MG01dHE2UllUQmFZX29MdGFNdUpnQTsyMzEyMjpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIxOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MDs=",
#    "took" : 4
# }

[Mon Mar  9 15:15:43 2015] # Request to: http://w-play-dev-es-1:9200
curl -XGET 'http://localhost:9200/_search/scroll?pretty=1&scroll=1m' -d '
cXVlcnlUaGVuRmV0Y2g7NTsyMzEyMDpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIzOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MTY1MTA6MG01dHE2UllUQmFZX29MdGFNdUpnQTsyMzEyMjpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIxOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MDs='

[Mon Mar  9 15:15:43 2015] # Response: 200, Took: 5 ms
# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "name" : null,
#                "location_id" : "217",
#             },
#             "sort" : [
#                null
#             ],
#             "_score" : null,
#             "_index" : "my_index",
#             "_id" : "AUv-9FpveML5spcUZJ81",
#             "_type" : "my_type"
#          }
#       ],
#       "max_score" : null,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsyMzEyMDpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIzOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MTY1MTA6MG01dHE2UllUQmFZX29MdGFNdUpnQTsyMzEyMjpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIxOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MDs=",
#    "took" : 4
# }

[Mon Mar  9 15:15:43 2015] # Request to: http://w-play-dev-es-3:9200
curl -XGET 'http://localhost:9200/_search/scroll?pretty=1&scroll=1m' -d '
cXVlcnlUaGVuRmV0Y2g7NTsyMzEyMDpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIzOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MTY1MTA6MG01dHE2UllUQmFZX29MdGFNdUpnQTsyMzEyMjpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIxOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MDs='

[Mon Mar  9 15:15:43 2015] # Response: 200, Took: 4 ms
# {
#    "hits" : {
#       "hits" : [],
#       "max_score" : null,
#       "total" : 1
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "_scroll_id" : "cXVlcnlUaGVuRmV0Y2g7NTsyMzEyMDpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIzOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MTY1MTA6MG01dHE2UllUQmFZX29MdGFNdUpnQTsyMzEyMjpsVnZnMWxGM1RoLTVNQXVPYm05TGFBOzIzMTIxOmxWdmcxbEYzVGgtNU1BdU9ibTlMYUE7MDs=",
#    "took" : 2
# }

@Edward-Francis
Copy link
Author

@clintongormley, any idea on this?

@clintongormley
Copy link
Contributor

Hi @Edward-Francis

I'm unable to replicate this locally on 1.4.3 or 1.5.0.

You're sorting on a null value, which makes me wonder if you're running into something related to elastic/elasticsearch#9157 . This was fixed in 1.4.3, but maybe there is another hidden bug here.

Either way, the problem is in Elasticsearch, not with the Perl API. Please could you reopen this ticket there (and it'd be great if you could provide a full recreation, if possible)

thanks

@clintongormley
Copy link
Contributor

@Edward-Francis are you by any chance using logstash in your cluster? See elastic/elasticsearch#10244 (comment) for the reason I ask

@Edward-Francis
Copy link
Author

@clintongormley: I written a little script to test this issue - works as expected on v1.1.1 but not on v1.4.4. But I still can't replicated just through curls.

We are also using Logstash on the clusters

use v5.16.3;
use strict;
use warnings;

use DDP;
use Search::Elasticsearch;

my $es = Search::Elasticsearch->new(
    nodes => [
        'w-play-dev-es-1:9200', 'w-play-dev-es-2:9200',
        'w-play-dev-es-3:9200',
    ],
    trace_to => ['File', '/tmp/es_output'],
);

my $index = 'my_index';
my $type  = 'my_type';

eval { $es->indices->delete( index => $index ) };

$es->indices->create(
    index => $index,
    body  => {
        mappings => {
            $type => {
                properties => {


                    id => { type => 'integer' },

                    (

                        map {
                            $_ =>
                                { type => 'string', index => 'not_analyzed' }
                            } qw/name email country city
                            /,
                    ),

                }
            }
        }
    }
);

for ( data() ) {

    $es->index(
        index => $index,
        type  => $type,
        id    => $_->{id},
        body  => $_,
    );

}

sleep(1);

for ( 1 .. 5 ) {

    say "------";

    my $scroll = $es->scroll_helper(
        index => $index,
        type  => $type,
        size  => 500,
        body  => query(),
    );

    say "Total hits: " . $scroll->total;

    my @results;

    while ( $scroll->refill_buffer ) {

        push @results, $scroll->drain_buffer;

    }

    say "Total results: " . scalar @results;
}


sub query {
    return {
        query => { term => { id => 3 } },
        sort  => [ { city => { order => 'asc' } } ],
    };
}

sub data {
    return (
        {   id      => 1,
            name    => "Christopher Schmidt",
            email   => "cschmidt0\@meetup.com",
            country => "Cameroon",
            city    => "Douala"
        },
        {   id      => 2,
            name    => "Gloria Banks",
            email   => "gbanks1\@joomla.org",
            country => "Argentina",
            city    => "Libertador General San Martín"
        },
        {   id      => 3,
            name    => "Elizabeth Shaw",
            email   => "eshaw2\@huffingtonpost.com",
            country => "Armenia"
        },
        {   id      => 4,
            name    => "Anna Fisher",
            email   => "afisher3\@wikipedia.org",
            country => "Indonesia"
        },
        {   id      => 5,
            name    => "Nicholas Ford",
            email   => "nford4\@trellian.com",
            country => "Yemen"
        },
        {   id      => 6,
            name    => "Terry Sanders",
            email   => "tsanders5\@cnn.com",
            country => "China"
        },
        {   id      => 7,
            name    => "Susan Shaw",
            email   => "sshaw6\@nba.com",
            country => "Russia",
            city    => "Kislovodsk"
        },
        {   id      => 8,
            name    => "Sara Flores",
            email   => "sflores7\@nytimes.com",
            country => "Brazil",
            city    => "Arapongas"
        },
        {   id      => 9,
            name    => "Mark White",
            email   => "mwhite8\@statcounter.com",
            country => "China",
            city    => "Bayan Hure"
        },
        {   id      => 10,
            name    => "Cynthia Medina",
            email   => "cmedina9\@miitbeian.gov.cn",
            country => "Russia"
        }
    );
}

@clintongormley
Copy link
Contributor

Hi @Edward-Francis

Many thanks for the recreation. the problem is indeed because of the older version of Elasticsearch that logstash is using, ie the same as elastic/elasticsearch#10244

If you change the protocol setting for your Elasticsearch output to transport or http, then the problem goes away.

@Edward-Francis
Copy link
Author

Ah great - thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants