Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampler Aggregator cannot be used in terms agg order #10719

Closed
colings86 opened this issue Apr 22, 2015 · 4 comments
Closed

Sampler Aggregator cannot be used in terms agg order #10719

colings86 opened this issue Apr 22, 2015 · 4 comments

Comments

@colings86
Copy link
Contributor

Sampler Aggregator is a single-bucket aggregator but if you try to use it as part of the order in a terms aggregation it fails. Below is a sense script to reproduce:

POST test/doc/1
{"color":"YELLOW","date":1500000009,"weight":105}

POST test/doc/2
{"color":"YELLOW","date":1500000008,"weight":104}

POST test/doc/3
{"color":"YELLOW","date":1500000007,"weight":103}

POST test/doc/11
{"color":"RED","date":1500000009,"weight":205}

GET test/doc/_search
{
  "size": 0,
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "date": {
        "order": "desc"
      }
    }
  ],
  "aggregations": {
    "distinctColors": {
      "terms": {
        "field": "color",
        "size": 1,
        "order": {
          "sample>max_weight.value": "asc"
        }
      },
      "aggregations": {
        "sample": {
          "sampler": {
            "shard_size": 1
          },
          "aggs": {
            "max_weight": {
              "max": {
                "field": "weight"
              }
            }
          }
        }
      }
    }
  }
}

The search request throws an ArrayStoreException:

org.elasticsearch.transport.RemoteTransportException: [Stallior][inet[/192.168.0.7:9300]][indices:data/read/search[phase/query]]
Caused by: java.lang.ArrayStoreException
    at java.lang.System.arraycopy(Native Method)
    at org.elasticsearch.search.aggregations.support.AggregationPath.subPath(AggregationPath.java:191)
    at org.elasticsearch.search.aggregations.support.AggregationPath.validate(AggregationPath.java:307)
    at org.elasticsearch.search.aggregations.bucket.terms.InternalOrder.validate(InternalOrder.java:145)
    at org.elasticsearch.search.aggregations.bucket.terms.InternalOrder.validate(InternalOrder.java:138)
    at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregator.<init>(TermsAggregator.java:141)
    at org.elasticsearch.search.aggregations.bucket.terms.AbstractStringTermsAggregator.<init>(AbstractStringTermsAggregator.java:39)
    at org.elasticsearch.search.aggregations.bucket.terms.GlobalOrdinalsStringTermsAggregator.<init>(GlobalOrdinalsStringTermsAggregator.java:75)
    at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory$ExecutionMode$2.create(TermsAggregatorFactory.java:70)
    at org.elasticsearch.search.aggregations.bucket.terms.TermsAggregatorFactory.doCreateInternal(TermsAggregatorFactory.java:223)
    at org.elasticsearch.search.aggregations.support.ValuesSourceAggregatorFactory.createInternal(ValuesSourceAggregatorFactory.java:57)
    at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:95)
    at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:69)
    at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:77)
    at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:96)
    at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:296)
    at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:307)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:422)
    at org.elasticsearch.search.action.SearchServiceTransportAction$SearchQueryTransportHandler.messageReceived(SearchServiceTransportAction.java:1)
    at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:340)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)

but if you debug at org.elasticsearch.search.aggregations.support.AggregationPath.subPath(AggregationPath.java:191) you can see that the aggregator being tested is of type AggregatorFactory$1 and wraps the SamplerAggregator. This is created by the asMultiBucketAggregator(this, context, parent); call in SamplerAggregator$Factory.createInternal(...).

The reason SamplerAggregator has to be wrapped is that it's collectors do not take into account the parentBucketOrdinal.

We should update the SamplerAggregator (including the Diversity parts) to collect documents for each parentBucketOrdinal so that it doesn't need to be wrapped anymore and can be used in ordering like the other single-bucket aggregators

@markharwood
Copy link
Contributor

@jpountz , can you confirm my assumption: the parent bucket IDs aggs are asked to collect on are compact and ascending (0,1,2,3...) or do I have to allow for very sparse values (7,10342,...)?
This dictates if I use a map or an array in my sampler collection and also if I in turn should rebase IDs of the buckets that survive the "best docs" selection process.

@jpountz
Copy link
Contributor

jpountz commented Apr 23, 2015

@markharwood Indeed they are fine to use as array indices. However I'm confused why you are mentioning "surviving" bucket as the sampler aggregator should not filter buckets? My assumption was that it would just compute a different sample on each bucket?

@markharwood
Copy link
Contributor

My assumption was that it would just compute a different sample on each bucket?

My bad. You are correct.
On a separate point - when replaying the deferred collection(s) I need to replay collects in docId order along with the choice of bucket ID. There may be more than one bucket per doc id. A convenient way of doing this which avoids extra object allocations is to take the ScoreDocs produced from each of the samples and sneak the bucketID into the "shardIndex" int value they hold and then sort them for replay. A bit hacky (casting long bucket ids to ints) but should be OK?

@jpountz
Copy link
Contributor

jpountz commented Apr 23, 2015

This hack sounds ok to me, if you use more than Integer.MAX_VALUE buckets to collect such an aggregator, you will have other issues anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants