API: Add response filtering with `filter_path` parameter #10980

tlrx · 2015-05-05T12:18:56Z

This change adds a new "_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.

For example, returning only the shards that failed to be optimized:

curl -XPOST 'localhost:9200/beer/_optimize?_path=_shards.failed'
{"_shards":{"failed":0}}%

It supports multiple filters (separated by a comma):

curl -XGET 'localhost:9200/_mapping?pretty&_path=*.mappings.*.properties.name,*.mappings.*.properties.title'

It also supports the YAML response format. Here it returns only the _id field of a newly indexed document:

curl -XPOST 'localhost:9200/library/book?_path=_id' -d '---hello:\n  world: 1\n'

---
_id: "AU0j64-b-stVfkvus5-A"

It also supports wildcards. Here it returns only the host name of every nodes in the cluster:

curl -XGET 'http://localhost:9200/_nodes/stats?_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}

And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:

curl 'http://localhost:9200/_segments?pretty&_path=indices.**.version'
{
  "indices" : {
    "beer" : {
      "shards" : {
        "0" : [ {
          "segments" : {
            "_0" : {
              "version" : "5.2.0"
            },
            "_1" : {
              "version" : "5.2.0"
            }
          }
        } ]
      }
    }
  }
}

Note that elasticsearch sometimes returns directly the raw value of a field, like the "_source" field. If you want to filter the response that include _source fields like in Search or Get responses , you should consider using the already existing "fields" parameter:

 curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category'
 {
   "_index" : "beer",
   "_type" : "tasty",
   "_id" : "3672",
   "_version" : 1,
   "found" : true,
   "fields" : {
     "category" : [ "German Lager" ]
   }
 }

The "_path" parameter can be used to further reduce the result:

curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category&_path=fields.category'
{
  "fields" : {
    "category" : [ "German Lager" ]
  }
}

Closes #7401

tlrx · 2015-05-05T12:32:34Z

Note:

if the provided filter specified in _path does not match something, an empty response is returned
if the API returns an error, the error is not filtered even if the _path is specified

clintongormley · 2015-05-05T12:35:20Z

if the provided filter specified in _path does not match something, an empty response is returned

We should return an empty JSON object rather than an empty string, otherwise the client needs to check whether JSON has been returned.

tlrx · 2015-05-05T12:47:25Z

@clintongormley thanks, I updated the code to reflect your last comment.

tlrx · 2015-05-05T13:06:04Z

@spinscale Can you please have a look? Thanks

spinscale · 2015-05-05T15:47:54Z

src/main/java/org/elasticsearch/common/xcontent/json/JsonXContent.java

+
+    @Override
+    public XContentGenerator createGenerator(OutputStream os, String[] filters) throws IOException {
+        if ((filters == null) || (filters.length == 0)) {


use CollectionUtils.isEmpty()?

spinscale · 2015-05-06T07:36:10Z

A couple of things

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

tlrx · 2015-05-06T07:55:24Z

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

That's the first thing we checked before starting working on this pull request. The current filtering capabilities of Jackson (version <2.5.1) can be used only with jackson-databinding lib and apply only to POJOs but not when you use jackson to generate JSON in streaming mode.

Jackson 2.6 will integrate an interesting filtering feature that will work with JSON streaming (see release note here and test class here). It currently uses JSONPointer to filter properties but I think we will be able to implement our own TokenFilter later, once version 2.6 is released.

tlrx · 2015-05-07T11:03:04Z

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

I added the FilteringJsonGeneratorBenchmark and put some numbers here (not very readable - I apologize for that).

There are no real surprise in the benchmark:

when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today
when writing a larger amount of data (ie large number of small docs, small number of large docs) there's a point where it becomes more efficient to check if fields must be included and skip them rather than writing more data. This point varies depending on the document size and the selectivity of the filter, but in this test it is in between 1-10%.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

There are room for improvements like this one (I'm also thinking of not creating FilterContext for sub fields when a parent field matches the end of a filter), but it might add complexity to the code and I'm not sure if it will be really more efficient.

I'm currently trying some of these improvement to see if they are pertinent or not.

tlrx · 2015-05-07T14:50:27Z

Thinking of improvements again, we should be able to reuse the objects instead of creating new ones for every sub fields. I'll try to improve that.

clintongormley · 2015-05-08T07:21:29Z

when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today

I assume that the code path (and performance) remains the same as today if no _path is specified?

tlrx · 2015-05-08T07:45:49Z

@clintongormley yes. Branching is done here where we fall back to the normal XContent generator.

spinscale · 2015-05-08T15:24:20Z

src/main/java/org/elasticsearch/common/xcontent/support/filtering/FilterContext.java

+    private List<XContentFilter> matchings;
+
+    /**
+     * Flag to indicate if the field/property must be write


... written

spinscale · 2015-05-08T16:05:09Z

added some absolutely minor comments...

First, this doesnt compile using mvn, due to the benchmark class and some locale, but not really an issue.

Few last things:

the kind-of linkedlist data structure in the FilterContext, which could possibly be replaced by a simple array?
This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware
No documentation so far?

Maybe we should also note somewhere to switch to the jackson filter features with the 2.6 release, I'd hate to forget that...

LGTM apart from that

rjernst · 2015-05-08T16:57:14Z

I'm late to the party here, but I have a simple question. Why is the parameter _path and not path? Why use the underscore when we control all the url params anyways? And all the other examples I know of (eg fields, pretty, human) do not use underscore?

clintongormley · 2015-05-08T17:00:37Z

@rjernst only because it works well with the existing _source filtering. I kinda like the underscore because it is a "global" param (mind you, so is pretty).

clintongormley · 2015-05-08T17:24:19Z

This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware

Elsewhere we use * as a greedy wildcard, which I think is problematic. We took the ** syntax from ant, and I'd actually be tempted to reuse it elsewhere.

kimchy · 2015-05-08T17:25:13Z

actually, since this is a "rest" level parameter, and for those, we have pretty and human, I think we should just call it path, or maybe be more explicit and call it filter_path

rjernst · 2015-05-08T17:26:57Z

+1 on filter_path

kimchy · 2015-05-09T21:53:24Z

here is another suggestion, calling it filter_path_include, in the future we might want to support filter_path_exclude.

The reason I am raising it is that today, we support source level include/exclude when fetching source documents. With this infrastructure, if we eventually expand it to support include_s_/exclude_s_, we can also support zero copy (as in, not create a map of maps/lists representation of source) when someone asks for source includes/includes (we can just create a bytes based filtering generator, and copy structure from the parser right into it).

I helped a fellow on IRC today where the loading and parsing of source include was the perf bottleneck (from 1ms it got up to 170ms for ~1000 docs). I think this will help a lot there eventually.

tlrx · 2015-05-18T16:05:38Z

@spinscale I rebased and updated the code according to your last comments. Can you have a look and do some manual testing please? Thanks a lot :)

Note: latest benchmark numbers are here.

spinscale · 2015-05-18T16:25:57Z

we may also not want to have manual testing but also add that parameter (plus tests) to our REST tests

tlrx · 2015-05-19T10:17:25Z

@spinscale I just added the REST tests. The parameter is still named _path until we reach an agreement on the final name.

spinscale · 2015-05-19T15:31:08Z

rest-api-spec/api/nodes.stats.json

@@ -56,6 +56,10 @@
          "options" : ["node", "indices", "shards"],
          "default" : "node"
        },
+        "_path": {


I think the _path needs to be added to all the methods supporting it, to make sure the clients are aware

No, like pretty and _source, we can just support this parameter everywhere, so don't add it to the rest spec.

spinscale · 2015-05-19T16:06:58Z

left one comment, apart from that it feels like we are close to getting this in, when we get the naming stuff resolved... /cc @clintongormley

clintongormley · 2015-05-20T18:31:27Z

The _source parameter is essentially a synonym for _source_include, so I'd go with filter_path for now. We can always add filter_path_include|exclude later on.

tlrx · 2015-05-20T18:39:51Z

@clintongormley sounds good to me, thanks

tlrx · 2015-05-22T11:41:24Z

@spinscale Updated with your last comment :)

clintongormley · 2015-05-25T10:06:16Z

@spinscale @kimchy does this need another review, or are we good to go?

spinscale · 2015-05-26T06:18:24Z

LGTM on my side

This change adds a new "filter_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch. For example, returning only the shards that failed to be optimized: ``` curl -XPOST 'localhost:9200/beer/_optimize?filter_path=_shards.failed' {"_shards":{"failed":0}}% ``` It supports multiple filters (separated by a comma): ``` curl -XGET 'localhost:9200/_mapping?pretty&filter_path=*.mappings.*.properties.name,*.mappings.*.properties.title' ``` It also supports the YAML response format. Here it returns only the `_id` field of a newly indexed document: ``` curl -XPOST 'localhost:9200/library/book?filter_path=_id' -d '---hello:\n world: 1\n' --- _id: "AU0j64-b-stVfkvus5-A" ``` It also supports wildcards. Here it returns only the host name of every nodes in the cluster: ``` curl -XGET 'http://localhost:9200/_nodes/stats?filter_path=nodes.*.host*' {"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}} ``` And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment: ``` curl 'http://localhost:9200/_segments?pretty&filter_path=indices.**.version' { "indices" : { "beer" : { "shards" : { "0" : [ { "segments" : { "_0" : { "version" : "5.2.0" }, "_1" : { "version" : "5.2.0" } } } ] } } } } ``` Note that elasticsearch sometimes returns directly the raw value of a field, like the _source field. If you want to filter _source fields, you should consider combining the already existing _source parameter (see Get API for more details) with the filter_path parameter like this: ``` curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title' { "hits" : { "hits" : [ { "_source":{"title":"Book elastic#2"} }, { "_source":{"title":"Book elastic#1"} }, { "_source":{"title":"Book elastic#3"} } ] } } ```

clintongormley · 2015-05-26T12:22:19Z

w00t

rashidkpc · 2015-06-02T18:09:45Z

WIN Thanks so much everyone for making this happen. This is huge for Kibana!

Related: elastic/elasticsearch#10980, elastic/elasticsearch@ce63590

dkroehan · 2016-02-26T13:34:47Z

Is this already supported via the Java API?

tlrx · 2016-02-29T08:53:24Z

@dkroehan The filter_path parameter is used as the REST level (like the pretty parameter) to render JSON responses and therefore it is not part of the Java API.

If you need to render XContent object instances using filtering in Java, you can use the XContent.builder(XContent xContent, String[] filters) method

nmors · 2016-03-08T06:26:28Z

are excludes supported yet, or just includes?

tlrx · 2016-03-08T07:43:05Z

@nmors Using the REST API and the filter_path parameter, only includes are supported. Using XContent filtering in Java, excludes are supported but you can't mix them both.

nmors · 2016-03-08T10:15:07Z

Thanks, I'm actually using the javascript client. I'm using the filterPath parameter like so (I guess it's a wrapper for filter_path), all is working fine! It would be good to be able to use excludes though, can be very helpful for me as it is web traffic that I would like to keep small. I've managed to cut the response size in half by using the following:

  index: my_index,
  type: my_type,
  body: q,
  filterPath: [
    "aggregations.my_agg.buckets.**.value",
    "aggregations.my_agg.buckets.**.key"
  ]

tlrx added v2.0.0-beta1 review :Core/Infra/REST API REST infrastructure and utilities v1.6.0 labels May 5, 2015

clintongormley added the release highlight label May 5, 2015

spinscale reviewed May 5, 2015
View reviewed changes

spinscale reviewed May 8, 2015
View reviewed changes

spinscale reviewed May 19, 2015
View reviewed changes

clintongormley mentioned this pull request May 25, 2015

Standardise wildcard matching #11334

Closed

tlrx force-pushed the response-filtering-with-path-parameter branch from 234c3b6 to ce63590 Compare May 26, 2015 12:07

tlrx merged commit ce63590 into elastic:master May 26, 2015

kevinkluge removed the review label May 26, 2015

clintongormley changed the title ~~API: Add response filtering with _path parameter~~ API: Add response filtering with filter_path parameter May 27, 2015

clintongormley added the >feature label May 29, 2015

karmi added a commit to elastic/elasticsearch-ruby that referenced this pull request Jun 5, 2015

[API] Added the filter_path parameter

6547bb8

Related: elastic/elasticsearch#10980, elastic/elasticsearch@ce63590

tlrx mentioned this pull request Sep 4, 2015

Update to Jackson 2.6.2 #13344

Merged

tlrx mentioned this pull request Oct 30, 2015

Filter path refactoring #14390

Merged

tlrx deleted the response-filtering-with-path-parameter branch May 19, 2016 09:52

wnojopra mentioned this pull request Apr 8, 2020

Short response for Bulk API #2661

Closed

API: Add response filtering with filter_path parameter #10980

API: Add response filtering with filter_path parameter #10980

Conversation

tlrx commented May 5, 2015

tlrx commented May 5, 2015

clintongormley commented May 5, 2015

tlrx commented May 5, 2015

tlrx commented May 5, 2015

spinscale May 5, 2015

Choose a reason for hiding this comment

tlrx May 6, 2015

Choose a reason for hiding this comment

spinscale commented May 6, 2015

tlrx commented May 6, 2015

tlrx commented May 7, 2015

tlrx commented May 7, 2015

clintongormley commented May 8, 2015

tlrx commented May 8, 2015

spinscale May 8, 2015

Choose a reason for hiding this comment

spinscale commented May 8, 2015

rjernst commented May 8, 2015

clintongormley commented May 8, 2015

clintongormley commented May 8, 2015

kimchy commented May 8, 2015

rjernst commented May 8, 2015

kimchy commented May 9, 2015

tlrx commented May 18, 2015

spinscale commented May 18, 2015

tlrx commented May 19, 2015

spinscale May 19, 2015

Choose a reason for hiding this comment

clintongormley May 22, 2015

Choose a reason for hiding this comment

spinscale commented May 19, 2015

clintongormley commented May 20, 2015

tlrx commented May 20, 2015

tlrx commented May 22, 2015

clintongormley commented May 25, 2015

spinscale commented May 26, 2015

clintongormley commented May 26, 2015

rashidkpc commented Jun 2, 2015

dkroehan commented Feb 26, 2016

tlrx commented Feb 29, 2016

nmors commented Mar 8, 2016

tlrx commented Mar 8, 2016

nmors commented Mar 8, 2016

API: Add response filtering with `filter_path` parameter #10980

API: Add response filtering with `filter_path` parameter #10980