Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: Add response filtering with filter_path parameter #10980

Merged
merged 1 commit into from May 26, 2015

Conversation

tlrx
Copy link
Member

@tlrx tlrx commented May 5, 2015

This change adds a new "_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.

For example, returning only the shards that failed to be optimized:

curl -XPOST 'localhost:9200/beer/_optimize?_path=_shards.failed'
{"_shards":{"failed":0}}%

It supports multiple filters (separated by a comma):

curl -XGET 'localhost:9200/_mapping?pretty&_path=*.mappings.*.properties.name,*.mappings.*.properties.title'

It also supports the YAML response format. Here it returns only the _id field of a newly indexed document:

curl -XPOST 'localhost:9200/library/book?_path=_id' -d '---hello:\n  world: 1\n'

---
_id: "AU0j64-b-stVfkvus5-A"

It also supports wildcards. Here it returns only the host name of every nodes in the cluster:

curl -XGET 'http://localhost:9200/_nodes/stats?_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}

And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:

curl 'http://localhost:9200/_segments?pretty&_path=indices.**.version'
{
  "indices" : {
    "beer" : {
      "shards" : {
        "0" : [ {
          "segments" : {
            "_0" : {
              "version" : "5.2.0"
            },
            "_1" : {
              "version" : "5.2.0"
            }
          }
        } ]
      }
    }
  }
}

Note that elasticsearch sometimes returns directly the raw value of a field, like the "_source" field. If you want to filter the response that include _source fields like in Search or Get responses , you should consider using the already existing "fields" parameter:

 curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category'
 {
   "_index" : "beer",
   "_type" : "tasty",
   "_id" : "3672",
   "_version" : 1,
   "found" : true,
   "fields" : {
     "category" : [ "German Lager" ]
   }
 }

The "_path" parameter can be used to further reduce the result:

curl -XGET 'localhost:9200/beer/tasty/3672?pretty&fields=category&_path=fields.category'
{
  "fields" : {
    "category" : [ "German Lager" ]
  }
}

Closes #7401

@tlrx tlrx added v2.0.0-beta1 review :Core/Infra/REST API REST infrastructure and utilities v1.6.0 labels May 5, 2015
@tlrx
Copy link
Member Author

tlrx commented May 5, 2015

Note:

  • if the provided filter specified in _path does not match something, an empty response is returned
  • if the API returns an error, the error is not filtered even if the _path is specified

@clintongormley
Copy link

if the provided filter specified in _path does not match something, an empty response is returned

We should return an empty JSON object rather than an empty string, otherwise the client needs to check whether JSON has been returned.

@tlrx
Copy link
Member Author

tlrx commented May 5, 2015

@clintongormley thanks, I updated the code to reflect your last comment.

@tlrx
Copy link
Member Author

tlrx commented May 5, 2015

@spinscale Can you please have a look? Thanks


@Override
public XContentGenerator createGenerator(OutputStream os, String[] filters) throws IOException {
if ((filters == null) || (filters.length == 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use CollectionUtils.isEmpty()?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

@spinscale
Copy link
Contributor

A couple of things

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

@tlrx
Copy link
Member Author

tlrx commented May 6, 2015

One other thing, I know that jackson has some filtering capabiltiies built-in, I have no ideas if we can make use of them though, see here, here and here

That's the first thing we checked before starting working on this pull request. The current filtering capabilities of Jackson (version <2.5.1) can be used only with jackson-databinding lib and apply only to POJOs but not when you use jackson to generate JSON in streaming mode.

Jackson 2.6 will integrate an interesting filtering feature that will work with JSON streaming (see release note here and test class here). It currently uses JSONPointer to filter properties but I think we will be able to implement our own TokenFilter later, once version 2.6 is released.

@tlrx
Copy link
Member Author

tlrx commented May 7, 2015

First, I really would like to see some simple benchmarking, so we have some ballpark numbers, how this affects performance. Especially because write operations that create sub-hierarchies also create new objects.

I added the FilteringJsonGeneratorBenchmark and put some numbers here (not very readable - I apologize for that).

There are no real surprise in the benchmark:

  • when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today
  • when writing a larger amount of data (ie large number of small docs, small number of large docs) there's a point where it becomes more efficient to check if fields must be included and skip them rather than writing more data. This point varies depending on the document size and the selectivity of the filter, but in this test it is in between 1-10%.

Minor thing: FilteringJsonGenerator.parse() - why not use an array instead of XContentFilter and next (like your own custom linked list).

There are room for improvements like this one (I'm also thinking of not creating FilterContext for sub fields when a parent field matches the end of a filter), but it might add complexity to the code and I'm not sure if it will be really more efficient.

I'm currently trying some of these improvement to see if they are pertinent or not.

@tlrx
Copy link
Member Author

tlrx commented May 7, 2015

Thinking of improvements again, we should be able to reuse the objects instead of creating new ones for every sub fields. I'll try to improve that.

@clintongormley
Copy link

when writing a small number of documents (less than 100) with a small number of fields each(1-10), the filtering adds an overhead that makes it less efficient than what we have today

I assume that the code path (and performance) remains the same as today if no _path is specified?

@tlrx
Copy link
Member Author

tlrx commented May 8, 2015

@clintongormley yes. Branching is done here where we fall back to the normal XContent generator.

private List<XContentFilter> matchings;

/**
* Flag to indicate if the field/property must be write
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... written

@spinscale
Copy link
Contributor

added some absolutely minor comments...

First, this doesnt compile using mvn, due to the benchmark class and some locale, but not really an issue.

Few last things:

  • the kind-of linkedlist data structure in the FilterContext, which could possibly be replaced by a simple array?
  • This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware
  • No documentation so far?

Maybe we should also note somewhere to switch to the jackson filter features with the 2.6 release, I'd hate to forget that...

LGTM apart from that

@rjernst
Copy link
Member

rjernst commented May 8, 2015

I'm late to the party here, but I have a simple question. Why is the parameter _path and not path? Why use the underscore when we control all the url params anyways? And all the other examples I know of (eg fields, pretty, human) do not use underscore?

@clintongormley
Copy link

@rjernst only because it works well with the existing _source filtering. I kinda like the underscore because it is a "global" param (mind you, so is pretty).

@clintongormley
Copy link

This seems the only part, where we use ** as 'greedy' wildcard in Elasticsearch, not sure if it is a problem, just to be aware

Elsewhere we use * as a greedy wildcard, which I think is problematic. We took the ** syntax from ant, and I'd actually be tempted to reuse it elsewhere.

@kimchy
Copy link
Member

kimchy commented May 8, 2015

actually, since this is a "rest" level parameter, and for those, we have pretty and human, I think we should just call it path, or maybe be more explicit and call it filter_path

@rjernst
Copy link
Member

rjernst commented May 8, 2015

+1 on filter_path

@kimchy
Copy link
Member

kimchy commented May 9, 2015

here is another suggestion, calling it filter_path_include, in the future we might want to support filter_path_exclude.

The reason I am raising it is that today, we support source level include/exclude when fetching source documents. With this infrastructure, if we eventually expand it to support include_s_/exclude_s_, we can also support zero copy (as in, not create a map of maps/lists representation of source) when someone asks for source includes/includes (we can just create a bytes based filtering generator, and copy structure from the parser right into it).

I helped a fellow on IRC today where the loading and parsing of source include was the perf bottleneck (from 1ms it got up to 170ms for ~1000 docs). I think this will help a lot there eventually.

@tlrx
Copy link
Member Author

tlrx commented May 18, 2015

@spinscale I rebased and updated the code according to your last comments. Can you have a look and do some manual testing please? Thanks a lot :)

Note: latest benchmark numbers are here.

@spinscale
Copy link
Contributor

we may also not want to have manual testing but also add that parameter (plus tests) to our REST tests

@tlrx
Copy link
Member Author

tlrx commented May 19, 2015

@spinscale I just added the REST tests. The parameter is still named _path until we reach an agreement on the final name.

@@ -56,6 +56,10 @@
"options" : ["node", "indices", "shards"],
"default" : "node"
},
"_path": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the _path needs to be added to all the methods supporting it, to make sure the clients are aware

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, like pretty and _source, we can just support this parameter everywhere, so don't add it to the rest spec.

@spinscale
Copy link
Contributor

left one comment, apart from that it feels like we are close to getting this in, when we get the naming stuff resolved... /cc @clintongormley

@clintongormley
Copy link

The _source parameter is essentially a synonym for _source_include, so I'd go with filter_path for now. We can always add filter_path_include|exclude later on.

@tlrx
Copy link
Member Author

tlrx commented May 20, 2015

@clintongormley sounds good to me, thanks

@tlrx
Copy link
Member Author

tlrx commented May 22, 2015

@spinscale Updated with your last comment :)

@clintongormley
Copy link

@spinscale @kimchy does this need another review, or are we good to go?

@spinscale
Copy link
Contributor

LGTM on my side

This change adds a new "filter_path" parameter that can be used to filter and reduce the responses returned by the REST API of elasticsearch.

For example, returning only the shards that failed to be optimized:
```
curl -XPOST 'localhost:9200/beer/_optimize?filter_path=_shards.failed'
{"_shards":{"failed":0}}%
```

It supports multiple filters (separated by a comma):
```
curl -XGET 'localhost:9200/_mapping?pretty&filter_path=*.mappings.*.properties.name,*.mappings.*.properties.title'
```

It also supports the YAML response format. Here it returns only the `_id` field of a newly indexed document:
```
curl -XPOST 'localhost:9200/library/book?filter_path=_id' -d '---hello:\n  world: 1\n'
---
_id: "AU0j64-b-stVfkvus5-A"
```

It also supports wildcards. Here it returns only the host name of every nodes in the cluster:
```
curl -XGET 'http://localhost:9200/_nodes/stats?filter_path=nodes.*.host*'
{"nodes":{"lvJHed8uQQu4brS-SXKsNA":{"host":"portable"}}}
```

And "**" can be used to include sub fields without knowing the exact path. Here it returns only the Lucene version of every segment:
```
curl 'http://localhost:9200/_segments?pretty&filter_path=indices.**.version'
{
  "indices" : {
    "beer" : {
      "shards" : {
        "0" : [ {
          "segments" : {
            "_0" : {
              "version" : "5.2.0"
            },
            "_1" : {
              "version" : "5.2.0"
            }
          }
        } ]
      }
    }
  }
}
```

Note that elasticsearch sometimes returns directly the raw value of a field, like the _source field. If you want to filter _source fields, you should consider combining the already existing _source parameter (see Get API for more details) with the filter_path parameter like this:

```
curl -XGET 'localhost:9200/_search?pretty&filter_path=hits.hits._source&_source=title'
{
  "hits" : {
    "hits" : [ {
      "_source":{"title":"Book elastic#2"}
    }, {
      "_source":{"title":"Book elastic#1"}
    }, {
      "_source":{"title":"Book elastic#3"}
    } ]
  }
}
```
@tlrx tlrx force-pushed the response-filtering-with-path-parameter branch from 234c3b6 to ce63590 Compare May 26, 2015 12:07
@tlrx tlrx merged commit ce63590 into elastic:master May 26, 2015
@kevinkluge kevinkluge removed the review label May 26, 2015
@clintongormley
Copy link

w00t

@clintongormley clintongormley changed the title API: Add response filtering with _path parameter API: Add response filtering with filter_path parameter May 27, 2015
@rashidkpc
Copy link

WIN Thanks so much everyone for making this happen. This is huge for Kibana!

karmi added a commit to elastic/elasticsearch-ruby that referenced this pull request Jun 5, 2015
@tlrx tlrx mentioned this pull request Sep 4, 2015
@tlrx tlrx mentioned this pull request Oct 30, 2015
@dkroehan
Copy link

Is this already supported via the Java API?

@tlrx
Copy link
Member Author

tlrx commented Feb 29, 2016

@dkroehan The filter_path parameter is used as the REST level (like the pretty parameter) to render JSON responses and therefore it is not part of the Java API.

If you need to render XContent object instances using filtering in Java, you can use the XContent.builder(XContent xContent, String[] filters) method

@nmors
Copy link

nmors commented Mar 8, 2016

are excludes supported yet, or just includes?

@tlrx
Copy link
Member Author

tlrx commented Mar 8, 2016

@nmors Using the REST API and the filter_path parameter, only includes are supported. Using XContent filtering in Java, excludes are supported but you can't mix them both.

@nmors
Copy link

nmors commented Mar 8, 2016

Thanks, I'm actually using the javascript client. I'm using the filterPath parameter like so (I guess it's a wrapper for filter_path), all is working fine! It would be good to be able to use excludes though, can be very helpful for me as it is web traffic that I would like to keep small. I've managed to cut the response size in half by using the following:

  index: my_index,
  type: my_type,
  body: q,
  filterPath: [
    "aggregations.my_agg.buckets.**.value",
    "aggregations.my_agg.buckets.**.key"
  ]

@tlrx tlrx deleted the response-filtering-with-path-parameter branch May 19, 2016 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add response_transform_script to allow xpath style selection of response elements
10 participants