Field stats filter #11187

rashidkpc · 2015-05-15T15:58:33Z

Originally discussed in #10523

Would it be possible to filter the fields/indices we get back based on a filter? Perhaps I want back all of the indices with @timestamp fields between 2014-01-01 and 2015-01-01:

curl -XGET "http://localhost:9200/_field_stats?fields=@timestamp&min=2014-01-01&max=2015-01-01"

The above is probably not the right syntax since it would only allow for a single field. I'm not sure if we want a post body here or some other syntax?

This would be incredibly useful for Kibana and let us get rid of some serious bugs that occur for users with large numbers of indices. In fact, we could entirely get rid of our notion of timestamped indices as we'd be able to reliably sort an index list based on time. Users would only need to know their wildcard pattern! It would also allow for weird indexing strategies where maybe they're indexing weekly at one point and they need to step up to daily, something that isn't currently possible. Heck, you wouldn't even need timestamped indices, you could just increase a counter!

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-05-15T19:40:44Z

@rashidkpc do you need to know which indices match for any reason other than to reduce the number of indices you query? I ask because I don't think you should need to do this at all. Elasticsearch should do this for you at search time. As @mikemccand said in #5829 (comment) :

I think higher level optimizations could be very worthwhile, e.g. for time-based indices, knowing that a given index won't have any hits because there is a top-level range filter, should be a big speed up in many cases ...

martijnvg · 2015-05-17T19:08:36Z

I think higher level optimizations could be very worthwhile, e.g. for time-based indices, knowing that a given index won't have any hits because there is a top-level range filter, should be a big speed up in many cases ...

We should definitely implement this. In order to implement this, we need to keep track of each min and max value per field per shard, so that we can skip entire shards. However even with this big improvement in place we will still send the shard level requests to the nodes that hold the shards. Kibana and other apps can do the next best thing by not sending requests at all to nodes that hold indices outside the timestamp range and to do this stats filter can help.

Maybe the right syntax would be:

curl -XGET "http://localhost:9200/_field_stats?fields=@timestamp&min.@timestamp=2014-01-01&max.@timestamp=2015-01-01"

So the field name is included in the min and max params. I'd prefer this over adding parameters to the request body for this particular filter option.

Field stats filter allows to filter out field stats results for indices that have no field values in the defined range. This is useful to for example find all indices that have logs within a certain date range. This option is only useful is the `level` option is set to `indices`. The following request only returns field stats for indices that have `_timestamp` date values between the defined range. The response format remains the same compared to if no field stats filtering is enabled. curl -XPOST 'http://localhost:9200/_field_stats?level=indices' -d '{ "fields" : { <1> "_timestamp" : { <2> "gte" : "2014-01-01T00:00:00.000Z", "lt" : "2015-01-01T00:00:00.000Z" } } }' Closes elastic#11187

clintongormley · 2015-05-26T09:26:16Z

Just been talking to @martijnvg and @jpountz about the field-stats filtering PR in #11259. I'm not liking the API at all, it feels like the wrong solution.

The intention of this change is to reduce the amount of work that Elasticsearch has to do to fulfill a search request, by only querying indices that contain documents that could possibly match certain conditions (eg a timestamp range). However, the proposed solution requires a round trip to all indices to retrieve the field stats data before executing the query, so all shards get hit anyway.

While you could cache this information, Kibana is supposed to take new data into account, so any caching would interfere with this process.

The problem with running a search request on all indices at the moment is that we don't have the optimizations in place to abort the search request as early as possible if no data can possibly match. Even if we use the pre-flight check suggested in #5829 (comment) we need other optimizations such as only loading global ordinals if we actually need them.

@rashidkpc I think that today Kibana queries each index in turn, starting with the most recent and backfilling data. Is that correct? If we implemented the optimizations suggested above, would you still want to do that in the same way and, if so, why?

If this is still a requirement, then I think we should look at a different API which just returns index names, rather than tacking this on to the field stats API, or do you need the field stats as well? Could you describe your "dream process" in more detail?

clintongormley · 2015-05-28T17:51:28Z

@rashidkpc and I had a chat about how Kibana would use this API:

Regardless of whether Elasticsearch uses the min/max field stats internally, Kibana would still query indices individually to show results for recent indices more quickly and with less concurrent load on ES. Kibana wouldn't cache the field stats information (as it needs to be aware of any new indices/data as it becomes available), so the API needs to be fast.

A user might select a time range, so Kibana would use the field stats API to show data from each index within the time range in turn. Then, if a user selects a field for filtering, Kibana would show the min/max values available for that field based on the selected time range.

In summary, the API should return:

a list of indices
field stats for a list of fields, per index
possibly filtered by a constraint on a field (eg timestamp)

A question that remains is: would we ever need to filter on more than one field and, if so, how should the constraints be applied: as AND or OR. That said, we couldn't come up with a use case for filtering on multiple fields, so let's worry about that later and for the moment just AND constraints (ie all must apply).

I think it is worth separating out the list of fields that should be returned from the constraints, eg:

curl -XPOST "http://localhost:9200/_field_stats?level=indices" -d'
{
  "fields": [
    "foo",
    "bar"
  ],
  "constraints": {
    "_timestamp.max": {
      "gte": "2014-01-01T00:00:00.000Z"
    },
    "_timestamp.min": {
      "lt": "2015-01-01T00:00:00.000Z"
    }
  }
}'

Should it be constraints, filters, restrictions, where? I'm avoiding filters because it doesn't accept the query DSL.

martijnvg · 2015-05-28T18:34:20Z

@clintongormley I like restrictions or maybe just restrict? I think that the restrictions should be embed in the field, so it is clear what restriction belongs to what field. Something like this:

{
  "fields": {
    "foo" : {
       "restrict" : {
          "_timestamp.max": {
              "gte": "2014-01-01T00:00:00.000Z"
          }
       }
    },
    "bar" : {
      "restrict" : {
         "_timestamp.min": {
            "lt": "2015-01-01T00:00:00.000Z"
         }
      }
    }
  }
}

clintongormley · 2015-05-28T19:43:14Z

@martijnvg the constraints should be separate from the fields list, because they are two separate concerns:

which fields should i filter by
which fields should i return

martijnvg · 2015-05-29T07:21:34Z

Ok, I see, makes sense.

bleskes · 2015-05-29T07:35:42Z

reading this I got confused a bit but what we are filtering on. I'm afraid people will very easily interpret the restrictions/constraints as to apply to data (i.e., query DSL) but I understand this is not what we aim for due to the performance implications. Rather the idea is to filter the indices used for the stats (right? ). I wonder if we should name it something that implies that like indices_filter or indices_constraints and have it in a structure that allows future extension. Something like:

POST /_field_stats?level=indices
{
  "fields": [
    "foo",
    "bar"
  ],
  "indices_constraints": {
    "field_range": {
      "_timestamp": {
        "max": {
          "gte": "2014-01-01T00:00:00.000Z"
        },
        "min": {
          "lt": "2015-01-01T00:00:00.000Z"
        }
      }
    }
  }
}

martijnvg · 2015-05-29T09:24:57Z

@bleskes Yes, the idea is to filter on indices. I like your syntax, because it is descriptive. It is a bit more verbose, but I think that is okay.

clintongormley · 2015-05-29T10:29:55Z

I like indices_constraints although i think it should be index_constraints. I don't think we need the field_range layer: all we can do is filter on the values of stats for each field. So in summary:

POST /_field_stats?level=indices
{
  "fields": [
    "foo",
    "bar"
  ],
  "index_constraints": {
    "_timestamp": {
      "max": {
        "gte": "2014-01-01T00:00:00.000Z"
      },
      "min": {
        "lt": "2015-01-01T00:00:00.000Z"
      }
    }
  }
}

rashidkpc assigned martijnvg May 15, 2015

rashidkpc added v1.6.0 >enhancement labels May 15, 2015

martijnvg mentioned this issue May 20, 2015

Field stats: added index_constraint option #11259

Merged

s1monw added v1.6.1 and removed v1.6.0 labels Jun 3, 2015

martijnvg added v2.0.0-beta1 and removed v1.6.1 labels Jun 17, 2015

rashidkpc mentioned this issue Jun 26, 2015

Efficiently search against wildcard indices regardless of underlying indexing strategy elastic/kibana#4342

Closed

martijnvg closed this as completed in ef9d70b Jul 1, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field stats filter #11187

Field stats filter #11187

rashidkpc commented May 15, 2015

clintongormley commented May 15, 2015

martijnvg commented May 17, 2015

clintongormley commented May 26, 2015

clintongormley commented May 28, 2015

martijnvg commented May 28, 2015

clintongormley commented May 28, 2015

martijnvg commented May 29, 2015

bleskes commented May 29, 2015

martijnvg commented May 29, 2015

clintongormley commented May 29, 2015

Field stats filter #11187

Field stats filter #11187

Comments

rashidkpc commented May 15, 2015

clintongormley commented May 15, 2015

martijnvg commented May 17, 2015

clintongormley commented May 26, 2015

clintongormley commented May 28, 2015

martijnvg commented May 28, 2015

clintongormley commented May 28, 2015

martijnvg commented May 29, 2015

bleskes commented May 29, 2015

martijnvg commented May 29, 2015

clintongormley commented May 29, 2015