Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Field stats filter #11187

Closed
rashidkpc opened this issue May 15, 2015 · 10 comments
Closed

Field stats filter #11187

rashidkpc opened this issue May 15, 2015 · 10 comments

Comments

@rashidkpc
Copy link

Originally discussed in #10523

Would it be possible to filter the fields/indices we get back based on a filter? Perhaps I want back all of the indices with @timestamp fields between 2014-01-01 and 2015-01-01:

curl -XGET "http://localhost:9200/_field_stats?fields=@timestamp&min=2014-01-01&max=2015-01-01"

The above is probably not the right syntax since it would only allow for a single field. I'm not sure if we want a post body here or some other syntax?

This would be incredibly useful for Kibana and let us get rid of some serious bugs that occur for users with large numbers of indices. In fact, we could entirely get rid of our notion of timestamped indices as we'd be able to reliably sort an index list based on time. Users would only need to know their wildcard pattern! It would also allow for weird indexing strategies where maybe they're indexing weekly at one point and they need to step up to daily, something that isn't currently possible. Heck, you wouldn't even need timestamped indices, you could just increase a counter!

@clintongormley
Copy link

@rashidkpc do you need to know which indices match for any reason other than to reduce the number of indices you query? I ask because I don't think you should need to do this at all. Elasticsearch should do this for you at search time. As @mikemccand said in #5829 (comment) :

I think higher level optimizations could be very worthwhile, e.g. for time-based indices, knowing that a given index won't have any hits because there is a top-level range filter, should be a big speed up in many cases ...

@martijnvg
Copy link
Member

I think higher level optimizations could be very worthwhile, e.g. for time-based indices, knowing that a given index won't have any hits because there is a top-level range filter, should be a big speed up in many cases ...

We should definitely implement this. In order to implement this, we need to keep track of each min and max value per field per shard, so that we can skip entire shards. However even with this big improvement in place we will still send the shard level requests to the nodes that hold the shards. Kibana and other apps can do the next best thing by not sending requests at all to nodes that hold indices outside the timestamp range and to do this stats filter can help.

Maybe the right syntax would be:

curl -XGET "http://localhost:9200/_field_stats?fields=@timestamp&min.@timestamp=2014-01-01&max.@timestamp=2015-01-01"

So the field name is included in the min and max params. I'd prefer this over adding parameters to the request body for this particular filter option.

martijnvg added a commit to martijnvg/elasticsearch that referenced this issue May 20, 2015
Field stats filter allows to filter out field stats results for indices that have no field values
in the defined range. This is useful to for example find all indices that have logs within a
certain date range. This option is only useful is the `level` option is set to `indices`.

The following request only returns field stats for indices that have `_timestamp` date values between
the defined range. The response format remains the same compared to if no field stats filtering is enabled.

curl -XPOST 'http://localhost:9200/_field_stats?level=indices' -d '{
   "fields" : { <1>
      "_timestamp" : { <2>
         "gte" : "2014-01-01T00:00:00.000Z",
         "lt" : "2015-01-01T00:00:00.000Z"
      }
   }
}'

Closes elastic#11187
@clintongormley
Copy link

Just been talking to @martijnvg and @jpountz about the field-stats filtering PR in #11259. I'm not liking the API at all, it feels like the wrong solution.

The intention of this change is to reduce the amount of work that Elasticsearch has to do to fulfill a search request, by only querying indices that contain documents that could possibly match certain conditions (eg a timestamp range). However, the proposed solution requires a round trip to all indices to retrieve the field stats data before executing the query, so all shards get hit anyway.

While you could cache this information, Kibana is supposed to take new data into account, so any caching would interfere with this process.

The problem with running a search request on all indices at the moment is that we don't have the optimizations in place to abort the search request as early as possible if no data can possibly match. Even if we use the pre-flight check suggested in #5829 (comment) we need other optimizations such as only loading global ordinals if we actually need them.

@rashidkpc I think that today Kibana queries each index in turn, starting with the most recent and backfilling data. Is that correct? If we implemented the optimizations suggested above, would you still want to do that in the same way and, if so, why?

If this is still a requirement, then I think we should look at a different API which just returns index names, rather than tacking this on to the field stats API, or do you need the field stats as well? Could you describe your "dream process" in more detail?

@clintongormley
Copy link

@rashidkpc and I had a chat about how Kibana would use this API:

Regardless of whether Elasticsearch uses the min/max field stats internally, Kibana would still query indices individually to show results for recent indices more quickly and with less concurrent load on ES. Kibana wouldn't cache the field stats information (as it needs to be aware of any new indices/data as it becomes available), so the API needs to be fast.

A user might select a time range, so Kibana would use the field stats API to show data from each index within the time range in turn. Then, if a user selects a field for filtering, Kibana would show the min/max values available for that field based on the selected time range.

In summary, the API should return:

  • a list of indices
  • field stats for a list of fields, per index
  • possibly filtered by a constraint on a field (eg timestamp)

A question that remains is: would we ever need to filter on more than one field and, if so, how should the constraints be applied: as AND or OR. That said, we couldn't come up with a use case for filtering on multiple fields, so let's worry about that later and for the moment just AND constraints (ie all must apply).

I think it is worth separating out the list of fields that should be returned from the constraints, eg:

curl -XPOST "http://localhost:9200/_field_stats?level=indices" -d'
{
  "fields": [
    "foo",
    "bar"
  ],
  "constraints": {
    "_timestamp.max": {
      "gte": "2014-01-01T00:00:00.000Z"
    },
    "_timestamp.min": {
      "lt": "2015-01-01T00:00:00.000Z"
    }
  }
}'

Should it be constraints, filters, restrictions, where? I'm avoiding filters because it doesn't accept the query DSL.

@martijnvg
Copy link
Member

@clintongormley I like restrictions or maybe just restrict? I think that the restrictions should be embed in the field, so it is clear what restriction belongs to what field. Something like this:

{
  "fields": {
    "foo" : {
       "restrict" : {
          "_timestamp.max": {
              "gte": "2014-01-01T00:00:00.000Z"
          }
       }
    },
    "bar" : {
      "restrict" : {
         "_timestamp.min": {
            "lt": "2015-01-01T00:00:00.000Z"
         }
      }
    }
  }
}

@clintongormley
Copy link

@martijnvg the constraints should be separate from the fields list, because they are two separate concerns:

  • which fields should i filter by
  • which fields should i return

@martijnvg
Copy link
Member

Ok, I see, makes sense.

@bleskes
Copy link
Contributor

bleskes commented May 29, 2015

reading this I got confused a bit but what we are filtering on. I'm afraid people will very easily interpret the restrictions/constraints as to apply to data (i.e., query DSL) but I understand this is not what we aim for due to the performance implications. Rather the idea is to filter the indices used for the stats (right? ). I wonder if we should name it something that implies that like indices_filter or indices_constraints and have it in a structure that allows future extension. Something like:

POST /_field_stats?level=indices
{
  "fields": [
    "foo",
    "bar"
  ],
  "indices_constraints": {
    "field_range": {
      "_timestamp": {
        "max": {
          "gte": "2014-01-01T00:00:00.000Z"
        },
        "min": {
          "lt": "2015-01-01T00:00:00.000Z"
        }
      }
    }
  }
}

@martijnvg
Copy link
Member

@bleskes Yes, the idea is to filter on indices. I like your syntax, because it is descriptive. It is a bit more verbose, but I think that is okay.

@clintongormley
Copy link

I like indices_constraints although i think it should be index_constraints. I don't think we need the field_range layer: all we can do is filter on the values of stats for each field. So in summary:

POST /_field_stats?level=indices
{
  "fields": [
    "foo",
    "bar"
  ],
  "index_constraints": {
    "_timestamp": {
      "max": {
        "gte": "2014-01-01T00:00:00.000Z"
      },
      "min": {
        "lt": "2015-01-01T00:00:00.000Z"
      }
    }
  }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants