[ML] Data frame transform silently fails if using min(timestamp) #39974

sophiec20 · 2019-03-12T18:49:46Z

Found in master "version" : { "number" : "8.0.0-SNAPSHOT", "build_flavor" : "default", "build_type" : "tar", "build_hash" : "4957cad", "build_date" : "2019-03-11T15:48:39.514013Z", "build_snapshot" : true,

When creating a data frame using min and max against a date field, the data frame is not populated, however the response from _stats implies it has worked. From a user perspective, the data frame silently fails to populate (although does log an error on the server).

#DELETE _data_frame/transforms/farequote-a
#DELETE df-farequote-a
PUT _data_frame/transforms/farequote-a
{
  "source": "farequote-*",
  "dest": "df-farequote-a",
  "pivot": {
	  "group_by": { 
	    "airline": { "terms": { "field": "airline" }}
	  },
    "aggregations": {
	    "max_responsetime": { "max": { "field": "responsetime" }},
	    "mean_responsetime": { "avg": { "field": "responsetime" }},
	    "min_time": { "min": { "field": "@timestamp"}},
	    "max_time": { "max": { "field": "@timestamp"}}
    }
  }
}

POST _data_frame/transforms/farequote-a/_start
GET _data_frame/transforms/farequote-a/_stats
POST _data_frame/transforms/farequote-a/_stop
GET df-farequote-a/_search

_stats returns the following, which is the same as a successful data frame, i.e. one without min_time and max_time:

{
  "count" : 1,
  "transforms" : [
    {
      "id" : "farequote-a",
      "state" : {
        "transform_state" : "stopped",
        "current_position" : {
          "airline" : "VRD"
        },
        "generation" : 1
      },
      "stats" : {
        "pages_processed" : 2,
        "documents_processed" : 86274,
        "documents_indexed" : 19,
        "trigger_count" : 1,
        "index_time_in_ms" : 281,
        "index_total" : 1,
        "index_failures" : 0,
        "search_time_in_ms" : 6,
        "search_total" : 2,
        "search_failures" : 0
      }
    }
  ]
}

Error in log:

[2019-03-12T18:39:33,771][WARN ][o.e.x.c.i.AsyncTwoPhaseIndexer] [node1] Error while attempting to bulk index documents: failure in bulk execution:

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-03-12T18:49:47Z

Pinging @elastic/ml-core

benwtrent · 2019-03-12T19:51:34Z

Even though we allow the mapping type for max and min to be whatever the source mapping type to be (e.g. max on a field of type date will get a mapping entry of type date in the destination index), we still regretfully assume an aggregation result of NumericMetricsAggregation.SingleValue is numeric only when parsing in AggregationResultUtils.extractCompositeAggregationResults.

Working on a fix as this could easily cause issues with other more complicated aggregations (not to mention when we start supporting pipeline aggregations).

benwtrent · 2019-03-12T20:45:54Z

Did some local testing, it turns out that getValueAsString() works just for letting a field be indexed into a date mapping type. But, we cannot do that for all results as the document _source would look incorrect if the user was expecting a numerical value in the field. For NumericMetricsAggregation.SingleValue results, we need to determine if we want a string or a number before indexing.

sophiec20 · 2019-03-13T12:52:25Z

As the mappings are created correctly as date in the data frame, and the _preview endpoint returns the aggregation, it seems a shame not to capture this.

Contemplating use cases, I suspect this is something that will be useful, for example:

for customers who have interacted in the last 30d, tell me who are the biggest ever spenders.
for customers acquired in Feb2019, tell me who is most active on the platform

It also provides a method by which to retire or exclude non-relevant data from the data frame.

There are other ways to answer these questions, for example with support of scripts, but this seems neatest.

benwtrent · 2019-03-13T13:01:19Z

@sophiec20 yeah, since the mapping is date but we are trying to push in a double it cannot be converted appropriately.

When determining the data to push to the new index, we should not lose what formatting is supplied given the originally aggregated type.

This, to me, is bigger than just date as other aggregations (max on an IP address, for instance), where this same issue would arise.

hendrikmuhs · 2019-03-15T11:12:02Z

@benwtrent

The idea with using getValueAsString() seems good to me. When we create the bulk index request we turn everything into a string anyway.

When double values are turned into strings they are written using scientific notation if the value is larger than 10**7. Parsing scientific works fine except for the date parser.

Interestingly getValueAsString() uses the following implementation:

    public String getValueAsString() {
        if (valueAsString != null) {
            return valueAsString;
        } else {
            return Double.toString(value);
        }
    }

Aggregations on dates seem like a very old issue: #6812

hendrikmuhs · 2019-03-22T06:37:15Z

For the record:

No matter what you choose getValueAsString() or value() you end up with a problem as either numbers get strings or dates become invalid. We need to know whether something needs to be quoted or not. There is no generic solution but we need to take the mapping and handle numbers and dates differently.

Solution: #40220

sophiec20 added >bug :ml Machine learning v8.0.0 labels Mar 12, 2019

benwtrent added the v7.2.0 label Mar 12, 2019

benwtrent self-assigned this Mar 12, 2019

sophiec20 added the :ml/Transform Transform label Mar 14, 2019

benwtrent mentioned this issue Mar 19, 2019

[ML] adds support for non-numeric mapped types #40220

Merged

benwtrent closed this as completed in #40220 Mar 22, 2019

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Data frame transform silently fails if using min(timestamp) #39974

[ML] Data frame transform silently fails if using min(timestamp) #39974

sophiec20 commented Mar 12, 2019

elasticmachine commented Mar 12, 2019

benwtrent commented Mar 12, 2019 •

edited

benwtrent commented Mar 12, 2019

sophiec20 commented Mar 13, 2019

benwtrent commented Mar 13, 2019

hendrikmuhs commented Mar 15, 2019

hendrikmuhs commented Mar 22, 2019

[ML] Data frame transform silently fails if using min(timestamp) #39974

[ML] Data frame transform silently fails if using min(timestamp) #39974

Comments

sophiec20 commented Mar 12, 2019

elasticmachine commented Mar 12, 2019

benwtrent commented Mar 12, 2019 • edited

benwtrent commented Mar 12, 2019

sophiec20 commented Mar 13, 2019

benwtrent commented Mar 13, 2019

hendrikmuhs commented Mar 15, 2019

hendrikmuhs commented Mar 22, 2019

benwtrent commented Mar 12, 2019 •

edited