ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

samcday · 2015-07-21T10:45:26Z

Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.

Reproducible test case

On Elasticsearch 1.6.0:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}

$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}

Expected behaviour

Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.

The text was updated successfully, but these errors were encountered:

clintongormley · 2015-07-24T09:35:05Z

+1

andrestc · 2015-09-26T05:33:35Z

While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'

[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index             ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch]     at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch]     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch]     at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch]     ... 13 more

Thats happening because, unlike in the string case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT and that comes to bite us later, here.

So, I think two things must be done:
(1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue)
(2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]"); in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument, when ignoreMalformed is set?

Does this make sense, @clintongormley? I'll happily send a PR for this.

clintongormley · 2015-09-27T10:05:14Z

ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...

@andrestc i agree with your second point, but i'm unsure about the first...

@rjernst what do you think?

rjernst · 2015-11-09T01:22:50Z

Sorry for the delayed response, I lost this one in email.

@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.

@andrestc A PR would be great.

abulhol · 2015-12-21T14:54:17Z

I want to upvote this issue!
I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a

 MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]

This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).

@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?

megastef · 2016-01-27T17:05:03Z

+1

I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more #3714)

I think this problem could be handled much better:

if a type error appears, try to convert the value (optional server/index setting). Often a JSON has number without quotes (correct), some put numbers as string in quotes. In this case the string could be converted to integer.
if the type does not fit, take a default value for this type (0,null) - or ignore the field as you do today, but very bad if it is a larger object ...
add a comment field like "_es_error_report: MapperParsingException: ...."
In that way users can see that there was something wrong, today - data just disappears, when it fails to be indexed or the field is ignored. And the sysadmin might see error message in some logs ... but users wonder that data in elasticsearch is not complete and might have no access to elasticsearch logs. In my case I missed all Apache messages with status code 500 and size "-" instead of 0 - which is really bad - and depends on the log parser ...

A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.

It would be nice when such improvements would be available out of box for all Elasticsearch users.

abulhol · 2016-02-01T20:41:21Z

any news here?

balooka · 2016-03-22T10:09:44Z

I wish to upvote the issue too.
My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content.
In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.

BeccaMG · 2016-05-03T12:54:39Z

Same problem with dates.

When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:

[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]

satazor · 2016-05-06T14:21:55Z

Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed would do that but apparently is not working for all cases.

jarlelin · 2016-06-20T12:30:16Z

We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.

This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.

DaTebe · 2016-08-26T15:40:43Z

Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?

goodfella1408 · 2016-09-09T07:17:47Z

Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -

"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}

The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app

BeccaMG · 2016-09-14T04:05:27Z

I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.

Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored in dates and numbers, should do the trick. Those tricky ones that could have string or nested content could be simply declared as object.

derEremit · 2016-09-15T16:30:01Z

for usage as a real log stash I would say something like #12366 (comment)
is a must have!
I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective

micpotts · 2016-10-31T21:17:59Z

Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.

patrick-oyst · 2017-01-19T14:17:32Z

Bump, this is proving to be an extremely tedious (non) feature to work around.

patrick-oyst · 2017-01-19T15:01:12Z

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled setting of your field to false. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object to a non-object field and vice versa.

Hope this helps. It certainly saved me a lot of trouble...

jarlelin · 2017-01-19T15:53:14Z

Thats a good trick. Ill try that out.

…

On 19 Jan 2017 16:01, "patrick-oyst" ***@***.***> wrote: I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set your object's enabled setting to false. This will make the fields non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Hope this helps. It certainly saved *me* a lot of trouble... — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#12366 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AGC4v4w0ZIXlOGN40nAgl_8fpy0dj2CUks5rT3rhgaJpZM4Fcpph> .

robinjha · 2017-01-25T19:47:00Z

+1

senseysensor · 2017-03-15T07:44:49Z

+1

marcovdkuur · 2017-03-28T08:57:31Z

+1

EricMCornelius · 2018-05-22T18:22:05Z

Now that #29494 has been merged, is the intention to pick this issue back up for application to multiple types (specifically object vs. primitive mismatches)?

Curious what the direction is.

jpountz · 2018-05-23T12:59:23Z

I think so. I think I was the main person who voiced concerns about increasing the scope of ignore_malformed, which are now mitigated with #29494.

schourode · 2018-07-20T11:54:17Z

@clintongormley In one of your comments, you mention

[…] the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.

Can I ask you to elaborate a bit on how you would detect and fix this in an ingest pipeline? I cannot seem to find any processor that allows me to test if a field is an object or a string. Would I have to resolve to scripting?

clintongormley · 2018-07-20T12:46:51Z

@schourode Yes you would have to use scripting.

cafuego · 2019-03-04T00:51:35Z

In case anyone else runs into this issue as well, here is what I used. If the field json.error (supposed to be an object) is a text string, it's copied to the errormessage field and then dropped. If it's an object, it remains unchanged.

curl -H 'Content-type: application/json' -s --write-out "%{response_code}" -o reindex.txt -XPOST -u user:pass http://127.0.0.1:9200/_reindex -d '{
  "conflicts": "proceed",
  "source": {
    "index": "terrible-rubbish"
  },
  "dest": {
    "index": "so-shiny"
  },
  "script": {
    "source": "if (ctx._source.json != null && ctx._source.json.error != null && ctx._source.json.error instanceof String) { ctx._source.errormessage = ctx._source.json.remove(\"error\"); }",
    "lang": "painless"
  }
}'

You need to check whether each parent object element is null, or you'll get an error if you hit an index entry where it is absent.

Mekk · 2019-04-20T09:06:18Z

I was referred here after raising #41372

Please, please, consider which options actual user have.

If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.

If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).

Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).

From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.

graphaelli · 2019-05-08T22:35:27Z

One issue to consider during implementation if this does get addressed, dynamic templates currently allow this setting even though it is rejected when directly mapping a field.

Mapping definition for [ignore_bool] has unsupported parameters: [ignore_malformed : true]:

PUT test
{
  "mappings": {
    "properties": {
      "ignore_bool": {
        "type": "boolean",
        "ignore_malformed": true
      }
    }
  }
}

ok:

PUT test
{
  "mappings": {
    "dynamic_templates": [
      {
        "labels": {
          "path_match": "ignore_bools.*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "boolean",
            "ignore_malformed": true
          }
        }
      }
    ]
  }
}

The resulting fields are created without issue also.

elastic#12366

tinarooot · 2020-12-15T13:31:57Z

我使用 springcloud aliaba 集成 es报以下错误

创建索引

es版本 7.6.2
springboot 2.3.4
springcloud alibaba 2.2.3
springcloud Hoxton.SR8

ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Invalid UTF-8 start byte 0xb5 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@762d957b; line: 1, column: 86]]];

tmenier · 2021-12-03T21:28:16Z

@jpountz Are you the decision maker on this? You've argued against it repeatedly, which I disagree with, but if it is what it is then maybe this issue should be closed as a won't-fix? Or am I wrong and it's really still under consideration?

elasticsearchmachine · 2022-08-22T10:04:12Z

Pinging @elastic/es-search (Team:Search)

felixbarny · 2023-08-17T07:54:54Z

In Elastic Observability, we're working on making log ingestion more resilient. In that context, we've discussed how to deal with object/scalar conflicts more gracefully and whether it makes sense to prioritize this issue or whether there are other alternatives.

A little while ago, Elasticsearch has introduced the subobjects mapping parameter. When setting that parameter to false, it allows documents to have, for example, both a foo and a foo.bar field. Instead of ignoring one of these fields (what's proposed in this issue), both fields can be indexed successfully.

Therefore, we're considering to make subobjects: false the default in the built-in index template for logs-*-* in the future. After #97972 has been implemented, this will be a backwards compatible change as Elasticsearch will then be able to accept both nested an flattened keys. One caveat of this mapping parameter, however, is that it doesn't support the nested field type.

With that in mind, are there still use cases for supporting ignore_malformed for objects?

EricMCornelius · 2023-08-17T11:04:41Z

@felixbarny - what's the implication there for any existing painless scripts etc which currently rely on iteration over sub-objects?

I've not noticed the new setting before so just taking a cursory glance, but are we now expecting all source documents to be flattened everywhere?

That sounds like a massive efficiency hit when you need to do selective source document filtering, not to mention a fair amount of data bloat on the wire with deeply nested prefixes being repeated?

felixbarny · 2023-08-17T11:51:43Z

Hey Eric,

After #97972 has been implemented, the _source does not need to change. It's just about how the documents are mapped internally. However, as explained on the docs for the object field type, on the Lucene level, all fields are stored as a flat key/value mapping anyway. In summary, subobjects: false doesn't affect the _source of the documents, not how they're stored. It's just a different way of how they're mapped and parsed.

If you're unsure if your source documents contain nested or flattened fields, you can use the field API in painless scripts which is able to access fields in either notation. We're also working on adding support for accessing dotted fields in ingest processors: #96648.

But again, you don't need to change structure of your documents when sending them to Elasticsearch. The idea is that dotted and nested fields are treated equally in all places.

Having said that, in OpenTelemetry, all attributes are by definition a flat key/value pair. As we're continuing to improve the support for OpenTelemetry, we may map OTel data with flattened keys.

selective source document filtering

I'd assume that source filtering using wildcards would still work as expected.

fair amount of data bloat on the wire with deeply nested prefixes being repeated

That's fair. But I'd expect compression to mostly take care of that anyway?

clintongormley added discuss :Search/Mapping Index mappings, including merging and defining field types labels Jul 23, 2015

clintongormley added >enhancement good first issue low hanging fruit help wanted adoptme and removed discuss labels Jul 24, 2015

codefromthecrypt mentioned this issue Sep 10, 2016

MapperParsingException on elasticsearch when address annotation present openzipkin/zipkin#1239

Closed

clintongormley removed the good first issue low hanging fruit label Jan 26, 2017

cbuescher mentioned this issue Apr 16, 2018

Update BooleanFieldMapper(boolean fields) to support ignore_malformed… #29522

Closed

melissachang mentioned this issue Jul 3, 2018

Print more details with JsonParseException #31778

Closed

jrodewig mentioned this issue Apr 8, 2019

[DOCS] Document limits for JSON objects with ignore_malformed mapping setting #40976

Merged

jtibshirani mentioned this issue Apr 19, 2019

Can't get text on a START OBJECT (please improve error handling for key=>{} case) #41372

Closed

graphaelli mentioned this issue May 8, 2019

[agents] Labels [formerly tags] management API elastic/apm#42

Closed

7 tasks

worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019

ignore_malformed does not work on TEXT fields.

229e751

elastic#12366

worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019

Update ignore-malformed.asciidoc

e18d7c6

elastic#12366

worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019

ignore_malformed does not work on TEXT datatypes.

e120361

elastic#12366

worldjoe mentioned this issue Jun 11, 2019

ignore_malformed does not work on TEXT datatypes. #43115

Closed

rjernst added the Team:Search Meta label for search team label May 4, 2020

calellowitz mentioned this issue Sep 29, 2021

ignore property values that are not strings dimagi/commcare-hq#30488

Closed

5 tasks

albertzaharovits mentioned this issue Dec 28, 2021

"ignore_malformed":true not work,put date [2021-01-21T19:47Z] error #82081

Closed

felixbarny mentioned this issue May 4, 2022

[Feature] Dead letter queue #86170

Closed

felixbarny mentioned this issue Jul 25, 2022

Better default mappings for logs #88777

Closed

javanna removed this from Search & Aggs in Background tasks Aug 2, 2022

javanna changed the title ~~Invalid mapping case not handled by index.mapping.ignore_malformed~~ ignore_malformed to support ignoring JSON objects submitted to fields of the wrong type Aug 22, 2022

javanna changed the title ~~ignore_malformed to support ignoring JSON objects submitted to fields of the wrong type~~ ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type Aug 22, 2022

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

Comments

samcday commented Jul 21, 2015

Reproducible test case

Expected behaviour

clintongormley commented Jul 24, 2015

andrestc commented Sep 26, 2015

clintongormley commented Sep 27, 2015

rjernst commented Nov 9, 2015

abulhol commented Dec 21, 2015

megastef commented Jan 27, 2016

abulhol commented Feb 1, 2016

balooka commented Mar 22, 2016 • edited

BeccaMG commented May 3, 2016

satazor commented May 6, 2016 • edited

jarlelin commented Jun 20, 2016

DaTebe commented Aug 26, 2016

goodfella1408 commented Sep 9, 2016 • edited

BeccaMG commented Sep 14, 2016

derEremit commented Sep 15, 2016 • edited

micpotts commented Oct 31, 2016

patrick-oyst commented Jan 19, 2017

patrick-oyst commented Jan 19, 2017 • edited

jarlelin commented Jan 19, 2017 via email

robinjha commented Jan 25, 2017

senseysensor commented Mar 15, 2017

marcovdkuur commented Mar 28, 2017

EricMCornelius commented May 22, 2018

jpountz commented May 23, 2018

schourode commented Jul 20, 2018

clintongormley commented Jul 20, 2018

cafuego commented Mar 4, 2019 • edited

Mekk commented Apr 20, 2019 • edited

graphaelli commented May 8, 2019

tinarooot commented Dec 15, 2020 • edited

tmenier commented Dec 3, 2021

elasticsearchmachine commented Aug 22, 2022

felixbarny commented Aug 17, 2023

EricMCornelius commented Aug 17, 2023

felixbarny commented Aug 17, 2023

balooka commented Mar 22, 2016 •

edited

satazor commented May 6, 2016 •

edited

goodfella1408 commented Sep 9, 2016 •

edited

derEremit commented Sep 15, 2016 •

edited

patrick-oyst commented Jan 19, 2017 •

edited

cafuego commented Mar 4, 2019 •

edited

Mekk commented Apr 20, 2019 •

edited

tinarooot commented Dec 15, 2020 •

edited