Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type #12366

Open
samcday opened this issue Jul 21, 2015 · 71 comments
Labels
>enhancement help wanted adoptme :Search/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team

Comments

@samcday
Copy link

samcday commented Jul 21, 2015

Indexing a document with an object type on a field that has already been mapped as a string type causes MapperParsingException, even if index.mapping.ignore_malformed has been enabled.

Reproducible test case

On Elasticsearch 1.6.0:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":"a string"}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test":{"nested":"a string"}}'
{"error":"MapperParsingException[failed to parse [test]]; nested: ElasticsearchIllegalArgumentException[unknown property [nested]]; ","status":400}

$ curl localhost:9200/broken/_mapping
{"broken":{"mappings":{"type":{"properties":{"test":{"type":"string"}}}}}}

Expected behaviour

Indexing a document with an object field where Elasticsearch expected a string field to be will not fail the whole document when index.mapping.ignore_malformed is enabled. Instead, it will ignore the invalid object field.

@clintongormley clintongormley added discuss :Search/Mapping Index mappings, including merging and defining field types labels Jul 23, 2015
@clintongormley
Copy link

+1

@andrestc
Copy link
Contributor

While working on this issue, I found out that it fails on other types too, but for another reason: For example, for integer:

$ curl -XPUT localhost:9200/broken -d'{"settings":{"index.mapping.ignore_malformed": true}}'
{"acknowledged":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2": 10}'
{"_index":"broken","_type":"type","_id":"AU6wNDGa_qDGqxty2Dvw","_version":1,"created":true}

$ curl -XPOST localhost:9200/broken/type -d '{"test2":{"nested": 20}}'
[elasticsearch] [2015-09-26 02:20:23,380][DEBUG][action.index             ] [Tyrant] [broken][1], node[7WAPN-92TAeuFYbRLVqf8g], [P], v[2], s[STARTED], a[id=WlYpBZ6vTXS-4WMvAypeTA]: Failed to execute [index {[broken][type][AVAIGFNQZ9WMajLk5l0S], source[{"test2":{"nested":1}}]}]
[elasticsearch] MapperParsingException[failed to parse]; nested: IllegalArgumentException[Malformed content, found extra data after parsing: END_OBJECT];
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:157)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.parseDocument(DocumentParser.java:77)
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentMapper.parse(DocumentMapper.java:319)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:475)
[elasticsearch]     at org.elasticsearch.index.shard.IndexShard.prepareCreate(IndexShard.java:466)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.prepareIndexOperationOnPrimary(TransportReplicationAction.java:1053)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction.executeIndexRequestOnPrimary(TransportReplicationAction.java:1061)
[elasticsearch]     at org.elasticsearch.action.index.TransportIndexAction.shardOperationOnPrimary(TransportIndexAction.java:170)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase.performOnPrimary(TransportReplicationAction.java:580)
[elasticsearch]     at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryPhase$1.doRun(TransportReplicationAction.java:453)
[elasticsearch]     at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
[elasticsearch]     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
[elasticsearch]     at java.lang.Thread.run(Thread.java:745)
[elasticsearch] Caused by: java.lang.IllegalArgumentException: Malformed content, found extra data after parsing: END_OBJECT
[elasticsearch]     at org.elasticsearch.index.mapper.DocumentParser.innerParseDocument(DocumentParser.java:142)
[elasticsearch]     ... 13 more

Thats happening because, unlike in the string case, we are handling the ignoreMalformed for numeric types but, when we throw the exception here we didn't parse the field object until XContentParser.Token.END_OBJECT and that comes to bite us later, here.

So, I think two things must be done:
(1) Use the ignoreMalformed settings in StringFieldMapper, which is not happening (hence the original reported issue)
(2) Parse until the end of the current object before throwing IllegalArgumentException("unknown property [" + currentFieldName + "]"); in the Mapper classes. To prevent the exception I reported from happening. Or maybe just ignore this exception, in innerParseDocument, when ignoreMalformed is set?

Does this make sense, @clintongormley? I'll happily send a PR for this.

@clintongormley
Copy link

ah - i just realised that the original post refers to a string field, which doesn't support ignore_malformed...

@andrestc i agree with your second point, but i'm unsure about the first...

@rjernst what do you think?

@rjernst
Copy link
Member

rjernst commented Nov 9, 2015

Sorry for the delayed response, I lost this one in email.

@clintongormley I think it is probably worth making the behavior consistent, and it does seem to me finding an object where a specific piece of data is expected constitutes "malformed" data.

@andrestc A PR would be great.

@abulhol
Copy link

abulhol commented Dec 21, 2015

I want to upvote this issue!
I have fields in my JSON that are objects, but when they are empty, they contain an empty string, i.e. "" (this is the result of an XML2JSON parser). Now when I add a document where this is the case, I get a

 MapperParsingException[object mapping for [xxx] tried to parse field [xxx] as object, but found a concrete value]

This is not at all what I would expect from the documentation https://www.elastic.co/guide/en/elasticsearch/reference/2.0/ignore-malformed.html; please improve the documentation or fix the behavior (preferred!).

@clintongormley "i just realised that the original post refers to a string field, which doesn't support ignore_malformed..." Why should string fields not support ignore_malformed?

@megastef
Copy link

+1

I think there could be done much more e.g. set the field to a default value and add an annotation to the document - so users can see what went wrong. In my case all documents from Apache Logs having "-" in the size field (Integer) got ignored. I could tell you 100 stories, why Elasticsearch don't take documents from real data sources ... (just to mention one more #3714)

I think this problem could be handled much better:

  1. if a type error appears, try to convert the value (optional server/index setting). Often a JSON has number without quotes (correct), some put numbers as string in quotes. In this case the string could be converted to integer.
  2. if the type does not fit, take a default value for this type (0,null) - or ignore the field as you do today, but very bad if it is a larger object ...
  3. add a comment field like "_es_error_report: MapperParsingException: ...."
    In that way users can see that there was something wrong, today - data just disappears, when it fails to be indexed or the field is ignored. And the sysadmin might see error message in some logs ... but users wonder that data in elasticsearch is not complete and might have no access to elasticsearch logs. In my case I missed all Apache messages with status code 500 and size "-" instead of 0 - which is really bad - and depends on the log parser ...

A good example is Logsene, it adds Error-Annotations to failed documents together with the String version of the original source document (@sematext can catch Elasticsearch errors during the indexing process). So at least Logsene users can see failed index operations and orginal document in their UI or in Kibana. Thanks to this feature I'm able to report this issue to you.

It would be nice when such improvements would be available out of box for all Elasticsearch users.

@abulhol
Copy link

abulhol commented Feb 1, 2016

any news here?

@balooka
Copy link

balooka commented Mar 22, 2016

I wish to upvote the issue too.
My understanding of the ignore_malformed purpose is to not lose events, even when you might lose some of its content.
In the current situation I'm in, a issue similar to what has been described here is occurring, and although it's identified and multiple mid-term approaches are looked into - Issue in our case relates to multiple sources sending similar event, so options like splitting the events in separate mappings, or even cleaning up the events before reaching elasticsearch could be done - I would have liked a short term approach similar to ignore_malformed functionality to be in place to help sort term.

@BeccaMG
Copy link

BeccaMG commented May 3, 2016

Same problem with dates.

When adding an object with a field of type "date", in my DB whenever it is empty it's represented as "" (empty string) causing this error:

[DEBUG][action.admin.indices.mapping.put] [x] failed to put mappings on indices [[all]], type [seedMember]
java.lang.IllegalArgumentException: mapper [nms_recipient.birthDate] of different type, current_type [string], merged_type [date]

@satazor
Copy link

satazor commented May 6, 2016

Same problem with me. I'm using the ELK stack in which people may use the same properties but with different types. I don't want those properties to be searchable but I don't want to loose the entity event neither. I though ignore_malformed would do that but apparently is not working for all cases.

@jarlelin
Copy link

We are having issues with this same feature. We have documents that sometimes decide to have objects inside something that was intedended to have strings. We would like to not lose the whole document just because one of the nodes of data are malformed.

This is the behaviour I expected to get from setting ignore_malformed on the properties, and I would applaude such a feature.

@DaTebe
Copy link

DaTebe commented Aug 26, 2016

Hay, I have the same problem. Is there any solution (even if it is a bit hacky) out there?

@goodfella1408
Copy link

goodfella1408 commented Sep 9, 2016

Facing this in elasticsearch 2.3.1 . Before this bug is fixed we should atleast have a list of bad fields inside mapper_parsing_exception error so that the app can choose to remove them . Currently there is no standard field in the error through which these keys can be retrieved -

"error":{"type":"mapper_parsing_exception","reason":"object mapping for [A.B.C.D] tried to parse field [D] as object, but found a concrete value"}}

The app would have to parse the reason string and extract A.B.C.D which will fail if the error doc format changes . Additionally mapper_parsing_exception error itself must be using different formats for different parsing error scenarios all of which need to be handled by the app

@BeccaMG
Copy link

BeccaMG commented Sep 14, 2016

I used a workaround for this matter following the recommendations from Elasticsearch forums and official documentation.

Declaring the mapping of the objects you want to index (if you know it), choosing ignore_malfored in dates and numbers, should do the trick. Those tricky ones that could have string or nested content could be simply declared as object.

@derEremit
Copy link

derEremit commented Sep 15, 2016

for usage as a real log stash I would say something like #12366 (comment)
is a must have!
I can get accustomed to losing indexed fields but losing log entries is a no-go for ELK from my perspective

@micpotts
Copy link

Bumping, this issue is preventing a number of my messages to successfully be processed as a field object is returned as an empty string on rare cases.

@patrick-oyst
Copy link

Bump, this is proving to be an extremely tedious (non) feature to work around.

@patrick-oyst
Copy link

patrick-oyst commented Jan 19, 2017

I've found a way around this but it comes at a cost. It could be worth it for those like me who are in a situation where intervening directly on your data flow (like checking and fixing the log line yourself before sending it to ES) is something you'd like to avoid in the short term. Set the enabled setting of your field to false. This will make the field non searchable though. This isn't too big of an issue in my context because the reason this field is so unpredictable is the reason I need ignore_malformed to begin with, so it's not a particularly useful field to search on anyways, though you still have access to the data when you search for that document using another field. Incidentally, this solves both situations : writing an object to a non-object field and vice versa.

Hope this helps. It certainly saved me a lot of trouble...

@jarlelin
Copy link

jarlelin commented Jan 19, 2017 via email

@robinjha
Copy link

+1

@clintongormley clintongormley removed the good first issue low hanging fruit label Jan 26, 2017
@senseysensor
Copy link

+1

1 similar comment
@marcovdkuur
Copy link

+1

@EricMCornelius
Copy link

Now that #29494 has been merged, is the intention to pick this issue back up for application to multiple types (specifically object vs. primitive mismatches)?

Curious what the direction is.

@jpountz
Copy link
Contributor

jpountz commented May 23, 2018

I think so. I think I was the main person who voiced concerns about increasing the scope of ignore_malformed, which are now mitigated with #29494.

@schourode
Copy link

@clintongormley In one of your comments, you mention

[…] the case where the same field name may be used with an object and (eg) a string. This would be easy to detect in (eg) an ingest pipeline and easy to fix by renaming (eg) the string field to my_field.string or similar, which would be compatible with the object mapping.

Can I ask you to elaborate a bit on how you would detect and fix this in an ingest pipeline? I cannot seem to find any processor that allows me to test if a field is an object or a string. Would I have to resolve to scripting?

@clintongormley
Copy link

@schourode Yes you would have to use scripting.

@cafuego
Copy link
Contributor

cafuego commented Mar 4, 2019

In case anyone else runs into this issue as well, here is what I used. If the field json.error (supposed to be an object) is a text string, it's copied to the errormessage field and then dropped. If it's an object, it remains unchanged.

curl -H 'Content-type: application/json' -s --write-out "%{response_code}" -o reindex.txt -XPOST -u user:pass http://127.0.0.1:9200/_reindex -d '{
  "conflicts": "proceed",
  "source": {
    "index": "terrible-rubbish"
  },
  "dest": {
    "index": "so-shiny"
  },
  "script": {
    "source": "if (ctx._source.json != null && ctx._source.json.error != null && ctx._source.json.error instanceof String) { ctx._source.errormessage = ctx._source.json.remove(\"error\"); }",
    "lang": "painless"
  }
}'

You need to check whether each parent object element is null, or you'll get an error if you hit an index entry where it is absent.

@Mekk
Copy link

Mekk commented Apr 20, 2019

I was referred here after raising #41372

Please, please, consider which options actual user have.

If „dirty” data is allowed to enter ES (and preferably flagged somehow) I can inspect it, I can analyze it, I can find it to test with it, I can count it. And I can see that it exists. Full Kibana to my power in particular.

If „dirty” data is rejected, I must visit ES logs with those horrible java stacktraces, to find cryptic error message about bulk post rejects. In most cases I don't even have a clue which data caused the problem or what the problem really is (see my #41372 for example error, good luck guessing why it happened).

Regarding data loss: you fear business decisions made on data with field missed? I can make those business decisions based on the database which doesn't have 20% of records at all because they were rejected (mayhaps due to minor field irrelevant in most cases). And unless I am ES sysadmin, I won't even know (with dirty data I have good chance to notice problematic records while exploring, and I can even have sanity queries).

From ELK own field: Logstash does very good thing with _grok_parse_failure tags (which can be further improved to differentiate between rules with custom tags). Sth is wrong? I see records with those tags, can inspect them, count them, and analyze the situation.

@graphaelli
Copy link
Member

One issue to consider during implementation if this does get addressed, dynamic templates currently allow this setting even though it is rejected when directly mapping a field.

Mapping definition for [ignore_bool] has unsupported parameters: [ignore_malformed : true]:

PUT test
{
  "mappings": {
    "properties": {
      "ignore_bool": {
        "type": "boolean",
        "ignore_malformed": true
      }
    }
  }
}

ok:

PUT test
{
  "mappings": {
    "dynamic_templates": [
      {
        "labels": {
          "path_match": "ignore_bools.*",
          "match_mapping_type": "string",
          "mapping": {
            "type": "boolean",
            "ignore_malformed": true
          }
        }
      }
    ]
  }
}

The resulting fields are created without issue also.

worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019
worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019
worldjoe added a commit to worldjoe/elasticsearch that referenced this issue Jun 11, 2019
@rjernst rjernst added the Team:Search Meta label for search team label May 4, 2020
@tinarooot
Copy link

tinarooot commented Dec 15, 2020

我使用 springcloud aliaba 集成 es报以下错误

创建索引

es版本 7.6.2
springboot 2.3.4
springcloud alibaba 2.2.3
springcloud Hoxton.SR8

ElasticsearchStatusException[Elasticsearch exception [type=parse_exception, reason=Failed to parse content to map]]; nested: ElasticsearchException[Elasticsearch exception [type=json_parse_exception, reason=Invalid UTF-8 start byte 0xb5 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@762d957b; line: 1, column: 86]]];

@tmenier
Copy link

tmenier commented Dec 3, 2021

@jpountz Are you the decision maker on this? You've argued against it repeatedly, which I disagree with, but if it is what it is then maybe this issue should be closed as a won't-fix? Or am I wrong and it's really still under consideration?

@javanna javanna removed this from Search & Aggs in Background tasks Aug 2, 2022
@javanna javanna changed the title Invalid mapping case not handled by index.mapping.ignore_malformed ignore_malformed to support ignoring JSON objects submitted to fields of the wrong type Aug 22, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-search (Team:Search)

@javanna javanna changed the title ignore_malformed to support ignoring JSON objects submitted to fields of the wrong type ignore_malformed to support ignoring JSON objects ingested into fields of the wrong type Aug 22, 2022
@felixbarny
Copy link
Member

In Elastic Observability, we're working on making log ingestion more resilient. In that context, we've discussed how to deal with object/scalar conflicts more gracefully and whether it makes sense to prioritize this issue or whether there are other alternatives.

A little while ago, Elasticsearch has introduced the subobjects mapping parameter. When setting that parameter to false, it allows documents to have, for example, both a foo and a foo.bar field. Instead of ignoring one of these fields (what's proposed in this issue), both fields can be indexed successfully.

Therefore, we're considering to make subobjects: false the default in the built-in index template for logs-*-* in the future. After #97972 has been implemented, this will be a backwards compatible change as Elasticsearch will then be able to accept both nested an flattened keys. One caveat of this mapping parameter, however, is that it doesn't support the nested field type.

With that in mind, are there still use cases for supporting ignore_malformed for objects?

@EricMCornelius
Copy link

@felixbarny - what's the implication there for any existing painless scripts etc which currently rely on iteration over sub-objects?

I've not noticed the new setting before so just taking a cursory glance, but are we now expecting all source documents to be flattened everywhere?

That sounds like a massive efficiency hit when you need to do selective source document filtering, not to mention a fair amount of data bloat on the wire with deeply nested prefixes being repeated?

@felixbarny
Copy link
Member

Hey Eric,

After #97972 has been implemented, the _source does not need to change. It's just about how the documents are mapped internally. However, as explained on the docs for the object field type, on the Lucene level, all fields are stored as a flat key/value mapping anyway. In summary, subobjects: false doesn't affect the _source of the documents, not how they're stored. It's just a different way of how they're mapped and parsed.

If you're unsure if your source documents contain nested or flattened fields, you can use the field API in painless scripts which is able to access fields in either notation. We're also working on adding support for accessing dotted fields in ingest processors: #96648.

But again, you don't need to change structure of your documents when sending them to Elasticsearch. The idea is that dotted and nested fields are treated equally in all places.

Having said that, in OpenTelemetry, all attributes are by definition a flat key/value pair. As we're continuing to improve the support for OpenTelemetry, we may map OTel data with flattened keys.

selective source document filtering

I'd assume that source filtering using wildcards would still work as expected.

fair amount of data bloat on the wire with deeply nested prefixes being repeated

That's fair. But I'd expect compression to mostly take care of that anyway?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement help wanted adoptme :Search/Mapping Index mappings, including merging and defining field types Team:Search Meta label for search team
Projects
None yet
Development

Successfully merging a pull request may close this issue.