Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bad map keys can not be unmarshaled #523

Closed
mensinda opened this issue May 3, 2022 · 6 comments · Fixed by #531
Closed

Bad map keys can not be unmarshaled #523

mensinda opened this issue May 3, 2022 · 6 comments · Fixed by #531

Comments

@mensinda
Copy link
Contributor

mensinda commented May 3, 2022

Specifically map keys like 000 and 111 will be marshaled as <000> and <111>, which can no longer be unmarshaled. Jackson should make sure to escape those keys correctly.

Example:

package it;

import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.dataformat.xml.XmlMapper;

import java.util.Map;

public class BadMap {
    public static class DTO {
        public Map<String, String> badMap = Map.of("000", "foo", "111", "bar");
    }

    public static void main(String ... args) throws JsonProcessingException {
        DTO dto = new DTO();

        XmlMapper mapper = new XmlMapper();

        final String res = mapper.writeValueAsString(dto);

        // <DTO><badMap><000>foo</000><111>bar</111></badMap></DTO>
        System.out.println(res);

        // ERROR!
        // com.fasterxml.jackson.core.JsonParseException: Unexpected character '0' (code 48) in content after '<' (malformed start element?).
        mapper.readValue(res, DTO.class);
    }
}

jackson version: 2.13.2

@cowtowncoder
Copy link
Member

cowtowncoder commented May 4, 2022

Unfortunately it is not possible to escape element names in XML (unlike Character Data content) so it is not clear what could and should be done.
I agree that producing non-well-formed XML is not a good thing to do.

One thing that would be possible would be to use some sort of name-mangling to automatically change names to perhaps:

  1. Prepend a character -- underscore? (_) -- if the first character is valid XML character except as the initial one
  2. For characters that are never valid, replace with some placeholder (once again, underscore?)

Unfortunately the actual rules for XML name character validity are VERY complicated (I know, wrote 2 XML parsers, Woodstox and Aalto :) ) so often times it is easier to focus on low-hanging fruits.

But if this was done, output would at least be well-formed.

EDIT: just realized that this is focused on java.util.Map handling and I conflated some issues with that of POJO value handling. So this comment in particular is bit off. Hopefully my later comment makes more sense.

@mensinda
Copy link
Contributor Author

mensinda commented May 5, 2022

Alternatively, instead of trying to escape the keys, maybe maps could always be generated like this:

<badMap>
  <item>
    <key>000</key>
    <value>foo</value>
  </item>
  <item>
    <key>111</key>
    <value>bar</value>
  </item>
</badMap>

Of course, this is FAR more verbose than the original syntax, but it should work. Alternatively, maybe XML attributes could also be used like in the following example this?

<badMap>
   <item name="000">foo</item>
   <item name="111">bar</item>
</badMap>

Guaranted, I don't know the internals of Jackson, so I am not sure how easy that would be.

@mensinda
Copy link
Contributor Author

mensinda commented May 5, 2022

I do realise that both options I have presented here are breaking format changes, which might be problematic. If this is indeed a problem, maybe this could be toggled via a ToXmlGenerator.Feature. Alternatively, a mix of my second example and the existing generated code could be used:

<badMap>
   <legalKey>Hello World!</legalKey>
   <item name="item">I guess this would be a bit awkward but I guess it works...</item>
   <item name="000">foo</item>
</badMap>

I am not a huge fan of escaping / mangling characters since the main use case for me is serializing / deserializing Objects from and to a human readable format.

@cowtowncoder
Copy link
Member

Alas, even if Map structure was changed -- which of course would be major incompatibility as you pointed out, unless defined as opt-in... plus, Map handling is not format-specific so tricky to actually implement -- we still have the same problem with just POJO property names; @JsonProperty may be used to indicate name like 111.

So I don't think there is any way around either name-mangling or erroring out.
I agree with you on everything above; not a big fan of name-mangling. But also don't see ways around it, aside form just failing on possible error.

@mensinda
Copy link
Contributor Author

mensinda commented May 6, 2022

So I don't think there is any way around either name-mangling or erroring out.

My problem is, that I don't see a good way to reliably name-mangle something like 00 foo <N/A> $$$ bar="Hello World!" in a way that is 100% reversible (and yes we might have stuff like this in there because of user input).

But also don't see ways around it, aside form just failing on possible error.

While this is certainly not ideal it is still far better than producing invalid XML.

@JsonProperty may be used to indicate name like 111

Yeah, I didn't think about that one.

So I don't think there is any way around either name-mangling or erroring out.

Here I would disagree, especially since I don't see how name-mangling could be made reversible and that there is the option of adding a new flag to opt into new behavior (WRITE_NULLS_AS_XSI_NIL for instance).

To solve this, my proposal would be:

  1. To error out by default when any key or @JsonProperty is provided that can not fit inside a XML tag. As I mentioned, this should ideally happen regardless, since it is always a good idea to fail early.

  2. To provide a flag ToXmlGenerator.Feature (or something, you probably know better here) to escape all invalid XML tag names with attribute strings:

    <!-- Without escaping
    <00 foo <N/A> $$$ bar="Hello World!">oh no</00 foo <N/A> $$$ bar="Hello World!">
    -->
    <!-- With escaping -->
    <complexTag name="00 foo <N/A> $$$ bar=&quot;Hello World!&quot;">works</complexTag>
    <complexTag name="complexTag">Not ideal but what can you do...</complexTag>
  3. Optionally, provide an additional flag to always generate maps this way for consistancy.

This way, the resulting XML is still always easily readable, the serialization / deserialization process is 100% reversible, and jackson-dataformat-xml is 100% backwards compatible by default. However, I don't know how much work this would be to implement since I am not familiar with the code base.


If all this is not an option, than maybe instead of name mangling this could be done?

  • Always escape leading _ as __
  • If there would be invalid chars in the tag, encode the tag as base32 (or hex, base64 also has probelematic chars) and emit _GAYCAZTPN4QDYTRPIE7CAJBEEQQGEYLSHUREQZLMNRXSAV3POJWGIIJC (trailing = replaced with _)

@cowtowncoder
Copy link
Member

Correct: there is no way to make name-mangling reversible. For POJOs this is non-optimal but doable.
But not so with Maps. So the solution would need to be different for Maps -- probably changing serialization structure.

But while both use logical structure of (JSON) Object for serialization, their handling is very different within jackson-databind; and making it possible to (some degree) use different handling.

And this is where the challenge wrt Maps comes from: Map serialization/deserialization is handled by jackson-databind, in format-agnostic way -- it simply processes streaming level tokens/events, as if all content had logical structure of JSON.
It is of course possible to override many things by format module adding handlers, including custom (de)serializers (via XmlModule). But that in itself adds complexity.

It might be possible to define new JsonFormat.Shape option for Map handling: JsonFormat.Shape.ARRAY. This could perhaps be used to solve the issue, but I think that might already run into trouble wrt XML's not having native Array type (meaning that streaming token handling might become non-trivial).

mensinda added a commit to mensinda/jackson-dataformat-xml that referenced this issue Jun 2, 2022
This commit introduces the `PROCESS_ESCAPED_MALFORMED_TAGS` and
`ESCAPE_MALFORMED_TAGS` features that control whether invalid
tag names will be escaped with an attribute.

fixes FasterXML#523
fixes FasterXML#524
mensinda added a commit to mensinda/jackson-dataformat-xml that referenced this issue Jun 22, 2022
This commit adds an extendable `XmlTagProcessor` that is used for
escaping invalid characters in XML tag names.

fixes FasterXML#523
fixes FasterXML#524
cowtowncoder pushed a commit that referenced this issue Sep 12, 2022
This commit adds an extendable `XmlTagProcessor` that is used for
escaping invalid characters in XML tag names.

fixes #523
fixes #524
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants