Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoding special whitespace characters #219

Closed
wooj2 opened this issue Jul 6, 2021 · 12 comments · Fixed by #222
Closed

Decoding special whitespace characters #219

wooj2 opened this issue Jul 6, 2021 · 12 comments · Fixed by #222
Assignees
Labels
question Further information is requested

Comments

@wooj2
Copy link
Contributor

wooj2 commented Jul 6, 2021

Hi,

I'm running into an issue where I think decoding is happening improperly with special characters.

For example:

Assuming the following structure:

struct SimpleScalar: Equatable, Decodable {
    public let stringValue: String?
}

I believe I should be able to write the following code:

func decodeScalar_withSpaces() {
    let sourceXML = """
    <SimpleScalar>
        <stringValue>escaped data: &amp;lt;&#xD;&#10;</stringValue>
    </SimpleScalar>
    """

    let decoder = XMLDecoder()
    let decoded = try! decoder.decode(SimpleScalar.self, from: Data(sourceXML.utf8))

    assert(decoded.stringValue == "escaped data: &lt;\r\n")
}

However, the assert fails because this string is decoded as escaped data: &lt; instead of the expected value of escaped data: &lt;\r\n.

  1. Is this a bug in the way we are handling these special characters?
  2. I've attempted to remedy this by turning trimValueWhitespace to false (which does allow the test to pass), but this doesn't seem to be the correct path to resolving this issue. Does anyone have any advice on making this work without having to set to trimValueWhitespace to false? I have some other cases where turning trimValueWhitespace to false will make other things difficult for my use case.
@MaxDesiatov
Copy link
Collaborator

I think setting trimValueWhitespace to false is the way to go here. What exactly is preventing you from using that property? Would you be able to provide a test case? If you aren't satisfied with the behavior of trimValueWhitespace, how exactly would you like it to be changed, or what other configuration would you like to be introduced?

@MaxDesiatov MaxDesiatov added the question Further information is requested label Jul 7, 2021
@wooj2
Copy link
Contributor Author

wooj2 commented Jul 9, 2021

Hi Max,

Thanks for getting back to me! Sorry for the delay-- I've been feeling under the weather for the last couple of days.

So, to give you some context, we are generating custom decoder (and encoder) logic, because we have particular requirements about how dates/lists/maps (and nested lists and maps) are serialized and deserialized. This is not ideal, but seems required given how many requirements we have with regards to how data can be serialized/deserialized.

That being said, in the following example, if we used the auto-synthesized decoder logic for the following example, the problem goes away, however, as mentioned before, we are unable to use the auto-synthesized decoder logic due to the reason mentioned above.

So, to reproduce this issue, you can create a new project, via:

swift package init --type executable

Then, import:

import XMLCoder
import Foundation

Then, assume you have a type called XmlListsOutputResponse, with the custom decoder logic (which is generated):

struct XmlListsOutputResponse: Decodable {
    public let nestedStringList: [[String]]?
    
    enum CodingKeys: String, CodingKey {
        case nestedStringList
    }
    
    public init (from decoder: Decoder) throws {
        let containerValues = try decoder.container(keyedBy: CodingKeys.self)
        if containerValues.contains(.nestedStringList) {
            struct KeyVal0{struct member{}}
            let nestedStringListWrappedContainer = containerValues.nestedContainerNonThrowable(keyedBy: CollectionMemberCodingKey<KeyVal0.member>.CodingKeys.self, forKey: .nestedStringList)
            if let nestedStringListWrappedContainer = nestedStringListWrappedContainer {
                let nestedStringListContainer = try nestedStringListWrappedContainer.decodeIfPresent([[String]].self, forKey: .member)
                var nestedStringListBuffer:[[String]]? = nil
                if let nestedStringListContainer = nestedStringListContainer {
                    nestedStringListBuffer = [[String]]()
                    var listBuffer0: [String]? = nil
                    for listContainer0 in nestedStringListContainer {
                        listBuffer0 = [String]()
                        for stringContainer1 in listContainer0 {
                                listBuffer0?.append(stringContainer1)
                        }
                        if let listBuffer0 = listBuffer0 {
                            nestedStringListBuffer?.append(listBuffer0)
                        }
                    }
                }
                nestedStringList = nestedStringListBuffer
            } else {
                nestedStringList = []
            }
        } else {
            nestedStringList = nil
        }
    }
}

Notice that this decoder logic has a few utilities/extensions to handle our requirements, so you'll need to copy and paste this into your project:

extension KeyedDecodingContainer where K : CodingKey {
    public func nestedContainerNonThrowable<NestedKey>(keyedBy type: NestedKey.Type, forKey key: KeyedDecodingContainer<K>.Key) -> KeyedDecodingContainer<NestedKey>? where NestedKey : CodingKey {
        do {
            return try nestedContainer(keyedBy: type, forKey: key)
        } catch {
            return nil
        }
    }
}

public struct CollectionMemberCodingKey<CustomMemberName> {
    public enum CodingKeys: String, CodingKey {
        case member

        public var rawValue: String {
            switch self {
            case .member: return customMemberName()
            }
        }

        func customMemberName() -> String {
            return String(describing: CustomMemberName.self)
        }
    }
}

These utilities/extensions are needed because "member" may not always be used as a key, and we need the ability to decode on a nested member.

Finally, to exercise, this code, we can have the following code:

func test_decodeNestedStringList() {
    let sourceXML = """
        <XmlListsOutputResponse>
            <nestedStringList>
                <member>
                    <member>foo</member>
                    <member>bar</member>
                </member>
                <member>
                    <member>baz</member>
                    <member>qux</member>
                </member>
            </nestedStringList>
        </XmlListsOutputResponse>
        """
    let decoder = XMLDecoder()
    decoder.trimValueWhitespaces = false
    let decoded = try! decoder.decode(XmlListsOutputResponse.self, from: Data(sourceXML.utf8))
    assert(decoded.nestedStringList![0][0] == "foo")
    assert(decoded.nestedStringList![0][1] == "bar")
    assert(decoded.nestedStringList![1][0] == "baz")
    assert(decoded.nestedStringList![1][1] == "qux")
}

test_decodeNestedStringList()

The expected behavior for running this code is that it runs successfuly, however, with trimValueWhitespaces = false, the asserts fail. Digging into why this happens, when trimValueWhitespaces is set to false, our decoder logic gets whitespaces that are embedded between the nested member. For example:

<member>
    <member>

Instead of decoded.nestedStringList![0][0] equaling "foo", decoded.nestedStringList![0][0] is equal to something like \n <multiple white spaces here>. This whitespace is not expected, and I'm unable to come up with decoder logic to handle this.

Perhaps you (or someone out there) can think of some decoder logic that would handle whitespaces? Considering that the auto-synthesized decoder is able to handle this (magically?!), I'm guessing there is a way, but I was unable to come up with decoder logic which handles this.

In terms of "how exactly would you like it to be changed" -- It seems hard to say right now. If there's some way to change our decoder logic, then, great, we can pursue this approach! Another option could be changing the behavior of trimValueWhitespaces = false to not return back \n. <multiple white spaces here>, in this nested case (although, I'm not sure if this is possible or not).

@MaxDesiatov
Copy link
Collaborator

MaxDesiatov commented Jul 9, 2021

Are places where you want whitespaces to be preserved and to be ignored predefined in some way? Maybe we could come up with something like trimValueWhitespaces, maybe a property wrapper that would be applied not to the whole decoder, but hierarchically, to properties you attribute with it?

@wooj2
Copy link
Contributor Author

wooj2 commented Jul 9, 2021

Thanks for getting back to me so quickly!

If I understand correctly, you're suggesting something like this:

struct XmlListsOutputResponse: Decodable {

    @TrimValueWhitespaces(shouldTrim: true)
    public let nestedStringList: [[String]]?

    public init (from decoder: Decoder) throws {
               .... insert impl here...
    }
}

I'm not sure if that would work, because, in our case of XmlListsOutputResponse, we could have whitespace as a value for one of nested members. For example:

func test_decodeNestedStringList() {
    let sourceXML = """
        <XmlListsOutputResponse>
            <nestedStringList>
                <member>
                    <member>foo &amp;lt;&#xD;&#10;</member>
                    <member>bar</member>
                </member>
                <member>
                    <member>baz</member>
                    <member>qux &amp;lt;&#xD;&#10;</member>
                </member>
            </nestedStringList>
        </XmlListsOutputResponse>
        """
    let decoder = XMLDecoder()
    decoder.trimValueWhitespaces = false
    let decoded = try! decoder.decode(XmlListsOutputResponse.self, from: Data(sourceXML.utf8))
    assert(decoded.nestedStringList![0][0] == "foo &lt;\r\n")
    assert(decoded.nestedStringList![0][1] == "bar")
    assert(decoded.nestedStringList![1][0] == "baz")
    assert(decoded.nestedStringList![1][1] == "qux &lt;\r\n")
}

I think it is impossible to know whether the xml payload will have these special characters ahead of time, so it doesn't seem like we can add a property wrapper to the member. Or, perhaps you are suggesting something else?

@MaxDesiatov
Copy link
Collaborator

Yes, a property wrapper to any member you'd like to ignore or not ignore whitespaces is what I was suggesting.

What about some configurable whitespace detection? Or would you be able to detect whitespace combinations not suitable for you in the decoder, with a regex or something like that?

@wooj2
Copy link
Contributor Author

wooj2 commented Jul 9, 2021

I think it would be difficult and/or brittle to come up with a regex that would work reliably. For example, if we skipped/detected any time we saw \n <n-number of white spaces>, how would we know that someone didn't encode that value into the inner member versus whitespace that we want to ignore?

Furthermore, the decoder implementation defines that there is a container with a key member. Within this container, we are saying we should decode another item member, which has a value. So, at the time we call decodeIfPresent (in the decoder implementation) it seems odd that one of these decoded values is the whitespace that is present between these containers, right?

Taking a step back, would you happen to have any advice on the way we are generating our decoder? Given that the Apple supplied auto-synthesized decoder works (when trimValueWhitesapces == false), do you know of any way we can get a hold of that implementation? The above implementation of the decoder (in my previous post) is only my best guess as to how they are decoding this data structure.

@MaxDesiatov
Copy link
Collaborator

Given that the Apple supplied auto-synthesized decoder works (when trimValueWhitesapces == false), do you know of any way we can get a hold of that implementation?

What prevents you from not overriding the decoding initializer and then using the auto-synthesized one instead? Or is there anything you need to customize there? What are the customizations then?

@wooj2
Copy link
Contributor Author

wooj2 commented Jul 15, 2021

One of many reasons why the auto-synthesized encoder/decoder logic does not work is that lists and maps can be serialized deserialized in different ways. For example, a struct like the following:

struct MyStruct {
    let wrappedList: [String]
    let flattenedList: [String]
}

And adding "example1", "example2", and "example3" strings into both of the lists, would need to be serialized in the following manner:

<MyStruct>
    <flattenedList>example1</flattenedList>
    <flattenedList>example2</flattenedList>
    <flattenedList>example3</flattenedList>
    <wrappedList>
        <member>example1</member>
        <member>example2</member>
        <member>example3</member>
    </wrappedList>
</MyStruct>

There are other examples that have to do with specific date/time formatting, as well as error types (I can post some other examples, but not sure how relevant it is).

Taking a step back, I just had another idea. It seems that this is only a problem when the xml is pretty-printed. Is it possible remove the pretty-printed-ness of the data before the data is parsed? If we remove the pretty-printed-ness of the data and set trimValueWhitesapces = false, I think we should be able to work around this issue?

@wooj2
Copy link
Contributor Author

wooj2 commented Jul 15, 2021

I started playing around with the parser, and using the SimpleScalar example above (where we have some data that looks like <stringValue>escaped data: &amp;lt;&#xD;&#10;</stringValue>, I'm seeing that the process() function in XMLStackParser located here is called in succession with the following values:

1: "\n        "
2: "escaped data: "
3: "&"
4: "lt;"
5: "\r"
6: "\n"
7: "\n    "

It's probably my misunderstanding of how XML and XML parsing works, but it's odd to me that it this function is being called in a way where "escaped data: " is being called separately from "&lt; ". Is there any chance that there is a bug in the way the parser is tokenizing this xml?

If you'd like to reproduce, you should be able to copy and paste the following code into a new project:

struct SimpleScalar: Equatable, Decodable {
    public let stringValue: String?
}
func test_escapableCharactersv2() {
    let sourceXML = """
            <SimpleScalar>
                <stringValue>escaped data: &amp;lt;&#xD;&#10;</stringValue>
            </SimpleScalar>
        """

    let decoder = XMLDecoder()
    decoder.trimValueWhitespaces = true
    let decoded = try! decoder.decode(SimpleScalar.self, from: Data(sourceXML.utf8))
    assert(decoded.stringValue == "escaped data: &lt;\r\n")
}
test_escapableCharactersv2()

Thanks again for all the help!!!

@MaxDesiatov
Copy link
Collaborator

If you think the bug is in tokenization, that won't be an easy fix I'm afraid. More like a very complicated fix where we'd have to write our own parser/tokenizer. We're using XMLParser class and XMLParserDelegate protocol from Foundation.

In a way, that may make your debugging process easier. You could try passing your XML directly to that parser and check how it tokenizes and parses things. If that leads to some reproducible results, it will be much easier to pinpoint the source of the issues either in Foundation's parser, or our library. I may not have much time in the next few days, but let me know if you find anything. Otherwise I could check it out this weekend or next week.

@MaxDesiatov
Copy link
Collaborator

If it's some unexpected behavior in Foundation's tokenizer/parser, hopefully it's customizable enough that we won't need a rewrite. But that remains to be seen.

@wooj2
Copy link
Contributor Author

wooj2 commented Jul 27, 2021

Hey Max!,

I'm not entirely sure if this is a valid approach, but I can effectively work around my issue by enabling trimValueWhitespaces = false and patching the code in this draft PR.

The problem I was observing was that when turning on trimValueWhitespaces = false, and attempting to decode a payload like:

        <XmlListsOutputResponse>
            <nestedStringList>
                <member>
                    <member>foo</member>
                    <member>bar</member>
                </member>
                <member>
                    <member>baz</member>
                    <member>qux</member>
                </member>
            </nestedStringList>
        </XmlListsOutputResponse>

.. i noticed that, while calling transformToBoxTree on the root node, and then calling merge to merge all the elements into KeyedStorage, I was seeing that there was an instance of XMLCoderElement having elements that had the following contents:

(lldb) po element.elements
▿ 5 elements
  ▿ 0 : XMLCoderElement
    - key : ""
    ▿ stringValue : Optional<String>
      - some : "\n        "
    - elements : 0 elements
    - attributes : 0 elements
    - containsTextNodes : false
  ▿ 1 : XMLCoderElement
    - key : "member"
    - stringValue : nil
    ▿ elements : 1 element
      ▿ 0 : XMLCoderElement
        - key : ""
        ▿ stringValue : Optional<String>
          - some : "foo"
        - elements : 0 elements
        - attributes : 0 elements
        - containsTextNodes : false
    - attributes : 0 elements
    - containsTextNodes : true
  ▿ 2 : XMLCoderElement
    - key : ""
    ▿ stringValue : Optional<String>
      - some : "\n        "
    - elements : 0 elements
    - attributes : 0 elements
    - containsTextNodes : false
  ▿ 3 : XMLCoderElement
    - key : "member"
    - stringValue : nil
    ▿ elements : 1 element
      ▿ 0 : XMLCoderElement
        - key : ""
        ▿ stringValue : Optional<String>
          - some : "bar"
        - elements : 0 elements
        - attributes : 0 elements
        - containsTextNodes : false
    - attributes : 0 elements
    - containsTextNodes : true
  ▿ 4 : XMLCoderElement
    - key : ""
    ▿ stringValue : Optional<String>
      - some : "\n    "
    - elements : 0 elements
    - attributes : 0 elements
    - containsTextNodes : false

Notice that element 0, 2 and 4, are just whitespace characters, which are elements outside of the of the inner <member> element. To remedy this, I've come up with a prototype to filter these elements out. I'm not sure if this is a good approach, so I'm seeking advice as to what you think.

#221

Note that the approach in PR 211 removes these elements in KeyedStorage. This might not be ideal, so another approach would be to remove these elements in the XMLStackParser's parser delegate, implemented here.

@MaxDesiatov MaxDesiatov linked a pull request Jul 29, 2021 that will close this issue
MaxDesiatov pushed a commit that referenced this issue Jul 30, 2021
This updates the `XMLStackParser` to accept a parameter called `removeWhitespaceElements`.

The purpose of the `XMLStackParser` is to call the XML parser and create a tree of `XMLCoderElement` representing the structure of the parsed XML.

Assuming that XMLStackParser has `trimValueWhitespaces` set to `false`, when attempting to parse a nested data structure like the following:
```xml
<SomeType>
    <nestedStringList>
        <member>
            <member>foo</member>
            <member>bar</member>
        </member>
        <member>
            <member>baz</member>
            <member>qux</member>
        </member>
    </nestedStringList>
</SomeType>
```

... then there will multiple `XMLCoderElement`s in the tree which will have `elements` set to elements that are either:
a.  Purely whitespaced elements or
b.  The child elements

These purely whitespaced elements are problematic for users who are implementing custom `Decoder` logic, as they are interpreted as regular child elements.  Therefore, setting `removeWhitespaceElements` to `true` while `trimValueWhitespaces` is set to `false`, will remove these whitespace elements during the construction of the `XMLCoderElement` tree.

An in-depth analysis of the original problem can be found [here](#219).

For historical purposes, a separate approach was implemented.  It uses a similar algorithm in a different part of the code. #221
spotlightishere pushed a commit to WiiLink24/XMLCoder that referenced this issue Aug 6, 2021
This updates the `XMLStackParser` to accept a parameter called `removeWhitespaceElements`.

The purpose of the `XMLStackParser` is to call the XML parser and create a tree of `XMLCoderElement` representing the structure of the parsed XML.

Assuming that XMLStackParser has `trimValueWhitespaces` set to `false`, when attempting to parse a nested data structure like the following:
```xml
<SomeType>
    <nestedStringList>
        <member>
            <member>foo</member>
            <member>bar</member>
        </member>
        <member>
            <member>baz</member>
            <member>qux</member>
        </member>
    </nestedStringList>
</SomeType>
```

... then there will multiple `XMLCoderElement`s in the tree which will have `elements` set to elements that are either:
a.  Purely whitespaced elements or
b.  The child elements

These purely whitespaced elements are problematic for users who are implementing custom `Decoder` logic, as they are interpreted as regular child elements.  Therefore, setting `removeWhitespaceElements` to `true` while `trimValueWhitespaces` is set to `false`, will remove these whitespace elements during the construction of the `XMLCoderElement` tree.

An in-depth analysis of the original problem can be found [here](CoreOffice#219).

For historical purposes, a separate approach was implemented.  It uses a similar algorithm in a different part of the code. CoreOffice#221
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants