-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Decoding special whitespace characters #219
Comments
I think setting |
Hi Max, Thanks for getting back to me! Sorry for the delay-- I've been feeling under the weather for the last couple of days. So, to give you some context, we are generating custom decoder (and encoder) logic, because we have particular requirements about how dates/lists/maps (and nested lists and maps) are serialized and deserialized. This is not ideal, but seems required given how many requirements we have with regards to how data can be serialized/deserialized. That being said, in the following example, if we used the auto-synthesized decoder logic for the following example, the problem goes away, however, as mentioned before, we are unable to use the auto-synthesized decoder logic due to the reason mentioned above. So, to reproduce this issue, you can create a new project, via:
Then, import: import XMLCoder
import Foundation Then, assume you have a type called struct XmlListsOutputResponse: Decodable {
public let nestedStringList: [[String]]?
enum CodingKeys: String, CodingKey {
case nestedStringList
}
public init (from decoder: Decoder) throws {
let containerValues = try decoder.container(keyedBy: CodingKeys.self)
if containerValues.contains(.nestedStringList) {
struct KeyVal0{struct member{}}
let nestedStringListWrappedContainer = containerValues.nestedContainerNonThrowable(keyedBy: CollectionMemberCodingKey<KeyVal0.member>.CodingKeys.self, forKey: .nestedStringList)
if let nestedStringListWrappedContainer = nestedStringListWrappedContainer {
let nestedStringListContainer = try nestedStringListWrappedContainer.decodeIfPresent([[String]].self, forKey: .member)
var nestedStringListBuffer:[[String]]? = nil
if let nestedStringListContainer = nestedStringListContainer {
nestedStringListBuffer = [[String]]()
var listBuffer0: [String]? = nil
for listContainer0 in nestedStringListContainer {
listBuffer0 = [String]()
for stringContainer1 in listContainer0 {
listBuffer0?.append(stringContainer1)
}
if let listBuffer0 = listBuffer0 {
nestedStringListBuffer?.append(listBuffer0)
}
}
}
nestedStringList = nestedStringListBuffer
} else {
nestedStringList = []
}
} else {
nestedStringList = nil
}
}
} Notice that this decoder logic has a few utilities/extensions to handle our requirements, so you'll need to copy and paste this into your project: extension KeyedDecodingContainer where K : CodingKey {
public func nestedContainerNonThrowable<NestedKey>(keyedBy type: NestedKey.Type, forKey key: KeyedDecodingContainer<K>.Key) -> KeyedDecodingContainer<NestedKey>? where NestedKey : CodingKey {
do {
return try nestedContainer(keyedBy: type, forKey: key)
} catch {
return nil
}
}
}
public struct CollectionMemberCodingKey<CustomMemberName> {
public enum CodingKeys: String, CodingKey {
case member
public var rawValue: String {
switch self {
case .member: return customMemberName()
}
}
func customMemberName() -> String {
return String(describing: CustomMemberName.self)
}
}
} These utilities/extensions are needed because "member" may not always be used as a key, and we need the ability to decode on a nested member. Finally, to exercise, this code, we can have the following code: func test_decodeNestedStringList() {
let sourceXML = """
<XmlListsOutputResponse>
<nestedStringList>
<member>
<member>foo</member>
<member>bar</member>
</member>
<member>
<member>baz</member>
<member>qux</member>
</member>
</nestedStringList>
</XmlListsOutputResponse>
"""
let decoder = XMLDecoder()
decoder.trimValueWhitespaces = false
let decoded = try! decoder.decode(XmlListsOutputResponse.self, from: Data(sourceXML.utf8))
assert(decoded.nestedStringList![0][0] == "foo")
assert(decoded.nestedStringList![0][1] == "bar")
assert(decoded.nestedStringList![1][0] == "baz")
assert(decoded.nestedStringList![1][1] == "qux")
}
test_decodeNestedStringList() The expected behavior for running this code is that it runs successfuly, however, with
Instead of Perhaps you (or someone out there) can think of some decoder logic that would handle whitespaces? Considering that the auto-synthesized decoder is able to handle this (magically?!), I'm guessing there is a way, but I was unable to come up with decoder logic which handles this. In terms of "how exactly would you like it to be changed" -- It seems hard to say right now. If there's some way to change our decoder logic, then, great, we can pursue this approach! Another option could be changing the behavior of |
Are places where you want whitespaces to be preserved and to be ignored predefined in some way? Maybe we could come up with something like |
Thanks for getting back to me so quickly! If I understand correctly, you're suggesting something like this: struct XmlListsOutputResponse: Decodable {
@TrimValueWhitespaces(shouldTrim: true)
public let nestedStringList: [[String]]?
public init (from decoder: Decoder) throws {
.... insert impl here...
}
} I'm not sure if that would work, because, in our case of func test_decodeNestedStringList() {
let sourceXML = """
<XmlListsOutputResponse>
<nestedStringList>
<member>
<member>foo &lt;
 </member>
<member>bar</member>
</member>
<member>
<member>baz</member>
<member>qux &lt;
 </member>
</member>
</nestedStringList>
</XmlListsOutputResponse>
"""
let decoder = XMLDecoder()
decoder.trimValueWhitespaces = false
let decoded = try! decoder.decode(XmlListsOutputResponse.self, from: Data(sourceXML.utf8))
assert(decoded.nestedStringList![0][0] == "foo <\r\n")
assert(decoded.nestedStringList![0][1] == "bar")
assert(decoded.nestedStringList![1][0] == "baz")
assert(decoded.nestedStringList![1][1] == "qux <\r\n")
} I think it is impossible to know whether the xml payload will have these special characters ahead of time, so it doesn't seem like we can add a property wrapper to the member. Or, perhaps you are suggesting something else? |
Yes, a property wrapper to any member you'd like to ignore or not ignore whitespaces is what I was suggesting. What about some configurable whitespace detection? Or would you be able to detect whitespace combinations not suitable for you in the decoder, with a regex or something like that? |
I think it would be difficult and/or brittle to come up with a regex that would work reliably. For example, if we skipped/detected any time we saw Furthermore, the decoder implementation defines that there is a container with a key Taking a step back, would you happen to have any advice on the way we are generating our decoder? Given that the Apple supplied auto-synthesized decoder works (when |
What prevents you from not overriding the decoding initializer and then using the auto-synthesized one instead? Or is there anything you need to customize there? What are the customizations then? |
One of many reasons why the auto-synthesized encoder/decoder logic does not work is that lists and maps can be serialized deserialized in different ways. For example, a struct like the following: struct MyStruct {
let wrappedList: [String]
let flattenedList: [String]
} And adding "example1", "example2", and "example3" strings into both of the lists, would need to be serialized in the following manner: <MyStruct>
<flattenedList>example1</flattenedList>
<flattenedList>example2</flattenedList>
<flattenedList>example3</flattenedList>
<wrappedList>
<member>example1</member>
<member>example2</member>
<member>example3</member>
</wrappedList>
</MyStruct> There are other examples that have to do with specific date/time formatting, as well as error types (I can post some other examples, but not sure how relevant it is). Taking a step back, I just had another idea. It seems that this is only a problem when the xml is pretty-printed. Is it possible remove the pretty-printed-ness of the data before the data is parsed? If we remove the pretty-printed-ness of the data and set |
I started playing around with the parser, and using the
It's probably my misunderstanding of how XML and XML parsing works, but it's odd to me that it this function is being called in a way where "escaped data: " is being called separately from "< ". Is there any chance that there is a bug in the way the parser is tokenizing this xml? If you'd like to reproduce, you should be able to copy and paste the following code into a new project: struct SimpleScalar: Equatable, Decodable {
public let stringValue: String?
}
func test_escapableCharactersv2() {
let sourceXML = """
<SimpleScalar>
<stringValue>escaped data: &lt;
 </stringValue>
</SimpleScalar>
"""
let decoder = XMLDecoder()
decoder.trimValueWhitespaces = true
let decoded = try! decoder.decode(SimpleScalar.self, from: Data(sourceXML.utf8))
assert(decoded.stringValue == "escaped data: <\r\n")
}
test_escapableCharactersv2() Thanks again for all the help!!! |
If you think the bug is in tokenization, that won't be an easy fix I'm afraid. More like a very complicated fix where we'd have to write our own parser/tokenizer. We're using In a way, that may make your debugging process easier. You could try passing your XML directly to that parser and check how it tokenizes and parses things. If that leads to some reproducible results, it will be much easier to pinpoint the source of the issues either in Foundation's parser, or our library. I may not have much time in the next few days, but let me know if you find anything. Otherwise I could check it out this weekend or next week. |
If it's some unexpected behavior in Foundation's tokenizer/parser, hopefully it's customizable enough that we won't need a rewrite. But that remains to be seen. |
Hey Max!, I'm not entirely sure if this is a valid approach, but I can effectively work around my issue by enabling The problem I was observing was that when turning on <XmlListsOutputResponse>
<nestedStringList>
<member>
<member>foo</member>
<member>bar</member>
</member>
<member>
<member>baz</member>
<member>qux</member>
</member>
</nestedStringList>
</XmlListsOutputResponse> .. i noticed that, while calling
Notice that element 0, 2 and 4, are just whitespace characters, which are elements outside of the of the inner Note that the approach in PR 211 removes these elements in |
This updates the `XMLStackParser` to accept a parameter called `removeWhitespaceElements`. The purpose of the `XMLStackParser` is to call the XML parser and create a tree of `XMLCoderElement` representing the structure of the parsed XML. Assuming that XMLStackParser has `trimValueWhitespaces` set to `false`, when attempting to parse a nested data structure like the following: ```xml <SomeType> <nestedStringList> <member> <member>foo</member> <member>bar</member> </member> <member> <member>baz</member> <member>qux</member> </member> </nestedStringList> </SomeType> ``` ... then there will multiple `XMLCoderElement`s in the tree which will have `elements` set to elements that are either: a. Purely whitespaced elements or b. The child elements These purely whitespaced elements are problematic for users who are implementing custom `Decoder` logic, as they are interpreted as regular child elements. Therefore, setting `removeWhitespaceElements` to `true` while `trimValueWhitespaces` is set to `false`, will remove these whitespace elements during the construction of the `XMLCoderElement` tree. An in-depth analysis of the original problem can be found [here](#219). For historical purposes, a separate approach was implemented. It uses a similar algorithm in a different part of the code. #221
This updates the `XMLStackParser` to accept a parameter called `removeWhitespaceElements`. The purpose of the `XMLStackParser` is to call the XML parser and create a tree of `XMLCoderElement` representing the structure of the parsed XML. Assuming that XMLStackParser has `trimValueWhitespaces` set to `false`, when attempting to parse a nested data structure like the following: ```xml <SomeType> <nestedStringList> <member> <member>foo</member> <member>bar</member> </member> <member> <member>baz</member> <member>qux</member> </member> </nestedStringList> </SomeType> ``` ... then there will multiple `XMLCoderElement`s in the tree which will have `elements` set to elements that are either: a. Purely whitespaced elements or b. The child elements These purely whitespaced elements are problematic for users who are implementing custom `Decoder` logic, as they are interpreted as regular child elements. Therefore, setting `removeWhitespaceElements` to `true` while `trimValueWhitespaces` is set to `false`, will remove these whitespace elements during the construction of the `XMLCoderElement` tree. An in-depth analysis of the original problem can be found [here](CoreOffice#219). For historical purposes, a separate approach was implemented. It uses a similar algorithm in a different part of the code. CoreOffice#221
Hi,
I'm running into an issue where I think decoding is happening improperly with special characters.
For example:
Assuming the following structure:
I believe I should be able to write the following code:
However, the assert fails because this string is decoded as
escaped data: <
instead of the expected value ofescaped data: <\r\n
.trimValueWhitespace
to false (which does allow the test to pass), but this doesn't seem to be the correct path to resolving this issue. Does anyone have any advice on making this work without having to set totrimValueWhitespace
to false? I have some other cases where turningtrimValueWhitespace
to false will make other things difficult for my use case.The text was updated successfully, but these errors were encountered: