-
Notifications
You must be signed in to change notification settings - Fork 612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Serialization fails with UTF-8 with BOM #1168
Comments
A BOM is a series of bytes in order to indicate the charset needed to decode bytes into a string. By the time the representation reaches a string, the BOM should have been parsed, used, and thus removed. I do not believe this to be the responsibility of the library because the input is broken. A BOM that prefixes character data is not useful–it's just a malformed JSON string as the library correctly indicates. |
@JakeWharton is right, however my example is a bit too "constructed". In my real-life application we receive an Since we transitioned from Jackson Serialization, this conversion was not necessary before: data class Data(@JsonProperty("value") val value: String)
@Test
fun `test jackson deserialization of UTF-8 with BOM`() {
val mapper = jacksonObjectMapper()
val data = Data("test")
val jsonData = mapper.writeValueAsString(data)
val jsonDataWithBOM = jsonData.prependIndent("\uFEFF") // simulates UTF-8 with BOM
val inputStream = jsonDataWithBOM.toByteArray().inputStream()
Assert.assertEquals(data, mapper.readValue<Data>(inputStream))
} In my opinion I might miss something when converting My only "knowledge" about these input streams is that they hold Many thanks and best regards, |
From what I've read just now, the BOM declares the encoding of the rest of the stream. When you call
|
@bartvanheukelom this is, of course, an intermediate solution to this problem (which we already implement), however it feels like this should be handled by the serialization framework (as it is handled by others such as Jackson). If this is only my opinion, feel free to close the issue. |
Neither Gson nor Moshi handle it and I believe their decision is correct. A BOM is a document-level prefix and serialization may be done on parts of a document. For example, if you have a newline-delimited series of JSON values (such as the Asciinema format) you would handle the BOM on the first line (as it would be a prefix) but then fail to handle it on subsequent lines potentially using the wrong decoding. The BOM in that situation applies to the whole document, not just the first line, and needs to be applied by the person calling into the serialization library with string data, not by the library itself. |
The "correct" way to handle byte inputs conversion to reader is to use an InputStream reader combined with a Charset (like UTF-8). To test it, the BOM needs to be prepended to the byte array, not the string. Of course there you are mainly testing the implementation of Charset. |
I agree with Jake that BOM removal shouldn't be the part of string encoding/decoding. When kotlinx.serialization will provide conversion from ByteArray/InputStream, then it would be its responsibility. |
Describe the bug
kotlinx.serialization
cannot deserializeUTF-8 with BOM
strings.To Reproduce
Expected behavior
This test should not throw any
SerializationException
Environment
1.4.10
1.0.0
6.1.1
The text was updated successfully, but these errors were encountered: