Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Serialization fails with UTF-8 with BOM #1168

Closed
FlorianObermayer opened this issue Oct 28, 2020 · 7 comments
Closed

Serialization fails with UTF-8 with BOM #1168

FlorianObermayer opened this issue Oct 28, 2020 · 7 comments

Comments

@FlorianObermayer
Copy link

Describe the bug

kotlinx.serialization cannot deserialize UTF-8 with BOM strings.

To Reproduce

package com.example.myapplication

import kotlinx.serialization.Serializable
import kotlinx.serialization.SerializationException
import kotlinx.serialization.json.Json
import org.junit.Test

class ExampleUnitTest {

    /**
    Throws:
    kotlinx.serialization.json.internal.JsonDecodingException: Unexpected JSON token at offset 0: Expected '{, kind: CLASS'
    JSON input: {}

    at kotlinx.serialization.json.internal.JsonExceptionsKt.JsonDecodingException(JsonExceptions.kt:24)
    at kotlinx.serialization.json.internal.JsonExceptionsKt.JsonDecodingException(JsonExceptions.kt:32)
    at kotlinx.serialization.json.internal.JsonReader.fail(JsonReader.kt:333)
    at kotlinx.serialization.json.internal.StreamingJsonDecoder.beginStructure(StreamingJsonDecoder.kt:39)
    at com.example.myapplication.ExampleUnitTest$test deserialization of UTF-8 with BOM$Data$$serializer.deserialize(ExampleUnitTest.kt)
    at com.example.myapplication.ExampleUnitTest$test deserialization of UTF-8 with BOM$Data$$serializer.deserialize(ExampleUnitTest.kt:12)
    at kotlinx.serialization.json.internal.PolymorphicKt.decodeSerializableValuePolymorphic(Polymorphic.kt:63)
    at kotlinx.serialization.json.internal.StreamingJsonDecoder.decodeSerializableValue(StreamingJsonDecoder.kt:33)
    at kotlinx.serialization.json.Json.decodeFromString(Json.kt:85)
    at com.example.myapplication.ExampleUnitTest.test deserialization of UTF-8 with BOM(ExampleUnitTest.kt:16)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
    at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
    at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
     */
    @Test(expected = SerializationException::class)
    fun `test deserialization of UTF-8 with BOM`() {
        @Serializable
        data class Data(val value: String)

        val jsonData = Json.encodeToString(Data.serializer(), Data("test"))

        val jsonDataWithBOM = jsonData.prependIndent("\uFEFF") // simulates UTF-8 with BOM
        Json.decodeFromString(Data.serializer(), jsonDataWithBOM) // fails with SerializationException
    }
}

Expected behavior

This test should not throw any SerializationException

Environment

  • Kotlin version: 1.4.10
  • Library version: 1.0.0
  • Kotlin platforms: JVM
  • Gradle version: 6.1.1
@JakeWharton
Copy link
Contributor

A BOM is a series of bytes in order to indicate the charset needed to decode bytes into a string. By the time the representation reaches a string, the BOM should have been parsed, used, and thus removed. I do not believe this to be the responsibility of the library because the input is broken. A BOM that prefixes character data is not useful–it's just a malformed JSON string as the library correctly indicates.

@FlorianObermayer
Copy link
Author

@JakeWharton is right, however my example is a bit too "constructed".

In my real-life application we receive an InputStream. Due to the lack of deserialization of ByteArray or InputStream we have to convert it to a string: inputStream.readBytes().toString(Charsets.UTF-8)

Since we transitioned from Jackson Serialization, this conversion was not necessary before:

    data class Data(@JsonProperty("value") val value: String)

    @Test
    fun `test jackson deserialization of UTF-8 with BOM`() {
        val mapper = jacksonObjectMapper()
        val data = Data("test")
        val jsonData = mapper.writeValueAsString(data)
        val jsonDataWithBOM = jsonData.prependIndent("\uFEFF") // simulates UTF-8 with BOM
        val inputStream = jsonDataWithBOM.toByteArray().inputStream()
        Assert.assertEquals(data, mapper.readValue<Data>(inputStream))
    }

In my opinion kotlinx.serialization lacks a Json::decodeFromByteArray(deserializer: DeserializationStrategy<T>, bytes : ByteArray) method.

I might miss something when converting InputStream to String though. Any suggestions? Removing the BOM "by hand" feels like an unnecessary hack.

My only "knowledge" about these input streams is that they hold UTF-8 encoded data, (with or without the BOM).

Many thanks and best regards,
Florian

@bartvanheukelom
Copy link

From what I've read just now, the BOM declares the encoding of the rest of the stream. When you call bytes.toString(Charsets.UTF-8), you're declaring that the encoding is UTF-8 and hence no BOM is needed.
Correct would be to either:

  • read the first byte and determine the charset from that, then use that to convert the bytes to a string
  • or if you explicitly only want to support UTF-8, peek at the first byte. remove it if it's a UTF-8 BOM, throw an error if it indicates another charset

@FlorianObermayer
Copy link
Author

@bartvanheukelom this is, of course, an intermediate solution to this problem (which we already implement), however it feels like this should be handled by the serialization framework (as it is handled by others such as Jackson).

If this is only my opinion, feel free to close the issue.

@JakeWharton
Copy link
Contributor

Neither Gson nor Moshi handle it and I believe their decision is correct. A BOM is a document-level prefix and serialization may be done on parts of a document.

For example, if you have a newline-delimited series of JSON values (such as the Asciinema format) you would handle the BOM on the first line (as it would be a prefix) but then fail to handle it on subsequent lines potentially using the wrong decoding. The BOM in that situation applies to the whole document, not just the first line, and needs to be applied by the person calling into the serialization library with string data, not by the library itself.

@pdvrieze
Copy link
Contributor

The "correct" way to handle byte inputs conversion to reader is to use an InputStream reader combined with a Charset (like UTF-8). To test it, the BOM needs to be prepended to the byte array, not the string. Of course there you are mainly testing the implementation of Charset.

@sandwwraith
Copy link
Member

I agree with Jake that BOM removal shouldn't be the part of string encoding/decoding. When kotlinx.serialization will provide conversion from ByteArray/InputStream, then it would be its responsibility.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants