Serialization fails with UTF-8 with BOM #1168

FlorianObermayer · 2020-10-28T12:51:00Z

Describe the bug

kotlinx.serialization cannot deserialize UTF-8 with BOM strings.

To Reproduce

package com.example.myapplication

import kotlinx.serialization.Serializable
import kotlinx.serialization.SerializationException
import kotlinx.serialization.json.Json
import org.junit.Test

class ExampleUnitTest {

    /**
    Throws:
    kotlinx.serialization.json.internal.JsonDecodingException: Unexpected JSON token at offset 0: Expected '{, kind: CLASS'
    JSON input: {}

    at kotlinx.serialization.json.internal.JsonExceptionsKt.JsonDecodingException(JsonExceptions.kt:24)
    at kotlinx.serialization.json.internal.JsonExceptionsKt.JsonDecodingException(JsonExceptions.kt:32)
    at kotlinx.serialization.json.internal.JsonReader.fail(JsonReader.kt:333)
    at kotlinx.serialization.json.internal.StreamingJsonDecoder.beginStructure(StreamingJsonDecoder.kt:39)
    at com.example.myapplication.ExampleUnitTest$test deserialization of UTF-8 with BOM$Data$$serializer.deserialize(ExampleUnitTest.kt)
    at com.example.myapplication.ExampleUnitTest$test deserialization of UTF-8 with BOM$Data$$serializer.deserialize(ExampleUnitTest.kt:12)
    at kotlinx.serialization.json.internal.PolymorphicKt.decodeSerializableValuePolymorphic(Polymorphic.kt:63)
    at kotlinx.serialization.json.internal.StreamingJsonDecoder.decodeSerializableValue(StreamingJsonDecoder.kt:33)
    at kotlinx.serialization.json.Json.decodeFromString(Json.kt:85)
    at com.example.myapplication.ExampleUnitTest.test deserialization of UTF-8 with BOM(ExampleUnitTest.kt:16)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
    at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
    at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
    at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
    at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
    at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
    at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
    at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
    at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
    at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:68)
    at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:33)
    at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:230)
    at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:58)
     */
    @Test(expected = SerializationException::class)
    fun `test deserialization of UTF-8 with BOM`() {
        @Serializable
        data class Data(val value: String)

        val jsonData = Json.encodeToString(Data.serializer(), Data("test"))

        val jsonDataWithBOM = jsonData.prependIndent("\uFEFF") // simulates UTF-8 with BOM
        Json.decodeFromString(Data.serializer(), jsonDataWithBOM) // fails with SerializationException
    }
}

Expected behavior

This test should not throw any SerializationException

Environment

Kotlin version: 1.4.10
Library version: 1.0.0
Kotlin platforms: JVM
Gradle version: 6.1.1

The text was updated successfully, but these errors were encountered:

JakeWharton · 2020-10-28T13:16:22Z

A BOM is a series of bytes in order to indicate the charset needed to decode bytes into a string. By the time the representation reaches a string, the BOM should have been parsed, used, and thus removed. I do not believe this to be the responsibility of the library because the input is broken. A BOM that prefixes character data is not useful–it's just a malformed JSON string as the library correctly indicates.

FlorianObermayer · 2020-10-28T16:28:25Z

@JakeWharton is right, however my example is a bit too "constructed".

In my real-life application we receive an InputStream. Due to the lack of deserialization of ByteArray or InputStream we have to convert it to a string: inputStream.readBytes().toString(Charsets.UTF-8)

Since we transitioned from Jackson Serialization, this conversion was not necessary before:

    data class Data(@JsonProperty("value") val value: String)

    @Test
    fun `test jackson deserialization of UTF-8 with BOM`() {
        val mapper = jacksonObjectMapper()
        val data = Data("test")
        val jsonData = mapper.writeValueAsString(data)
        val jsonDataWithBOM = jsonData.prependIndent("\uFEFF") // simulates UTF-8 with BOM
        val inputStream = jsonDataWithBOM.toByteArray().inputStream()
        Assert.assertEquals(data, mapper.readValue<Data>(inputStream))
    }

In my opinion kotlinx.serialization lacks a Json::decodeFromByteArray(deserializer: DeserializationStrategy<T>, bytes : ByteArray) method.

I might miss something when converting InputStream to String though. Any suggestions? Removing the BOM "by hand" feels like an unnecessary hack.

My only "knowledge" about these input streams is that they hold UTF-8 encoded data, (with or without the BOM).

Many thanks and best regards,
Florian

bartvanheukelom · 2020-11-06T16:56:30Z

From what I've read just now, the BOM declares the encoding of the rest of the stream. When you call bytes.toString(Charsets.UTF-8), you're declaring that the encoding is UTF-8 and hence no BOM is needed.
Correct would be to either:

read the first byte and determine the charset from that, then use that to convert the bytes to a string
or if you explicitly only want to support UTF-8, peek at the first byte. remove it if it's a UTF-8 BOM, throw an error if it indicates another charset

FlorianObermayer · 2020-11-17T11:08:57Z

@bartvanheukelom this is, of course, an intermediate solution to this problem (which we already implement), however it feels like this should be handled by the serialization framework (as it is handled by others such as Jackson).

If this is only my opinion, feel free to close the issue.

JakeWharton · 2020-11-17T14:36:34Z

Neither Gson nor Moshi handle it and I believe their decision is correct. A BOM is a document-level prefix and serialization may be done on parts of a document.

For example, if you have a newline-delimited series of JSON values (such as the Asciinema format) you would handle the BOM on the first line (as it would be a prefix) but then fail to handle it on subsequent lines potentially using the wrong decoding. The BOM in that situation applies to the whole document, not just the first line, and needs to be applied by the person calling into the serialization library with string data, not by the library itself.

pdvrieze · 2020-11-17T16:01:01Z

The "correct" way to handle byte inputs conversion to reader is to use an InputStream reader combined with a Charset (like UTF-8). To test it, the BOM needs to be prepended to the byte array, not the string. Of course there you are mainly testing the implementation of Charset.

sandwwraith · 2020-11-20T19:36:41Z

I agree with Jake that BOM removal shouldn't be the part of string encoding/decoding. When kotlinx.serialization will provide conversion from ByteArray/InputStream, then it would be its responsibility.

sandwwraith closed this as completed Nov 20, 2020

sandwwraith mentioned this issue Jan 31, 2022

Empty json and Serializable class with no properties or only @EncodedDefault properties #1850

Closed

sandwwraith mentioned this issue Jul 5, 2022

UTF8 Boom and integer boolean #1981

Closed

AzimMuradov mentioned this issue Oct 2, 2023

Support encodings other than UTF-8 and support BOM handling Kotlin/kotlinx-io#226

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialization fails with UTF-8 with BOM #1168

Serialization fails with UTF-8 with BOM #1168

FlorianObermayer commented Oct 28, 2020

JakeWharton commented Oct 28, 2020

FlorianObermayer commented Oct 28, 2020

bartvanheukelom commented Nov 6, 2020

FlorianObermayer commented Nov 17, 2020

JakeWharton commented Nov 17, 2020

pdvrieze commented Nov 17, 2020

sandwwraith commented Nov 20, 2020

Serialization fails with UTF-8 with BOM #1168

Serialization fails with UTF-8 with BOM #1168

Comments

FlorianObermayer commented Oct 28, 2020

JakeWharton commented Oct 28, 2020

FlorianObermayer commented Oct 28, 2020

bartvanheukelom commented Nov 6, 2020

FlorianObermayer commented Nov 17, 2020

JakeWharton commented Nov 17, 2020

pdvrieze commented Nov 17, 2020

sandwwraith commented Nov 20, 2020