Use Encoder instead of Encoding in CsvParser #2106

Rob-Hague · 2022-12-26T18:57:13Z

closes #2088

This improves the ByteCount logic when consuming surrogate characters because the Encoder maintains state across calls. Previously the default behaviour was that the Encoding would return the byte count having replaced the surrogate character with U+FFFD REPLACEMENT CHARACTER because on its own a surrogate character is invalid. For example, when reading the UTF-16 surrogate pair \uD83D\uDE17 corresponding to the Unicode character U+1F617, the (UTF-8) ByteCount would be 6 (the byte count of the sequence \uFFFD\uFFFD) instead of the correct value of 4 UTF-8 bytes.

An encoder with a custom EncoderReplacementFallback could easily require a larger buffer, for example if its replacement string is "{LONG REPLACEMENT STRING}". So the upper-bound of 16 bytes is not correct. Note that in this case CsvParser.ByteCount would be nonsensical for input requiring the fallback, and configuration.Encoding probably does not match the actual encoding, e.g we are reading from a UTF-16 byte stream containing non-ASCII characters and we have set configuration.Encoding to ASCII. Nevertheless the safeguard is more sensible to have than not.

JoshClose · 2024-01-25T23:24:49Z

I'm going to wait on this for now. I believe the SIMD code will completely change how this works. I will be counting blocks of bytes at a time instead of single characters.

Rob-Hague mentioned this pull request Dec 26, 2022

ByteCount fails to count surrogate characters properly. #2088

Open

JoshClose mentioned this pull request Jan 25, 2024

Count surrogate pairs. #2090

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Encoder instead of Encoding in CsvParser #2106

Use Encoder instead of Encoding in CsvParser #2106

Rob-Hague commented Dec 26, 2022

JoshClose commented Jan 25, 2024

Use Encoder instead of Encoding in CsvParser #2106

Are you sure you want to change the base?

Use Encoder instead of Encoding in CsvParser #2106

Conversation

Rob-Hague commented Dec 26, 2022

JoshClose commented Jan 25, 2024