Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Encoder instead of Encoding in CsvParser #2106

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Rob-Hague
Copy link
Contributor

closes #2088

This improves the ByteCount logic when consuming surrogate characters because
the Encoder maintains state across calls. Previously the default behaviour was
that the Encoding would return the byte count having replaced the surrogate
character with U+FFFD REPLACEMENT CHARACTER because on its own a surrogate
character is invalid.

For example, when reading the UTF-16 surrogate pair \uD83D\uDE17 corresponding
to the Unicode character U+1F617, the (UTF-8) ByteCount would be 6 (the byte
count of the sequence \uFFFD\uFFFD) instead of the correct value of 4 UTF-8
bytes.
An encoder with a custom EncoderReplacementFallback could easily require
a larger buffer, for example if its replacement string is "{LONG REPLACEMENT STRING}".
So the upper-bound of 16 bytes is not correct. Note that in this case
CsvParser.ByteCount would be nonsensical for input requiring the fallback, and
configuration.Encoding probably does not match the actual encoding, e.g we are reading
from a UTF-16 byte stream containing non-ASCII characters and we have set
configuration.Encoding to ASCII.

Nevertheless the safeguard is more sensible to have than not.
@JoshClose
Copy link
Owner

I'm going to wait on this for now. I believe the SIMD code will completely change how this works. I will be counting blocks of bytes at a time instead of single characters.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ByteCount fails to count surrogate characters properly.
2 participants