Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read(io, Char) doesn't match collect(string) for malformed UTF-8 #50532

Closed
stevengj opened this issue Jul 12, 2023 · 2 comments · Fixed by #50552
Closed

read(io, Char) doesn't match collect(string) for malformed UTF-8 #50532

stevengj opened this issue Jul 12, 2023 · 2 comments · Fixed by #50552
Labels
domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" kind:bug Indicates an unexpected problem or unintended behavior

Comments

@stevengj
Copy link
Member

stevengj commented Jul 12, 2023

The following mismatch seems undesirable to me: the same data "\xfc\xa8" is treated as 2 (malformed) characters for collect but as only 1 character for read:

julia> s = "\xfc\xa8"
"\xfc\xa8"

julia> io = IOBuffer(s);

julia> read(io, Char)
'\xfc\xa8': Malformed UTF-8 (category Ma: Malformed, bad data)

julia> collect(s)
2-element Vector{Char}:
 '\xfc': Malformed UTF-8 (category Ma: Malformed, bad data)
 '\xa8': Malformed UTF-8 (category Ma: Malformed, bad data)

cc @StefanKarpinski, the guru of malformed Char, who wrote this read code and the string iteration in #24999 — is this intentional?

@stevengj stevengj added domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" labels Jul 12, 2023
@StefanKarpinski
Copy link
Sponsor Member

This definitely seems wrong—we should be consistent about what we consider to be a character, even for invalid data. I can dig into why there's a discrepancy...

@StefanKarpinski
Copy link
Sponsor Member

Yeah, ok, read(io, Char) is just buggy when there's a lead byte with too many leading ones, so I'll fix that.

@stevengj stevengj added the kind:bug Indicates an unexpected problem or unintended behavior label Jul 14, 2023
StefanKarpinski added a commit that referenced this issue Jul 17, 2023
Fixes #50532. The `read(io, Char)` method didn't correctly handle the
case where the lead byte starts with too many leading ones; this fix
makes it handle that case correctly, which makes `read(io, Char)` match
`collect(s)` in its interpretation of what a character is in all invalid
cases. Also fix and test `read(::File, Char)` which has the same bug.
KristofferC pushed a commit that referenced this issue Jul 17, 2023
Fixes #50532. The `read(io, Char)` method didn't correctly handle the
case where the lead byte starts with too many leading ones; this fix
makes it handle that case correctly, which makes `read(io, Char)` match
`collect(s)` in its interpretation of what a character is in all invalid
cases. Also fix and test `read(::File, Char)` which has the same bug.

(cherry picked from commit ffe1a07)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
domain:io Involving the I/O subsystem: libuv, read, write, etc. domain:strings "Strings!" kind:bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants