-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
export and document transcode #17323
Conversation
@@ -121,8 +121,21 @@ end | |||
|
|||
# transcoding between data in UTF-8 and UTF-16 for Windows APIs | |||
|
|||
""" | |||
transcode(T, src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs rst signature so genstdlib can fill it in
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I was meaning to do that and forgot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
Updated to include very basic UTF-32 conversions, especially for systems where |
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src] | ||
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src) = transcode(T, transcode(String, src)) | ||
transcode{S<:Union{Int32,UInt32}}(T, src::Vector{S}) = transcode(T, transcode(String, src)) | ||
transcode{S<:Union{Int32,UInt32}}(::Type{String}, src::Vector{S}) = string(map(Char, src)...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this splat every character?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, is that a big problem? string
has an optimized method for string(Char...)
that I wanted to exploit.
Or should I create an IOBuffer
and just print
the characters one-by-one to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The splatting would still have to go through dispatch for all arguments to determine it can use that method, I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Splatting performance confuses me. I feel like there should be an optimized path for foo(array...)
when array::Vector{T}
and foo
has a foo(T...)
method.
Anyway, I'll change it to use an IOBuffer
.
x-ref #17218 which we probably should have merged |
In StringEncodings.jl, I have followed the naming from e.g. Python and used I think it would make sense to use this convention, and keep |
@nalimilan, why? I'd strongly prefer having fewer symbols if the meaning is clear, as I think it is here. |
Okay, method ambiguities should be fixed. |
needs rebase (probably trivial end-of-file), and failing commits should be squashed out |
Dunno, it feels more natural to me to use different functions for string -> bytes and bytes -> string. That approach is also used in the corresponding Rust package. OTOH, I couldn't find a language/package where Note that the difference between the methods in this PR and those in StringEncodings.jl is that the latter take an encoding argument, since the integer size cannot be used to carry that information in all cases. This means the encoding argument would be in one case (bytes -> string) the source encoding, and in the other one the target encoding (string -> bytes). |
""" | ||
transcode(T, src) | ||
|
||
Convert string data between Unicode encodings. `src` is either a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we usually use double space between sentences? (Here and below.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the Markdown formatter ignore extra spaces when it formats the result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it just ignores extra whitespace inside paragraphs. Shouldn't be causing any problems, though we should probably use consistent formatting throughout.
IOW: while I agree this feels natural here, where the scope is quite limited, I'm afraid it wouldn't be very natural when extending it to any encoding. Whether Base and StringEncodings should be completely consistent is up for debate, though. |
Part of the difference with Python and Rust, I think, is that those are OOP languages and the methods in that case are "owned" by an object. In that case it is more natural to have |
6b31fe3
to
9016ffc
Compare
I merged #17218 so This change significantly expands on the notion of For example, the current If we add More broadly, this raises and issue about errors and generic functions. Currently, we allow any method of a function raise any error. This makes it very hard to write generic code in a way that doesn't unexpected errors. At some point we'll want the errors that a function can throw to be part of the signature of the function in some sense so that we can know when generic code handles all the errors it can legitimately encounter. If we don't define encoding error types in Base, then it is in principle impossible do this for the transcode{S<:AbstractString}(::Type{S}, src::AbstractString) throws EncodingError Of course, that's not the whole signature since it only captures transcoding between strings. |
@StefanKarpinski, the basic motivation for I really don't understand your point about exceptions. What about that is specific to We should treat this like any other method: we document the behavior in Base (we don't throw exceptions for converting between Unicode encodings), and we support |
@StefanKarpinski, I think you mean that you merged #17218. That adds a docstring for |
9016ffc
to
d57d866
Compare
Ok, ignore the issue about generic functions and exceptions if that doesn't bother you. I completely understand the motivation for adding convenience methods to |
I think it should only raise an exception if it is impossible to losslessly represent the source data in the destination encoding. |
What does it mean to "losslessly represent" invalid string data? Does that mean that the round trip transcoding is guaranteed to be the identity? There are byte sequences which are invalid UTF-8 but which the current algorithm can round trip safely through UTF-16 anyway. There are other cases where UTF-8-like data can include non-canonical encodings that are useful and common (CESU-8, Modified UTF-8 and overlong encodings in general), but which do not round trip through UTF-16 – although they still encode the same sequence of code points. Is that ok? In still other cases, it's often reasonable to interpret invalid UTF-8 as Latin-1 and encode it as UTF-16 accordingly, but this doesn't round-trip, of course. Any sequence of UInt16 code points can be transcoded to UTF-8 and round-tripped back to UTF-16 identically, even invalid ones. |
I would think that as long as it encodes to the same sequence of code points and/or it round-trips, it should be accepted. |
How about something like:
And then we leave it up to packages to decide on a reasonable meaning of "reasonably". (It's not like any other functions in Base impose formal constraints on user methods.) |
Added a clarification to the docs that this function is supposed to handle even invalid data. |
to return a `Vector{UIntXX}` of UTF-`XX` data. Only conversion | ||
to or from UTF-8 is currently supported. | ||
to return a `Vector{UIntXX}` of UTF-`XX` data. (The alias `Cwchar_t` | ||
also be used as the integer type for converting `wchar_t*` strings |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be used
So here's another issue: as we've discussed before, we probably want encodings to be first class and write something like |
If more encodings are needed, they can always be added later, no? (Probably this is best left to a package...I can't see supporting Latin1 or EBCDIC or Windows-1252 in Base.) As time goes on, it seems like UTF-xx will be the only encodings that matter, so it is not crazy to use |
Here's a possible policy on raising transcoding errors, and it's simple: never return an invalid object of the requested type. Since This raises another question: should we change |
I like |
We went through the argument about edit: I also like referring to the container type rather than the element type, echoes of #11557 |
Since |
transcode{T<:Union{UInt8,UInt16,UInt32,Int32}}(::Type{T}, src::Vector{T}) = src | ||
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::String) = T[T(c) for c in src] | ||
transcode{T<:Union{Int32,UInt32}}(::Type{T}, src::Vector{UInt8}) = transcode(T, String(src)) | ||
function transcode{S<:Union{Int32,UInt32}}(::Type{UInt8}, src::Vector{S}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the rationale for supporting signed 32-bit code units (Int32
) but only unsigned code units for others?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cwchar_t
can be Int32
, so we need to support the signed type for UTF-32.
I suppose it wouldn't hurt to support the signed types for Int8
and Int32
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I was arguing for not supporting Int32
either but the point about Cwchar_t
makes sense. Let's leave it as is. I'd like to keep this minimal until we find we really need to do otherwise.
Ok, I'm convinced. @stevengj – will you rebase and merge? Or I can do it since I merged the conflicting changes. Just let me know. |
76f789f
to
e48811c
Compare
Rebased; click "merge" when green. |
!#$&-ing |
…ing, x) and transcode(T, ::String) convenience methods
e48811c
to
d10a1e8
Compare
Can we not run CI again for a NEWS conflict fix? |
Okay. |
* export and document transcode from JuliaLang#16974, add transcode(String, x) and transcode(T, ::String) convenience methods * docs * support UTF-32 in transcode * don't use splatting for UTF-32 to String conversion * typo * eliminate method ambiguities * re-run genstdlib * doc clarification * typo
This exports and documents the
transcode
function from #16974. It also adds convenience methodstranscode(String, x) == String(transcode(UInt8, x))
(an extremely common pattern in our own code) andtranscode(T, s::String) == transcode(T, s.data)
(since we'd prefer to avoid requiring that users muck withs.data
).I feel like this is a necessary prerequisite to #16590, in order to remove the need for the
UTF16String
andUTF32String
types when calling C APIs.For the same reason, I'd also prefer to add support for transcoding to/from UTF-32, for calling C APIs on systems where
Cwchar_t == Int32
. I'm a little confused about how that is supposed to work in #16590. It seems like a pretty basic implementation liketranscode(::Type{Uint32}, s::String) = map(UInt32, s)
should be fine.