Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove UTF-16 and UTF-32 stuff #16590

Merged
merged 7 commits into from
Jul 11, 2016
Merged

remove UTF-16 and UTF-32 stuff #16590

merged 7 commits into from
Jul 11, 2016

Conversation

StefanKarpinski
Copy link
Member

@StefanKarpinski StefanKarpinski commented May 25, 2016

Part of #16107.

@@ -908,4 +908,11 @@ function detect_ambiguities(mods...; imported::Bool=false)
collect(ambs)
end

immutable GenericString <: AbstractString
string::AbstractString
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is test/ not good enough for this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because packages are going to want to use this type as well. Otherwise how do you test that your interfaces work with string types that are not String once that's the only standard string type?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also plan on making this type "lumpy" – i.e. each character will take a random number of bytes. That way tests that use this type will really put code through its paces in terms of assumptions about encodings and such.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is intended to be used by packages, it'd be good to provide a docstring, because I imagine the appearance of this type in this file would appear strange to the casual - doesn't even necessarily have to be in the manual, but that'd be nice too.

@@ -151,14 +151,6 @@ end
@test lcfirst("")==""
@test lcfirst("*")=="*"

#more String tests
@test convert(String, UInt8[32,107,75], "*") == " kK"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what were these 3-arg convert methods doing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third argument was a replacement string for invalid UTF-8 data. This is not a way you're supposed to be able to call convert. I'll have to add a deprecation for it before this PR is done.

@tknopp
Copy link
Contributor

tknopp commented May 26, 2016

Are these things already covered in some package?

@StefanKarpinski
Copy link
Member Author

Are these things already covered in some package?

Not yet, I'll have to create that package. I will do so before merging this change.

@tknopp
Copy link
Contributor

tknopp commented May 26, 2016

One question would be if it makes sense to have some sort of package that acts as a "quarantine zone" where all removed stuff from Base is moved to (without any extra reorganization and so on, just copy paste). Then if someone thinks that the stuff is important and want to make a dedicated package he/she can pick up the code and take over maintainership.
The advantage is that one decouples the removal and the point where a (maybe broader) package is created.

@StefanKarpinski
Copy link
Member Author

That's not a bad idea, @tknopp.

@tknopp
Copy link
Contributor

tknopp commented May 26, 2016

Ok. I created a new issue #16598 so that this is not highjacked by this broader idea.

include("unicode/utf8.jl")
include("unicode/utf16.jl")
include("unicode/utf32.jl")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanKarpinski: Why is include("unicode/utf32.jl") not removed here? It looks that the entire file is removed later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That file isn't actually deleted here yet – there are a bunch of method definitions in it that weren't actually specific to the UTF32String type. For now I'm focusing on deleting things and keeping the tests working rather than moving things around as well. Once everything is pared down and working, then reorganization can be done.

@tkelman
Copy link
Contributor

tkelman commented May 26, 2016

There is https://github.com/nalimilan/StringEncodings.jl which has a fair amount of content in it already and might make a decent home for these types and operations, vs the alternatives of standalone package (revive https://github.com/nolta/UTF16.jl as a package, or make a new UTF16andUTF32.jl for both?) or catch-all ArtistFormerlyKnownAsBase.jl collection. Thoughts, @nalimilan?

@StefanKarpinski
Copy link
Member Author

StringEncodings seems like a good home. Does that sound good, @nalimilan?

@nalimilan
Copy link
Member

Yes, I think we'll need these functions there. The idea is to support all encodings under a common EncodedString{enc} type, which methods to handle all conversions, either written in pure Julia or falling back to using iconv. I would welcome a PR adding this code.

@stevengj
Copy link
Member

stevengj commented Jun 3, 2016

Since we gave up on replacing string with String, doesn't that clear the way to have String(d::Vector{UInt16}) = String(utf16to8(d)) ? It seems pretty consistent with saying UInt8 data is UTF-8 in the string constructor.

@vtjnash
Copy link
Member

vtjnash commented Jun 3, 2016

Since we gave up on replacing string with String, doesn't that clear the way to have String(d::Vector{UInt16}) = String(utf16to8(d)) ? It seems pretty consistent with saying UInt8 data is UTF-8 in the string constructor.

Not entirely, since String(Vector{UInt8}) means "interpret these bytes as utf-8", not convert these bytes to utf-8 (which might theoretically differ in handling of invalid codepoints, for example). For comparison, UTF32String(Vector{UInt8}) used to mean "interpret these bytes as utf-32", not convert from utf-8 to utf-32.

I'm fairly strongly against giving our arrays any string-like properties.

@nalimilan
Copy link
Member

I agree with @vtjnash. For anything other than UTF-8, I think we'd better require specifying the original encoding. In StringEncodings.jl's framework, that would be something like String(d, enc"UTF-16") (it is currently decode(d, enc"UTF-16"), but only accepts d::Vector{UInt8}).

@stevengj
Copy link
Member

stevengj commented Jun 3, 2016

My motivation here is that making a String from UTF-16 will be fairly common, while no other 16-bit string encoding is widespread. It makes sense to have this conversion be easily accessible (without hunting down some other function...currently an unexported function in Base). Nor is String(::Vector{UInt16}) particularly ambiguous — there seems to be no other sensible meaning for such a constructor.

@nalimilan
Copy link
Member

Yeah, there are certainly arguments for this. OTOH, Latin-1/ISO-8859-1 will also be a very common need, and we won't be able to support it with this approach. Not sure.

@stevengj
Copy link
Member

stevengj commented Jun 3, 2016

@nalimilan, since String(::Vector{UInt8}) already assumes UTF-8 (rather than Latin-1 or some other pre-Unicode 8-bit encoding), that kind of proves the point: the String constructors expect standard Unicode encodings, and for any other encoding we'll need some separate function (probably in an external package). Having String(::Vector{UInt16}) assume UTF-16 is consistent with this.

@tkelman
Copy link
Contributor

tkelman commented Jun 4, 2016

It looks like LLVM has a set of unicode conversion functions in it - http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html. Have those always been there?

@StefanKarpinski
Copy link
Member Author

My current thinking is this:

  • String(data::Vector{UInt8})
    • construct String with bytes as its data; takes ownership of bytes.
  • convert(::Type{String}, str::AbstractString)
    • construct String representing the same character sequence as str.
  • String(str::AbstractString) = convert(String, str)
  • convert(::Type{String}, data::Vector{UInt8})
    • copy UTF-8 data and create a String from it; equivalent to String(copy(data)).
  • convert(::Type{String}, data::Vector{UInt16})
    • transcode UTF-16 data to UTF-8 and create a String from it.
  • convert(::Type{String}, data::Vector{UInt32})
    • transcode UTF-32 data to UTF-8 and create a String from it.
  • convert(::Type{Vector{UInt8}}, str::String)
    • copy UTF-8 data from str and return as a byte vector; equivalent to copy(str.data).
  • convert(::Type{Vector{UInt16}}, str::String)
    • transcode UTF-8 data from str to UTF-16 as a vector of two-byte code units.
  • convert(::Type{Vector{UInt32}}, str::String)
    • transcode UTF-8 data from str to UTF-32 as a vector of four-byte code units.

This way String(data) is only used in cases where the data is taken ownership of by the resulting String object, which can only happen when data is Vector{UInt8}, while convert is used when the string data must be a copy.

@stevengj
Copy link
Member

stevengj commented Jun 4, 2016

@StefanKarpinski, there are lots of places in the code that use String(d[1:n]). When the arraypocolypse happens and d[1:n] returns a view, this will presumably have to make a copy instead of taking ownership.

Anyway, since T(x) falls back to convert, by providing the the convert methods aren't you supplying String(...) methods too?

@StefanKarpinski
Copy link
Member Author

Anyway, since T(x) falls back to convert, by providing the the convert methods aren't you supplying String(...) methods too?

Yes, this is annoying. So what do you suggest?

@stevengj
Copy link
Member

stevengj commented Jun 4, 2016

My suggestion is just to accept that String(data) will do conversions from UTF-16 and UTF-32 arrays.

@StefanKarpinski
Copy link
Member Author

The danger with that is that someone will write generic code where the data is converted and therefore copied and works for UInt16 and UInt32 arrays but then when passed a UInt8 array it takes ownership and the code breaks. But maybe that's an acceptable risk.

@nalimilan
Copy link
Member

I'd say it's worth the risk, anyway UTF-8 is quite common, so it's not like it would only break in a corner case. Better always have String call the fallback to convert and keep methods consistent. Anyway, in general, convert does not offer any guarantees as regards aliasing (#12441).

Likewise, I wonder whether convert(::Type{Vector{UInt8}}, str::String) should make a copy or not. Maybe better avoiding copies, and look for a more general solution for when you want to mutate the result of a conversion.

@stevengj
Copy link
Member

stevengj commented Jun 4, 2016

Or we can just document that s.data is the raw UInt8 array, and caution people not to modify it. In practice, tons of packages seem to use this anyway.

@ivarne
Copy link
Member

ivarne commented Jun 5, 2016

I really dislike String(::Vector{Uint8}) taking ownership. It has similar pittfalls as pointer arithmetic and gives hard to debug errors. It also only solves a subset of the "Use this block of memory as a UTF8 string" problem.

How about making it a method of reinterpret instead? (which is actually what it does)

@tkelman
Copy link
Contributor

tkelman commented Jul 6, 2016

Merging things without fixing docs at the same time is a really bad habit that has led to documentation being incorrect or missing on master for months (ahem iterator traits). We need to stop doing that, rushing towards an RC or not. Anyone who wants to help can propose the doc updates for this, but this isn't done or ready until the docs and news are prepared.

@tkelman
Copy link
Contributor

tkelman commented Jul 6, 2016

I don't think that merging a big PR like this immediately before the RC is a great idea.

It certainly isn't. There are three steps towards RC1 - feature freeze, branch release-0.5, then RC1. These are separate events that I don't think we should do at the same time, and we haven't formally done even the first step yet because the milestone still has breaking changes left on it. This PR should be merged as soon as it's ready, including doc news updates.

StefanKarpinski and others added 5 commits July 5, 2016 20:37
reverse() for GenericString/AbstractString returns a RevString, whose
indexing behavior is very different from a reverse()'d String which
is returned for String. Thus, calling reverseind() on the underlying String
object is not correct for GenericString. Add a generic but O(n) method for
AbstractString and use it for GenericString.
since their code in base has been removed
@tkelman
Copy link
Contributor

tkelman commented Jul 6, 2016

There, doc update done. Now just NEWS if someone wants to propose a wording for that.

UTF-32 encodings. Additional discussion of other encodings and how to
implement support for them is beyond the scope of this document for
Julia uses UTF-8 encoding by default, and support for new encodings can
be added by packages. Additional discussion of other encodings and how
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't it say that the UTF-16 and UTF-32 encodings are supported by the LegacyStrings package, since the latter is a quasi-official package mentioned in the deprecation warnings?

@stevengj
Copy link
Member

stevengj commented Jul 7, 2016

Is there a documented way to convert String to/from UTF-16 data for calling Windows APIs? (This functionality is still in Base.)

@tkelman
Copy link
Contributor

tkelman commented Jul 7, 2016

#16974, which is still totally undocumented.

@stevengj
Copy link
Member

stevengj commented Jul 7, 2016

This PR has

function cconvert(::Type{Cwstring}, s::AbstractString)
    v = transcode(Cwchar_t, String(s).data)
    !isempty(v) && v[end] == 0 || push!(v, 0)
    return v
end

but where is the transcode method for when Cwchar_t == Int32?

@StefanKarpinski
Copy link
Member Author

These are valid points but irrelevant to this PR, which is strictly about deleting UTF-16 and UTF-32 string types and functions.

@StefanKarpinski StefanKarpinski merged commit 9a223c8 into master Jul 11, 2016
@StefanKarpinski StefanKarpinski deleted the sk/highlander5 branch July 11, 2016 10:35
@stevengj
Copy link
Member

stevengj commented Jul 11, 2016

They are not irrelevant, because if you delete the UTF-16/32 support without documenting and fixing transcode, then people ccalling wchar_t* interfaces are left without any documented way to do their conversions.

@tkelman
Copy link
Contributor

tkelman commented Jul 11, 2016

NEWS updates still badly needed for this and many other breaking changes so people know how to deal with them

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request Jul 11, 2016
tkelman added a commit to JuliaLang/Compat.jl that referenced this pull request Jul 12, 2016
Returns `true` if the given value is valid for its type, which currently can be one of
`Char`, `String`, `UTF16String`, or `UTF32String`.
Returns `true` if the given value is valid for its type, which currently can be either
`Char` or `String`.
Copy link
Contributor

@yuyichao yuyichao Jul 17, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is deleted for String which breaks IJulia. Is it intentional? (If it is meant to be removed, the doc should be too.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If IJulia needs UTF-8 validation, it may need to use LegacyStrings. Otherwise we could have an isvalid(String) method that returns true if the String is valid UTF-8 and false otherwise. The tricky thing is that there are a few different versions of what could be considered valid:

  • encoding sanity (leading byte followed by the corresponding number of trailing bytes)
  • invalid code points (surrogates)
  • overlong encodings

And of course, these are not mutually exclusive – each string can exhibit any subset of these issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@StefanKarpinski, Base still includes UTF-8 validation: isvalid(String, "foo") works, and calls the C function u8_isvalid.

You just removed isvalid(::String). This seems a bit arbitrary and probably a mistake? Maybe we should have a fallback method isvalid{T}(x::T) = isvalid(T, x) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's just put that method back then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
strings "Strings!" unicode Related to unicode characters and encodings
Projects
None yet
Development

Successfully merging this pull request may close these issues.