remove UTF-16 and UTF-32 stuff #16590

StefanKarpinski · 2016-05-25T22:26:34Z

Part of #16107.

yuyichao · 2016-05-25T22:29:57Z

base/test.jl

@@ -908,4 +908,11 @@ function detect_ambiguities(mods...; imported::Bool=false)
    collect(ambs)
 end

+immutable GenericString <: AbstractString
+    string::AbstractString


Why is test/ not good enough for this?

Because packages are going to want to use this type as well. Otherwise how do you test that your interfaces work with string types that are not String once that's the only standard string type?

I also plan on making this type "lumpy" – i.e. each character will take a random number of bytes. That way tests that use this type will really put code through its paces in terms of assumptions about encodings and such.

If this is intended to be used by packages, it'd be good to provide a docstring, because I imagine the appearance of this type in this file would appear strange to the casual - doesn't even necessarily have to be in the manual, but that'd be nice too.

tkelman · 2016-05-25T23:29:48Z

test/strings/basic.jl

@@ -151,14 +151,6 @@ end
 @test lcfirst("")==""
 @test lcfirst("*")=="*"

-#more String tests
-@test convert(String, UInt8[32,107,75], "*") == " kK"


what were these 3-arg convert methods doing?

The third argument was a replacement string for invalid UTF-8 data. This is not a way you're supposed to be able to call convert. I'll have to add a deprecation for it before this PR is done.

tknopp · 2016-05-26T10:14:18Z

Are these things already covered in some package?

StefanKarpinski · 2016-05-26T13:35:01Z

Are these things already covered in some package?

Not yet, I'll have to create that package. I will do so before merging this change.

tknopp · 2016-05-26T15:31:53Z

One question would be if it makes sense to have some sort of package that acts as a "quarantine zone" where all removed stuff from Base is moved to (without any extra reorganization and so on, just copy paste). Then if someone thinks that the stuff is important and want to make a dedicated package he/she can pick up the code and take over maintainership.
The advantage is that one decouples the removal and the point where a (maybe broader) package is created.

StefanKarpinski · 2016-05-26T15:43:46Z

That's not a bad idea, @tknopp.

tknopp · 2016-05-26T16:04:18Z

Ok. I created a new issue #16598 so that this is not highjacked by this broader idea.

tknopp · 2016-05-26T16:05:21Z

base/unicode.jl

 include("unicode/utf8.jl")
-include("unicode/utf16.jl")
 include("unicode/utf32.jl")


@StefanKarpinski: Why is include("unicode/utf32.jl") not removed here? It looks that the entire file is removed later

That file isn't actually deleted here yet – there are a bunch of method definitions in it that weren't actually specific to the UTF32String type. For now I'm focusing on deleting things and keeping the tests working rather than moving things around as well. Once everything is pared down and working, then reorganization can be done.

tkelman · 2016-05-26T17:53:13Z

There is https://github.com/nalimilan/StringEncodings.jl which has a fair amount of content in it already and might make a decent home for these types and operations, vs the alternatives of standalone package (revive https://github.com/nolta/UTF16.jl as a package, or make a new UTF16andUTF32.jl for both?) or catch-all ArtistFormerlyKnownAsBase.jl collection. Thoughts, @nalimilan?

StefanKarpinski · 2016-05-26T18:00:35Z

StringEncodings seems like a good home. Does that sound good, @nalimilan?

nalimilan · 2016-05-26T18:07:18Z

Yes, I think we'll need these functions there. The idea is to support all encodings under a common EncodedString{enc} type, which methods to handle all conversions, either written in pure Julia or falling back to using iconv. I would welcome a PR adding this code.

stevengj · 2016-06-03T01:21:25Z

Since we gave up on replacing string with String, doesn't that clear the way to have String(d::Vector{UInt16}) = String(utf16to8(d)) ? It seems pretty consistent with saying UInt8 data is UTF-8 in the string constructor.

vtjnash · 2016-06-03T04:49:20Z

Since we gave up on replacing string with String, doesn't that clear the way to have String(d::Vector{UInt16}) = String(utf16to8(d)) ? It seems pretty consistent with saying UInt8 data is UTF-8 in the string constructor.

Not entirely, since String(Vector{UInt8}) means "interpret these bytes as utf-8", not convert these bytes to utf-8 (which might theoretically differ in handling of invalid codepoints, for example). For comparison, UTF32String(Vector{UInt8}) used to mean "interpret these bytes as utf-32", not convert from utf-8 to utf-32.

I'm fairly strongly against giving our arrays any string-like properties.

nalimilan · 2016-06-03T07:56:31Z

I agree with @vtjnash. For anything other than UTF-8, I think we'd better require specifying the original encoding. In StringEncodings.jl's framework, that would be something like String(d, enc"UTF-16") (it is currently decode(d, enc"UTF-16"), but only accepts d::Vector{UInt8}).

stevengj · 2016-06-03T11:41:51Z

My motivation here is that making a String from UTF-16 will be fairly common, while no other 16-bit string encoding is widespread. It makes sense to have this conversion be easily accessible (without hunting down some other function...currently an unexported function in Base). Nor is String(::Vector{UInt16}) particularly ambiguous — there seems to be no other sensible meaning for such a constructor.

nalimilan · 2016-06-03T12:14:06Z

Yeah, there are certainly arguments for this. OTOH, Latin-1/ISO-8859-1 will also be a very common need, and we won't be able to support it with this approach. Not sure.

stevengj · 2016-06-03T16:20:37Z

@nalimilan, since String(::Vector{UInt8}) already assumes UTF-8 (rather than Latin-1 or some other pre-Unicode 8-bit encoding), that kind of proves the point: the String constructors expect standard Unicode encodings, and for any other encoding we'll need some separate function (probably in an external package). Having String(::Vector{UInt16}) assume UTF-16 is consistent with this.

tkelman · 2016-06-04T14:47:09Z

It looks like LLVM has a set of unicode conversion functions in it - http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html. Have those always been there?

StefanKarpinski · 2016-06-04T18:29:16Z

My current thinking is this:

String(data::Vector{UInt8})
- construct String with bytes as its data; takes ownership of bytes.
convert(::Type{String}, str::AbstractString)
- construct String representing the same character sequence as str.
String(str::AbstractString) = convert(String, str)
convert(::Type{String}, data::Vector{UInt8})
- copy UTF-8 data and create a String from it; equivalent to String(copy(data)).
convert(::Type{String}, data::Vector{UInt16})
- transcode UTF-16 data to UTF-8 and create a String from it.
convert(::Type{String}, data::Vector{UInt32})
- transcode UTF-32 data to UTF-8 and create a String from it.
convert(::Type{Vector{UInt8}}, str::String)
- copy UTF-8 data from str and return as a byte vector; equivalent to copy(str.data).
convert(::Type{Vector{UInt16}}, str::String)
- transcode UTF-8 data from str to UTF-16 as a vector of two-byte code units.
convert(::Type{Vector{UInt32}}, str::String)
- transcode UTF-8 data from str to UTF-32 as a vector of four-byte code units.

This way String(data) is only used in cases where the data is taken ownership of by the resulting String object, which can only happen when data is Vector{UInt8}, while convert is used when the string data must be a copy.

stevengj · 2016-06-04T18:48:49Z

@StefanKarpinski, there are lots of places in the code that use String(d[1:n]). When the arraypocolypse happens and d[1:n] returns a view, this will presumably have to make a copy instead of taking ownership.

Anyway, since T(x) falls back to convert, by providing the the convert methods aren't you supplying String(...) methods too?

StefanKarpinski · 2016-06-04T18:49:56Z

Anyway, since T(x) falls back to convert, by providing the the convert methods aren't you supplying String(...) methods too?

Yes, this is annoying. So what do you suggest?

stevengj · 2016-06-04T18:51:35Z

My suggestion is just to accept that String(data) will do conversions from UTF-16 and UTF-32 arrays.

StefanKarpinski · 2016-06-04T18:53:43Z

The danger with that is that someone will write generic code where the data is converted and therefore copied and works for UInt16 and UInt32 arrays but then when passed a UInt8 array it takes ownership and the code breaks. But maybe that's an acceptable risk.

nalimilan · 2016-06-04T20:20:08Z

I'd say it's worth the risk, anyway UTF-8 is quite common, so it's not like it would only break in a corner case. Better always have String call the fallback to convert and keep methods consistent. Anyway, in general, convert does not offer any guarantees as regards aliasing (#12441).

Likewise, I wonder whether convert(::Type{Vector{UInt8}}, str::String) should make a copy or not. Maybe better avoiding copies, and look for a more general solution for when you want to mutate the result of a conversion.

stevengj · 2016-06-04T20:26:51Z

Or we can just document that s.data is the raw UInt8 array, and caution people not to modify it. In practice, tons of packages seem to use this anyway.

ivarne · 2016-06-05T19:54:46Z

I really dislike String(::Vector{Uint8}) taking ownership. It has similar pittfalls as pointer arithmetic and gives hard to debug errors. It also only solves a subset of the "Use this block of memory as a UTF8 string" problem.

How about making it a method of reinterpret instead? (which is actually what it does)

tkelman · 2016-07-06T02:56:11Z

Merging things without fixing docs at the same time is a really bad habit that has led to documentation being incorrect or missing on master for months (ahem iterator traits). We need to stop doing that, rushing towards an RC or not. ~~Anyone who wants to help can propose the doc updates for this, but~~ this isn't done or ready until the ~~docs and~~ news are prepared.

tkelman · 2016-07-06T03:02:19Z

I don't think that merging a big PR like this immediately before the RC is a great idea.

It certainly isn't. There are three steps towards RC1 - feature freeze, branch release-0.5, then RC1. These are separate events that I don't think we should do at the same time, and we haven't formally done even the first step yet because the milestone still has breaking changes left on it. This PR should be merged as soon as it's ready, including ~~doc~~ news updates.

reverse() for GenericString/AbstractString returns a RevString, whose indexing behavior is very different from a reverse()'d String which is returned for String. Thus, calling reverseind() on the underlying String object is not correct for GenericString. Add a generic but O(n) method for AbstractString and use it for GenericString.

since their code in base has been removed

tkelman · 2016-07-06T03:39:17Z

There, doc update done. Now just NEWS if someone wants to propose a wording for that.

stevengj · 2016-07-06T12:36:40Z

doc/manual/strings.rst

-UTF-32 encodings. Additional discussion of other encodings and how to
-implement support for them is beyond the scope of this document for
+Julia uses UTF-8 encoding by default, and support for new encodings can
+be added by packages. Additional discussion of other encodings and how


Shouldn't it say that the UTF-16 and UTF-32 encodings are supported by the LegacyStrings package, since the latter is a quasi-official package mentioned in the deprecation warnings?

stevengj · 2016-07-07T18:41:45Z

Is there a documented way to convert String to/from UTF-16 data for calling Windows APIs? (This functionality is still in Base.)

tkelman · 2016-07-07T18:46:46Z

#16974, which is still totally undocumented.

stevengj · 2016-07-07T19:20:41Z

This PR has

function cconvert(::Type{Cwstring}, s::AbstractString)
    v = transcode(Cwchar_t, String(s).data)
    !isempty(v) && v[end] == 0 || push!(v, 0)
    return v
end

but where is the transcode method for when Cwchar_t == Int32?

StefanKarpinski · 2016-07-11T10:34:56Z

These are valid points but irrelevant to this PR, which is strictly about deleting UTF-16 and UTF-32 string types and functions.

stevengj · 2016-07-11T12:48:00Z

They are not irrelevant, because if you delete the UTF-16/32 support without documenting and fixing transcode, then people ccalling wchar_t* interfaces are left without any documented way to do their conversions.

tkelman · 2016-07-11T13:57:05Z

NEWS updates still badly needed for this and many other breaking changes so people know how to deal with them

…arnings for unsafe_string

fix tests to work with JuliaLang/julia#16590

yuyichao · 2016-07-17T19:25:00Z

base/docs/helpdb/Base.jl

-Returns `true` if the given value is valid for its type, which currently can be one of
-`Char`, `String`, `UTF16String`, or `UTF32String`.
+Returns `true` if the given value is valid for its type, which currently can be either
+`Char` or `String`.


This method is deleted for String which breaks IJulia. Is it intentional? (If it is meant to be removed, the doc should be too.)

If IJulia needs UTF-8 validation, it may need to use LegacyStrings. Otherwise we could have an isvalid(String) method that returns true if the String is valid UTF-8 and false otherwise. The tricky thing is that there are a few different versions of what could be considered valid:

encoding sanity (leading byte followed by the corresponding number of trailing bytes)

invalid code points (surrogates)

overlong encodings

And of course, these are not mutually exclusive – each string can exhibit any subset of these issues.

@StefanKarpinski, Base still includes UTF-8 validation: isvalid(String, "foo") works, and calls the C function u8_isvalid.

You just removed isvalid(::String). This seems a bit arbitrary and probably a mistake? Maybe we should have a fallback method isvalid{T}(x::T) = isvalid(T, x) ?

Yes, let's just put that method back then.

…arnings for unsafe_string

yuyichao reviewed May 25, 2016
View reviewed changes

StefanKarpinski force-pushed the sk/highlander5 branch from 120617b to ad08728 Compare May 25, 2016 22:44

tkelman reviewed May 25, 2016
View reviewed changes

tknopp mentioned this pull request May 26, 2016

RFC: BaseQuarantine.jl package #16598

Closed

tknopp reviewed May 26, 2016
View reviewed changes

ararslan mentioned this pull request Jun 2, 2016

String constructor is bad #16713

Closed

vtjnash force-pushed the sk/highlander5 branch from 8a36482 to 0fc2d1f Compare July 5, 2016 22:39

StefanKarpinski and others added 5 commits July 5, 2016 20:37

remove UTF-16 and UTF-32 string types and functions

3ad3784

Delete the now-unused UTF_ERR constants

593c5de

Doc update for utf16 and utf32 removal

1c2e5b5

Delete test/unicode/types.jl and test/unicode/checkstring.jl

d7b8361

since their code in base has been removed

tkelman force-pushed the sk/highlander5 branch from 52cface to d7b8361 Compare July 6, 2016 03:38

One more doc update about utf-16 etc in calling-c-and-fortran-code

145dd58

stevengj reviewed Jul 6, 2016
View reviewed changes

Add link to LegacyStrings.jl in doc/manual/strings.rst

3098e6c

stevengj mentioned this pull request Jul 7, 2016

export and document transcode #17323

Merged

StefanKarpinski merged commit 9a223c8 into master Jul 11, 2016

StefanKarpinski deleted the sk/highlander5 branch July 11, 2016 10:35

aviks mentioned this pull request Jul 11, 2016

Gallium fails to precompile JuliaDebug/Gallium.jl#133

Closed

stevengj added a commit to JuliaLang/Compat.jl that referenced this pull request Jul 11, 2016

fix tests to work with JuliaLang/julia#16590, eliminate deprecation w…

400c3b3

…arnings for unsafe_string

stevengj mentioned this pull request Jul 11, 2016

fix tests to work with JuliaLang/julia#16590 JuliaLang/Compat.jl#249

Merged

tkelman added a commit to JuliaLang/Compat.jl that referenced this pull request Jul 12, 2016

Merge pull request #249 from JuliaLang/fix-string-tests

8e815ff

fix tests to work with JuliaLang/julia#16590

yuyichao reviewed Jul 17, 2016
View reviewed changes

yuyichao mentioned this pull request Jul 17, 2016

isvalid(::String) is not defined but documented. #17467

Closed

damiendr mentioned this pull request Sep 16, 2016

Types wrongly redefined in early 0.5.0-dev JuliaStrings/LegacyStrings.jl#6

Closed

dpsanders pushed a commit to dpsanders/Compat.jl that referenced this pull request Feb 1, 2017

fix tests to work with JuliaLang/julia#16590, eliminate deprecation w…

f7e9d7c

…arnings for unsafe_string

remove UTF-16 and UTF-32 stuff #16590

remove UTF-16 and UTF-32 stuff #16590

Conversation

StefanKarpinski commented May 25, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tknopp commented May 26, 2016

StefanKarpinski commented May 26, 2016

tknopp commented May 26, 2016

StefanKarpinski commented May 26, 2016

tknopp commented May 26, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented May 26, 2016

StefanKarpinski commented May 26, 2016

nalimilan commented May 26, 2016

stevengj commented Jun 3, 2016

vtjnash commented Jun 3, 2016

nalimilan commented Jun 3, 2016

stevengj commented Jun 3, 2016 • edited Loading

nalimilan commented Jun 3, 2016

stevengj commented Jun 3, 2016 • edited Loading

tkelman commented Jun 4, 2016

StefanKarpinski commented Jun 4, 2016

stevengj commented Jun 4, 2016

StefanKarpinski commented Jun 4, 2016

stevengj commented Jun 4, 2016

StefanKarpinski commented Jun 4, 2016

nalimilan commented Jun 4, 2016

stevengj commented Jun 4, 2016

ivarne commented Jun 5, 2016

tkelman commented Jul 6, 2016 • edited Loading

tkelman commented Jul 6, 2016 • edited Loading

tkelman commented Jul 6, 2016

Choose a reason for hiding this comment

stevengj commented Jul 7, 2016

tkelman commented Jul 7, 2016

stevengj commented Jul 7, 2016

StefanKarpinski commented Jul 11, 2016

stevengj commented Jul 11, 2016 • edited Loading

tkelman commented Jul 11, 2016

yuyichao Jul 17, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

StefanKarpinski commented May 25, 2016 •

edited

Loading

stevengj commented Jun 3, 2016 •

edited

Loading

stevengj commented Jun 3, 2016 •

edited

Loading

tkelman commented Jul 6, 2016 •

edited

Loading

tkelman commented Jul 6, 2016 •

edited

Loading

stevengj commented Jul 11, 2016 •

edited

Loading

yuyichao Jul 17, 2016 •

edited

Loading