-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
remove UTF-16 and UTF-32 stuff #16590
Conversation
@@ -908,4 +908,11 @@ function detect_ambiguities(mods...; imported::Bool=false) | |||
collect(ambs) | |||
end | |||
|
|||
immutable GenericString <: AbstractString | |||
string::AbstractString |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is test/ not good enough for this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because packages are going to want to use this type as well. Otherwise how do you test that your interfaces work with string types that are not String
once that's the only standard string type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also plan on making this type "lumpy" – i.e. each character will take a random number of bytes. That way tests that use this type will really put code through its paces in terms of assumptions about encodings and such.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is intended to be used by packages, it'd be good to provide a docstring, because I imagine the appearance of this type in this file would appear strange to the casual - doesn't even necessarily have to be in the manual, but that'd be nice too.
120617b
to
ad08728
Compare
@@ -151,14 +151,6 @@ end | |||
@test lcfirst("")=="" | |||
@test lcfirst("*")=="*" | |||
|
|||
#more String tests | |||
@test convert(String, UInt8[32,107,75], "*") == " kK" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what were these 3-arg convert methods doing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third argument was a replacement string for invalid UTF-8 data. This is not a way you're supposed to be able to call convert
. I'll have to add a deprecation for it before this PR is done.
Are these things already covered in some package? |
Not yet, I'll have to create that package. I will do so before merging this change. |
One question would be if it makes sense to have some sort of package that acts as a "quarantine zone" where all removed stuff from Base is moved to (without any extra reorganization and so on, just copy paste). Then if someone thinks that the stuff is important and want to make a dedicated package he/she can pick up the code and take over maintainership. |
That's not a bad idea, @tknopp. |
Ok. I created a new issue #16598 so that this is not highjacked by this broader idea. |
include("unicode/utf8.jl") | ||
include("unicode/utf16.jl") | ||
include("unicode/utf32.jl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanKarpinski: Why is include("unicode/utf32.jl")
not removed here? It looks that the entire file is removed later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That file isn't actually deleted here yet – there are a bunch of method definitions in it that weren't actually specific to the UTF32String type. For now I'm focusing on deleting things and keeping the tests working rather than moving things around as well. Once everything is pared down and working, then reorganization can be done.
There is https://github.com/nalimilan/StringEncodings.jl which has a fair amount of content in it already and might make a decent home for these types and operations, vs the alternatives of standalone package (revive https://github.com/nolta/UTF16.jl as a package, or make a new UTF16andUTF32.jl for both?) or catch-all |
StringEncodings seems like a good home. Does that sound good, @nalimilan? |
Yes, I think we'll need these functions there. The idea is to support all encodings under a common |
Since we gave up on replacing |
Not entirely, since I'm fairly strongly against giving our arrays any string-like properties. |
I agree with @vtjnash. For anything other than UTF-8, I think we'd better require specifying the original encoding. In StringEncodings.jl's framework, that would be something like |
My motivation here is that making a |
Yeah, there are certainly arguments for this. OTOH, Latin-1/ISO-8859-1 will also be a very common need, and we won't be able to support it with this approach. Not sure. |
@nalimilan, since |
It looks like LLVM has a set of unicode conversion functions in it - http://llvm.org/docs/doxygen/html/ConvertUTF_8h_source.html. Have those always been there? |
My current thinking is this:
This way |
@StefanKarpinski, there are lots of places in the code that use Anyway, since |
Yes, this is annoying. So what do you suggest? |
My suggestion is just to accept that |
The danger with that is that someone will write generic code where the data is converted and therefore copied and works for UInt16 and UInt32 arrays but then when passed a UInt8 array it takes ownership and the code breaks. But maybe that's an acceptable risk. |
I'd say it's worth the risk, anyway UTF-8 is quite common, so it's not like it would only break in a corner case. Better always have Likewise, I wonder whether |
Or we can just document that |
I really dislike How about making it a method of |
Merging things without fixing docs at the same time is a really bad habit that has led to documentation being incorrect or missing on master for months (ahem iterator traits). We need to stop doing that, rushing towards an RC or not. |
It certainly isn't. There are three steps towards RC1 - feature freeze, branch release-0.5, then RC1. These are separate events that I don't think we should do at the same time, and we haven't formally done even the first step yet because the milestone still has breaking changes left on it. This PR should be merged as soon as it's ready, including |
reverse() for GenericString/AbstractString returns a RevString, whose indexing behavior is very different from a reverse()'d String which is returned for String. Thus, calling reverseind() on the underlying String object is not correct for GenericString. Add a generic but O(n) method for AbstractString and use it for GenericString.
since their code in base has been removed
There, doc update done. Now just NEWS if someone wants to propose a wording for that. |
UTF-32 encodings. Additional discussion of other encodings and how to | ||
implement support for them is beyond the scope of this document for | ||
Julia uses UTF-8 encoding by default, and support for new encodings can | ||
be added by packages. Additional discussion of other encodings and how |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't it say that the UTF-16 and UTF-32 encodings are supported by the LegacyStrings package, since the latter is a quasi-official package mentioned in the deprecation warnings?
Is there a documented way to convert |
#16974, which is still totally undocumented. |
This PR has function cconvert(::Type{Cwstring}, s::AbstractString)
v = transcode(Cwchar_t, String(s).data)
!isempty(v) && v[end] == 0 || push!(v, 0)
return v
end but where is the |
These are valid points but irrelevant to this PR, which is strictly about deleting UTF-16 and UTF-32 string types and functions. |
They are not irrelevant, because if you delete the |
NEWS updates still badly needed for this and many other breaking changes so people know how to deal with them |
…arnings for unsafe_string
fix tests to work with JuliaLang/julia#16590
Returns `true` if the given value is valid for its type, which currently can be one of | ||
`Char`, `String`, `UTF16String`, or `UTF32String`. | ||
Returns `true` if the given value is valid for its type, which currently can be either | ||
`Char` or `String`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method is deleted for String
which breaks IJulia. Is it intentional? (If it is meant to be removed, the doc should be too.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If IJulia needs UTF-8 validation, it may need to use LegacyStrings. Otherwise we could have an isvalid(String)
method that returns true if the String
is valid UTF-8 and false otherwise. The tricky thing is that there are a few different versions of what could be considered valid:
- encoding sanity (leading byte followed by the corresponding number of trailing bytes)
- invalid code points (surrogates)
- overlong encodings
And of course, these are not mutually exclusive – each string can exhibit any subset of these issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@StefanKarpinski, Base
still includes UTF-8 validation: isvalid(String, "foo")
works, and calls the C function u8_isvalid
.
You just removed isvalid(::String)
. This seems a bit arbitrary and probably a mistake? Maybe we should have a fallback method isvalid{T}(x::T) = isvalid(T, x)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's just put that method back then.
…arnings for unsafe_string
Part of #16107.