Add UTF encoding validity functions #11575

ScottPJones · 2015-06-04T16:48:00Z

This introduces the Base.check_string function, with methods for handling UTF-8, UTF-16, and UTF-32 as vectors of UInt8, UInt16, and UInt32, respectively, as well as an AbstractString version that operates on Unicode characters (making sure that they are all valid code points, 0 <= ch < 0xd800, 0xe000 <= ch < 0x10ffff).
There are options to accept or not accept things like Modified UTF-8 encoding, or CESU-8 encoding,
or "overly long" encodings. (See utfcheck.jl for documentation)
These methods either throw a UnicodeError, with information about which character was invalid, and
its position in the input, or they return a tuple, of the number of logical characters in the string, a bit flag to indicate what types of data were found (i.e. all ASCII, all Latin1, surrogates present, overlong characters present, etc.), the number of characters that would take 2 bytes to encode in UTF-8, the number that would take 3 bytes, and the number that would take 4 bytes.
This information is designed to be useful for future conversion routines, because they allow one to calculate the exact size needed to represent a valid string in UTF-8, UTF-16, or UTF-32, and also
indicate whether an optimized "widening" or "narrowing" conversion can be done.

ScottPJones · 2015-06-04T16:58:22Z

Prepare your 🍅s! I'm off for dinner, please let me know what you think of this and #11573

tkelman · 2015-06-04T17:00:43Z

base/sysimg.jl

@@ -84,6 +84,8 @@ include("iterator.jl")
 include("osutils.jl")

 # strings & printing
+include("utferror.jl")
+include("utftypes.jl")


you'll need to include("utfcheck.jl") here too

Hmm... somehow that got lost... was definitely part of my source locally...

tkelman · 2015-06-05T11:02:42Z

Please rebase relative to latest master now that #11573 is merged

ScottPJones · 2015-06-05T11:09:08Z

Of course! Thanks very much. Should I squash everything also?

tkelman · 2015-06-05T11:13:40Z

It's only 2 commits here, and rebasing would remove 1 since that's now on master.

If you want to amend your commit message while rebasing to make it a little more descriptive that would be welcomed.

tkelman · 2015-06-05T13:15:42Z

As I mentioned at #11551 (comment), I think this code is essentially ready now and good to merge, assuming the tests pass.

ScottPJones · 2015-06-05T13:20:52Z

Is that what you wanted, as far as the commit message? Thanks!

tkelman · 2015-06-05T13:24:50Z

The indentation is a little odd, but it's fine. Nice and descriptive, I like that part.

ScottPJones · 2015-06-06T05:08:11Z

Ah, I see about what happened to the indentation... I updated that directly on GitHub.
The test failure is the dreaded OOM, not the fault of this change, so hopefully somebody can merge this.
Thanks!

vtjnash · 2015-06-06T06:28:24Z

base/utfcheck.jl

+            elseif ch < 0x800
+                num2byte += 1
+                flags |= UTF_UNICODE2
+            elseif T != Vector{UInt16} && ch > 0x0ffff


is the type check here actually necessary? if T==UInt16, then typemax(T) == 0xffff && ch > 0xffff should be false. no?

It's checking against Vector{UInt16}, not UInt16, so I don't think Julia can figure it out by itself.

julia> function t16(x) local a::UInt32 @inbounds a, i = next(x, 1) if a > 0xffff return 1 else return 2 end end t16 (generic function with 1 method) julia> code_native(t16,Tuple{Vector{UInt16}}) .section __TEXT,__text,regular,pure_instructions Filename: none Source line: 4 pushq %rbp movq %rsp, %rbp movl $2, %eax Source line: 4 popq %rbp ret julia> code_llvm(t16,Tuple{Vector{UInt16}}) define i64 @julia_t16_20941(%jl_value_t*) { top: ret i64 2 }

that looks pretty successful to me.

Thanks! I changed that. I know that at one point (before Jeff put in the change that fixed the big performance regression compared to 0.3.x I'd seen) it seemed not to get that correctly... (unless my eyes where playing tricks on me!), but I hadn't thought to try again (also my test wasn't the same as yours, it was using an AbstractString)

ScottPJones · 2015-06-07T01:19:05Z

The only failures are due to the OOM problems affecting everybody.
I believe I've addressed the remaining issues. Anything else preventing this from going in?
Thanks!

Added new convert methods that use the check_string function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

tkelman · 2015-06-07T11:16:13Z

base/utfcheck.jl

+end
+
+function check_string(dat::Vector{UInt8}, len = sizeof(dat), pos = 0 ; options::Integer=0)
+" Validates and calculates number of characters in a UTF-8 encoded vector of UInt8


I believe @nalimilan said this in one of the other PR's, but the doc string convention is more likely to be docs immediately preceding the function, not just inside the function. Not positive though.

I hadn't seen that, but, why introduce a different convention from what's very commonly in use (i.e. Python documentation)? I'd heard that void strings (which I took to mean any string inside or outside of a method, since it wasn't at all specified, that wasn't used (as an argument, to be returned, or set into a variable), would in the future be picked out for documentation.
Since it seems there is a lot of overlap between Python and Julia programmers (all the SciPy, NumPy, etc. people), I think it would be a very good think to be consistent with that, if there is no compelling reason not to.

Compelling reason being Julia has multiple dispatch, with multiple method implementations for the same generic function but different argument types, which Python doesn't have. That and the @doc macro works for docstring-before-function, so I believe that's the direction that things are going.

The use of the @doc macro apparently will no longer be necessary (or even the doc string prefix, if the latest version of Docile were put in Base. I'm not sure why multiple dispatch feels like a compelling reason to you... This doesn't remove the ability to (later in the build process, at least) use @doc doc"""...""" -> syntax before the first function of a set of methods...
I think you are going to see this style whether you like it or not... just this week I had to tell a Python programmer learning Julia that he had to move his comments out of the function and use @doc doc"""...""" -> in order to get Help to pick it up. Allowing this style will make a lot of people moving from Python happy (or less confused, at least)

People coming from Doxygen, Javadoc or gtk-doc are used to having docs before the body of the function, so you can't please everybody. Anyway, didn't you support a solution similar to Doxygen? :-p

doxygen doesn't have anything to do with it... you can use doxygen with Python, which has the docstring inside the function.
Also, I'm not saying that the default shouldn't be to have the documentation before... but why couldn't both be allowed?
About doxygen - I just like having things documented, and using doxygen pushes that a bit, but I've never been particularly attached to the syntax, just to having the necessary information preserved to make cross-references, etc. Since nobody here seems to like the doxygen syntax, and somebody on another thread talked about parsing the comments to build the metadata, (which could then be used to produce output for doxygen), I decided that that was the better solution for now.

Can we just pick one convention, and use it without arguing please? Docstrings still need some work, but for code outside of base, docstring before function has been implemented, docstring inside of function hasn't. If someone wants to make a PR to support docstrings inside of functions in Docile or the base docs system, then that convention could be an alternate to accommodate people coming from python. For now, wouldn't it be easier to just make this change than waste more bandwidth resisting trivial changes that multiple reviewers are asking you to make?

I'm fine with changing it... I was responding via my phone until a couple of hours ago... out enjoying the day... I'm back at the laptop, doing it now.

Updated comment to go before function, not indented by 4

Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

ScottPJones · 2015-06-08T08:09:47Z

I believe I've handled everything... please take a look!

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation

nalimilan · 2015-06-15T14:46:57Z

Ah, something that @mbauman's comment on the other thread reminded me of: to check bounds in checkstring, you can call checkbounds, which is slightly shorter.

ScottPJones · 2015-06-15T14:51:13Z

I'm already changing it - it doesn't really matter that it's shorter though, it is pretty important that it is correct! 😀 Thanks!

ScottPJones · 2015-06-15T15:17:06Z

@nalimilan Turns out, there would still be problems, using checkbounds, so I think for now, this should just be left as is, and a separate PR done to deal with bounds checking issues throughout base.
So, anything left to do?

ScottPJones · 2015-06-15T17:10:30Z

@mbauman Could you please take a look at my use of next(), start(), endof(), pos, and endpos in this code? The back and forth review suggestions are driving me a bit crazy, and I'd like some sort of definitive answer as to what should be done. I've just been following patterns I've seen elsewhere in the code, but if that code is not correct, then I'm lost...

mbauman · 2015-06-15T17:22:04Z

Yes, I'm sorry, I was missing the context of this PR, and was answering all your questions (both here and in #11713) in the general case for all AbstractArrays. Your usages of indexing and iteration over arrays here looks like its entirely restricted to Vector, and in that case, indexing and iteration state happen to be the same. So everything here looks to be correct (albeit dependent upon the internal Array iteration implementation).

I think that's just fine for internal code like this, but this is outside of my purview and I won't be helping to maintain this code directly. Edit: I will be working more on array indexing, and I do not foresee the implementation of iteration for Array changing anytime soon.

ScottPJones · 2015-06-15T17:25:24Z

@mbauman Thanks very much. I'll be looking forward to your planned Interfaces manual page!

Adds `check_string` function, which checks a vector of bytes, 16-bit or 32-bit words, or an AbstractString for validity, either for UTF-8, UTF-16, or UTF-32 encoding. By default, `Modified UTF-8 (long \0 encoding)` and `CESU-8 (surrogate pairs encoded as 2 UTF-8 3-byte sequences)` are allowed, but other over long encoded sequences are not allowed, but this can be changed by the keyword options argument. Add unit tests of all the errors found by `check_string` Updated documentation to not use doxygen tags. Move documentation strings from line after to line before Add testing of valid strings Improve/consolidate documentation Add bounds checking Change name to unsafe_checkstring, warn that doesn't check bounds Add checkstring, which does check bounds Add tests of bounds checking Change order of start/end positions Update bounds checking tests Change 1 to start(dat) Use checkbounds()

ScottPJones · 2015-06-15T21:06:58Z

OK, I've changed this according to the very last round of suggestions in #11713.
Thanks everybody for your time and thorough reviews!
Anything left to fix?

tkelman · 2015-06-17T04:03:53Z

Unless anyone objects strongly, I'll merge this in a day or two so we can continue on to the actual bug fixes. This part of the code looks pretty much done to me, so for the sake of future unicode bugfix backporting if nothing else let's get it on master.

Add UTF encoding validity functions

ScottPJones · 2015-06-19T01:03:46Z

🎉 Thank you very much, @tkelman! I'll go rebase the rest of my PRs now...
Thanks to everybody for the very thorough reviews, and all the time you spent on them. I'm very impressed with and grateful for all the thought, great suggestions, and help from people!

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation Fix AbstractVector{UInt16} conversion Remove support for converting Vector{UInt16} to UTF8String Add Unicode validation function and fix UTF-16 conversion bugs

Added new `convert` methods that use the `check_string` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Rewrote a number of the conversions between ASCIIString, UTF8String, and UTF16String. Rewrote length() for UTF16String(). Improved reverse() for UTF16String(). Added over 150 lines of testing code to detect the above conversion problems Added (in a gist) code to show other conversion problems not yet fixed: https://gist.github.com/ScottPJones/4e6e8938f0559998f9fc Added (in a gist) code to benchmark the performance, to ensure that adding the extra validity checking did not adversely affect performance (in fact, performance was greatly improved). https://gist.github.com/ScottPJones/79ed895f05f85f333d84 Updated based on review comments Changes to error handling and check_string Rebased against JuliaLang#11575 Updated comment to go before function, not indented by 4 Updated to use unsafe_checkstring Removed redundant argument documentation

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575

Added new `convert` methods that use the `checkstring` function to validate input Added tests for many sorts of valid/invalid data Depends on PR JuliaLang#11551 and JuliaLang#11575 Updated to use unsafe_checkstring, fix comments Remove conversions from Vector{UInt32} Move some code from utf32.jl to utf16.jl and utf8.jl, hopefully more logical

tkelman added the domain:unicode Related to unicode characters and encodings label Jun 4, 2015

tkelman reviewed Jun 4, 2015
View reviewed changes

ScottPJones mentioned this pull request Jun 5, 2015

Fix #10959 bugs with UTF-16 conversions #11551

Merged

ScottPJones force-pushed the spj/checkstring branch 2 times, most recently from 6714044 to f1c1583 Compare June 5, 2015 10:43

ScottPJones force-pushed the spj/checkstring branch from f1c1583 to 79f2618 Compare June 5, 2015 12:28

vtjnash reviewed Jun 6, 2015
View reviewed changes

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 6, 2015

Rebase against JuliaLang#11575

9f26d58

ScottPJones force-pushed the spj/checkstring branch from 79f2618 to da43b2a Compare June 6, 2015 20:49

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 7, 2015

Rebase against JuliaLang#11575

ce5e7c8

ScottPJones mentioned this pull request Jun 7, 2015

Fix #10959 bugs with UTF-32 conversions #11607

Merged

tkelman reviewed Jun 7, 2015
View reviewed changes

ScottPJones added a commit to ScottPJones/julia that referenced this pull request Jun 8, 2015

Rebase against JuliaLang#11575

70bf53c

Updated comment to go before function, not indented by 4

tkelman mentioned this pull request Jun 8, 2015

Add check_string function that is more generic, thanks to Encodings quinnj/Strings.jl#2

Open

ScottPJones force-pushed the spj/checkstring branch from c457bfb to e462158 Compare June 15, 2015 21:05

tkelman added a commit that referenced this pull request Jun 19, 2015

Merge pull request #11575 from ScottPJones/spj/checkstring

bbb8764

Add UTF encoding validity functions

tkelman merged commit bbb8764 into JuliaLang:master Jun 19, 2015

ScottPJones deleted the spj/checkstring branch June 19, 2015 01:03

ScottPJones mentioned this pull request Jun 23, 2015

abbreviated syntax for docstrings #8965

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add UTF encoding validity functions #11575

Add UTF encoding validity functions #11575

ScottPJones commented Jun 4, 2015

ScottPJones commented Jun 4, 2015

tkelman Jun 4, 2015

ScottPJones Jun 4, 2015

tkelman commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

tkelman commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

ScottPJones commented Jun 6, 2015

vtjnash Jun 6, 2015

ScottPJones Jun 6, 2015

vtjnash Jun 6, 2015

ScottPJones Jun 7, 2015

ScottPJones commented Jun 7, 2015

tkelman Jun 7, 2015

ScottPJones Jun 7, 2015

tkelman Jun 7, 2015

ScottPJones Jun 7, 2015

nalimilan Jun 7, 2015

ScottPJones Jun 7, 2015

tkelman Jun 7, 2015

ScottPJones Jun 7, 2015

ScottPJones commented Jun 8, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

tkelman commented Jun 17, 2015

ScottPJones commented Jun 19, 2015

Add UTF encoding validity functions #11575

Add UTF encoding validity functions #11575

Conversation

ScottPJones commented Jun 4, 2015

ScottPJones commented Jun 4, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tkelman commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

tkelman commented Jun 5, 2015

ScottPJones commented Jun 5, 2015

tkelman commented Jun 5, 2015

ScottPJones commented Jun 6, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jun 7, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ScottPJones commented Jun 8, 2015

nalimilan commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

mbauman commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

ScottPJones commented Jun 15, 2015

tkelman commented Jun 17, 2015

ScottPJones commented Jun 19, 2015