improve character category predicates #5939

stevengj · 2014-02-24T17:43:14Z

As @jiahao suggested in #5576, it might be worthwhile to use utf8proc (which we are shipping with Julia anyway) to provide functions like isalnum, isalpha, iscntrl, isdigit, isgraph, islower, isprint, ispunct, isspace, isupper, and possibly isblank in string.jl. The reason is that utf8proc seems to be more up-to-date on the Unicode standard than libc, and is unhampered by legacy issues (e.g. isblank returns false for a non-breaking space, apparently for legacy reasons).

utf8proc's results are also locale-independent. This may be a plus or a minus; I don't really understand how the locale affects the results of the abovementioned predicates in libc.

The text was updated successfully, but these errors were encountered:

JeffBezanson · 2014-02-24T17:48:52Z

We should definitely do this. All those predicates have canonical answers for unicode characters and shouldn't depend on locale.
One example is that the libc function isalpha(c) is defined to be equivalent to isupper(c) || islower(c), but in unicode there are letters that are neither lower- nor upper-case. The "C" locale selects the legacy behavior. So libc seems pretty bogus here.

jiahao · 2014-02-24T17:56:44Z

I'd be up for putting in some effort to make this happen for 0.4. Is it fair to say that things aren't so badly broken Unicode-wide that all this needs to happen in 0.3?

stevengj · 2014-02-24T18:04:37Z

I think that this particular issue can wait until 0.4.

stevengj · 2014-02-24T18:13:40Z

One difficult question is whether to even keep some of these functions, or to replace them with more Unicode-centric names and concepts.

stevengj · 2014-06-25T20:11:45Z

Note that these functions are all broken on Windows because the isw* functions take a single wchar_t argument, which is only 16 bits on Windows (and hence can only handle the BMP).

catawbasam · 2014-08-18T02:06:09Z

In follow-up to @JeffBezanson's comments in #8012, benchmarks comparing islower(Char) and islower(String) are posted at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b

They compare 3 approaches:

the current Base methods (which is inaccurate on Windows)
methods based on utf8proc
methods based on PCRE

Findings for test functions, as tested:

For islower(Char), Base was fastest, utf8proc was moderately slower and PCRE was unacceptable.
For islower(String), Base was slowest, utf8proc was moderately faster and PCRE was much faster.

For my purposes, performance of islower(ByteString) and islower(SubString{ByteString}) are most important, so I find the PCRE approach attractive.

Comments and suggestions for improving either the methods or the tests would be welcome.

stevengj · 2014-08-18T13:37:51Z

But do utf8proc and PCRE always give the same answers? Is PCRE Unicode-7 aware?

Also, the string benchmark is probably misleading. If you want to optimize this function, don't use all (a higher-order function, where things probably don't get inlined). Use a straight for char in string loop.

catawbasam · 2014-08-18T17:06:03Z

"Is PCRE Unicode-7 aware?"
Per http://www.pcre.org/changelog.txt PCRE was updated to Unicode 6.3.3 in December 2013, so it is presumably not yet Unicode-7 aware.

In briefly reviewing http://www.unicode.org/charts/PDF/Unicode-7.0/ and http://www.unicode.org/charts/case/ I did not find any Unicode-7 character codes that had lowercase/uppercase. If you have any test cases available, I'd be happy to try them out.

Regarding all vs for loop, the utf8proc version is already using a for loop. However, I did add a tweaked version of the Base.islower(String) call to ccall(:iswlower with a for loop, and as you suggested, it was indeed faster than the current Base.islower(). The new variant is called islower_loop in the notebook, and the gist is updated with those results.

catawbasam · 2014-08-18T17:14:14Z

Seems like a utf8proc version of islower(String) might be faster if the string were passed via ccall and the conversion to character codes was done on that side. Unfortunately I don't know C very well, so I'm not in a position to do that test.

stevengj · 2014-08-18T17:34:26Z

Try adding the following to libmojibake (formerly utf8proc):

#include "mojibake.h"

/* note: assumes str points to a valid NUL-terminated UTF8 string */
int mojibake_islower(const uint8_t *str)
{
     uint8_t c = *str;
     while (c) {
          int length = utf8proc_utf8class[c];
          int32_t uc;
          switch (length) {
              case 1:
                   uc = str[0];
                   break;
              case 2:
                   uc = ((str[0] & 0x1F) <<  6) + (str[1] & 0x3F);
                   break;
              case 3:
                   uc = ((str[0] & 0x0F) << 12) + ((str[1] & 0x3F) <<  6)
                        + (str[2] & 0x3F);
                   break;
              case 4:
                   uc = ((str[0] & 0x07) << 18) + ((str[1] & 0x3F) << 12)
                        + ((str[2] & 0x3F) <<  6) + (str[3] & 0x3F);
                   break;
          }
          if (utf8proc_get_property(uc)->category != UTF8PROC_CATEGORY_LL)
               return 0;
          str += length;
          c = *str;
     }
     return 1;
}

stevengj · 2014-08-18T17:38:26Z

I don't think we would want islower to ever disagree for a Char or a String, so we need to use the same implementation for both. (Is islower for strings a function whose performance is critical?)

JeffBezanson · 2014-08-18T17:53:54Z

The libmojibake approach is generally preferred, since it is already a deep dependency of the parser. Although PCRE is very important, I'd rather not dig in deeper with it. At a certain point, it becomes reasonable to want to use just the language with minimal library dependencies.

catawbasam · 2014-08-18T18:23:10Z

Makes sense. I'll plan on trying out mojibake_islower in the next day or so.

re "Is islower for strings a function whose performance is critical?"
Not particularly. It's just that, in contrast to Julia's excellent performance on numerical tasks, text handling times aren't always especially good compared to Python's. It would be a shame to give up any further ground.

catawbasam · 2014-08-19T03:03:00Z

Updated test results are up at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b.
islower_utf8proc() is faster in this version, and mojibake_islower(String) is faster yet for ByteStrings.

Changes:

added mojibake_islower(ByteString) based on a C function provided by @stevengj
Revised islower_utf8proc() to use a hacked version of category_code(c) which omits the line cat == 0 ? 30 : cat. It isn't needed by islower() and adds quite a bit of time.
Updated tests of islower(Char) to use a randomly shuffled array of Char.
dropped PCRE as it does not work well with Char and libmojibake is the preferred path.

Updated Findings for test functions, as tested:

For Chars run against shuffled Char arrays, islower_base was just a hair faster than islower_utf8proc
For Strings, islower_utf8proc was 3-4x faster than islower_base
The discrepency between islower(Char) and islower(String) is probably due to the inefficiency of the all() function used in the Base version of islower(String).
mojibake_islower(ByteString) provided another 3x speed boost for ByteStrings.

ivarne · 2014-08-19T14:02:48Z

This might be totally obvious to you, but it seems to me like mojibake_islower should take an explicit length, so that SubString can use the same implementation.

catawbasam · 2014-08-19T14:46:39Z

Looks to me like we might be able to forgo specialized C methods like mojibake_islower altogether if we move c > 0x10FFFF && return 0x0000 and cat == 0 ? 30 : cat from category_code() in utf8proc.jl to utf8proc_get_property on the C side, and still get pretty good performance.

That would keep libmojibake nice and slim and minimize calls on @stevengj 's time.

stevengj · 2014-08-19T17:46:14Z

One of those checks can be eliminated completely by JuliaStrings/utf8proc#15. As for the other check, moving it into utf8proc_get_property is probably not a good idea, because that is an inner-loop function in the normalization code.

Also, realize that I got the 3x speed boot in mojibake_islower in part by completely removing all validity checks. Honestly, I don't think that is worth it to get a factor of 3 in a function that is probably not performance-critical anyway. (In what real applications is the inner loop a check of whether some string is lower-case?)

catawbasam · 2014-08-19T18:39:25Z

"Honestly, I don't think that is worth it to get a factor of 3 in a function that is probably not performance-critical anyway"
Agreed.

catawbasam · 2014-08-20T23:30:17Z

A draft cut at character category predicates based on libmojibake (as-is) is posted at https://gist.github.com/catawbasam/3ab68615b4c78a5a49b1
[nbviewer link: http://nbviewer.ipython.org/gist/catawbasam/3ab68615b4c78a5a49b1]
The definitions draw from similar functions in Haskell, Go and perl.

The functions are intended to be non-breaking, with one minor exception -- isspace() returns true for U+0085 (newline), U+00A0 (non-breaking space). This follows the Go version.

If the approach looks promising, I could put together a pull request.

[pao: add nbviewer link]

stevengj · 2014-08-21T02:02:20Z

@catawbasam, this seems reasonable to me.

StefanKarpinski · 2014-08-21T13:04:41Z

Seems good although I feel like some investigation of the non-breaking space issue is warranted. Would the desired behavior of that character be precisely to look like a space but not act like one?

catawbasam · 2014-08-21T14:57:06Z

Yes, it looks like a space (with one exception) but behaves differently.
The 2 contexts where I've seen non-breaking space used are:

In word processors, where you sometimes want a visible space, but you don't want a line break at that space. For example you might use it to control orphaning.
In HTML it can be used to avoid whitespace collapse.

Just reviewed Wikipedia at http://en.wikipedia.org/wiki/Non-breaking_space, and there are several variants of non-breaking space. All are in category ZS, so isspace can be slightly simplified to:

function isspace_moji(c::Char)
    return  c in (' ','\t','\n','\r','\f','\v', NEL) || 
             category_code_assigned(c)==UTF8PROC_CATEGORY_ZS
end

The one non-breakable space that looks marginal is word-joiner U+2060: "The word-joiner does not normally produce any space but prohibits a line break on either side of it." It is in category Zs, so it would be selected as true under the proposed definition. Worth adding a clause to exclude it?

stevengj · 2014-08-21T15:31:23Z

What do other languages do with U+2060?

catawbasam · 2014-08-21T17:50:37Z

Oops, made a mistake when checking the code categories -- U+2060 is in Cf: Other, Format,
so there is no issue with it.

julia> int(category_code_assigned(char(0x002060))) 
27

stevengj · 2014-09-13T15:14:37Z

Closed by #8233.

stevengj added feature and removed feature labels Feb 24, 2014

JeffBezanson added the unicode label Feb 24, 2014

stevengj added the unicode label Feb 24, 2014

stevengj mentioned this issue Jun 25, 2014

inconsistent results from isprint('\t') #7416

Closed

This was referenced Aug 13, 2014

islower(Char) incorrect on Windows #7988

Closed

PCRE based islower() isupper()? #8012

Closed

JeffBezanson changed the title ~~use utf8proc's Unicode character categories for isspace etc.~~ improve character category predicates Aug 15, 2014

catawbasam mentioned this issue Aug 24, 2014

char predicates using libmojibake #8110

Closed

catawbasam mentioned this issue Sep 4, 2014

moji char predicates #8233

Merged

stevengj closed this as completed Sep 13, 2014

JeffBezanson mentioned this issue Oct 21, 2014

REPL show utf-8 string maybe wrong #8747

Closed

stevengj mentioned this issue Jul 23, 2017

Replace isnumber(), etc. with a single function #14347

Open

Seelengrab mentioned this issue May 12, 2024

Ensure non-ASCII decimal digits are also isdigit #54447

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve character category predicates #5939

improve character category predicates #5939

stevengj commented Feb 24, 2014

JeffBezanson commented Feb 24, 2014

jiahao commented Feb 24, 2014

stevengj commented Feb 24, 2014

stevengj commented Feb 24, 2014

stevengj commented Jun 25, 2014

catawbasam commented Aug 18, 2014

stevengj commented Aug 18, 2014

catawbasam commented Aug 18, 2014

catawbasam commented Aug 18, 2014

stevengj commented Aug 18, 2014

stevengj commented Aug 18, 2014

JeffBezanson commented Aug 18, 2014

catawbasam commented Aug 18, 2014

catawbasam commented Aug 19, 2014

ivarne commented Aug 19, 2014

catawbasam commented Aug 19, 2014

stevengj commented Aug 19, 2014

catawbasam commented Aug 19, 2014

catawbasam commented Aug 20, 2014

stevengj commented Aug 21, 2014

StefanKarpinski commented Aug 21, 2014

catawbasam commented Aug 21, 2014

stevengj commented Aug 21, 2014

catawbasam commented Aug 21, 2014

stevengj commented Sep 13, 2014