-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improve character category predicates #5939
Comments
We should definitely do this. All those predicates have canonical answers for unicode characters and shouldn't depend on locale. |
I'd be up for putting in some effort to make this happen for 0.4. Is it fair to say that things aren't so badly broken Unicode-wide that all this needs to happen in 0.3? |
I think that this particular issue can wait until 0.4. |
One difficult question is whether to even keep some of these functions, or to replace them with more Unicode-centric names and concepts. |
Note that these functions are all broken on Windows because the |
In follow-up to @JeffBezanson's comments in #8012, benchmarks comparing islower(Char) and islower(String) are posted at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b They compare 3 approaches:
Findings for test functions, as tested:
For my purposes, performance of islower(ByteString) and islower(SubString{ByteString}) are most important, so I find the PCRE approach attractive. Comments and suggestions for improving either the methods or the tests would be welcome. |
But do utf8proc and PCRE always give the same answers? Is PCRE Unicode-7 aware? Also, the string benchmark is probably misleading. If you want to optimize this function, don't use |
"Is PCRE Unicode-7 aware?" In briefly reviewing http://www.unicode.org/charts/PDF/Unicode-7.0/ and http://www.unicode.org/charts/case/ I did not find any Unicode-7 character codes that had lowercase/uppercase. If you have any test cases available, I'd be happy to try them out. Regarding |
Seems like a utf8proc version of |
Try adding the following to libmojibake (formerly utf8proc):
|
I don't think we would want |
The libmojibake approach is generally preferred, since it is already a deep dependency of the parser. Although PCRE is very important, I'd rather not dig in deeper with it. At a certain point, it becomes reasonable to want to use just the language with minimal library dependencies. |
Makes sense. I'll plan on trying out re "Is islower for strings a function whose performance is critical?" |
Updated test results are up at http://nbviewer.ipython.org/gist/catawbasam/28153e91774992d6482b. Changes:
Updated Findings for test functions, as tested:
|
This might be totally obvious to you, but it seems to me like |
Looks to me like we might be able to forgo specialized C methods like That would keep libmojibake nice and slim and minimize calls on @stevengj 's time. |
One of those checks can be eliminated completely by JuliaStrings/utf8proc#15. As for the other check, moving it into Also, realize that I got the 3x speed boot in |
"Honestly, I don't think that is worth it to get a factor of 3 in a function that is probably not performance-critical anyway" |
A draft cut at character category predicates based on libmojibake (as-is) is posted at https://gist.github.com/catawbasam/3ab68615b4c78a5a49b1 The functions are intended to be non-breaking, with one minor exception -- isspace() returns true for U+0085 (newline), U+00A0 (non-breaking space). This follows the Go version. If the approach looks promising, I could put together a pull request. [pao: add nbviewer link] |
@catawbasam, this seems reasonable to me. |
Seems good although I feel like some investigation of the non-breaking space issue is warranted. Would the desired behavior of that character be precisely to look like a space but not act like one? |
Yes, it looks like a space (with one exception) but behaves differently.
Just reviewed Wikipedia at http://en.wikipedia.org/wiki/Non-breaking_space, and there are several variants of non-breaking space. All are in category ZS, so
The one non-breakable space that looks marginal is word-joiner U+2060: "The word-joiner does not normally produce any space but prohibits a line break on either side of it." It is in category Zs, so it would be selected as true under the proposed definition. Worth adding a clause to exclude it? |
What do other languages do with U+2060? |
Oops, made a mistake when checking the code categories -- U+2060 is in Cf: Other, Format,
|
Closed by #8233. |
As @jiahao suggested in #5576, it might be worthwhile to use utf8proc (which we are shipping with Julia anyway) to provide functions like
isalnum
,isalpha
,iscntrl
,isdigit
,isgraph
,islower
,isprint
,ispunct
,isspace
,isupper
, and possiblyisblank
instring.jl
. The reason is that utf8proc seems to be more up-to-date on the Unicode standard than libc, and is unhampered by legacy issues (e.g.isblank
returnsfalse
for a non-breaking space, apparently for legacy reasons).utf8proc's results are also locale-independent. This may be a plus or a minus; I don't really understand how the locale affects the results of the abovementioned predicates in libc.
The text was updated successfully, but these errors were encountered: