uppercase/lowercase functions are not portable? #11471

stevengj · 2015-05-28T14:21:03Z

Since @ScottPJones was mentioning upper/lowercase functions recently, I took a quick look at them and I noticed that we are calling towupper and towlower, which are C99 functions that accept wchar_t arguments.

Unfortunately, this means that they are broken on Windows (where wchar_t is 16 bits) for any character outside the BMP. Even on other platforms with a 32-bit wchar_t, they are going to return different results on different systems, and many systems will have out-of-date Unicode tables. They are also locale-dependent; I'm not sure if this is desirable for us.

utf8proc has up-to-date upper/lower/titlecase mapping data already in its "database" (generated from http://www.unicode.org/Public/UNIDATA/UnicodeData.txt), so maybe we should just add a utf8proc_toupper function (etc.) to utf8proc to make this accessible. Then we could call that (probably plus a check for the common case of ASCII codepoints).

The text was updated successfully, but these errors were encountered:

ScottPJones · 2015-05-28T14:49:00Z

Yep... although I wouldn't solve this with utf8proc... I'd intended to tackle this shortly (it's something else we need for our project, and I'd prefer to use Julia as much as possible).

JeffBezanson · 2015-05-28T14:56:16Z

although I wouldn't solve this with utf8proc

Why not? If utf8proc does this wrong, we should just fix it instead of re-implementing.

ScottPJones · 2015-05-28T14:58:12Z

Because, I believe it can better be implemented in pure Julia (sorry, now I'm a convert!)

jiahao · 2015-05-28T14:59:46Z

Essentially a duplicate of #7848

The Unicode standard defines upper-lower case mappings on a per character basis, but a correct transformation must necessarily take into account the locale. (See the infamous Turkish i - İ, ı - I pairings as the most egregious examples.)

ScottPJones · 2015-05-28T15:09:18Z

I don't think this really is a duplicate of #7848, and should be reopened.
If you look at ICU (probably the most widely used library for this sort of stuff), you can do either locale dependent or locale independent case mappings...
I think that in Base, Julia should give consistent results on all platforms, using locale independent mappings.
I think fancy locale dependent mappings, as discussed in #7848, belong in a package (I'll have to look, they may already be handled by the ICU.jl package).

jiahao · 2015-05-28T15:12:03Z

Did you read the discussion in #7848? The conclusion was exactly what you said, that we should have a locale independent choice in Base and have the locale specific choices not in base.

ScottPJones · 2015-05-28T15:14:36Z

I missed the last couple of comments (I'm in a meeting right now!) Sorry, my fault totally... The only difference is that I think any locale dependent mappings should be done as extending the Base methods via a Package...

ScottPJones · 2015-05-28T15:17:05Z

Also, I think @stevengj's point was something else, and this may need to still be open... it was about characters outside the BMP not being handled (even in a local independent fashion), on Windows,
simply due to the way the functions are implemented...

jiahao · 2015-05-28T15:20:06Z

If it really makes you happier, I'll close #7848 in favor of this issue.

ScottPJones · 2015-05-28T15:31:32Z

Actually, I don't think #7848 needs to be closed either 😀
IMO, this one is about non-BMP case mapping being broken on Windows (good catch, @stevengj),
and #7848 is about the need for having both locale independent (in Base) and locale dependent (hopefully in a package outside of Base).

stevengj · 2015-05-28T21:22:32Z

Pure Julia code for this has its pluses and minuses. On the minus side, it is more work: this information is already in utf8proc, and adding a couple of functions to expose it to Julia is literally something like 6 lines of C code. Also on the minus side, having it in utf8proc makes it available to non-Julia users. On the plus side, writing code in Julia is more fun, and potentially allows for more optimizations (e.g. inlining).

For my own part, I tend to default to the path of least work.

ScottPJones · 2015-05-28T21:56:06Z

About the minuses... more work, well, maybe, but it's something that I'd like to do when I want to take a break from other stuff 😀 (i.e. the Julia is fun part), plus doing it in Julia helps improve my Julia skills, and finally, I'm not talking at all about removing anything from utf8proc, so utf8proc is still available to non-Julia users (how many are there?)
About the pluses... I've been thinking about ways to make the size of the data structures much smaller,
and to write the code to generate the data structures (probably into a C source code file) in Julia also...
[I posted on julia-users, https://groups.google.com/forum/#!topic/julia-users/CyrLc_E0dac, to try to get some help to learn the best "julian" way of doing that, but nobody responded... :sad:]

stevengj · 2015-05-29T17:28:53Z

Actually, it looks like the towupper functions etcetera take a wint_t, even on Windows, which in my understanding is at least 32 bits wide. In which case the Julia ccall is wrong on Windows.

But I'm confused, because we have test coverage of lowercase etc... if we are passing the wrong type on Windows, how could we not have noticed? Does sizeof(wint_t) == 2 on Windows?

stevengj · 2015-05-30T03:34:58Z

Sorry, I didn't realize that (for #....) would close an issue on github; I thought it had to be fixed or closes or similar.

…ixes JuliaLang#11471)

stevengj · 2015-05-30T04:14:13Z

By the way, shouldn't ucfirst be using titlecase rather than uppercase for the first letter?

ScottPJones · 2015-05-30T06:59:50Z

Yes about using titlecase

ScottPJones · 2015-05-30T07:31:59Z

BTW, could you reopen #7848, which really is separate, and isn't addressed by your nice fix to this issue?

jiahao · 2015-05-30T10:52:59Z

I don't see the need. We already decided that locale-specific transformations should not belong in base, which is the remaining part of that issue.

…ixes JuliaLang#11471)

nalimilan · 2015-06-03T06:22:31Z

@stevengj I'd have preferred we wait for a true utf8proc release before relying on it. Now I need to package this development version to build the nightlies...

tkelman · 2015-06-03T07:09:22Z

There is a tag https://github.com/JuliaLang/utf8proc/releases/tag/1.3-dev1 but it's got a -dev1 marker on it, should we promote that to 1.3.0?

nalimilan · 2015-06-03T10:05:31Z

Well, not before we sort out JuliaStrings/utf8proc#42. :-)

tkelman · 2015-06-03T10:10:28Z

It would also be friendlier to distro packagers to put this change (#11493) in under a conditional utf8proc version number check.

…ixes JuliaLang#11471)

stevengj added the domain:unicode Related to unicode characters and encodings label May 28, 2015

jiahao closed this as completed May 28, 2015

jiahao mentioned this issue May 28, 2015

lowercase(String) doesn't handle case transformation contexts correctly #7848

Closed

jiahao reopened this May 28, 2015

stevengj added a commit to JuliaStrings/utf8proc that referenced this issue May 29, 2015

add toupper/tolower functions (for JuliaLang/julia#11471)

7e53895

stevengj mentioned this issue May 29, 2015

add toupper/tolower functions JuliaStrings/utf8proc#40

Merged

stevengj added a commit to JuliaStrings/utf8proc that referenced this issue May 29, 2015

add toupper/tolower functions (for JuliaLang/julia#11471)

36d4201

stevengj added a commit to JuliaStrings/utf8proc that referenced this issue May 30, 2015

add toupper/tolower functions (for JuliaLang/julia#11471)

a8fb4b1

stevengj closed this as completed in JuliaStrings/utf8proc#40 May 30, 2015

stevengj reopened this May 30, 2015

stevengj added a commit to stevengj/julia that referenced this issue May 30, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

37ec3d6

…ixes JuliaLang#11471)

stevengj mentioned this issue May 30, 2015

switch to utf8proc's upper/lowercase functions #11493

Merged

stevengj added a commit to stevengj/julia that referenced this issue May 30, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

89061b2

…ixes JuliaLang#11471)

stevengj added a commit to stevengj/julia that referenced this issue May 31, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

becc6eb

…ixes JuliaLang#11471)

stevengj added a commit to stevengj/julia that referenced this issue Jun 1, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

2945c84

…ixes JuliaLang#11471)

stevengj closed this as completed in #11493 Jun 1, 2015

ScottPJones mentioned this issue Jun 3, 2015

RFC: Roadmap for improving string support in Julia #11558

Closed

mbauman pushed a commit to mbauman/julia that referenced this issue Jun 6, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

07dbeb0

…ixes JuliaLang#11471)

tkelman pushed a commit to tkelman/julia that referenced this issue Jun 6, 2015

switch to utf8proc's portable, up-to-date, upper/lowercase functions (f…

982ba88

…ixes JuliaLang#11471)

stevengj mentioned this issue May 18, 2017

Make ucfirst a method of uppercase #21910

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

uppercase/lowercase functions are not portable? #11471

uppercase/lowercase functions are not portable? #11471

stevengj commented May 28, 2015

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

stevengj commented May 28, 2015

ScottPJones commented May 28, 2015

stevengj commented May 29, 2015

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

jiahao commented May 30, 2015

nalimilan commented Jun 3, 2015

tkelman commented Jun 3, 2015

nalimilan commented Jun 3, 2015

tkelman commented Jun 3, 2015

uppercase/lowercase functions are not portable? #11471

uppercase/lowercase functions are not portable? #11471

Comments

stevengj commented May 28, 2015

ScottPJones commented May 28, 2015

JeffBezanson commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

ScottPJones commented May 28, 2015

jiahao commented May 28, 2015

ScottPJones commented May 28, 2015

stevengj commented May 28, 2015

ScottPJones commented May 28, 2015

stevengj commented May 29, 2015

stevengj commented May 30, 2015

stevengj commented May 30, 2015

ScottPJones commented May 30, 2015

ScottPJones commented May 30, 2015

jiahao commented May 30, 2015

nalimilan commented Jun 3, 2015

tkelman commented Jun 3, 2015

nalimilan commented Jun 3, 2015

tkelman commented Jun 3, 2015