Julia doesn't like Pizza #3721

Keno · 2013-07-15T19:37:36Z

julia> x = '\U1f355'
'\U1f355'

julia> charwidth(x)
0

The text was updated successfully, but these errors were encountered:

staticfloat · 2013-07-15T19:54:52Z

I nominate for most interesting issue subject.

In other news, this looks like something to take up with wcwidth; it doesn't even think pizza is printable:

$ cat test.c
#include <wchar.h>
#include <stdio.h>

int main( void ) {
    wchar_t pizza = 0x1fe55;
    printf("%d\n", wcwidth(pizza) );
    return 0;
}

$  gcc -o test test.c && ./test
-1

We don't allow negative lengths, so we clip to 0, which seems pretty reasonable to me

StefanKarpinski · 2013-07-15T21:28:43Z

This prints just fine on my terminal and looks to have a width of two. Not sure how we should handle this given that the C function is wrong.

jiahao · 2013-11-05T16:37:41Z

Revisiting this, it appears to work on OSX Mavericks 10.9, although it reports a charwidth() of 1, so the ending single quote that wraps the character is overlaid on some rounds of pepperoni.

JeffBezanson · 2013-11-06T03:27:30Z

Unsurprisingly apple seems to be the first to update their unicode tables. I think our only options here are to rely on whatever libc is available, or use our own unicode tables. Not sure we want to get into all that.

jiahao · 2013-11-06T03:46:31Z

Is it possible to test some characters during build time and emit runtime warnings if charwidths are known to be wrong?

StefanKarpinski · 2013-11-06T06:21:20Z

Eh, what's the point? That's just a warning that people are going to ignore or re-open this issue. If the system libc has the wrong character width for something, then you get mangled garbage. Get a better OS.

timholy · 2013-11-06T12:07:37Z

The fact that this even exists as a character is a clear sign that unicode allows too many bits :-).

quinnj · 2014-06-23T12:57:28Z

What's the status here? @Keno?

JeffBezanson · 2014-06-23T13:45:31Z

Looks like somebody submitted a patch to glibc yesterday to update their unicode data:
https://sourceware.org/ml/libc-alpha/2014-06/msg00585.html

quinnj · 2014-08-21T14:01:16Z

Did the bump from libutf8proc to libmojibake solve this? #7917.

jakebolewski · 2014-08-21T14:17:40Z

I don't think the char width problem has been solved yet. @jiahao went through all the codepoints and computed the correct widths, but this information has not made it into libmojibake.

quinnj · 2014-08-29T14:16:58Z

ping @jiahao

jiahao · 2014-08-29T15:01:27Z

The last time I discussed this with @stevengj, we weren't entirely settled on whether the new charwidth function should be submitted as an entirely new function to JuliaStrings/utf8proc#2 or as a patch to Julia's existing charwidth.

A correct charwidth would be useful to projects other than Julia, but it would mean significant new functionality to libmojibake and it would cease to be just a lightly updated fork of a minimal Unicode handling library.

JeffBezanson · 2014-08-29T16:42:00Z

I really think it makes sense for it to be in libmojibake. It's in line with the other functionality in there, and won't bloat the library by a large percentage.

stevengj · 2014-08-29T20:24:15Z

I don't actually care much either way, but I'm fine with putting it in libmojibake.

stevengj · 2014-11-21T15:22:11Z

The advantage of putting it in Julia (replacing src/support/wcwidth.c) is that it will still work if someone is using the system utf8proc (which seems not unlikely on e.g. Fedora, especially if the Unicode-7 support in libmojibake gets folded upstream).

JeffBezanson · 2014-11-21T16:15:35Z

Ok, then let's drop it in as our wcwidth.c to fix this issue, and possibly move it to libmojibake later.

…nicode tables in doc

stevengj · 2015-03-12T20:10:47Z

utf8proc now includes an up-to-date utf8proc_charwidth function based on @jiahao's analysis (JuliaStrings/utf8proc#27), so we can fix this issue by upgrading to the latest utf8proc and using this function instead of wcwidth. (utf8proc's charwidth for U+1f355 is 2.)

tkelman · 2015-03-12T22:52:19Z

We should probably turn the name back to utf8proc here. I'd also slightly prefer going back to using a tarball for it (once 1.2.0 is tagged) rather than a submodule.

nalimilan · 2015-03-13T13:35:47Z

+1

stevengj · 2015-03-13T13:54:56Z

I slightly prefer submodules, since the tarballs tend to leave a bunch of old versions littering the deps/ directory when the version is upgraded. Submodules are also somewhat more flexible, since we can link a pre-release version if there is an urgent need (e.g. a bugfix).

tkelman · 2015-03-14T00:33:37Z

Submodules tend to confuse newcomers when versions are upgraded, introducing confusing diffs after they git pull when we change the submodule. It's also a bit messier for packagers who want to use system versions. It should be possible to set UTF8PROC_VERSION to a non-release sha for testing, and github should just make the right tarball for us. Either way though.

…uliaLang#6939)

gnachman · 2015-04-01T05:36:57Z

@stevengj iTerm2 does need to interoperate with anything you can ssh, telnet, etc. to, not to mention Julia, so I'm open to giving users a way to opt in to more a sensible wcwidth(). I don't use wcwidth() on the client so I could use utf8proc_charwidth in the right circumstances. Since AFAIK only Julia departs from the standard, there'd need to be a new escape sequence to tell the terminal emulator to switch character-width lookup tables.

OTOH, since Julia is the black sheep in this regard, it probably makes the most sense for Julia to print a space after characters that it treats as wide but wcwidth does not. And deal with cursor movement across them correctly, etc. That'll work with every terminal out there. If a window gets resized it won't wrap correctly, though. Terminal and iTerm2 will both refuse to "break" a fullwidth character into two half-width pieces, choosing instead to move the whole thing to the start of the next line, but that's a small price to pay.

Keno · 2015-04-01T05:58:21Z

@gnachman If I print a space next to a character, is there a chance the space will get drawn on top of it during a redraw? I think I've seen that behavior in my experiments.

gnachman · 2015-04-01T06:18:13Z

@Keno Yes, that can happen. I'm working on a fix to that issue in my refactor_drawing branch. Feel free to try it if you're feeling brave :). I expect to merge it into master in a week or two. Terminal.app doesn't have that issue, so that approach is safe to use.

JeffBezanson · 2015-04-01T17:00:56Z

It's not our preference to depart from standards here. Just looking at that glyph, clearly somebody thinks it is double-width. As @stevengj said there is no clear standard.

stevengj · 2015-04-01T18:42:06Z

wcwidth is only "standard" in the sense that it is used by many programs; it is not consistent even between MacOS versions, much less across operating systems, and is invariably out of date.

Note that UAX#11 provides a clear standard for a subset of Unicode, and wcwidth as of MacOS 10.10.2 does not conform to Unicode 7 in the sense that it reports -1 (not printable / not recognized) for many of the characters listed in UAX#11 as having width 1 or 2 (narrow/wide).

jiahao · 2015-04-02T04:16:53Z

@gnachman The rationale and details of the analyses used to justify Julia's implementation are explained in JuliaStrings/utf8proc#2 and JuliaStrings/utf8proc#27 and in this notebook, which amongst other things details the exact discrepancies between my system wcwidth and the analysis. In all the cases I examined, I could not find a reason to justify the system answer over the analysis outlined in the issues and notebook.

As @JeffBezanson and @stevengj have already stated, there is no standard governing character widths, and so it is not possible to characterize Julia as "departing from the standard". On the contrary, it appears that not enough thought has gone into any other implementation for the purpose of determining character widths.

To illustrate our reasoning, consider the pizza character U+1F355. The relevant entry in EastAsianWidths.txt is:

1F330..1F37D;N # So [78] CHESTNUT..FORK AND KNIFE WITH PLATE

which assigns it the "neutral" category (not "narrow", which is coded as "Na"). Thus it falls into the nebulous category where UAX 11 has essentially nothing to say because the character doesn't exist in legacy East Asian encodings. (UAX 11 even says in its Scope not to consider it an authoritative source on character widths, but rather that

The East_Asian_Width is an informative property... the guidelines on use of this property should be considered recommendations based on a particular legacy practice that may be overridden by implementations as necessary.

)

In the absence of a clear standard, the best I could come up with is to look at a font that actually bothered to provide a glyph for that code point, hence settling on Unifont, which provides this glyph:

Note that the character width assigned by inspecting the advance width from Unifont agrees with the eyeball comparison of the reference glyph in the Unicode character charts (pdf).

Superimposed for reference is a square box. I do not see any reason why this should be 'narrow' instead of 'fullwidth'.

gnachman · 2015-04-02T06:30:36Z

@jiahao, I wasn't criticizing your work. The informal agreement between client and server, which as you note is underspecified, is what is rickety. Your work is really valuable--I wish it (or something like it) were widely adopted.

I had believed that EastAsianWidth.txt was "the standard", but I'm persuaded that there isn't really one at all. AFAIK most apps treat N as narrow, but it leads to the problems described in this bug.

timholy · 2015-04-02T16:16:50Z

It sounds like this should be reported upstream/more widely, if it hasn't been already.

stevengj · 2015-04-02T17:27:28Z

Unfortunately, it seems like the only upstream that can really fix this is libc, in order to fix wcwidth. I don't know where to file this kind of low-level bug report with Apple (??), and Microsoft is hopeless because of their wchar_t size, but it would be worthwhile for someone to check utf8proc against the latest GNU libc and file a bug report for discrepancies where libc is clearly wrong.

jiahao · 2015-04-02T17:32:38Z

glibc#4335

vtjnash · 2016-03-14T20:26:55Z

Julia has the right (most updated) char widths, so it's up to the user to demand their terminal emulators are displaying properly. most likely, that'll happen gradually with various companies (#7267) lagging behind more or less from the standards committee.

Keno · 2016-06-24T04:30:33Z

Britain may be leaving the EU, but Unicode 9 came out and fixed this issue for us, so overall it's a pretty good day:

Keno · 2016-06-24T04:36:04Z

iTerm2 PR: gnachman/iTerm2#294

JeffBezanson mentioned this issue Jan 31, 2014

RFC: export utf8proc Unicode transformation functionality in Julia #5576

Merged

timholy mentioned this issue Mar 3, 2014

drop dimensions indexed with a scalar? #5949

Closed

stevengj referenced this issue Jan 4, 2015

latex symbols: fix overbrace and underbrace code points, and update u…

dc247a8

…nicode tables in doc

stevengj mentioned this issue Mar 28, 2015

update utf8proc to 1.2 #10654

Closed

stevengj added a commit to stevengj/julia that referenced this issue Mar 28, 2015

replace wcwidth by utf8proc_charwidth (fixes JuliaLang#3721, closes J…

40d5719

…uliaLang#6939)

stevengj added a commit to stevengj/julia that referenced this issue Mar 28, 2015

replace wcwidth by utf8proc_charwidth (fixes JuliaLang#3721, closes J…

7d908f9

…uliaLang#6939)

stevengj mentioned this issue Mar 28, 2015

update utf8proc, replace wcwidth #10659

Merged

stevengj added a commit to stevengj/julia that referenced this issue Mar 29, 2015

replace wcwidth by utf8proc_charwidth (fixes JuliaLang#3721, closes J…

6bf5e61

…uliaLang#6939)

jiahao mentioned this issue Apr 2, 2015

Emoji REPL completion #10709

Merged

stevengj mentioned this issue Apr 20, 2015

erroneous use of charwidth in rpad? #10825

Closed

stevengj mentioned this issue Aug 10, 2015

julia-mode: incorrect indent with unicode #12527

Closed

ridiculousfish mentioned this issue Oct 2, 2015

Cursor positioning is wrong when "Treat ambiguous-width chars as double width" is checked fish-shell/fish-shell#2451

Closed

JeffBezanson added the domain:unicode Related to unicode characters and encodings label Jan 28, 2016

vtjnash closed this as completed Mar 14, 2016

vtjnash added the status:won't change Indicates that work won't continue on an issue or pull request label Mar 14, 2016

Keno removed the status:won't change Indicates that work won't continue on an issue or pull request label Jun 24, 2016

stevengj mentioned this issue Aug 30, 2016

set east asian neutral width to 1 JuliaStrings/utf8proc#83

Closed

stevengj mentioned this issue Jan 4, 2017

Compact printing of Array{Char}s? #19845

Closed

fornwall mentioned this issue Apr 6, 2017

Take (xft) font(s) into consideration? termux/wcwidth#1

Closed

stevengj mentioned this issue Sep 22, 2017

Bowtie symbol isn't a valid character. #23820

Open

stevengj mentioned this issue May 3, 2018

Case conversion test fails on Alpine Linux JuliaStrings/utf8proc#127

Closed

stevengj mentioned this issue Jul 30, 2018

clearer parse errors for invalid characters #28339

Closed

yurivish mentioned this issue Nov 25, 2020

Julia doesn't like Potatoes JuliaWeb/HTTP.jl#626

Closed

Keno added the 😃🍕 and other emoji label Mar 17, 2021

stevengj mentioned this issue Aug 4, 2024

add rtruncate, ltruncate, ctruncate for truncating strings #55351

Merged

Julia doesn't like Pizza #3721

Julia doesn't like Pizza #3721

Comments

Keno commented Jul 15, 2013

staticfloat commented Jul 15, 2013

StefanKarpinski commented Jul 15, 2013

jiahao commented Nov 5, 2013

JeffBezanson commented Nov 6, 2013

jiahao commented Nov 6, 2013

StefanKarpinski commented Nov 6, 2013

timholy commented Nov 6, 2013

quinnj commented Jun 23, 2014

JeffBezanson commented Jun 23, 2014

quinnj commented Aug 21, 2014

jakebolewski commented Aug 21, 2014

quinnj commented Aug 29, 2014

jiahao commented Aug 29, 2014

JeffBezanson commented Aug 29, 2014

stevengj commented Aug 29, 2014

stevengj commented Nov 21, 2014

JeffBezanson commented Nov 21, 2014

stevengj commented Mar 12, 2015

tkelman commented Mar 12, 2015

nalimilan commented Mar 13, 2015

stevengj commented Mar 13, 2015

tkelman commented Mar 14, 2015

gnachman commented Apr 1, 2015

Keno commented Apr 1, 2015

gnachman commented Apr 1, 2015

JeffBezanson commented Apr 1, 2015

stevengj commented Apr 1, 2015

jiahao commented Apr 2, 2015

gnachman commented Apr 2, 2015

timholy commented Apr 2, 2015

stevengj commented Apr 2, 2015

jiahao commented Apr 2, 2015

vtjnash commented Mar 14, 2016

Keno commented Jun 24, 2016

Keno commented Jun 24, 2016