WIP: charwidth function #27

stevengj · 2015-03-08T21:32:05Z

This PR fixes #2 by following the algorithm suggested by @jiahao: it uses character widths from GNU unifont, plus a few exceptions from UAX 11 and some control characters.

One difference from @jiahao's script is that I divide unifont advance widths by 512 before incorporating the UAX 11 data, whereas @jiahao's notebook did the division afterwards, which seemed like a bug(??).

Also, I'm not quite sure what we should do about non-printable characters. Should we return -1 for non-printable characters like wcwidth? If so, what algorithm for "printable" should we use? @jiahao's notebook had one algorithm, but Julia's isprint function subsequently adopted a different algorithm. Right now, I am returning 0; my feeling is that if the caller wants to know some more specific notion of "printable," they can always inspect the unicode category code provided by utf8proc.

stevengj · 2015-03-08T21:37:01Z

data/charwidths.jl

+        elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
+            contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
+            contains(line, "Width:") && (width = int(split(line)[2]))
+            if codepoint!=nothing && width!=nothing && codepoint >= 0


@jiahao, I added the codepoint>=0 check here since the sfd file seems to have some codepoint -1 entries in the beginning.

nalimilan · 2015-03-08T21:45:00Z

Maybe return 0 rather than -1 for non-printable characters? That way, people won't subtract 1 to the total width of a string without noticing if it accidentally contains such a character. Ideally they should check and remove them, but it is generally easier to compute the width without modifying the string, and skip non-printable chars only in the code that actually does the printing.

stevengj · 2015-03-08T23:17:51Z

@nalimilan, currently I'm returning 0 for non-printables. The main argument for returning -1 would be to make it a drop-in replacement for wcwidth. Of course, we could just add a utf8proc_wcwidth that did this, but we still have to decide what "printable" means. (It would also be good to understand the original rationale for this behavior of wcwidth.)

stevengj · 2015-03-08T23:18:29Z

I just noticed the Python-based wcwidth repo on github; it would be good to compare to their behavior.

stevengj · 2015-03-09T00:12:42Z

@jiahao, I noticed that your algorithm seems to assign a nonzero width to category Mn (non-spacing combining characters). e.g. U+0300 (combining grave accent) is assigned a width of 1. That seems weird to me...shouldn't we assign these a width of zero, since in practice their width is subsumed by the width of the character they are combined with?

stevengj · 2015-03-09T01:00:30Z

@staticfloat, Travis seems to be failing occasionally because the Unifont downloads are stalling. Is there some problem with large (few MB) downloads in Travis? If so, how do the julia Travis builds work?

staticfloat · 2015-03-09T01:21:24Z

We've got some caching infrastructure that we use to make sure we get consistent results with as high uptime as possible. The basic formula is to change your request URLs to something like https://cache.e.ip.saba.us/http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. The caching server is fairly picky about what it will cache, so I could add this to the whitelist and allow you to query the caching server for those files.

stevengj · 2015-03-09T02:43:44Z

@staticfloat, it would be great if you added the unifont .ttf files we are using to the filter (with wildcards so that we can change versions without problems), since those seem to be the problem cases.

staticfloat · 2015-03-09T04:49:55Z

Alright, it should be live now. Try changing the URLs and let me know.

stevengj · 2015-03-09T12:32:16Z

Thanks @staticfloat, ~~it seems to work on my machine; hopefully Travis will succeed now.~~ Oops, no, the file seems to be truncated to 502 bytes (no wonder it downloaded so quickly). Is there something I need to do differently?

stevengj · 2015-03-09T17:27:09Z

(Another weird thing: why does the Travis badge for the master branch say "build:failing" when it is only an unmerged PR that has failed?)

stevengj · 2015-03-09T17:36:30Z

Ah, it looks like I need to pass --location to curl so that it follows redirects.

tkelman · 2015-03-09T23:25:44Z

(Another weird thing: why does the Travis badge for the master branch say "build:failing" when it is only an unmerged PR that has failed?)

I think the markdown for the badge link is currently set to "most recent build for project," it needs some extra arguments to be specifically for master

stevengj · 2015-03-11T17:03:43Z

@jiahao, it looks like the problem is the UAX11 code. That assigns a width "1" to category "A" (ambiguous), which includes the Mn non-spacing combining marks in EastAsianWidth.txt. Like I mentioned above, I'm not sure how your UAX11 code worked at all in your notebook, because you divided the widths by 512 after the UAX11 code, which would have set all of those widths to zero. Maybe I just shouldn't check UAX11 at all, or should check that before Unifont (so that Unifont can override UAX11)?

stevengj · 2015-03-11T18:53:52Z

Okay, just pushed another update. Now Unifont takes precedence over UAX11, and I added a check. Unifont includes PUA glyphs from the ConScript registry (e.g. Klingon, Elvish, etcetera), but I'm skeptical that we should return nonzero widths for those as they aren't part of the Unicode standard, so I changed the algorithm to return zero for these (category Co); we can always add them back later.

It now agrees with the MacOS 10.10.2 wcwidth function in all cases where wcwidth returns a nonnegative result. (wcwidth returns -1 for many codepoints that we recognize as printable, however.)

stevengj · 2015-03-11T20:21:41Z

@staticfloat, I'm still getting Travis timeouts on the downloads. Am I using the caching server incorrectly somehow?

stevengj · 2015-03-12T02:56:02Z

data/charwidths.jl

+#0x002003 ' ' category: Zs name: EM SPACE/
+CharWidths[0x2001]=2
+CharWidths[0x2003]=2
+


@jiahao, in a monospaced font, like in a terminal, aren't em and en the same?

jiahao · 2015-03-12T04:03:45Z

@stevengj thanks for taking the lead on actually making this happen.

Dividing by 512: probably a silly bug. I may have tinkered with the notebook since the version I posted.

Category Mn having width 1: I think you're right, they should have width 0. The oversight did occur to me at some point but I don't think I updated the gist since I realized the mistake.

The 'ambiguous' category was difficult for me to figure out. Some characters like ❶ (U+02776 DINGBAT NEGATIVE CIRCLED DIGIT ONE) I have only ever seen in printed Chinese literature, where it is typeset as fullwidth. Others like ¸ (U+00B8 CEDILLA, a spacing character not to be confused with ̧ U+0327 COMBINING CEDILLA) seem silly to treat as ambiguous. I guess if ❶ were to ever show up in non-East Asian text the rule is to assign it as narrow? UAX 11 §4.2 Ambiguous Characters is really, well, ambiguous. Or one can simply adopt the recommendation of §5 which suggests either a) always take Unifont's size (as the canonical monospaced font) as correct, or b) pretending that they are always narrow, are both acceptable:

When processing or displaying data... Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default.

em vs en: The 'em' is defined to be equal to the current point size and the 'en' is defined to be half an 'em'. Given that the bounding box for Unifont was 1024 x 512 for most of the ordinary narrow characters in Fontforge, I assigned anything defined to be 'en'-sized as width 1 and anything 'em'-sized as width 2.

staticfloat · 2015-03-12T07:46:56Z

@stevengj I think I forgot to enable a certain fix on the caching server, it's been updated now, and I've re-run the jobs which seemed to work better.

stevengj · 2015-03-12T12:15:22Z

Thanks! @jiahao, ~~can you take a look at the patch? There are some changes from your algorithm.~~ Thanks for your comments above.

stevengj · 2015-03-12T15:49:46Z

@jiahao, Unifont gives width=1 for four "W" codepoints U+2329, U+232a, U+fe33, U+fe34 and one "F" codepoint U+ff3d according to EastAsianWidth.txt. I guess we want UAX11 to "win" in these cases, right?

(Unifont gives width 1 for all codepoints listed as narrow or halfwidth, so there is no conflict there.)

I think I'll let Unifont "win" for the ambiguous and neutral cases.

jiahao · 2015-03-12T16:26:51Z

Yes I recall seeing some discrepancies in the Unifont W characters having width 1. I don't remember the details but I think the Unifont bounding boxes were quite clearly erroneous. (Some of them are clearly fullwidth Chinese characters.)

jiahao · 2015-03-12T16:29:18Z

Letting Unifont decide neutral and ambiguous seems better to me too, especially since anyone who actually wants monospaced Unicode will have little choice other than Unifont in practice.

WIP: charwidth function

stevengj force-pushed the charwidth branch from 63ba255 to 9896839 Compare March 8, 2015 21:33

stevengj reviewed Mar 8, 2015
View reviewed changes

stevengj force-pushed the charwidth branch from 9896839 to 1d07802 Compare March 8, 2015 23:15

stevengj force-pushed the charwidth branch from 1d07802 to 2a40628 Compare March 9, 2015 00:05

stevengj force-pushed the charwidth branch from 2a40628 to 1167448 Compare March 9, 2015 12:31

stevengj force-pushed the charwidth branch from 1167448 to f6d46cf Compare March 9, 2015 21:15

stevengj force-pushed the charwidth branch from f6d46cf to b429b69 Compare March 11, 2015 18:48

stevengj mentioned this pull request Mar 11, 2015

wrong width for U+00AD jquast/wcwidth#8

Open

stevengj reviewed Mar 12, 2015
View reviewed changes

stevengj force-pushed the charwidth branch from b429b69 to bf58768 Compare March 12, 2015 14:40

fix #2: add charwidth function

a4c84d2

stevengj force-pushed the charwidth branch from bf58768 to a4c84d2 Compare March 12, 2015 16:10

stevengj added a commit that referenced this pull request Mar 12, 2015

Merge pull request #27 from JuliaLang/charwidth

128c04e

WIP: charwidth function

stevengj merged commit 128c04e into master Mar 12, 2015

stevengj mentioned this pull request Mar 12, 2015

Julia doesn't like Pizza JuliaLang/julia#3721

Closed

stevengj deleted the charwidth branch June 27, 2015 14:07

stevengj mentioned this pull request Aug 30, 2016

set east asian neutral width to 1 #83

Closed

stevengj mentioned this pull request May 4, 2018

charwidth=1 for soft hyphen and unassigned codepoints #135

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: charwidth function #27

WIP: charwidth function #27

stevengj commented Mar 8, 2015

stevengj Mar 8, 2015

nalimilan commented Mar 8, 2015

stevengj commented Mar 8, 2015

stevengj commented Mar 8, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

staticfloat commented Mar 9, 2015

stevengj commented Mar 9, 2015

staticfloat commented Mar 9, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

tkelman commented Mar 9, 2015

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

stevengj Mar 12, 2015

jiahao commented Mar 12, 2015

staticfloat commented Mar 12, 2015

stevengj commented Mar 12, 2015

stevengj commented Mar 12, 2015

jiahao commented Mar 12, 2015

jiahao commented Mar 12, 2015

WIP: charwidth function #27

WIP: charwidth function #27

Conversation

stevengj commented Mar 8, 2015

stevengj Mar 8, 2015

Choose a reason for hiding this comment

nalimilan commented Mar 8, 2015

stevengj commented Mar 8, 2015

stevengj commented Mar 8, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

staticfloat commented Mar 9, 2015

stevengj commented Mar 9, 2015

staticfloat commented Mar 9, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

stevengj commented Mar 9, 2015

tkelman commented Mar 9, 2015

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

stevengj commented Mar 11, 2015

stevengj Mar 12, 2015

Choose a reason for hiding this comment

jiahao commented Mar 12, 2015

staticfloat commented Mar 12, 2015

stevengj commented Mar 12, 2015

stevengj commented Mar 12, 2015

jiahao commented Mar 12, 2015

jiahao commented Mar 12, 2015