Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: charwidth function #27

Merged
merged 1 commit into from
Mar 12, 2015
Merged

WIP: charwidth function #27

merged 1 commit into from
Mar 12, 2015

Conversation

stevengj
Copy link
Member

@stevengj stevengj commented Mar 8, 2015

This PR fixes #2 by following the algorithm suggested by @jiahao: it uses character widths from GNU unifont, plus a few exceptions from UAX 11 and some control characters.

One difference from @jiahao's script is that I divide unifont advance widths by 512 before incorporating the UAX 11 data, whereas @jiahao's notebook did the division afterwards, which seemed like a bug(??).

Also, I'm not quite sure what we should do about non-printable characters. Should we return -1 for non-printable characters like wcwidth? If so, what algorithm for "printable" should we use? @jiahao's notebook had one algorithm, but Julia's isprint function subsequently adopted a different algorithm. Right now, I am returning 0; my feeling is that if the caller wants to know some more specific notion of "printable," they can always inspect the unicode category code provided by utf8proc.

elseif state==:readdata #Encoding: 65538 -1 2, Width: 1024
contains(line, "Encoding:") && (codepoint = int(split(line)[3]))
contains(line, "Width:") && (width = int(split(line)[2]))
if codepoint!=nothing && width!=nothing && codepoint >= 0
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiahao, I added the codepoint>=0 check here since the sfd file seems to have some codepoint -1 entries in the beginning.

@nalimilan
Copy link
Member

Maybe return 0 rather than -1 for non-printable characters? That way, people won't subtract 1 to the total width of a string without noticing if it accidentally contains such a character. Ideally they should check and remove them, but it is generally easier to compute the width without modifying the string, and skip non-printable chars only in the code that actually does the printing.

@stevengj
Copy link
Member Author

stevengj commented Mar 8, 2015

@nalimilan, currently I'm returning 0 for non-printables. The main argument for returning -1 would be to make it a drop-in replacement for wcwidth. Of course, we could just add a utf8proc_wcwidth that did this, but we still have to decide what "printable" means. (It would also be good to understand the original rationale for this behavior of wcwidth.)

@stevengj
Copy link
Member Author

stevengj commented Mar 8, 2015

I just noticed the Python-based wcwidth repo on github; it would be good to compare to their behavior.

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

@jiahao, I noticed that your algorithm seems to assign a nonzero width to category Mn (non-spacing combining characters). e.g. U+0300 (combining grave accent) is assigned a width of 1. That seems weird to me...shouldn't we assign these a width of zero, since in practice their width is subsumed by the width of the character they are combined with?

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

@staticfloat, Travis seems to be failing occasionally because the Unifont downloads are stalling. Is there some problem with large (few MB) downloads in Travis? If so, how do the julia Travis builds work?

@staticfloat
Copy link
Contributor

We've got some caching infrastructure that we use to make sure we get consistent results with as high uptime as possible. The basic formula is to change your request URLs to something like https://cache.e.ip.saba.us/http://www.unicode.org/Public/UNIDATA/UnicodeData.txt. The caching server is fairly picky about what it will cache, so I could add this to the whitelist and allow you to query the caching server for those files.

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

@staticfloat, it would be great if you added the unifont .ttf files we are using to the filter (with wildcards so that we can change versions without problems), since those seem to be the problem cases.

@staticfloat
Copy link
Contributor

Alright, it should be live now. Try changing the URLs and let me know.

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

Thanks @staticfloat, it seems to work on my machine; hopefully Travis will succeed now. Oops, no, the file seems to be truncated to 502 bytes (no wonder it downloaded so quickly). Is there something I need to do differently?

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

(Another weird thing: why does the Travis badge for the master branch say "build:failing" when it is only an unmerged PR that has failed?)

@stevengj
Copy link
Member Author

stevengj commented Mar 9, 2015

Ah, it looks like I need to pass --location to curl so that it follows redirects.

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2015

(Another weird thing: why does the Travis badge for the master branch say "build:failing" when it is only an unmerged PR that has failed?)

I think the markdown for the badge link is currently set to "most recent build for project," it needs some extra arguments to be specifically for master

@stevengj
Copy link
Member Author

@jiahao, it looks like the problem is the UAX11 code. That assigns a width "1" to category "A" (ambiguous), which includes the Mn non-spacing combining marks in EastAsianWidth.txt. Like I mentioned above, I'm not sure how your UAX11 code worked at all in your notebook, because you divided the widths by 512 after the UAX11 code, which would have set all of those widths to zero. Maybe I just shouldn't check UAX11 at all, or should check that before Unifont (so that Unifont can override UAX11)?

@stevengj
Copy link
Member Author

Okay, just pushed another update. Now Unifont takes precedence over UAX11, and I added a check. Unifont includes PUA glyphs from the ConScript registry (e.g. Klingon, Elvish, etcetera), but I'm skeptical that we should return nonzero widths for those as they aren't part of the Unicode standard, so I changed the algorithm to return zero for these (category Co); we can always add them back later.

It now agrees with the MacOS 10.10.2 wcwidth function in all cases where wcwidth returns a nonnegative result. (wcwidth returns -1 for many codepoints that we recognize as printable, however.)

@stevengj
Copy link
Member Author

@staticfloat, I'm still getting Travis timeouts on the downloads. Am I using the caching server incorrectly somehow?

#0x002003 ' ' category: Zs name: EM SPACE/
CharWidths[0x2001]=2
CharWidths[0x2003]=2

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jiahao, in a monospaced font, like in a terminal, aren't em and en the same?

@jiahao
Copy link
Collaborator

jiahao commented Mar 12, 2015

@stevengj thanks for taking the lead on actually making this happen.

Dividing by 512: probably a silly bug. I may have tinkered with the notebook since the version I posted.

Category Mn having width 1: I think you're right, they should have width 0. The oversight did occur to me at some point but I don't think I updated the gist since I realized the mistake.

The 'ambiguous' category was difficult for me to figure out. Some characters like ❶ (U+02776 DINGBAT NEGATIVE CIRCLED DIGIT ONE) I have only ever seen in printed Chinese literature, where it is typeset as fullwidth. Others like ¸ (U+00B8 CEDILLA, a spacing character not to be confused with ̧ U+0327 COMBINING CEDILLA) seem silly to treat as ambiguous. I guess if ❶ were to ever show up in non-East Asian text the rule is to assign it as narrow? UAX 11 §4.2 Ambiguous Characters is really, well, ambiguous. Or one can simply adopt the recommendation of §5 which suggests either a) always take Unifont's size (as the canonical monospaced font) as correct, or b) pretending that they are always narrow, are both acceptable:

When processing or displaying data... Ambiguous characters behave like wide or narrow characters depending on the context (language tag, script identification, associated font, source of data, or explicit markup; all can provide the context). If the context cannot be established reliably, they should be treated as narrow characters by default.

em vs en: The 'em' is defined to be equal to the current point size and the 'en' is defined to be half an 'em'. Given that the bounding box for Unifont was 1024 x 512 for most of the ordinary narrow characters in Fontforge, I assigned anything defined to be 'en'-sized as width 1 and anything 'em'-sized as width 2.

@staticfloat
Copy link
Contributor

@stevengj I think I forgot to enable a certain fix on the caching server, it's been updated now, and I've re-run the jobs which seemed to work better.

@stevengj
Copy link
Member Author

Thanks! @jiahao, can you take a look at the patch? There are some changes from your algorithm. Thanks for your comments above.

@stevengj
Copy link
Member Author

@jiahao, Unifont gives width=1 for four "W" codepoints U+2329, U+232a, U+fe33, U+fe34 and one "F" codepoint U+ff3d according to EastAsianWidth.txt. I guess we want UAX11 to "win" in these cases, right?

(Unifont gives width 1 for all codepoints listed as narrow or halfwidth, so there is no conflict there.)

I think I'll let Unifont "win" for the ambiguous and neutral cases.

@jiahao
Copy link
Collaborator

jiahao commented Mar 12, 2015

Yes I recall seeing some discrepancies in the Unifont W characters having width 1. I don't remember the details but I think the Unifont bounding boxes were quite clearly erroneous. (Some of them are clearly fullwidth Chinese characters.)

@jiahao
Copy link
Collaborator

jiahao commented Mar 12, 2015

Letting Unifont decide neutral and ambiguous seems better to me too, especially since anyone who actually wants monospaced Unicode will have little choice other than Unifont in practice.

stevengj added a commit that referenced this pull request Mar 12, 2015
@stevengj stevengj merged commit 128c04e into master Mar 12, 2015
@stevengj stevengj deleted the charwidth branch June 27, 2015 14:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add charwidth property
5 participants