Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle Unicode font filenames correctly/Fix crashing MacOSX backend #2433

Merged
merged 7 commits into from Sep 24, 2013

Conversation

mdboom
Copy link
Member

@mdboom mdboom commented Sep 18, 2013

This is an alternative to #2431.

First, when handling the output of fontconfig (fc-match and fc-list), the bulk of the output is in ascii, so it is maintained, parsed and handled as bytes. Only once the filename part of the output is extracted do we decode it using sys.getfilesystemencoding().

Behind that bug, if you actually try to use a font file with non-ascii characters in the path:

  1. Every time we passed the path to FT2Font, we were calling str() on it. This was required when FT2Font did not handle Unicode paths, prior to f4adec. However, then and now, it just serves to raise an exception when provided with a non-ascii path.

  2. Within FT2Font, that unicode string was encoded for the purpose of displaying error messages. This could fail both on encoding and displaying it. Encoding it using "unicode_escape" ensures that we'll have something printable in any environment.

  3. ttconv does not handle Unicode filenames. ttconv is in the process of being eliminated for MEP14, so I didn't want to invest a lot of effort fixing that. Instead, since it is only called from two places, the path is encoded back to filesystemencoding on the Python side before calling into it. This doesn't change the API to ttconv.

This will require a manual backport to 1.3.x if we want that -- things are different enough there that this won't really work or apply.

This approach is not necessarily "best practice" for POSIX platforms, which is to keep paths as byte strings everywhere. I tried that first, but there are too many assumptions throughout the code that file paths are unicode (particularly post f4adec with the move to six and from __future__ import unicode_literals). The downside of that is pretty minor -- if a path contains a sequence that does not decode using the filesystemencoding, such as an invalid utf-8 surrogate pair, it will not be openable by matplotlib. I'd consider that a pathological, even if technically legal, case, however, so I don't think we should worry too much about it. Keeping the paths as unicode objects does make the processing of them much easier, particularly with six and Python 3.

…ntconfig in

the filesystemencoding.  Fix a number of buglets that did not allow
fonts with non-ascii filenames to be opened.  This involves taking
advantage of code already in FT2Font to use unicode filenames, and
encoding the filenames before passing into ttconv.
# The bulk of the output from fc-list is ascii, so we keep the
# result in bytes and parse it as bytes, until we extract the
# filename, which is in sys.filesystemencoding().
for line in output.split(b'\n'):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can b'\n' or b':' be part of a bytestring representation of a unicode filename itself? If so, this will fail. I don't see the rationale for doing all this splitting on bytestrings instead of converting to unicode, doing the string-based processing, and then converting back to bytes before returning from this function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The need to do this is because if the file system encoding is something that isn't byte-based, such as utf-16be on Windows, then you have a combination of the ascii output from fontconfig with the embedded filename path in utf-16.

It is possible that '\n' could be in the filename, but again, that's kind of a degenerate case. It would not appear unexpectedly as a utf-8 surrogate pair (since utf-8 ensures that the control characters always remain control characters). It could appear in utf-16 as an upper byte, I suppose.

The colon is indeed an issue, however.

Looking at the fc-list manpage further, I think we can provide it with a custom output format that would only contain the filepath (which is the only part we care about anyway). That would simplify this...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a new solution to this using the --format argument to fontconfig.

@mdboom
Copy link
Member Author

mdboom commented Sep 18, 2013

Travis is very unhappy -- looking into it.

@mdehoon
Copy link
Contributor

mdehoon commented Sep 20, 2013

The fix for the Mac OS X backend in src/_macosx.m looks fine. We may want to raise an Exception if the result of setfont is not NULL, just to be sure that we don't get a hard crash if for some reason setfont fails to find the font.

@mdboom
Copy link
Member Author

mdboom commented Sep 20, 2013

@mdehoon: I'll add the setfont NULL check. Thanks for the tip.

@mdboom
Copy link
Member Author

mdboom commented Sep 23, 2013

I have added the NULL check on setfont.

@mdehoon
Copy link
Contributor

mdehoon commented Sep 24, 2013

That looks fine to me. You may consider to skip the call to CGContextSelectFont(cr, name, size, kCGEncodingMacRoman) if font is NULL.

mdboom added a commit that referenced this pull request Sep 24, 2013
Handle Unicode font filenames correctly/Fix crashing MacOSX backend
@mdboom mdboom merged commit b060f7c into matplotlib:master Sep 24, 2013
@mdboom mdboom deleted the unicode-font-filenames branch August 7, 2014 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants