Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

Closed
jadrian opened this issue Mar 8, 2015 · 6 comments
Closed

Comments

@jadrian
Copy link

jadrian commented Mar 8, 2015

  • uname -a = Linux helix 3.13.0-46-generic #77-Ubuntu SMP Mon Mar 2 18:23:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
  • matplotlib.__version__ = 1.3.1

MWE:

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rcParams['text.usetex'] = True

fig, ax = plt.subplots(1, 1)
ax.plot([5, 8, 3], label=r"90{\textdegree}")
ax.set_ylabel(r"$\Delta$AIC ({\texttimes}1000)")
ax.legend()
fig.savefig("plt_bom_mwe.pdf")

I'm using LaTeX syntax to add text to a figure, including some special symbols. When I generate a PDF from the figure using Python 3, I get weird characters at the beginning or in the middle of this text. (I see these extra characters regardless of the program or OS with which I view the PDF; the generated PDF looks the same when viewed in Mac OS X's preview app as in evince on Ubuntu.) The following figure was converted by imagemagick from the resulting PDF.
plt_bom_mwe_py3_before
The extra characters are always the same, "þÿ": a lowercase thorn character followed by a lowercase y with diaresis. These are at Unicode codepoints FE and FF respectively, which are the bytes that make up the Unicode byte-order mark.

There's a section of backend_pdf.py that reads:

# Unicode strings are encoded in UTF-16BE with byte-order mark.
elif isinstance(obj, str):
    try:
        # But maybe it's really ASCII?
        s = obj.encode('ASCII')
        return pdfRepr(s)
    except UnicodeEncodeError:
        s = codecs.BOM_UTF16_BE + obj.encode('UTF-16BE')  # <<<<
        return pdfRepr(s)

If I edit the line marked by # <<<< above to omit the prepended UTF16 BOM:

        s = obj.encode('UTF-16BE')  # <<<<

... then I get exactly the figure I expect:
plt_bom_mwe_py3_after

Again, I only see this bug when running Python 3. If I run the above MWE with Python 2 (and the same matplotlib version, 1.3.1), then I get the expected output both before and after my proposed change.

I'll also submit a pull request, but I have no idea about the potential wider impact of a change like this on other systems or use cases; this is my first time delving into matplotlib's guts, and the above Unicode-handling lines were added way back in November 2009, in revision 974b360 .

@jadrian
Copy link
Author

jadrian commented Mar 8, 2015

Just a thought: it might be the case that some other system considers the Unicode text that LaTeX generates to be UTF-8 instead of UTF-16. That might explain why the BOM byte sequence would get interpreted as two 8-bit visible characters rather than a 16-bit invisible BOM. The PDF backend seems pretty consistent on the UTF-16 thing, so I don't know where else in the stack this mismatch might occur.

@tacaswell
Copy link
Member

See http://en.wikipedia.org/wiki/Byte_order_mark

Something else seems fishy as the characters you are getting getting match the windows encoding, but you are on linux.

I can not reproduce this issues using 1.4.3 + py3.4.

@tacaswell
Copy link
Member

The upgrade between 1.3.1 and 1.4.x probably matters quite a bit in this case as for the 1.4 series we switched from using 2to3 to six to get python 2/3 compatibility and as part of the conversion cleaned up a large number of unicode related issues.

@tacaswell tacaswell added this to the next point release milestone Mar 8, 2015
@jadrian
Copy link
Author

jadrian commented Mar 8, 2015

Righto, thanks for looking into this. As I mentioned, the characters I saw are also at codepoints FE and FF in Unicode, so I don't think Windows encoding is necessary to explain the bug I saw, just a naive UTF-8 interpretation that never deals with multi-byte characters. Thanks for explaining about the upgrades in 1.4.x and for considering my pull request. I agree that there's no reason to merge it in if the bugs can't be confirmed with 1.4.x; I'll just keep the patch on my own system until I have cause to upgrade to 1.4.x.

@tacaswell
Copy link
Member

I am also a tad worried that this is an issue with the interaction with LaTeX on your system.

Could you please try with 1.4.3? The easiest way to test is probably to install via anaconda. I would like to get this sorted before it drifts off to our huge back-log of issues and don't feel comfortable closing it on just my in ability to reproduce.

@mdboom
Copy link
Member

mdboom commented Mar 9, 2015

I can reproduce this on 1.3.1, but not 1.4.0 (which makes sense, since there was a big push for better Python 3 compatibility in that revision). So I'm going to close this as "already fixed", since the 1.3 branch is no longer maintained.

@mdboom mdboom closed this as completed Mar 9, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants