New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199
Comments
Just a thought: it might be the case that some other system considers the Unicode text that LaTeX generates to be UTF-8 instead of UTF-16. That might explain why the BOM byte sequence would get interpreted as two 8-bit visible characters rather than a 16-bit invisible BOM. The PDF backend seems pretty consistent on the UTF-16 thing, so I don't know where else in the stack this mismatch might occur. |
See http://en.wikipedia.org/wiki/Byte_order_mark Something else seems fishy as the characters you are getting getting match the windows encoding, but you are on linux. I can not reproduce this issues using 1.4.3 + py3.4. |
The upgrade between 1.3.1 and 1.4.x probably matters quite a bit in this case as for the 1.4 series we switched from using 2to3 to |
Righto, thanks for looking into this. As I mentioned, the characters I saw are also at codepoints FE and FF in Unicode, so I don't think Windows encoding is necessary to explain the bug I saw, just a naive UTF-8 interpretation that never deals with multi-byte characters. Thanks for explaining about the upgrades in 1.4.x and for considering my pull request. I agree that there's no reason to merge it in if the bugs can't be confirmed with 1.4.x; I'll just keep the patch on my own system until I have cause to upgrade to 1.4.x. |
I am also a tad worried that this is an issue with the interaction with LaTeX on your system. Could you please try with 1.4.3? The easiest way to test is probably to install via anaconda. I would like to get this sorted before it drifts off to our huge back-log of issues and don't feel comfortable closing it on just my in ability to reproduce. |
I can reproduce this on 1.3.1, but not 1.4.0 (which makes sense, since there was a big push for better Python 3 compatibility in that revision). So I'm going to close this as "already fixed", since the 1.3 branch is no longer maintained. |
uname -a
=Linux helix 3.13.0-46-generic #77-Ubuntu SMP Mon Mar 2 18:23:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
matplotlib.__version__
= 1.3.1MWE:
I'm using LaTeX syntax to add text to a figure, including some special symbols. When I generate a PDF from the figure using Python 3, I get weird characters at the beginning or in the middle of this text. (I see these extra characters regardless of the program or OS with which I view the PDF; the generated PDF looks the same when viewed in Mac OS X's preview app as in evince on Ubuntu.) The following figure was converted by imagemagick from the resulting PDF.
The extra characters are always the same, "þÿ": a lowercase thorn character followed by a lowercase y with diaresis. These are at Unicode codepoints FE and FF respectively, which are the bytes that make up the Unicode byte-order mark.
There's a section of
backend_pdf.py
that reads:If I edit the line marked by
# <<<<
above to omit the prepended UTF16 BOM:... then I get exactly the figure I expect:
Again, I only see this bug when running Python 3. If I run the above MWE with Python 2 (and the same matplotlib version, 1.3.1), then I get the expected output both before and after my proposed change.
I'll also submit a pull request, but I have no idea about the potential wider impact of a change like this on other systems or use cases; this is my first time delving into matplotlib's guts, and the above Unicode-handling lines were added way back in November 2009, in revision 974b360 .
The text was updated successfully, but these errors were encountered: