PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

jadrian · 2015-03-08T03:56:15Z

uname -a = Linux helix 3.13.0-46-generic #77-Ubuntu SMP Mon Mar 2 18:23:39 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
matplotlib.__version__ = 1.3.1

MWE:

import matplotlib as mpl
import matplotlib.pyplot as plt

mpl.rcParams['text.usetex'] = True

fig, ax = plt.subplots(1, 1)
ax.plot([5, 8, 3], label=r"90{\textdegree}")
ax.set_ylabel(r"$\Delta$AIC ({\texttimes}1000)")
ax.legend()
fig.savefig("plt_bom_mwe.pdf")

I'm using LaTeX syntax to add text to a figure, including some special symbols. When I generate a PDF from the figure using Python 3, I get weird characters at the beginning or in the middle of this text. (I see these extra characters regardless of the program or OS with which I view the PDF; the generated PDF looks the same when viewed in Mac OS X's preview app as in evince on Ubuntu.) The following figure was converted by imagemagick from the resulting PDF.

The extra characters are always the same, "þÿ": a lowercase thorn character followed by a lowercase y with diaresis. These are at Unicode codepoints FE and FF respectively, which are the bytes that make up the Unicode byte-order mark.

There's a section of backend_pdf.py that reads:

# Unicode strings are encoded in UTF-16BE with byte-order mark.
elif isinstance(obj, str):
    try:
        # But maybe it's really ASCII?
        s = obj.encode('ASCII')
        return pdfRepr(s)
    except UnicodeEncodeError:
        s = codecs.BOM_UTF16_BE + obj.encode('UTF-16BE')  # <<<<
        return pdfRepr(s)

If I edit the line marked by # <<<< above to omit the prepended UTF16 BOM:

        s = obj.encode('UTF-16BE')  # <<<<

... then I get exactly the figure I expect:

Again, I only see this bug when running Python 3. If I run the above MWE with Python 2 (and the same matplotlib version, 1.3.1), then I get the expected output both before and after my proposed change.

I'll also submit a pull request, but I have no idea about the potential wider impact of a change like this on other systems or use cases; this is my first time delving into matplotlib's guts, and the above Unicode-handling lines were added way back in November 2009, in revision 974b360 .

The text was updated successfully, but these errors were encountered:

jadrian · 2015-03-08T04:04:23Z

Just a thought: it might be the case that some other system considers the Unicode text that LaTeX generates to be UTF-8 instead of UTF-16. That might explain why the BOM byte sequence would get interpreted as two 8-bit visible characters rather than a 16-bit invisible BOM. The PDF backend seems pretty consistent on the UTF-16 thing, so I don't know where else in the stack this mismatch might occur.

tacaswell · 2015-03-08T18:01:42Z

See http://en.wikipedia.org/wiki/Byte_order_mark

Something else seems fishy as the characters you are getting getting match the windows encoding, but you are on linux.

I can not reproduce this issues using 1.4.3 + py3.4.

tacaswell · 2015-03-08T18:04:27Z

The upgrade between 1.3.1 and 1.4.x probably matters quite a bit in this case as for the 1.4 series we switched from using 2to3 to six to get python 2/3 compatibility and as part of the conversion cleaned up a large number of unicode related issues.

jadrian · 2015-03-08T19:09:24Z

Righto, thanks for looking into this. As I mentioned, the characters I saw are also at codepoints FE and FF in Unicode, so I don't think Windows encoding is necessary to explain the bug I saw, just a naive UTF-8 interpretation that never deals with multi-byte characters. Thanks for explaining about the upgrades in 1.4.x and for considering my pull request. I agree that there's no reason to merge it in if the bugs can't be confirmed with 1.4.x; I'll just keep the patch on my own system until I have cause to upgrade to 1.4.x.

tacaswell · 2015-03-08T19:19:10Z

I am also a tad worried that this is an issue with the interaction with LaTeX on your system.

Could you please try with 1.4.3? The easiest way to test is probably to install via anaconda. I would like to get this sorted before it drifts off to our huge back-log of issues and don't feel comfortable closing it on just my in ability to reproduce.

mdboom · 2015-03-09T13:49:17Z

I can reproduce this on 1.3.1, but not 1.4.0 (which makes sense, since there was a big push for better Python 3 compatibility in that revision). So I'm going to close this as "already fixed", since the 1.3 branch is no longer maintained.

jadrian mentioned this issue Mar 8, 2015

Removed BOM from Unicode text #4200

Closed

tacaswell added this to the next point release milestone Mar 8, 2015

tacaswell added the status: needs confirmation label Mar 8, 2015

mdboom closed this as completed Mar 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

jadrian commented Mar 8, 2015

jadrian commented Mar 8, 2015

tacaswell commented Mar 8, 2015

tacaswell commented Mar 8, 2015

jadrian commented Mar 8, 2015

tacaswell commented Mar 8, 2015

mdboom commented Mar 9, 2015

PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

PDF backend + TeX renders Unicode BOM as visible junk characters on Python 3 #4199

Comments

jadrian commented Mar 8, 2015

jadrian commented Mar 8, 2015

tacaswell commented Mar 8, 2015

tacaswell commented Mar 8, 2015

jadrian commented Mar 8, 2015

tacaswell commented Mar 8, 2015

mdboom commented Mar 9, 2015