py3 text_to_bytes(unicode_string) results in mojibake #154

Closed
escapewindow opened this Issue Feb 25, 2016 · 13 comments

Projects

None yet

2 participants

@escapewindow

https://simple.wikipedia.org/wiki/Mojibake

In Python3, PGPMessage.new(unicode_string) munges the string even before any encryption happens. This is because PGPObject.text_to_bytes() is munging the unicode. This results in an undecodable byte string.

The currently Py3 workaround is to PGPMessage.new(unicode_bytearray).

This issue tracks fixing PGPMessage.new(unicode_string).

See #153 for STR tests.

@escapewindow

In the passing tests (all py2 and py3 with a bytearray passed):

  • i.bit_length() is 8, 1, 31
  • int_byte_len is 1, 1, 4

In the failing tests (py3 with a unicode_string passed):

  • i.bit_length() is 17, 8, 1, 31
  • int_byte_len is 3, 1, 1, 4
@escapewindow

I was going to SWAG and say the 17 is a unicode escape character.
However, with some debug prints, I found it's 128169, or http://www.fileformat.info/info/unicode/char/1F4A9/index.htm (which is what \U0001F4A9 is).

So py2 must be splitting the poop emoji into < 256 ints, and py3 blobs it into 128169, and it's getting munged somewhere in PGPObject.int_to_bytes().

@escapewindow

This fixes py3, but assumes utf-8 encoding: https://gist.github.com/escapewindow/5051b00196d656af9e13

@escapewindow

This also fixes it, falling back to the above code if the original bytearray doesn't decode() properly.
However, this also needs to know the encoding, so I'm not sure how you get around that.
Maybe PGPMessage.new() should take an encoding, defaulting to utf-8, and pass that to text_to_bytes()?

https://gist.github.com/escapewindow/435a5bd1f16783edfd11

@escapewindow

https://gist.github.com/escapewindow/17ff4517f85dfe0cd20e fixes it pretty cleanly, but I'm sure there was a reason it was written the way it was originally. The fact that the tests are broken well before test_06_unicode.py is even run hampers my ability to test around this.

@Commod0re Commod0re self-assigned this Feb 25, 2016
@Commod0re Commod0re added the bug label Feb 25, 2016
@Commod0re Commod0re added this to the 0.4.0 milestone Feb 25, 2016
@Commod0re
Contributor

Interesting find. I'm taking a look at these fixes and test cases now.

The reason why the current code doesn't accept an encoding parameter is because RFC 4880 expects text to be UTF-8, and clearly I didn't test that conversion thoroughly enough, as you noticed with the lack of explicit unicode tests.

I'm not sure that accepting an encoding parameter is the right fix since there's no way to store that in an OpenPGP-compliant way, so an encoding parameter would also have to be added to other methods that use text data, which adds opportunity to get incorrect return values from things like signature verification.

I think the right fix is to correctly translate text into equivalent UTF-8 before loading it into a bytearray, so that we're always looking at UTF-8 as expected by the spec.

Thanks for finding this, and all the detail, by the way!

@Commod0re
Contributor

While looking at this, I remembered why it was written how it was - I wanted it to not fail when the encoding was weird, but that isn't the best behavior.

In this case, as PGPy isn't a character encoding detection library (and reliably detecting charsets is non-trivial), I think the "right" approach here is to do the following:

  • Expect that text data given to PGPy when creating objects is UTF-8 (or ASCII, which is compatible anyway) and raise an exception if that isn't the case
  • Respect the "Charset" armor header (see RFC 4880 6.2 - Forming ASCII Armor) when present. Internally, PGPy will still store text data as UTF-8 bytes, but should encode to the charset given if that header is present.
@escapewindow

That sounds good to me. Thanks for taking a look!

@Commod0re
Contributor

Should be fixed now as of 137c5d4.

Here's what I ended up doing (and this information will be in the documentation with the 0.4.0 release as well):

  • Added two new keyword arguments to PGPMessage.new: encoding and format. Setting encoding will add a "Charset" header to the ASCII armor of the message (unless it is a cleartext message, in which case the Charset header will appear in the signature block).
  • If binary data is given when creating a PGPMessage, it will be left as-is unless format is explicitly set to 'u' or 't', or if cleartext is True.
  • if cleartext is True or format is set to 'u' or 't' and encoding is set to something other than utf-8, PGPMessage will try to transcode the binary data given from the specified encoding to UTF-8.

Other than that, it should behave as it did before but without munging anything. Let me know if you find any cases where it still breaks :)

@Commod0re Commod0re closed this Feb 27, 2016
@Commod0re Commod0re added a commit that referenced this issue Feb 27, 2016
@Commod0re Commod0re - fixed improper munging of non-unicode inputs - #154
 - added additional unicode tests - #153
 - fixed mixin/base class ordering of PGPObject subclasses
137c5d4
@escapewindow

Thanks!
When do you plan on releasing 0.4.0?

@Commod0re
Contributor

Soon™

I don't have an exact timeline I'm committed to just yet, but I am feeling like the codebase for 0.4.0 is already just about feature complete. I'm planning on pushing a couple of feature issues to the following release since this one has been delayed more than enough as it is. Other than that, it'll mainly be a matter of filling in holes in unit test coverage, updating documentation, and just playing with to try to make sure the interface still feels good and doesn't break too badly on weird stuff.

I feel like I can probably reasonably do that within the next couple of weeks, but it'll largely depend on how busy I end up being with tasks not related to PGPy.

@escapewindow

I feel like I can probably reasonably do that within the next couple of weeks, but it'll largely depend on how busy I end up being with tasks not related to PGPy.

Looks like very busy :)

@Commod0re
Contributor

Yeah, I was hopeful, but we are finishing up a release of another piece of software as well, so I haven't had a chance to work on wrapping this one up yet :/

@escapewindow escapewindow referenced this issue in taskcluster/taskcluster-client.py May 4, 2016
Closed

pgpy 0.4.0 ! #53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment