Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: blindly change all encoding to UTF-8 #224

Closed
wants to merge 1 commit into from

Conversation

anarcat
Copy link
Contributor

@anarcat anarcat commented Nov 8, 2017

My public key doesn't properly decode when loaded with PGPy:

In [38]: pgpy.PGPKey.from_file('anarcat.gpg')[0].userids[0].name
Out[38]: 'Antoine Beaupré'

By switching to UTF-8 everywhere, it gets properly parsed. Using
Latin-1 is not a right default: it is europe-centric and will
necessarily break whenever another language than english or some
european countries is used (e.g. any asian language will fail). As you
can see, even french accents fail, if they are (properly) encoded in
UTF-8.

This is a stopgap measure: I am not sure how to decode user
identifiers properly, nor if this touches on too much stuff to fix the
actual problem. But I would assert that using UTF-8 is a better
default than latin-1.

My public key doesn't properly decode when loaded with PGPy:

In [38]: pgpy.PGPKey.from_file('anarcat.gpg')[0].userids[0].name
Out[38]: 'Antoine Beaupré'

By switching to UTF-8 everywhere, it gets properly parsed. Using
Latin-1 is not a right default: it is europe-centric and will
necessarily break whenever another language than english or some
european countries is used (e.g. any asian language will fail). As you
can see, even french accents fail, if they are (properly) encoded in
UTF-8.

This is a stopgap measure: I am not sure how to decode user
identifiers properly, nor if this touches on too much stuff to fix the
actual problem. But I would assert that using UTF-8 is a better
default than latin-1.
@J08nY
Copy link
Contributor

J08nY commented Nov 8, 2017

RFC 4880 indeed specifies the User ID packet contains the identifier in UTF-8:
https://tools.ietf.org/html/rfc4880#section-5.11

However this PR contains changes that go much further and I believe may break some other stuff, so maybe just changing the stuff RFC4880 explicitly specifies as UTF-8 should be done instead.

@Commod0re
Copy link
Contributor

the reason it is currently latin-1 (which in hindsight was probably not the best workaround) is because although RFC 4880 stipulates UTF-8, it has not always been this way:

  • RFC 1991 stipulated that the contents should be printable ASCII characters
  • RFC 2440 allowed for non-text values

I have encountered some UserIDs in the wild that are likely legacy ones that have been carried forward, which use extended ASCII characters like \xe3 and \xe9 that are not valid utf-8 characters (and do not decode correctly with 'ascii' but do with 'latin-1'). I'm going to close this merge request as I am working on a fix that is more correct and more lenient

@Commod0re Commod0re closed this Nov 8, 2017
@anarcat
Copy link
Contributor Author

anarcat commented Nov 9, 2017

right. UserIDs are a pain in the butt. i would recommend trying UTF-8 and if that fails, fallback to latin-1 or something.

@anarcat anarcat deleted the utf8 branch November 9, 2017 15:24
@anarcat
Copy link
Contributor Author

anarcat commented Nov 9, 2017

oh, and for what it's worth, the Haskell equivalent of us (hopenpgp) considers invalid UTF-8 to be broken packets and they are just ignored.

@Commod0re
Copy link
Contributor

I spent some time playing with it last night and the approach I've settled on is too try utf-8, if that fails, fallback to the charmap decoder. With the legacy UserIDs in my large test keyring this seems to produce the most consistently correct looking results without breaking anything.

I think usability edges out principle in this case

@anarcat
Copy link
Contributor Author

anarcat commented Nov 10, 2017

I think usability edges out principle in this case

agreed. but utf-8 should be our first bet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants