Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError with call to tidy_document #11

Open
jezdez opened this issue Jun 17, 2015 · 5 comments
Open

UnicodeDecodeError with call to tidy_document #11

jezdez opened this issue Jun 17, 2015 · 5 comments

Comments

@jezdez
Copy link

jezdez commented Jun 17, 2015

This may seem a bit simplistic but I couldn't figure out a way how to reproduce this manually, so maybe you have an idea how to fix the following traceback.

Traceback (most recent call last):
File "_ctypes/callbacks.c", line 314, in 'calling callback function'
File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
write_func(byte.decode('utf-8'))
File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError

Sorry for the odd formatting but that's all I got from a Celery task that runs a call to tidy_document: https://github.com/mozilla/kuma/blob/8693de789413ef81e74da6d1f02aa39421eb611b/kuma/wiki/helpers.py#L92-L95

Others seem to have a similar problem and have worked around it: https://github.com/1flow/python-ftr/blob/90a2108c5ee005f1bf66dbe8cce68f2b7051b839/ftr/extractor.py#L146-L154

Do you know what's causing this?

@jezdez jezdez changed the title UnicodeDecodeError with UnicodeDecodeError with call to tidy_document Jun 17, 2015
jezdez added a commit to mdn/kuma that referenced this issue Jun 17, 2015
This basically works around the upstream issue in difflib, which is documented in the many 3rd party bug reports:
- https://bugzilla.redhat.com/show_bug.cgi?id=1221169
- frescobaldi/frescobaldi#674
- https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784940
- https://bugzilla.mozilla.org/show_bug.cgi?id=1149667

This also manually encodes the content passed to pytidylib with utf-8 in case something went wrong. See countergram/pytidylib#11 for more info and https://github.com/1flow/python-ftr/commit/ed8015ec9161d86e1b502122b640a1490b546d46 for the inspiration of this work around.
@countergram
Copy link
Owner

Sorry I lost track of this. I do have an idea of what could be happening but it'd be easier with a sample input document that triggers the error as of course none of the existing tests catch it.

@jezdez
Copy link
Author

jezdez commented Jul 14, 2015

@countergram The value in question is 0xc3 0xa9, which should be é. For some reason it stumbles over it though. Here's a better traceback:

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 314, in 'calling callback function'
  File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
  File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc3 in position 0: unexpected end of data
Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 314, in 'calling callback function'
  File "/home/vagrant/src/vendor/src/pytidylib/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
  File "/home/vagrant/env/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

FWIW, this also happens outside of Vagrant btw, in production.

Ignoring those characters "fixed" the issue for me in local testing FWIW: https://gist.github.com/jezdez/579e6a30d85c2ced042a

@riklaunim
Copy link

Have it too

@phep
Copy link

phep commented Feb 18, 2016

Yes, got it too since it has migrated in Debian where it breaks the rawdog RSS feed reader.

Traceback (most recent call last):
  File "_ctypes/callbacks.c", line 314, in 'calling callback function'
  File "/usr/lib/python2.7/dist-packages/tidylib/sink.py", line 79, in put_byte
    write_func(byte.decode('utf-8'))
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xa9 in position 0: invalid start byte

@atsampson
Copy link

The cause of this for rawdog was that libtidy 0.99 used ASCII as its default encoding, and libtidy 5 uses UTF-8. rawdog relied on the default and didn't expect to get a UTF-8 encoded result. I fixed this in rawdog 2.22 by explicitly specifying the input and output encodings, so it works the same way on all versions. So it's not pytidylib's fault, it's a tidy bug.

I'm not sure the problem the original poster is seeing is the same thing, though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants