New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UnicodeDecodeError with call to tidy_document #11
Comments
This basically works around the upstream issue in difflib, which is documented in the many 3rd party bug reports: - https://bugzilla.redhat.com/show_bug.cgi?id=1221169 - frescobaldi/frescobaldi#674 - https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=784940 - https://bugzilla.mozilla.org/show_bug.cgi?id=1149667 This also manually encodes the content passed to pytidylib with utf-8 in case something went wrong. See countergram/pytidylib#11 for more info and https://github.com/1flow/python-ftr/commit/ed8015ec9161d86e1b502122b640a1490b546d46 for the inspiration of this work around.
Sorry I lost track of this. I do have an idea of what could be happening but it'd be easier with a sample input document that triggers the error as of course none of the existing tests catch it. |
@countergram The value in question is
FWIW, this also happens outside of Vagrant btw, in production. Ignoring those characters "fixed" the issue for me in local testing FWIW: https://gist.github.com/jezdez/579e6a30d85c2ced042a |
Have it too |
Yes, got it too since it has migrated in Debian where it breaks the rawdog RSS feed reader.
|
The cause of this for rawdog was that libtidy 0.99 used ASCII as its default encoding, and libtidy 5 uses UTF-8. rawdog relied on the default and didn't expect to get a UTF-8 encoded result. I fixed this in rawdog 2.22 by explicitly specifying the input and output encodings, so it works the same way on all versions. So it's not pytidylib's fault, it's a tidy bug. I'm not sure the problem the original poster is seeing is the same thing, though... |
This may seem a bit simplistic but I couldn't figure out a way how to reproduce this manually, so maybe you have an idea how to fix the following traceback.
Sorry for the odd formatting but that's all I got from a Celery task that runs a call to tidy_document: https://github.com/mozilla/kuma/blob/8693de789413ef81e74da6d1f02aa39421eb611b/kuma/wiki/helpers.py#L92-L95
Others seem to have a similar problem and have worked around it: https://github.com/1flow/python-ftr/blob/90a2108c5ee005f1bf66dbe8cce68f2b7051b839/ftr/extractor.py#L146-L154
Do you know what's causing this?
The text was updated successfully, but these errors were encountered: