New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'ascii' codec can't decode . . . #42

Open
quietlyconfident opened this Issue Dec 10, 2015 · 5 comments

Comments

Projects
None yet
6 participants
@quietlyconfident

quietlyconfident commented Dec 10, 2015

When I try to use python-pdfkit with certain HTML content that has certain characters in it, it fails with one of these errors if the html content is loaded into memory:

File ". . . /pdfkit.py", line 100, in to_pdf
    input = self.source.to_s().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 64: ordinal not in range(128)

or

File ". . ./pdfkit.py", line 102, in to_pdf
    input = self.source.source.read().encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 64: ordinal not in range(128)

But, python pdfkit works just fine if it is provided with just a filename, and so does wkhtmltopdf.

I think that python pdfkit is doing something unsafe with strings; perhaps it should assume that the input is just bytes.

python-pdfkit error demo.zip

@debaetsr

This comment has been minimized.

Show comment
Hide comment
@debaetsr

debaetsr Dec 23, 2015

I have also problems when the source is already in utf-8 (encoding utf-8 to utf-8 gives weird results).

Removing the encode works for me. My HTML source files are in UTF-8, as we have many accents in Belgium.

I assume it's the programmers job to ensure correct encoding before calling the library, so he can be in complete control what to do if unsupported characters occur.

Regards,
Ruben

debaetsr commented Dec 23, 2015

I have also problems when the source is already in utf-8 (encoding utf-8 to utf-8 gives weird results).

Removing the encode works for me. My HTML source files are in UTF-8, as we have many accents in Belgium.

I assume it's the programmers job to ensure correct encoding before calling the library, so he can be in complete control what to do if unsupported characters occur.

Regards,
Ruben

@patrickyan

This comment has been minimized.

Show comment
Hide comment
@patrickyan

patrickyan Apr 29, 2016

@debaetsr how did you fix the problem?

patrickyan commented Apr 29, 2016

@debaetsr how did you fix the problem?

@alanhamlett

This comment has been minimized.

Show comment
Hide comment
@alanhamlett

alanhamlett May 4, 2017

Collaborator

Should be fixed on master branch with #81 and released in the next version.

Collaborator

alanhamlett commented May 4, 2017

Should be fixed on master branch with #81 and released in the next version.

@gbrowdy

This comment has been minimized.

Show comment
Hide comment
@gbrowdy

gbrowdy Jun 5, 2017

The commit you referenced deals with the decoding, whereas the problem stated here (which I am also having) is about the encode function in the to_pdf method.

gbrowdy commented Jun 5, 2017

The commit you referenced deals with the decoding, whereas the problem stated here (which I am also having) is about the encode function in the to_pdf method.

@alexandrezia

This comment has been minimized.

Show comment
Hide comment
@alexandrezia

alexandrezia Jun 3, 2018

I'm also having this issue,
All my html files are already utf-8 as they are in Portuguese language.

alexandrezia commented Jun 3, 2018

I'm also having this issue,
All my html files are already utf-8 as they are in Portuguese language.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment