Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter invalid characters to avoid PCDATA errors #12

Merged
merged 1 commit into from
Jun 8, 2017

Conversation

neiljp
Copy link
Contributor

@neiljp neiljp commented Jun 3, 2017

I made these changes to resolve what I believe was a similar issue as #6.

@neiljp neiljp mentioned this pull request Jun 3, 2017
@JoshData
Copy link
Owner

JoshData commented Jun 3, 2017

Hey thanks. I'm away for a few days - I'll look more closely when I get back.

@JoshData
Copy link
Owner

JoshData commented Jun 6, 2017

Before I merge, can you say something about where you think these characters are coming from?

@neiljp
Copy link
Contributor Author

neiljp commented Jun 7, 2017

What do you mean by coming from?

I've not found a definitive reference for what is or is not valid, but my understanding is that XML, or the library being used (or maybe the encoding?) simply does not support certain characters.

I followed a simplified form of an empirical solution which I found here:
https://stackoverflow.com/questions/8888628/how-should-i-deal-with-an-xmlsyntaxerror-in-pythons-lxml-while-parsing-a-large
However, if there is a better approach involving clarifying the expected character-set or explicitly checking via the XML/pdf standard, that would also be great.
I ran in to what may be a similar approach elsewhere here:
neitanod/forceutf8#39

@JoshData
Copy link
Owner

JoshData commented Jun 7, 2017

So. Your patch is making a change at the point where pdftotext's output becomes the input for lxml's parser. lxml is saying that there are some bytes in the file that aren't valid XML. My question is, why would pdftotext create invalid XML? Under what circumstances does this happen? Or, is pdftotext operating correctly and maybe the problem is in how lxml is interpreting the bytes?

@neiljp
Copy link
Contributor Author

neiljp commented Jun 7, 2017

Ah, good question. Does this help?

It's difficult to know whether these characters are in the original PDF, but I have confirmed that they appear in the straight-text version using pdftotext (no -bbox option), so it could also be that pdftotext doesn't consider whether "words" it generates contain only a limited set of characters and how that might conflict with xhtml standards.

In this case, my editor (vim) shows a '^K' character at the end of some lines, which is presumably the 'value 11', in both the .txt and .html files, that lxml complains about.

I can only see that it would be an issue with lxml if it is an encoding (mismatch) issue, but even then that doesn't explain why some sources seem to suggest that some characters are simply invalid XML? If that can be confirmed, that should rule out lxml as being at fault.

So, unless it's an encoding mismatch somehow, lxml is operating correctly, and this patch is a workaround for pdftotext not filtering "invalid" XML characters when generating the bbox XHTML. Certain files work fine, as you know, so it could be that pdftotext doesn't know to take into account that those characters might exist (whether the PDF is valid or not!).

@JoshData
Copy link
Owner

JoshData commented Jun 8, 2017

Ok. I'll merge. Thanks.

@JoshData JoshData merged commit f9fe805 into JoshData:master Jun 8, 2017
neiljp added a commit to neiljp/pdf-diff that referenced this pull request Jul 11, 2017
Inspired by comment by wookayin in JoshData#12.
neiljp added a commit to neiljp/pdf-diff that referenced this pull request Jul 11, 2017
Inspired by comment by wookayin in JoshData#12.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants