Filter invalid characters to avoid PCDATA errors #12

neiljp · 2017-06-03T01:18:00Z

I made these changes to resolve what I believe was a similar issue as #6.

JoshData · 2017-06-03T11:53:08Z

Hey thanks. I'm away for a few days - I'll look more closely when I get back.

JoshData · 2017-06-06T13:19:34Z

Before I merge, can you say something about where you think these characters are coming from?

neiljp · 2017-06-07T00:57:08Z

What do you mean by coming from?

I've not found a definitive reference for what is or is not valid, but my understanding is that XML, or the library being used (or maybe the encoding?) simply does not support certain characters.

I followed a simplified form of an empirical solution which I found here:
https://stackoverflow.com/questions/8888628/how-should-i-deal-with-an-xmlsyntaxerror-in-pythons-lxml-while-parsing-a-large
However, if there is a better approach involving clarifying the expected character-set or explicitly checking via the XML/pdf standard, that would also be great.
I ran in to what may be a similar approach elsewhere here:
neitanod/forceutf8#39

JoshData · 2017-06-07T12:16:38Z

So. Your patch is making a change at the point where pdftotext's output becomes the input for lxml's parser. lxml is saying that there are some bytes in the file that aren't valid XML. My question is, why would pdftotext create invalid XML? Under what circumstances does this happen? Or, is pdftotext operating correctly and maybe the problem is in how lxml is interpreting the bytes?

neiljp · 2017-06-07T17:40:39Z

Ah, good question. Does this help?

It's difficult to know whether these characters are in the original PDF, but I have confirmed that they appear in the straight-text version using pdftotext (no -bbox option), so it could also be that pdftotext doesn't consider whether "words" it generates contain only a limited set of characters and how that might conflict with xhtml standards.

In this case, my editor (vim) shows a '^K' character at the end of some lines, which is presumably the 'value 11', in both the .txt and .html files, that lxml complains about.

I can only see that it would be an issue with lxml if it is an encoding (mismatch) issue, but even then that doesn't explain why some sources seem to suggest that some characters are simply invalid XML? If that can be confirmed, that should rule out lxml as being at fault.

So, unless it's an encoding mismatch somehow, lxml is operating correctly, and this patch is a workaround for pdftotext not filtering "invalid" XML characters when generating the bbox XHTML. Certain files work fine, as you know, so it could be that pdftotext doesn't know to take into account that those characters might exist (whether the PDF is valid or not!).

JoshData · 2017-06-08T21:45:39Z

Ok. I'll merge. Thanks.

Inspired by comment by wookayin in JoshData#12.

Filter invalid characters to avoid PCDATA errors

59e3fa5

neiljp mentioned this pull request Jun 3, 2017

lxml error #6

Open

JoshData merged commit f9fe805 into JoshData:master Jun 8, 2017

neiljp added a commit to neiljp/pdf-diff that referenced this pull request Jul 11, 2017

Add specific check for python 3+.

9897f28

Inspired by comment by wookayin in JoshData#12.

neiljp added a commit to neiljp/pdf-diff that referenced this pull request Jul 11, 2017

Add specific check for python 3+.

ed92fc8

Inspired by comment by wookayin in JoshData#12.

neiljp mentioned this pull request Jul 11, 2017

Useability changes (python3 check, import removal, stderr usage) #16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Filter invalid characters to avoid PCDATA errors #12

Filter invalid characters to avoid PCDATA errors #12

neiljp commented Jun 3, 2017

JoshData commented Jun 3, 2017

JoshData commented Jun 6, 2017

neiljp commented Jun 7, 2017

JoshData commented Jun 7, 2017

neiljp commented Jun 7, 2017

JoshData commented Jun 8, 2017

Filter invalid characters to avoid PCDATA errors #12

Filter invalid characters to avoid PCDATA errors #12

Conversation

neiljp commented Jun 3, 2017

JoshData commented Jun 3, 2017

JoshData commented Jun 6, 2017

neiljp commented Jun 7, 2017

JoshData commented Jun 7, 2017

neiljp commented Jun 7, 2017

JoshData commented Jun 8, 2017