-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Filter invalid characters to avoid PCDATA errors #12
Conversation
Hey thanks. I'm away for a few days - I'll look more closely when I get back. |
Before I merge, can you say something about where you think these characters are coming from? |
What do you mean by coming from? I've not found a definitive reference for what is or is not valid, but my understanding is that XML, or the library being used (or maybe the encoding?) simply does not support certain characters. I followed a simplified form of an empirical solution which I found here: |
So. Your patch is making a change at the point where |
Ah, good question. Does this help? It's difficult to know whether these characters are in the original PDF, but I have confirmed that they appear in the straight-text version using pdftotext (no -bbox option), so it could also be that pdftotext doesn't consider whether "words" it generates contain only a limited set of characters and how that might conflict with xhtml standards. In this case, my editor (vim) shows a '^K' character at the end of some lines, which is presumably the 'value 11', in both the .txt and .html files, that lxml complains about. I can only see that it would be an issue with lxml if it is an encoding (mismatch) issue, but even then that doesn't explain why some sources seem to suggest that some characters are simply invalid XML? If that can be confirmed, that should rule out lxml as being at fault. So, unless it's an encoding mismatch somehow, lxml is operating correctly, and this patch is a workaround for pdftotext not filtering "invalid" XML characters when generating the bbox XHTML. Certain files work fine, as you know, so it could be that pdftotext doesn't know to take into account that those characters might exist (whether the PDF is valid or not!). |
Ok. I'll merge. Thanks. |
Inspired by comment by wookayin in JoshData#12.
Inspired by comment by wookayin in JoshData#12.
I made these changes to resolve what I believe was a similar issue as #6.