-
Notifications
You must be signed in to change notification settings - Fork 271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also skip single CR chars in Buffer#read #364
Conversation
It is valid to delimit the stream content from the stream/endstream keywords by only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change.
thanks ❤️ |
Wow, that was fast ;-) Always a pleasure! |
It was an easy review with such a focused change and great spec. |
The PDF spec does NOT allow a bare CR between the stream token and its data. This is because the data may start with LF: how can you tell whether the LF is part of CRLF or the start of the data? See 7.3.8.1 NOTE 2 in the PDF spec at: However, it looks like the order of checks in the code will ensure that a bare CR is only accepted if the next character is not LF. This is not allowed by the spec, but at least the patch should not result in losing the first byte of the data. It might be better to make this permissive behaviour optional, so invalid PDFs can be detected. |
Damn, you are right. I didn't check the spec, I just had a PDF from the wild that was perfectly openable by all viewers I had in my reach but would crash pdf-reader. Not sure what to do now yet. |
Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens. However it does open OK in the viewers that I tried, so it looks like they are quite permissive. I suggest making this permissive behaviour optional if possible. |
It does not. It uses bare
That's why I didn't even consider it to be invalid.
Possible, of course. But at a price, of course. We would need to drag this option value along all from the top to the bottom, wouldn't we? Or introduce some global state. Do you have a concrete use case to detect invalid PDF files that are perfectly viewable by all readers? |
Is is possible to publish the name of the software that created the PDF? Also, if the intention is to provide a test case to mimic the private file, maybe that should use CR everywhere too. |
My general approach to this library has been "If acrobat can open it, then we probably should too". Nearly every bug report is a version of "I can open this PDF in acrobat/evince/preview, but the gem raises an exception", and I can't bring myself to be pure enough to reply with "I implement the spec, so I won't fix your exception". An opt-in strict parsing mode might be interesting? I wonder at the use case for it, other than validating a PDF against the spec. Updating the comments in |
The original file seems to be created by "HP Scan". This is the As the EOL character is always only CR I suppose this was done on a Mac (?) Some more explanations: I think updating the comments is a good idea. The strict-parsing option still lacks a proper use case from my perspective. At least for the systems where I use pdf-reader, the input PDF files always come from customers who don't have a clue what the creating software does, let alone have the ability to alter it. So my goal is to just make things work for them whenever I can. |
It is valid to delimit the stream content from the stream/endstream keywords with only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change, as shown by the test case.