Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Also skip single CR chars in Buffer#read #364

Merged
merged 1 commit into from Sep 21, 2021
Merged

Conversation

sbilharz
Copy link
Contributor

It is valid to delimit the stream content from the stream/endstream keywords with only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change, as shown by the test case.

It is valid to delimit the stream content from the stream/endstream
keywords by only a CR instead of CRLF or LF. pdf-reader would crash in
some cases until this change.
@yob yob merged commit 6cc5c66 into yob:main Sep 21, 2021
@yob
Copy link
Owner

yob commented Sep 21, 2021

thanks ❤️

@sbilharz sbilharz deleted the stream-cr-only branch September 21, 2021 12:26
@sbilharz
Copy link
Contributor Author

Wow, that was fast ;-)

Always a pleasure!

@yob
Copy link
Owner

yob commented Sep 21, 2021

It was an easy review with such a focused change and great spec.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 21, 2021

The PDF spec does NOT allow a bare CR between the stream token and its data.

This is because the data may start with LF: how can you tell whether the LF is part of CRLF or the start of the data?

See 7.3.8.1 NOTE 2 in the PDF spec at:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

However, it looks like the order of checks in the code will ensure that a bare CR is only accepted if the next character is not LF. This is not allowed by the spec, but at least the patch should not result in losing the first byte of the data.

It might be better to make this permissive behaviour optional, so invalid PDFs can be detected.

@sbilharz
Copy link
Contributor Author

Damn, you are right. I didn't check the spec, I just had a PDF from the wild that was perfectly openable by all viewers I had in my reach but would crash pdf-reader. Not sure what to do now yet.

@sebbASF
Copy link
Contributor

sebbASF commented Oct 21, 2021

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens.
This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

I suggest making this permissive behaviour optional if possible.

@sbilharz
Copy link
Contributor Author

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens.
This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

It does not. It uses bare \rs as EOL everywhere. Unfortunately it isn't public so I can't share it. That's why I built the test file where I just wanted to quickly reproduce the error which occurred exactly when parsing a stream, because the /Length attribute ignored the LF char but pdf-reader didn't.

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

That's why I didn't even consider it to be invalid.

I suggest making this permissive behaviour optional if possible.

Possible, of course. But at a price, of course. We would need to drag this option value along all from the top to the bottom, wouldn't we? Or introduce some global state.

Do you have a concrete use case to detect invalid PDF files that are perfectly viewable by all readers?

@sebbASF
Copy link
Contributor

sebbASF commented Oct 21, 2021

Is is possible to publish the name of the software that created the PDF?

Also, if the intention is to provide a test case to mimic the private file, maybe that should use CR everywhere too.
[AFAICT a bare CR is valid as an EOL everywhere -- except after stream, where it is ambiguous.]

@yob
Copy link
Owner

yob commented Oct 22, 2021

My general approach to this library has been "If acrobat can open it, then we probably should too".

Nearly every bug report is a version of "I can open this PDF in acrobat/evince/preview, but the gem raises an exception", and I can't bring myself to be pure enough to reply with "I implement the spec, so I won't fix your exception".

An opt-in strict parsing mode might be interesting? I wonder at the use case for it, other than validating a PDF against the spec.

Updating the comments in lib/pdf/reader/buffer.rb might be interesting, to clarify which behaviour is off-spec but included for compatibility.

@sbilharz
Copy link
Contributor Author

Is is possible to publish the name of the software that created the PDF?

The original file seems to be created by "HP Scan". This is the /Info object from the trailer:
{:Keywords=>"", :Creator=>"HP Scan", :CreationDate=>"D:20210912101258-08'00'", :ModDate=>"D:20210912101258-08'00'", :Author=>"", :Producer=>"HP Scan Extended Application", :Title=>"", :Subject=>""}

As the EOL character is always only CR I suppose this was done on a Mac (?)

Some more explanations:
The file contains several scanned images that couldn't be opened by pdf-reader because of the bare CRs after the stream keywords. They would not be skipped and thus the last byte of the actual stream would not be read, which lead to the following error when the parser tried to look for the endstream keyword next:
PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found 0 instead)
I am not sure why it says found 0 as the last byte of the stream is actually 0xD9 but I suppose that is a different issue. The /Length provided in the dictionary for the stream is set correctly (assuming the CR is skipped).

I think updating the comments is a good idea. The strict-parsing option still lacks a proper use case from my perspective. At least for the systems where I use pdf-reader, the input PDF files always come from customers who don't have a clue what the creating software does, let alone have the ability to alter it. So my goal is to just make things work for them whenever I can.

@sebbASF sebbASF mentioned this pull request Oct 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants