Also skip single CR chars in Buffer#read #364

sbilharz · 2021-09-21T11:31:09Z

It is valid to delimit the stream content from the stream/endstream keywords with only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change, as shown by the test case.

It is valid to delimit the stream content from the stream/endstream keywords by only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change.

yob · 2021-09-21T12:15:00Z

thanks ❤️

sbilharz · 2021-09-21T12:29:56Z

Wow, that was fast ;-)

Always a pleasure!

yob · 2021-09-21T12:46:05Z

It was an easy review with such a focused change and great spec.

sebbASF · 2021-10-21T09:06:53Z

The PDF spec does NOT allow a bare CR between the stream token and its data.

This is because the data may start with LF: how can you tell whether the LF is part of CRLF or the start of the data?

See 7.3.8.1 NOTE 2 in the PDF spec at:
https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

However, it looks like the order of checks in the code will ensure that a bare CR is only accepted if the next character is not LF. This is not allowed by the spec, but at least the patch should not result in losing the first byte of the data.

It might be better to make this permissive behaviour optional, so invalid PDFs can be detected.

sbilharz · 2021-10-21T09:40:17Z

Damn, you are right. I didn't check the spec, I just had a PDF from the wild that was perfectly openable by all viewers I had in my reach but would crash pdf-reader. Not sure what to do now yet.

sebbASF · 2021-10-21T09:49:26Z

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens.
This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

I suggest making this permissive behaviour optional if possible.

sbilharz · 2021-10-21T15:45:31Z

Note that the test file (content_stream_cr_only.pdf) uses LF for EOL apart from after the stream and endstream tokens.
This seems very unlikely in the wild - does your failing example have this characteristic? Is it public?

It does not. It uses bare \rs as EOL everywhere. Unfortunately it isn't public so I can't share it. That's why I built the test file where I just wanted to quickly reproduce the error which occurred exactly when parsing a stream, because the /Length attribute ignored the LF char but pdf-reader didn't.

However it does open OK in the viewers that I tried, so it looks like they are quite permissive.

That's why I didn't even consider it to be invalid.

I suggest making this permissive behaviour optional if possible.

Possible, of course. But at a price, of course. We would need to drag this option value along all from the top to the bottom, wouldn't we? Or introduce some global state.

Do you have a concrete use case to detect invalid PDF files that are perfectly viewable by all readers?

sebbASF · 2021-10-21T18:59:27Z

Is is possible to publish the name of the software that created the PDF?

Also, if the intention is to provide a test case to mimic the private file, maybe that should use CR everywhere too.
[AFAICT a bare CR is valid as an EOL everywhere -- except after stream, where it is ambiguous.]

yob · 2021-10-22T02:55:38Z

My general approach to this library has been "If acrobat can open it, then we probably should too".

Nearly every bug report is a version of "I can open this PDF in acrobat/evince/preview, but the gem raises an exception", and I can't bring myself to be pure enough to reply with "I implement the spec, so I won't fix your exception".

An opt-in strict parsing mode might be interesting? I wonder at the use case for it, other than validating a PDF against the spec.

Updating the comments in lib/pdf/reader/buffer.rb might be interesting, to clarify which behaviour is off-spec but included for compatibility.

sbilharz · 2021-10-22T09:37:26Z

Is is possible to publish the name of the software that created the PDF?

The original file seems to be created by "HP Scan". This is the /Info object from the trailer:
{:Keywords=>"", :Creator=>"HP Scan", :CreationDate=>"D:20210912101258-08'00'", :ModDate=>"D:20210912101258-08'00'", :Author=>"", :Producer=>"HP Scan Extended Application", :Title=>"", :Subject=>""}

As the EOL character is always only CR I suppose this was done on a Mac (?)

Some more explanations:
The file contains several scanned images that couldn't be opened by pdf-reader because of the bare CRs after the stream keywords. They would not be skipped and thus the last byte of the actual stream would not be read, which lead to the following error when the parser tried to look for the endstream keyword next:
PDF::Reader::MalformedPDFError (PDF malformed, expected 'endstream' but found 0 instead)
I am not sure why it says found 0 as the last byte of the stream is actually 0xD9 but I suppose that is a different issue. The /Length provided in the dictionary for the stream is set correctly (assuming the CR is skipped).

I think updating the comments is a good idea. The strict-parsing option still lacks a proper use case from my perspective. At least for the systems where I use pdf-reader, the input PDF files always come from customers who don't have a clue what the creating software does, let alone have the ability to alter it. So my goal is to just make things work for them whenever I can.

Also skip single CR chars in Buffer#read

0050215

It is valid to delimit the stream content from the stream/endstream keywords by only a CR instead of CRLF or LF. pdf-reader would crash in some cases until this change.

yob merged commit 6cc5c66 into yob:main Sep 21, 2021

sbilharz deleted the stream-cr-only branch September 21, 2021 12:26

sebbASF mentioned this pull request Oct 22, 2021

Why OK to allow CR alone #369

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Also skip single CR chars in Buffer#read #364

Also skip single CR chars in Buffer#read #364

sbilharz commented Sep 21, 2021

yob commented Sep 21, 2021

sbilharz commented Sep 21, 2021

yob commented Sep 21, 2021

sebbASF commented Oct 21, 2021

sbilharz commented Oct 21, 2021

sebbASF commented Oct 21, 2021

sbilharz commented Oct 21, 2021

sebbASF commented Oct 21, 2021

yob commented Oct 22, 2021

sbilharz commented Oct 22, 2021

Also skip single CR chars in Buffer#read #364

Also skip single CR chars in Buffer#read #364

Conversation

sbilharz commented Sep 21, 2021

yob commented Sep 21, 2021

sbilharz commented Sep 21, 2021

yob commented Sep 21, 2021

sebbASF commented Oct 21, 2021

sbilharz commented Oct 21, 2021

sebbASF commented Oct 21, 2021

sbilharz commented Oct 21, 2021

sebbASF commented Oct 21, 2021

yob commented Oct 22, 2021

sbilharz commented Oct 22, 2021