Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Uncaught (in promise) Error: Invalid Stream!" #12

Closed
vitaly-zdanevich opened this issue Jul 5, 2018 · 8 comments
Closed

"Uncaught (in promise) Error: Invalid Stream!" #12

vitaly-zdanevich opened this issue Jul 5, 2018 · 8 comments

Comments

@vitaly-zdanevich
Copy link

at error (pdf-lib.js:17961)
at parseStream (pdf-lib.js:30748)
at parseStream$1 (pdf-lib.js:30777)
at parseDictOrStream (pdf-lib.js:30820)
at parseIndirectObj (pdf-lib.js:30853)
at parseBodySection (pdf-lib.js:31079)
at parseDocument (pdf-lib.js:31133)
at PDFParser.parse (pdf-lib.js:31162)
at Function.PDFDocumentFactory.load (pdf-lib.js:31272)
at fetch.then.then.buf (account.js:5)

My code:

fetch('receipt.pdf')
    .then(resp => resp.arrayBuffer())
    .then(buf => {
        PDFLib.PDFDocumentFactory.load(new Uint8Array(buf));
    })

receipt.pdf

@Hopding
Copy link
Owner

Hopding commented Jul 5, 2018

I was able to reproduce this parsing error using the receipt.pdf file you provided. Looking into the cause of it (seems to be a bug in the pdf-lib parser).

@Hopding
Copy link
Owner

Hopding commented Jul 5, 2018

I found the cause of the issue and will publish an update containing a fix.

The receipt.pdf document you provided actually does not conform to the PDF specification. It contains a content stream that's missing the \n character before its endstream keyword. pdf-lib relies on this character to find the end of content streams when parsing (see here):

// Locate the end of the stream
const endstreamIdx =
  arrayIndexOf(trimmed, '\nendstream') ||
  arrayIndexOf(trimmed, '\rendstream');
if (!endstreamIdx && endstreamIdx !== 0) error('Invalid Stream!');

Adding another keyword string results in a successful parsing of the document:

// Locate the end of the stream
const endstreamIdx =
  arrayIndexOf(trimmed, '\nendstream') ||
  arrayIndexOf(trimmed, '\rendstream') ||
  arrayIndexOf(trimmed, 'endstream');
if (!endstreamIdx && endstreamIdx !== 0) error('Invalid Stream!');

It's not actually uncommon for PDFs in the wild to fail to conform exactly to the specification, but that doesn't mean pdf-lib should fail to parse them. Unfortunately, it's impossible to predict all the ways that they will differ from the spec, so the only way I can identify and fix these problems is when issues like this are filed by pdf-lib users. So, thank you for filing this issue 😄. Hopefully over time as pdf-lib is used on a wider variety of documents these parsing issues can be ironed out completely.


Here is the section of the receipt.pdf document that is missing the end-of-line marker:

stream
xúÌ]Kè„∏�æ˚WË�†Ÿ,æ
... a bunch more binary data ...
7�W˛b˛c˝ãî˝GX[˜∑ú^Ûe˜�%�endstream
endobj

Manually editing the PDF file in a text editor and adding a newline as follows also fixes the problem:

stream
xúÌ]Kè„∏�æ˚WË�†Ÿ,æ
... a bunch more binary data ...
7�W˛b˛c˝ãî˝GX[˜∑ú^Ûe˜�%�
endstream
endobj

Also, if you're interested, here's the relevant section of the PDF 1.7 Specification which specifies that a \n or \r character should precede the endstream keyword:

7.3.8.1 General

A stream object, like a string object, is a sequence of bytes. Furthermore, a stream may be of unlimited length, whereas a string shall be subject to an implementation limit. For this reason, objects with potentially large amounts of data, such as images and page descriptions, shall be represented as streams.

A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream (followed by newline) and endstream.

EXAMPLE:

...dictionary... 
stream
...Zero or more bytes... 
endstream

All streams shall be indirect objects (see 7.3.10, "Indirect Objects") and the stream dictionary shall be a direct object. The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There should be an end-of-line marker after the data and before endstream; this marker shall not be included in the stream length. There shall not be any extra bytes, other than white space, between endstream and endobj.

@Hopding
Copy link
Owner

Hopding commented Jul 6, 2018

@vitaly-zdanevich I've just published version 0.2.1-rc1. It contains a fix for this parsing bug, as well as a new method: PDFPage.normalizeCTM.

This new method essentially resets the graphics state of a PDFPage to its default state. This can be useful if you're modifying a PDF that modifies the graphics state and doesn't clean up its changes. I noticed that this was the case with the receipt.pdf that you shared, so if you're planning on modifying that PDF, you may wish to call this new method:

import { PDFDocumentFactory, PDFDocumentWriter, drawText } from 'pdf-lib';

const pdfDoc = PDFDocumentFactory.load(existingPdfDocBytes);
const [helveticaFont] = pdfDoc.embedStandardFont('Helvetica');

const pages = pdfDoc.getPages();
const page  = pages[0];

page
  .addFontDictionary('Helvetica', helveticaFont)
  .normalizeCTM(pdfDoc);

const contentStream = pdfDoc.createContentStream(
  drawText('This text was added to the PDF with JavaScript!', {
    x: 25,
    y: 25,
    size: 24,
    font: 'Helvetica',
    colorRgb: [0.95, 0.26, 0.21],
  }),
);

page.addContentStreams(pdfDoc.register(contentStream));

const pdfBytes = PDFDocumentWriter.saveToBytes(pdfDoc);

You can install this new version with npm:

npm install pdf-lib@0.2.1-rc1

It's also available on unpkg:

@Hopding
Copy link
Owner

Hopding commented Jul 6, 2018

I do want to mention that I don't think it's desirable to have to manually call PDFPage.normalizeCTM. I'd like for it to happen automatically. But I need to think about how to do that, and I wanted to get a fix released quickly.

@vitaly-zdanevich
Copy link
Author

Thank you, now PDF loaded without errors.

@Hopding
Copy link
Owner

Hopding commented Jul 6, 2018

@vitaly-zdanevich Do you mind if I add the receipt.pdf file you shared to the repo to run tests against? (I assume not, since you shared it in this thread, but I just want to be sure 😄)

Hopding added a commit that referenced this issue Jul 7, 2018
@Hopding
Copy link
Owner

Hopding commented Jul 7, 2018

Version 0.2.1 is now published. It contains the parsing fix, as well as automated content stream normalization (calling normalizeCTM manually is no longer necessary).

You can install this new version with npm:

npm install pdf-lib@0.2.1

It's also available on unpkg:

@Hopding Hopding closed this as completed Jul 7, 2018
@vitaly-zdanevich
Copy link
Author

Yes you can use this PDF, this is from Stripe - popular payment gateway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants