"Uncaught (in promise) Error: Invalid Stream!" #12

vitaly-zdanevich · 2018-07-05T16:15:19Z

at error (pdf-lib.js:17961)
at parseStream (pdf-lib.js:30748)
at parseStream$1 (pdf-lib.js:30777)
at parseDictOrStream (pdf-lib.js:30820)
at parseIndirectObj (pdf-lib.js:30853)
at parseBodySection (pdf-lib.js:31079)
at parseDocument (pdf-lib.js:31133)
at PDFParser.parse (pdf-lib.js:31162)
at Function.PDFDocumentFactory.load (pdf-lib.js:31272)
at fetch.then.then.buf (account.js:5)

My code:

fetch('receipt.pdf')
    .then(resp => resp.arrayBuffer())
    .then(buf => {
        PDFLib.PDFDocumentFactory.load(new Uint8Array(buf));
    })

receipt.pdf

The text was updated successfully, but these errors were encountered:

Hopding · 2018-07-05T21:51:07Z

I was able to reproduce this parsing error using the receipt.pdf file you provided. Looking into the cause of it (seems to be a bug in the pdf-lib parser).

Hopding · 2018-07-05T22:33:53Z

I found the cause of the issue and will publish an update containing a fix.

The receipt.pdf document you provided actually does not conform to the PDF specification. It contains a content stream that's missing the \n character before its endstream keyword. pdf-lib relies on this character to find the end of content streams when parsing (see here):

// Locate the end of the stream
const endstreamIdx =
  arrayIndexOf(trimmed, '\nendstream') ||
  arrayIndexOf(trimmed, '\rendstream');
if (!endstreamIdx && endstreamIdx !== 0) error('Invalid Stream!');

Adding another keyword string results in a successful parsing of the document:

// Locate the end of the stream
const endstreamIdx =
  arrayIndexOf(trimmed, '\nendstream') ||
  arrayIndexOf(trimmed, '\rendstream') ||
  arrayIndexOf(trimmed, 'endstream');
if (!endstreamIdx && endstreamIdx !== 0) error('Invalid Stream!');

It's not actually uncommon for PDFs in the wild to fail to conform exactly to the specification, but that doesn't mean pdf-lib should fail to parse them. Unfortunately, it's impossible to predict all the ways that they will differ from the spec, so the only way I can identify and fix these problems is when issues like this are filed by pdf-lib users. So, thank you for filing this issue 😄. Hopefully over time as pdf-lib is used on a wider variety of documents these parsing issues can be ironed out completely.

Here is the section of the receipt.pdf document that is missing the end-of-line marker:

stream
xúÌ]Kè„∏�æ˚WË�†Ÿ,æ
... a bunch more binary data ...
7�W˛b˛c˝ãî˝GX[˜∑ú^Ûe˜�%�endstream
endobj

Manually editing the PDF file in a text editor and adding a newline as follows also fixes the problem:

stream
xúÌ]Kè„∏�æ˚WË�†Ÿ,æ
... a bunch more binary data ...
7�W˛b˛c˝ãî˝GX[˜∑ú^Ûe˜�%�
endstream
endobj

Also, if you're interested, here's the relevant section of the PDF 1.7 Specification which specifies that a \n or \r character should precede the endstream keyword:

7.3.8.1 General

A stream object, like a string object, is a sequence of bytes. Furthermore, a stream may be of unlimited length, whereas a string shall be subject to an implementation limit. For this reason, objects with potentially large amounts of data, such as images and page descriptions, shall be represented as streams.

A stream shall consist of a dictionary followed by zero or more bytes bracketed between the keywords stream (followed by newline) and endstream.

EXAMPLE:
...dictionary... 
stream
...Zero or more bytes... 
endstream
All streams shall be indirect objects (see 7.3.10, "Indirect Objects") and the stream dictionary shall be a direct object. The keyword stream that follows the stream dictionary shall be followed by an end-of-line marker consisting of either a CARRIAGE RETURN and a LINE FEED or just a LINE FEED, and not by a CARRIAGE RETURN alone. The sequence of bytes that make up a stream lie between the end-of-line marker following the stream keyword and the endstream keyword; the stream dictionary specifies the exact number of bytes. There should be an end-of-line marker after the data and before endstream; this marker shall not be included in the stream length. There shall not be any extra bytes, other than white space, between endstream and endobj.

Hopding · 2018-07-06T00:39:23Z

@vitaly-zdanevich I've just published version 0.2.1-rc1. It contains a fix for this parsing bug, as well as a new method: PDFPage.normalizeCTM.

This new method essentially resets the graphics state of a PDFPage to its default state. This can be useful if you're modifying a PDF that modifies the graphics state and doesn't clean up its changes. I noticed that this was the case with the receipt.pdf that you shared, so if you're planning on modifying that PDF, you may wish to call this new method:

import { PDFDocumentFactory, PDFDocumentWriter, drawText } from 'pdf-lib';

const pdfDoc = PDFDocumentFactory.load(existingPdfDocBytes);
const [helveticaFont] = pdfDoc.embedStandardFont('Helvetica');

const pages = pdfDoc.getPages();
const page  = pages[0];

page
  .addFontDictionary('Helvetica', helveticaFont)
  .normalizeCTM(pdfDoc);

const contentStream = pdfDoc.createContentStream(
  drawText('This text was added to the PDF with JavaScript!', {
    x: 25,
    y: 25,
    size: 24,
    font: 'Helvetica',
    colorRgb: [0.95, 0.26, 0.21],
  }),
);

page.addContentStreams(pdfDoc.register(contentStream));

const pdfBytes = PDFDocumentWriter.saveToBytes(pdfDoc);

You can install this new version with npm:

npm install pdf-lib@0.2.1-rc1

It's also available on unpkg:

Hopding · 2018-07-06T01:57:12Z

I do want to mention that I don't think it's desirable to have to manually call PDFPage.normalizeCTM. I'd like for it to happen automatically. But I need to think about how to do that, and I wanted to get a fix released quickly.

vitaly-zdanevich · 2018-07-06T07:35:41Z

Thank you, now PDF loaded without errors.

Hopding · 2018-07-06T21:10:41Z

@vitaly-zdanevich Do you mind if I add the receipt.pdf file you shared to the repo to run tests against? (I assume not, since you shared it in this thread, but I just want to be sure 😄)

Hopding · 2018-07-07T20:59:38Z

Version 0.2.1 is now published. It contains the parsing fix, as well as automated content stream normalization (calling normalizeCTM manually is no longer necessary).

You can install this new version with npm:

npm install pdf-lib@0.2.1

It's also available on unpkg:

vitaly-zdanevich · 2018-07-08T10:06:15Z

Yes you can use this PDF, this is from Stripe - popular payment gateway.

Hopding added a commit that referenced this issue Jul 7, 2018

Add integration test for #12 fix

ef0dbb0

Hopding closed this as completed Jul 7, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Uncaught (in promise) Error: Invalid Stream!" #12

"Uncaught (in promise) Error: Invalid Stream!" #12

vitaly-zdanevich commented Jul 5, 2018

Hopding commented Jul 5, 2018

Hopding commented Jul 5, 2018 •

edited

7.3.8.1 General

Hopding commented Jul 6, 2018 •

edited

Hopding commented Jul 6, 2018

vitaly-zdanevich commented Jul 6, 2018

Hopding commented Jul 6, 2018

Hopding commented Jul 7, 2018

vitaly-zdanevich commented Jul 8, 2018

"Uncaught (in promise) Error: Invalid Stream!" #12

"Uncaught (in promise) Error: Invalid Stream!" #12

Comments

vitaly-zdanevich commented Jul 5, 2018

Hopding commented Jul 5, 2018

Hopding commented Jul 5, 2018 • edited

7.3.8.1 General

Hopding commented Jul 6, 2018 • edited

Hopding commented Jul 6, 2018

vitaly-zdanevich commented Jul 6, 2018

Hopding commented Jul 6, 2018

Hopding commented Jul 7, 2018

vitaly-zdanevich commented Jul 8, 2018

Hopding commented Jul 5, 2018 •

edited

Hopding commented Jul 6, 2018 •

edited