How to Decode Content Streams (for Text/Paragraph Parsing)? #296

cshenks · 2019-12-31T00:18:34Z

Following up on #137, I would also like to use pdf-lib to extract and modify the text content of PDFs. I've been looking into traversing the structure tree to identify paragraphs. I've been able to accomplish this, but in the case where I reach a structure element dictionary whose kids array contains references to portions of a page content stream, I've been unable to figure out how to convert that portion of the context stream into readable text. Is this doable?

const traverseStructTree = (root) => {
  const kidsRef = root.dict.get(PDFName.of('K'));
  const structElementType = root.dict.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');
  if (structElementType === paragraphType) {
    console.log("Paragraph", root);
    const page = root.context.lookup(root.dict.get(PDFName.of('Pg')));
    const contents = page.Contents();
    const markedContentIdentifer = kidsRef;
    console.log(contents, markedContentIdentifer);
    // How to extract text based on content identifier?
  }
  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;


  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

const structTreeRoot = pdfDoc.catalog.lookup(PDFName.of('StructTreeRoot'));

The text was updated successfully, but these errors were encountered:

Hopding · 2019-12-31T01:02:34Z

@cshenks Can you please provide an example document I can run this script on?

cshenks · 2019-12-31T20:35:13Z

@Hopding here's the document I've been using.
f1099msc.pdf. This script only handles marked content identifiers that are specified by a PDFNumber

Hopding · 2020-01-01T13:25:43Z

@cshenks

The reason the content streams are not legible is that they are all encoded. So if you want to process their contents, you'll first need to decode them. Fortunately, pdf-lib already contains the code needed to do this (it's required to parse object streams and xref streams). The specific function you'll need to use is decodePDFRawStream:

pdf-lib/src/core/streams/decode.ts

Lines 48 to 69 in 9535e35

    
           export const decodePDFRawStream = ({ dict, contents }: PDFRawStream) => { 
        
             let stream: StreamType = new Stream(contents); 
        
             const Filter = dict.lookup(PDFName.of('Filter')); 
        
             const DecodeParms = dict.lookup(PDFName.of('DecodeParms')); 
        
             if (Filter instanceof PDFName) { 
        
               stream = decodeStream(stream, Filter, DecodeParms); 
        
             } else if (Filter instanceof PDFArray) { 
        
               for (let idx = 0, len = Filter.size(); idx < len; idx++) { 
        
                 stream = decodeStream( 
        
                   stream, 
        
                   Filter.lookup(idx, PDFName), 
        
                   DecodeParms && (DecodeParms as PDFArray).lookup(idx), 
        
                 ); 
        
               } 
        
             } else if (!!Filter) { 
        
               throw new UnexpectedObjectTypeError([PDFName, PDFArray], Filter); 
        
             } 
        
             return stream; 
        
           };

However, this function is not exported as it has only been used internally up to now. So if you're using the UMD modules, you can't really access it. But if you're using the NPM package, you can pull it out of pdf-lib/es/core/streams/decode or pdf-lib/cjs/core/streams/decode.

I modified your example using decodePDFRawStream to convert the contents streams into strings. Then, for each marked content identifier, I generate and apply a regex to find the section of the content stream that corresponds to it:

import fs from 'fs';

import {
  arrayAsString,
  PDFArray,
  PDFDict,
  PDFDocument,
  PDFName,
  PDFNumber,
  PDFPageLeaf,
  PDFRawStream,
  PDFRef,
} from 'pdf-lib';

// Note that this little guy isn't really accessible in the UMD modules, as he
// is not exported to the root, as of `pdf-lib@1.3.0`. But perhaps this will
// change in the next release.
import { decodePDFRawStream } from 'pdf-lib/cjs/core/streams/decode';

const markedContentRegex = (mcid: number) =>
  new RegExp(`<<[^]*\\/MCID[\\0\\t\\n\\f\\r\\ ]*${mcid}[^]*>>[^]*BDC([^]*)EMC`);

const extractMarkedContent = (mcid: number, contentStream: string) => {
  const regex = markedContentRegex(mcid);
  const res = contentStream.match(regex);
  return res?.[1];
};

const traverseStructTree = (root: PDFDict) => {
  const kidsRef = root.get(PDFName.of('K'));
  const structElementType = root.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');

  if (structElementType === paragraphType) {
    // TODO: What if this isn't a `PDFPageLeaf`?
    const page = root.lookup(PDFName.of('Pg')) as PDFPageLeaf;

    // TODO: What if this isn't a `PDFRawStream`?
    const contents = page.Contents() as PDFRawStream;

    // TODO: What if this isn't a `PDFNumber`?
    const markedContentIdentifer = kidsRef! as PDFNumber;
    const mcid = markedContentIdentifer.value();

    console.log(`------- Marked Content (id=${mcid}) --------`);
    const decodedBytes = decodePDFRawStream(contents).decode();
    const decodedString = arrayAsString(decodedBytes);
    const content = extractMarkedContent(mcid, decodedString);
    console.log(content);
    console.log(`-------- End (id=${mcid}) ---------`);
    console.log();
  }

  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;

  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef, PDFDict);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

(async () => {
  const pdfDoc = await PDFDocument.load(fs.readFileSync('f1099msc.pdf'));

  const structTreeRoot = pdfDoc.catalog.lookup(
    PDFName.of('StructTreeRoot'),
    PDFDict,
  );

  traverseStructTree(structTreeRoot);
})();

Running this will output the following:

------- Marked Content (id=0) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=0) ---------

------- Marked Content (id=1) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=1) ---------

...

------- Marked Content (id=10) --------
 
/T1_0 1 Tf
0 -1.275 TD
(Need help? )Tj
/T1_1 1 Tf
(If you have questions about reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(on )Tj
0 -1.075 TD
(Form 1099-MISC, call the information reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
T*
(customer service site toll free at 866-455-7438)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(or )Tj
T*
(304-263-8700 \(not toll free\). Persons with a hearing or )Tj
T*
(speech disability with access to TTY/TDD)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(equipment )Tj
T*
(can call 304-579-4827 \(not toll free\). )Tj

-------- End (id=10) ---------

To obtain sentences/paragraphs of text, you'll need to parse and process the graphics operators in the marked content streams.

The above example is written in TypeScript. I also created a working NPM script you can use: extract-marked-content.zip

There are a couple of important things to note about this script/example:

It's tailored to the specific PDF you shared. There are a number of scenarios that are not handled properly. Such as when the page does not have a single content stream, but an array of them.
Not all PDFs have a structure tree. So this approach will not generalize to arbitrary PDF documents.
As noted above, the example uses an unexported function. However, I can see the value of exporting this as content stream decoding is a fairly common task. So I'll probably export it in the next pdf-lib release.
Finally, and this is probably the most important caveat, the specific PDF you shared appears to use fonts with simple ASCII character encodings. This makes extracting text strings from the text drawing operators delightfully easy to do. But this absolutely does not generalize. Many documents encode text in far more obscure ways that require you to decode them using cmaps. Obviously, this is not handled in the above example.

I hope this helps. Please let me know if you have any additional questions!

Hopding · 2020-01-01T17:39:20Z

Version 1.3.1 is now published. It contains an exported version of decodePDFRawStream. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@1.3.1

It's also available on unpkg:

As well as jsDelivr:

vegarringdal · 2020-07-17T13:10:19Z

@Hopding

Do you have any plans to add simple function to get page content objects ?
Objects like we have under annotations, love we have this.
Would have been so useful if we could get that from content too :-)

cyrusho100 · 2022-05-10T09:57:26Z

@cshenks

The reason the content streams are not legible is that they are all encoded. So if you want to process their contents, you'll first need to decode them. Fortunately, pdf-lib already contains the code needed to do this (it's required to parse object streams and xref streams). The specific function you'll need to use is decodePDFRawStream:

pdf-lib/src/core/streams/decode.ts

Lines 48 to 69 in 9535e35

export const decodePDFRawStream = ({ dict, contents }: PDFRawStream) => {

let stream: StreamType = new Stream(contents);

const Filter = dict.lookup(PDFName.of('Filter'));

const DecodeParms = dict.lookup(PDFName.of('DecodeParms'));

if (Filter instanceof PDFName) {

stream = decodeStream(stream, Filter, DecodeParms);

} else if (Filter instanceof PDFArray) {

for (let idx = 0, len = Filter.size(); idx < len; idx++) {

stream = decodeStream(

stream,

Filter.lookup(idx, PDFName),

DecodeParms && (DecodeParms as PDFArray).lookup(idx),

);

}

} else if (!!Filter) {

throw new UnexpectedObjectTypeError([PDFName, PDFArray], Filter);

}

return stream;

};

However, this function is not exported as it has only been used internally up to now. So if you're using the UMD modules, you can't really access it. But if you're using the NPM package, you can pull it out of pdf-lib/es/core/streams/decode or pdf-lib/cjs/core/streams/decode.

I modified your example using decodePDFRawStream to convert the contents streams into strings. Then, for each marked content identifier, I generate and apply a regex to find the section of the content stream that corresponds to it:
import fs from 'fs';

import {
  arrayAsString,
  PDFArray,
  PDFDict,
  PDFDocument,
  PDFName,
  PDFNumber,
  PDFPageLeaf,
  PDFRawStream,
  PDFRef,
} from 'pdf-lib';

// Note that this little guy isn't really accessible in the UMD modules, as he
// is not exported to the root, as of `pdf-lib@1.3.0`. But perhaps this will
// change in the next release.
import { decodePDFRawStream } from 'pdf-lib/cjs/core/streams/decode';

const markedContentRegex = (mcid: number) =>
  new RegExp(`<<[^]*\\/MCID[\\0\\t\\n\\f\\r\\ ]*${mcid}[^]*>>[^]*BDC([^]*)EMC`);

const extractMarkedContent = (mcid: number, contentStream: string) => {
  const regex = markedContentRegex(mcid);
  const res = contentStream.match(regex);
  return res?.[1];
};

const traverseStructTree = (root: PDFDict) => {
  const kidsRef = root.get(PDFName.of('K'));
  const structElementType = root.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');

  if (structElementType === paragraphType) {
    // TODO: What if this isn't a `PDFPageLeaf`?
    const page = root.lookup(PDFName.of('Pg')) as PDFPageLeaf;

    // TODO: What if this isn't a `PDFRawStream`?
    const contents = page.Contents() as PDFRawStream;

    // TODO: What if this isn't a `PDFNumber`?
    const markedContentIdentifer = kidsRef! as PDFNumber;
    const mcid = markedContentIdentifer.value();

    console.log(`------- Marked Content (id=${mcid}) --------`);
    const decodedBytes = decodePDFRawStream(contents).decode();
    const decodedString = arrayAsString(decodedBytes);
    const content = extractMarkedContent(mcid, decodedString);
    console.log(content);
    console.log(`-------- End (id=${mcid}) ---------`);
    console.log();
  }

  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;

  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef, PDFDict);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

(async () => {
  const pdfDoc = await PDFDocument.load(fs.readFileSync('f1099msc.pdf'));

  const structTreeRoot = pdfDoc.catalog.lookup(
    PDFName.of('StructTreeRoot'),
    PDFDict,
  );

  traverseStructTree(structTreeRoot);
})();
Running this will output the following:
------- Marked Content (id=0) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=0) ---------

------- Marked Content (id=1) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=1) ---------

...

------- Marked Content (id=10) --------
 
/T1_0 1 Tf
0 -1.275 TD
(Need help? )Tj
/T1_1 1 Tf
(If you have questions about reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(on )Tj
0 -1.075 TD
(Form 1099-MISC, call the information reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
T*
(customer service site toll free at 866-455-7438)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(or )Tj
T*
(304-263-8700 $not toll free$. Persons with a hearing or )Tj
T*
(speech disability with access to TTY/TDD)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(equipment )Tj
T*
(can call 304-579-4827 $not toll free$. )Tj

-------- End (id=10) ---------
To obtain sentences/paragraphs of text, you'll need to parse and process the graphics operators in the marked content streams.

The above example is written in TypeScript. I also created a working NPM script you can use: extract-marked-content.zip

There are a couple of important things to note about this script/example:

It's tailored to the specific PDF you shared. There are a number of scenarios that are not handled properly. Such as when the page does not have a single content stream, but an array of them.

Not all PDFs have a structure tree. So this approach will not generalize to arbitrary PDF documents.

As noted above, the example uses an unexported function. However, I can see the value of exporting this as content stream decoding is a fairly common task. So I'll probably export it in the next pdf-lib release.

Finally, and this is probably the most important caveat, the specific PDF you shared appears to use fonts with simple ASCII character encodings. This makes extracting text strings from the text drawing operators delightfully easy to do. But this absolutely does not generalize. Many documents encode text in far more obscure ways that require you to decode them using cmaps. Obviously, this is not handled in the above example.

I hope this helps. Please let me know if you have any additional questions!

@Hopding
After found the desired structured element by the above code, it it possible to get the page index which the element locates if page element is missing from the structured element of PDF?
sample.pdf

Hopding closed this as completed Jan 1, 2020

Hopding changed the title ~~Text/Paragraph Parsing~~ How to Decode Content Streams (for Text/Paragraph Parsing)? Jan 1, 2020

kimmobrunfeldt mentioned this issue May 14, 2020

Converting matching colors to spot colors #445

Closed

Hopding mentioned this issue Sep 20, 2020

Check PDF pages for equality #462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

cshenks commented Dec 31, 2019

Hopding commented Dec 31, 2019

cshenks commented Dec 31, 2019

Hopding commented Jan 1, 2020

Hopding commented Jan 1, 2020

vegarringdal commented Jul 17, 2020

cyrusho100 commented May 10, 2022 •

edited

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

Comments

cshenks commented Dec 31, 2019

Hopding commented Dec 31, 2019

cshenks commented Dec 31, 2019

Hopding commented Jan 1, 2020

Hopding commented Jan 1, 2020

vegarringdal commented Jul 17, 2020

cyrusho100 commented May 10, 2022 • edited

cyrusho100 commented May 10, 2022 •

edited