Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

Closed
cshenks opened this issue Dec 31, 2019 · 6 comments
Closed

How to Decode Content Streams (for Text/Paragraph Parsing)? #296

cshenks opened this issue Dec 31, 2019 · 6 comments

Comments

@cshenks
Copy link
Contributor

cshenks commented Dec 31, 2019

Following up on #137, I would also like to use pdf-lib to extract and modify the text content of PDFs. I've been looking into traversing the structure tree to identify paragraphs. I've been able to accomplish this, but in the case where I reach a structure element dictionary whose kids array contains references to portions of a page content stream, I've been unable to figure out how to convert that portion of the context stream into readable text. Is this doable?

const traverseStructTree = (root) => {
  const kidsRef = root.dict.get(PDFName.of('K'));
  const structElementType = root.dict.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');
  if (structElementType === paragraphType) {
    console.log("Paragraph", root);
    const page = root.context.lookup(root.dict.get(PDFName.of('Pg')));
    const contents = page.Contents();
    const markedContentIdentifer = kidsRef;
    console.log(contents, markedContentIdentifer);
    // How to extract text based on content identifier?
  }
  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;


  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

const structTreeRoot = pdfDoc.catalog.lookup(PDFName.of('StructTreeRoot'));
@Hopding
Copy link
Owner

Hopding commented Dec 31, 2019

@cshenks Can you please provide an example document I can run this script on?

@cshenks
Copy link
Contributor Author

cshenks commented Dec 31, 2019

@Hopding here's the document I've been using.
f1099msc.pdf. This script only handles marked content identifiers that are specified by a PDFNumber

@Hopding
Copy link
Owner

Hopding commented Jan 1, 2020

@cshenks

The reason the content streams are not legible is that they are all encoded. So if you want to process their contents, you'll first need to decode them. Fortunately, pdf-lib already contains the code needed to do this (it's required to parse object streams and xref streams). The specific function you'll need to use is decodePDFRawStream:

export const decodePDFRawStream = ({ dict, contents }: PDFRawStream) => {
let stream: StreamType = new Stream(contents);
const Filter = dict.lookup(PDFName.of('Filter'));
const DecodeParms = dict.lookup(PDFName.of('DecodeParms'));
if (Filter instanceof PDFName) {
stream = decodeStream(stream, Filter, DecodeParms);
} else if (Filter instanceof PDFArray) {
for (let idx = 0, len = Filter.size(); idx < len; idx++) {
stream = decodeStream(
stream,
Filter.lookup(idx, PDFName),
DecodeParms && (DecodeParms as PDFArray).lookup(idx),
);
}
} else if (!!Filter) {
throw new UnexpectedObjectTypeError([PDFName, PDFArray], Filter);
}
return stream;
};

However, this function is not exported as it has only been used internally up to now. So if you're using the UMD modules, you can't really access it. But if you're using the NPM package, you can pull it out of pdf-lib/es/core/streams/decode or pdf-lib/cjs/core/streams/decode.

I modified your example using decodePDFRawStream to convert the contents streams into strings. Then, for each marked content identifier, I generate and apply a regex to find the section of the content stream that corresponds to it:

import fs from 'fs';

import {
  arrayAsString,
  PDFArray,
  PDFDict,
  PDFDocument,
  PDFName,
  PDFNumber,
  PDFPageLeaf,
  PDFRawStream,
  PDFRef,
} from 'pdf-lib';

// Note that this little guy isn't really accessible in the UMD modules, as he
// is not exported to the root, as of `pdf-lib@1.3.0`. But perhaps this will
// change in the next release.
import { decodePDFRawStream } from 'pdf-lib/cjs/core/streams/decode';

const markedContentRegex = (mcid: number) =>
  new RegExp(`<<[^]*\\/MCID[\\0\\t\\n\\f\\r\\ ]*${mcid}[^]*>>[^]*BDC([^]*)EMC`);

const extractMarkedContent = (mcid: number, contentStream: string) => {
  const regex = markedContentRegex(mcid);
  const res = contentStream.match(regex);
  return res?.[1];
};

const traverseStructTree = (root: PDFDict) => {
  const kidsRef = root.get(PDFName.of('K'));
  const structElementType = root.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');

  if (structElementType === paragraphType) {
    // TODO: What if this isn't a `PDFPageLeaf`?
    const page = root.lookup(PDFName.of('Pg')) as PDFPageLeaf;

    // TODO: What if this isn't a `PDFRawStream`?
    const contents = page.Contents() as PDFRawStream;

    // TODO: What if this isn't a `PDFNumber`?
    const markedContentIdentifer = kidsRef! as PDFNumber;
    const mcid = markedContentIdentifer.value();

    console.log(`------- Marked Content (id=${mcid}) --------`);
    const decodedBytes = decodePDFRawStream(contents).decode();
    const decodedString = arrayAsString(decodedBytes);
    const content = extractMarkedContent(mcid, decodedString);
    console.log(content);
    console.log(`-------- End (id=${mcid}) ---------`);
    console.log();
  }

  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;

  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef, PDFDict);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

(async () => {
  const pdfDoc = await PDFDocument.load(fs.readFileSync('f1099msc.pdf'));

  const structTreeRoot = pdfDoc.catalog.lookup(
    PDFName.of('StructTreeRoot'),
    PDFDict,
  );

  traverseStructTree(structTreeRoot);
})();

Running this will output the following:

------- Marked Content (id=0) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=0) ---------

------- Marked Content (id=1) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=1) ---------

...

------- Marked Content (id=10) --------
 
/T1_0 1 Tf
0 -1.275 TD
(Need help? )Tj
/T1_1 1 Tf
(If you have questions about reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(on )Tj
0 -1.075 TD
(Form 1099-MISC, call the information reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
T*
(customer service site toll free at 866-455-7438)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(or )Tj
T*
(304-263-8700 \(not toll free\). Persons with a hearing or )Tj
T*
(speech disability with access to TTY/TDD)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(equipment )Tj
T*
(can call 304-579-4827 \(not toll free\). )Tj

-------- End (id=10) ---------

To obtain sentences/paragraphs of text, you'll need to parse and process the graphics operators in the marked content streams.

The above example is written in TypeScript. I also created a working NPM script you can use: extract-marked-content.zip


There are a couple of important things to note about this script/example:

  • It's tailored to the specific PDF you shared. There are a number of scenarios that are not handled properly. Such as when the page does not have a single content stream, but an array of them.
  • Not all PDFs have a structure tree. So this approach will not generalize to arbitrary PDF documents.
  • As noted above, the example uses an unexported function. However, I can see the value of exporting this as content stream decoding is a fairly common task. So I'll probably export it in the next pdf-lib release.
  • Finally, and this is probably the most important caveat, the specific PDF you shared appears to use fonts with simple ASCII character encodings. This makes extracting text strings from the text drawing operators delightfully easy to do. But this absolutely does not generalize. Many documents encode text in far more obscure ways that require you to decode them using cmaps. Obviously, this is not handled in the above example.

I hope this helps. Please let me know if you have any additional questions!

@Hopding Hopding closed this as completed Jan 1, 2020
@Hopding Hopding changed the title Text/Paragraph Parsing How to Decode Content Streams (for Text/Paragraph Parsing)? Jan 1, 2020
@Hopding
Copy link
Owner

Hopding commented Jan 1, 2020

Version 1.3.1 is now published. It contains an exported version of decodePDFRawStream. The full release notes are available here.

You can install this new version with npm:

npm install pdf-lib@1.3.1

It's also available on unpkg:

As well as jsDelivr:

@vegarringdal
Copy link

@Hopding

Do you have any plans to add simple function to get page content objects ?
Objects like we have under annotations, love we have this.
Would have been so useful if we could get that from content too :-)

@cyrusho100
Copy link

cyrusho100 commented May 10, 2022

@cshenks

The reason the content streams are not legible is that they are all encoded. So if you want to process their contents, you'll first need to decode them. Fortunately, pdf-lib already contains the code needed to do this (it's required to parse object streams and xref streams). The specific function you'll need to use is decodePDFRawStream:

export const decodePDFRawStream = ({ dict, contents }: PDFRawStream) => {
let stream: StreamType = new Stream(contents);
const Filter = dict.lookup(PDFName.of('Filter'));
const DecodeParms = dict.lookup(PDFName.of('DecodeParms'));
if (Filter instanceof PDFName) {
stream = decodeStream(stream, Filter, DecodeParms);
} else if (Filter instanceof PDFArray) {
for (let idx = 0, len = Filter.size(); idx < len; idx++) {
stream = decodeStream(
stream,
Filter.lookup(idx, PDFName),
DecodeParms && (DecodeParms as PDFArray).lookup(idx),
);
}
} else if (!!Filter) {
throw new UnexpectedObjectTypeError([PDFName, PDFArray], Filter);
}
return stream;
};

However, this function is not exported as it has only been used internally up to now. So if you're using the UMD modules, you can't really access it. But if you're using the NPM package, you can pull it out of pdf-lib/es/core/streams/decode or pdf-lib/cjs/core/streams/decode.

I modified your example using decodePDFRawStream to convert the contents streams into strings. Then, for each marked content identifier, I generate and apply a regex to find the section of the content stream that corresponds to it:

import fs from 'fs';

import {
  arrayAsString,
  PDFArray,
  PDFDict,
  PDFDocument,
  PDFName,
  PDFNumber,
  PDFPageLeaf,
  PDFRawStream,
  PDFRef,
} from 'pdf-lib';

// Note that this little guy isn't really accessible in the UMD modules, as he
// is not exported to the root, as of `pdf-lib@1.3.0`. But perhaps this will
// change in the next release.
import { decodePDFRawStream } from 'pdf-lib/cjs/core/streams/decode';

const markedContentRegex = (mcid: number) =>
  new RegExp(`<<[^]*\\/MCID[\\0\\t\\n\\f\\r\\ ]*${mcid}[^]*>>[^]*BDC([^]*)EMC`);

const extractMarkedContent = (mcid: number, contentStream: string) => {
  const regex = markedContentRegex(mcid);
  const res = contentStream.match(regex);
  return res?.[1];
};

const traverseStructTree = (root: PDFDict) => {
  const kidsRef = root.get(PDFName.of('K'));
  const structElementType = root.get(PDFName.of('S'));
  const paragraphType = PDFName.of('P');

  if (structElementType === paragraphType) {
    // TODO: What if this isn't a `PDFPageLeaf`?
    const page = root.lookup(PDFName.of('Pg')) as PDFPageLeaf;

    // TODO: What if this isn't a `PDFRawStream`?
    const contents = page.Contents() as PDFRawStream;

    // TODO: What if this isn't a `PDFNumber`?
    const markedContentIdentifer = kidsRef! as PDFNumber;
    const mcid = markedContentIdentifer.value();

    console.log(`------- Marked Content (id=${mcid}) --------`);
    const decodedBytes = decodePDFRawStream(contents).decode();
    const decodedString = arrayAsString(decodedBytes);
    const content = extractMarkedContent(mcid, decodedString);
    console.log(content);
    console.log(`-------- End (id=${mcid}) ---------`);
    console.log();
  }

  let node;
  if (!kidsRef || kidsRef instanceof PDFNumber) return;

  if (kidsRef instanceof PDFRef) {
    node = root.context.lookup(kidsRef, PDFDict);
    traverseStructTree(node);
  } else if (kidsRef instanceof PDFArray) {
    for (let idx = 0, len = kidsRef.size(); idx < len; idx++) {
      const nodeRef = kidsRef.get(idx);
      node = root.context.lookup(nodeRef);
      if (!(node instanceof PDFDict)) return;
      traverseStructTree(node);
    }
  }
};

(async () => {
  const pdfDoc = await PDFDocument.load(fs.readFileSync('f1099msc.pdf'));

  const structTreeRoot = pdfDoc.catalog.lookup(
    PDFName.of('StructTreeRoot'),
    PDFDict,
  );

  traverseStructTree(structTreeRoot);
})();

Running this will output the following:

------- Marked Content (id=0) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=0) ---------

------- Marked Content (id=1) --------
 
0 -1.869 TD
(See IRS Publications 1141, 1167, and 1179 for more information about pri\
nting these tax )Tj
0 -1.2 TD
(forms.)Tj

-------- End (id=1) ---------

...

------- Marked Content (id=10) --------
 
/T1_0 1 Tf
0 -1.275 TD
(Need help? )Tj
/T1_1 1 Tf
(If you have questions about reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(on )Tj
0 -1.075 TD
(Form 1099-MISC, call the information reporting)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
T*
(customer service site toll free at 866-455-7438)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(or )Tj
T*
(304-263-8700 \(not toll free\). Persons with a hearing or )Tj
T*
(speech disability with access to TTY/TDD)Tj
/T1_0 1 Tf
( )Tj
/T1_1 1 Tf
(equipment )Tj
T*
(can call 304-579-4827 \(not toll free\). )Tj

-------- End (id=10) ---------

To obtain sentences/paragraphs of text, you'll need to parse and process the graphics operators in the marked content streams.

The above example is written in TypeScript. I also created a working NPM script you can use: extract-marked-content.zip

There are a couple of important things to note about this script/example:

  • It's tailored to the specific PDF you shared. There are a number of scenarios that are not handled properly. Such as when the page does not have a single content stream, but an array of them.
  • Not all PDFs have a structure tree. So this approach will not generalize to arbitrary PDF documents.
  • As noted above, the example uses an unexported function. However, I can see the value of exporting this as content stream decoding is a fairly common task. So I'll probably export it in the next pdf-lib release.
  • Finally, and this is probably the most important caveat, the specific PDF you shared appears to use fonts with simple ASCII character encodings. This makes extracting text strings from the text drawing operators delightfully easy to do. But this absolutely does not generalize. Many documents encode text in far more obscure ways that require you to decode them using cmaps. Obviously, this is not handled in the above example.

I hope this helps. Please let me know if you have any additional questions!

@Hopding
After found the desired structured element by the above code, it it possible to get the page index which the element locates if page element is missing from the structured element of PDF?
sample.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants