Text & Layout Analysis Feasibility #3

knowtheory · 2019-05-16T19:33:58Z

pdf.js affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.

Either way, we can use pdf.js's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely on tesseract.js to analyze a page's layout and text from Canvas data.

In the best case, pdf.js's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.

The text was updated successfully, but these errors were encountered:

knowtheory · 2019-05-23T18:44:55Z

Alright. We have learned many things.

With pdf.js we can definitely access the position information for all of the text in a document (and we have this working within a demo). PDFs contain some font information and it should be feasible to use pdf.js to perform some light comparative analysis to identify a text hierarchy within a document.

Additionally pdf.js will allow us to identify and extract any LegislativeXML embedded in a draft PDF.

Unfortunately Tesseract, in their move to 4.0 have removed their font detection capabilities. That means for PDFs which are just straight up images, the best case scenario is just raw text + position information.

For planning purposes, all of the above means that there are two and a half branches to plan for:

Worst case scenario: A PDF that's just images
Better scenario: A digitally native PDF
(really 2.5) Best scenario: A digitally native PDF + legislativeXML

In the first scenario, Tesseract.js + 10mb of data must be downloaded and the PDF must be rendered and OCR'd. OCR will provide the raw text, bounding boxes for the text, lines, and paragraphs.
We'll need to time this to assess how long extracting the text will take.

In the remaining scenarios, pdf.js can provide bounding boxes for text objects in the PDF. These text objects contain some font data, however they are not as consistent as the lines recognized by Tesseract (for example Smallcaps may be divided up between an item for the larger capital letter, and the remaining letters in a word/line). Some additional reconstruction will be necessary.

Lastly, and this is extra credit most likely, it may be possible to cross reference Legislative XML where available with the text extracted from pdf.js directly to help reconstruct what users see when they read the pdf.

knowtheory · 2019-06-04T07:11:39Z

Conclusions

Punt on Tesseract.js for now.
pdf.js internals are messy.
Congress's Office of Legislative Counsel has guidelines for legislation, including formatting guidelines
we can at least assess what fonts and font sizes each text item has.
Legislative PDFs have an awful implementation of SmallCaps which is just using ALLCAPS and then breaking up words and shrinking the font height. This will require stitching lines back together.
It should be possible to implement Thomas Breuel's algorithm for meaningful white space detection (overview, the details of the algorithm)

It should be feasible to carry all of these points forward into a layout analysis engine.

knowtheory mentioned this issue May 16, 2019

Editable Draft MVP #1

Open

11 tasks

knowtheory changed the title ~~Layout Analysis~~ Text & Layout Analysis Feasibility May 23, 2019

knowtheory closed this as completed Jun 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Text & Layout Analysis Feasibility #3

Text & Layout Analysis Feasibility #3

knowtheory commented May 16, 2019

knowtheory commented May 23, 2019

knowtheory commented Jun 4, 2019

Text & Layout Analysis Feasibility #3

Text & Layout Analysis Feasibility #3

Comments

knowtheory commented May 16, 2019

knowtheory commented May 23, 2019

knowtheory commented Jun 4, 2019