Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Text & Layout Analysis Feasibility #3

Closed
knowtheory opened this issue May 16, 2019 · 2 comments
Closed

Text & Layout Analysis Feasibility #3

knowtheory opened this issue May 16, 2019 · 2 comments

Comments

@knowtheory
Copy link
Collaborator

pdf.js affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.

Either way, we can use pdf.js's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely on tesseract.js to analyze a page's layout and text from Canvas data.

In the best case, pdf.js's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.

@knowtheory knowtheory mentioned this issue May 16, 2019
11 tasks
@knowtheory knowtheory changed the title Layout Analysis Text & Layout Analysis Feasibility May 23, 2019
@knowtheory
Copy link
Collaborator Author

Alright. We have learned many things.

With pdf.js we can definitely access the position information for all of the text in a document (and we have this working within a demo). PDFs contain some font information and it should be feasible to use pdf.js to perform some light comparative analysis to identify a text hierarchy within a document.

Additionally pdf.js will allow us to identify and extract any LegislativeXML embedded in a draft PDF.

Unfortunately Tesseract, in their move to 4.0 have removed their font detection capabilities. That means for PDFs which are just straight up images, the best case scenario is just raw text + position information.


For planning purposes, all of the above means that there are two and a half branches to plan for:

  1. Worst case scenario: A PDF that's just images
  2. Better scenario: A digitally native PDF
  3. (really 2.5) Best scenario: A digitally native PDF + legislativeXML

In the first scenario, Tesseract.js + 10mb of data must be downloaded and the PDF must be rendered and OCR'd. OCR will provide the raw text, bounding boxes for the text, lines, and paragraphs.
We'll need to time this to assess how long extracting the text will take.

In the remaining scenarios, pdf.js can provide bounding boxes for text objects in the PDF. These text objects contain some font data, however they are not as consistent as the lines recognized by Tesseract (for example Smallcaps may be divided up between an item for the larger capital letter, and the remaining letters in a word/line). Some additional reconstruction will be necessary.

Lastly, and this is extra credit most likely, it may be possible to cross reference Legislative XML where available with the text extracted from pdf.js directly to help reconstruct what users see when they read the pdf.

@knowtheory
Copy link
Collaborator Author

Conclusions

  1. Punt on Tesseract.js for now.
  2. pdf.js internals are messy.
  3. Congress's Office of Legislative Counsel has guidelines for legislation, including formatting guidelines
  4. we can at least assess what fonts and font sizes each text item has.
  5. Legislative PDFs have an awful implementation of SmallCaps which is just using ALLCAPS and then breaking up words and shrinking the font height. This will require stitching lines back together.
  6. It should be possible to implement Thomas Breuel's algorithm for meaningful white space detection (overview, the details of the algorithm)

It should be feasible to carry all of these points forward into a layout analysis engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant