-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Text & Layout Analysis Feasibility #3
Comments
Alright. We have learned many things. With Additionally Unfortunately Tesseract, in their move to 4.0 have removed their font detection capabilities. That means for PDFs which are just straight up images, the best case scenario is just raw text + position information. For planning purposes, all of the above means that there are two and a half branches to plan for:
In the first scenario, In the remaining scenarios, Lastly, and this is extra credit most likely, it may be possible to cross reference Legislative XML where available with the text extracted from |
Conclusions
It should be feasible to carry all of these points forward into a layout analysis engine. |
pdf.js
affords us access to PDF internals. Unfortunately for us, PDF internals could mean many things. In the best case scenario, it means a digitally native PDF where each page has instructions on how to render text and what positions to put it onto the page, or in the worst case, it could mean a biiiiiiiiig image file.Either way, we can use
pdf.js
's main API to render a page to a Canvas element. In the worst case, being able to dump a page into pixels on a Canvas means that we can rely ontesseract.js
to analyze a page's layout and text from Canvas data.In the best case,
pdf.js
's APIs and users' documents willing, we may be able to iterate through the instructions in a page and identify the position of text elements directly.The text was updated successfully, but these errors were encountered: