Add HOCRDocProprocessor and HocrVisualParser #519

HiromuHota · 2020-09-30T00:09:25Z

Description of the problems or issues

Is your pull request related to a problem? Please describe.

This is the second patch that follows #518 .

Does your pull request fix any issue.

N/A.

Description of the proposed changes

Add HOCRDocProprocessor and HocrVisualParser

Test plan

I added a few real hOCR example files.

Checklist

I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.
I have updated the CHANGELOG.rst accordingly.

codecov-commenter · 2020-09-30T00:39:31Z

Codecov Report

Merging #519 into master will increase coverage by 0.22%.
The diff coverage is 90.96%.

@@            Coverage Diff             @@
##           master     #519      +/-   ##
==========================================
+ Coverage   85.81%   86.03%   +0.22%     
==========================================
  Files          90       92       +2     
  Lines        4582     4769     +187     
  Branches      852      896      +44     
==========================================
+ Hits         3932     4103     +171     
- Misses        467      475       +8     
- Partials      183      191       +8

Flag	Coverage Δ
#unittests	`86.03% <90.96%> (+0.22%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
setup.py	`0.00% <ø> (ø)`
.../fonduer/parser/visual_parser/pdf_visual_parser.py	`84.76% <ø> (ø)`
...duer/parser/preprocessors/hocr_doc_preprocessor.py	`86.66% <86.66%> (ø)`
...fonduer/parser/visual_parser/hocr_visual_parser.py	`95.23% <95.23%> (ø)`
src/fonduer/parser/preprocessors/__init__.py	`100.00% <100.00%> (ø)`
src/fonduer/parser/visual_parser/__init__.py	`100.00% <100.00%> (ø)`
src/fonduer/parser/parser.py	`93.39% <0.00%> (+0.23%)`	⬆️

HiromuHota · 2020-10-01T16:39:08Z

Unfortunately, this PR would not fix the #12. Moreover, #12 won't be fixed by its nature.
I'd recommend the hOCR format if visual information is to be parsed because bboxes and words are perfectly linked in the hOCR.
If hOCR cannot be used for any reason, use PdfVisualParser as a last resort.

senwu · 2020-10-03T00:08:27Z

Can we get all conflicts resolved first? Thanks!

This is a sentence. "This" is another sentence. is split into the following two sentences: This is a sentence. " This" is another sentence.

HiromuHota · 2020-10-03T00:52:38Z

I rebased on the master branch and resolved the conflicts. Thanks.

lukehsiao

Just a few nits.

docs/user/parser.rst

tests/data/hocr_simple/md.hocr

tests/parser/test_preprocessor.py

HiromuHota · 2020-10-05T21:38:01Z

@lukehsiao thank you so much for reviewing such a big PR and your comments.
Please review my edits and replies. Feel free to ask any further clarification.

lukehsiao

Thanks!

I think @senwu wanted to take a look to, so I'll wait for him. But LGTM.

senwu

LGTM

senwu · 2020-10-06T23:30:44Z

Let's plan to have a tutorial about this PR. This is a really awesome improvement!!! Thoughts? @HiromuHota @lukehsiao

HiromuHota · 2020-10-07T00:08:42Z

I'll definitely update the existing tutorials.
Rather than adding a new tutorial, I'd like to

pick one of the existing ones and replace HTMLDocPreprocessor and PdfVisualParser with HOCRDocProprocessor and HocrVisualParser if it is just drop-in replacement,
Or enrich the "Parsing Documents into the Data Model" section of intro/Intro_Data_Model.ipynb.

senwu · 2020-10-07T18:41:24Z

That works as well. We need to show two things: 1) some basics about what's in it and how to use it, and 2) end to end run with high quality.

HiromuHota · 2020-10-24T01:05:48Z

An update: I've started replacing html with hocr (html -> pdfy -> pdf -> pdftotree -> hocr) in the wikipedia tutorial.
With a minor change at parser, all cells run with no error but #527 up until the LF part, which needs a major change.

HiromuHota mentioned this pull request Sep 30, 2020

Native support for hOCR #509

Closed

7 tasks

HiromuHota marked this pull request as ready for review September 30, 2020 18:06

HiromuHota added the enhancement New feature or request label Sep 30, 2020

senwu requested review from lukehsiao and senwu September 30, 2020 23:27

Hiromu Hota added 12 commits October 2, 2020 17:49

Add HOCRDocProprocessor and HocrVisualParser

30c315a

Update docs

5b8f0d1

"sep" is not reliable due to spaCy's unexpected sentencizing.

12f7fea

This is a sentence. "This" is another sentence. is split into the following two sentences: This is a sentence. " This" is another sentence.

Clear the hocr specific html_attrs before being persisted into database

7b15417

Add a test that fails

8a1edc8

Add a safeguard for multi-words in ocrx_word

22513bc

Fix style check errors

b381d2a

Correctly strip extra whitespaces

ee8ee0a

Add test_parse_hocr_with_tables

d51bba5

Refactor how to clean up text nodes

2c287a5

Remove comments at preprocessing

5f99338

Update docs

28fb864

HiromuHota force-pushed the fix/476_2 branch from a55d375 to 28fb864 Compare October 3, 2020 00:50

lukehsiao approved these changes Oct 3, 2020

View reviewed changes

docs/user/parser.rst Show resolved Hide resolved

docs/user/parser.rst Outdated Show resolved Hide resolved

docs/user/parser.rst Outdated Show resolved Hide resolved

tests/data/hocr_simple/md.hocr Show resolved Hide resolved

tests/parser/test_preprocessor.py Show resolved Hide resolved

Hiromu Hota added 2 commits October 5, 2020 11:18

Add PDF--Convert-->HTML as a possible pipeline

fd88351

Add more assertion checks to the tests

d0c96fb

lukehsiao approved these changes Oct 6, 2020

View reviewed changes

senwu approved these changes Oct 6, 2020

View reviewed changes

senwu merged commit b44cdcd into HazyResearch:master Oct 6, 2020

HiromuHota deleted the fix/476_2 branch October 6, 2020 23:21

HiromuHota mentioned this pull request Oct 6, 2020

Support hOCR #476

Closed

HiromuHota mentioned this pull request Oct 6, 2020

Integrate new parser to support pdftotree output #3

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add HOCRDocProprocessor and HocrVisualParser #519

Add HOCRDocProprocessor and HocrVisualParser #519

HiromuHota commented Sep 30, 2020 •

edited

Loading

codecov-commenter commented Sep 30, 2020 •

edited

Loading

HiromuHota commented Oct 1, 2020

senwu commented Oct 3, 2020

HiromuHota commented Oct 3, 2020

lukehsiao left a comment

HiromuHota commented Oct 5, 2020

lukehsiao left a comment

senwu left a comment

senwu commented Oct 6, 2020

HiromuHota commented Oct 7, 2020

senwu commented Oct 7, 2020

HiromuHota commented Oct 24, 2020

Add HOCRDocProprocessor and HocrVisualParser #519

Add HOCRDocProprocessor and HocrVisualParser #519

Conversation

HiromuHota commented Sep 30, 2020 • edited Loading

Description of the problems or issues

Description of the proposed changes

Test plan

Checklist

codecov-commenter commented Sep 30, 2020 • edited Loading

Codecov Report

HiromuHota commented Oct 1, 2020

senwu commented Oct 3, 2020

HiromuHota commented Oct 3, 2020

lukehsiao left a comment

Choose a reason for hiding this comment

HiromuHota commented Oct 5, 2020

lukehsiao left a comment

Choose a reason for hiding this comment

senwu left a comment

Choose a reason for hiding this comment

senwu commented Oct 6, 2020

HiromuHota commented Oct 7, 2020

senwu commented Oct 7, 2020

HiromuHota commented Oct 24, 2020

HiromuHota commented Sep 30, 2020 •

edited

Loading

codecov-commenter commented Sep 30, 2020 •

edited

Loading