New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #540
Conversation
Codecov Report
@@ Coverage Diff @@
## master #540 +/- ##
==========================================
+ Coverage 86.02% 86.07% +0.04%
==========================================
Files 92 92
Lines 4773 4775 +2
Branches 899 899
==========================================
+ Hits 4106 4110 +4
+ Misses 476 475 -1
+ Partials 191 190 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems clear that the behavior is changed (e.g., that RuntimeError will no longer be raised). I don't know if there will be any issues for downstream applications with this change, but I'm inclined to trust your judgement, since you're probably the one who has used it more than average.
Can you manually rebase this on the master branch so there aren't any conflicts?
I'll let @senwu chime in also.
@lukehsiao Thank you for review. |
@lukehsiao I have rebase on my local master, but I could not get the reason of conflicts. All codes of rebase original and destination are the same but rebase command shows conflict messages. |
@YasushiMiyata I'll take a look, one sec. |
@YasushiMiyata I'm able to resolve the rebase after "skipping" two commits and resolving the conflicts on another two. I can't force push to your PR, though, so I don't think there is an easy way for me to manually rebase this PR.
Those were the steps I took to get to a state rebased on |
@lukehsiao
|
@lukehsiao |
@YasushiMiyata, maybe you could add me as a collaborator on your repo, and let me force push so I can fix the rebase? This is what the process looks like for me: https://asciinema.org/a/x45UXBrTBJdQLMhSDZnR6sIUP In your case you might just want to |
I pushed a master-rebased version to: https://github.com/YasushiMiyata/fonduer/commits/YasushiMiyata-master. In particular, note that it's just two clean commits off of 5ab8e9c. Does that help? |
@lukehsiao, I add you to my repository as a collaborator. Thank you for the explanation with video.
Maybe, it's cause of this error. I think that original cause is possibly my merge or rebase in the wrong oder. |
Thank you! I check it. |
@lukehsiao, I have tried rebasing and got following result without errors.
|
@lukehsiao, sorry to keep bothering you. I have rebased on my repository successfully and pushed. But some checks end with error other than codes... Meanwhile, my rebase and push process is here:
|
This doesn't look like a correct rebase. There should be only 2 commits, I believe. Right now you have 12 kind of messy ones. Maybe it's easier to just open a new PR? You'll probably also want to sync your fork with this repo so you don't run into issues like this in the future as well? I'd sync each time a PR is merged. |
I also think it's better way to start over this PR. I close this PR. Thanks a lot! |
Description of the problems or issues
Is your pull request related to a problem? Please describe.
See #534.
This request redoes #537, which needs prior fixing #538 (fixed by #539).
Does your pull request fix any issue.
See #534
Description of the proposed changes
In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').
Test plan
This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'
Checklist