Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537 #540

Closed
wants to merge 12 commits into from

Conversation

YasushiMiyata
Copy link
Contributor

@YasushiMiyata YasushiMiyata commented May 5, 2021

Description of the problems or issues

Is your pull request related to a problem? Please describe.
See #534.
This request redoes #537, which needs prior fixing #538 (fixed by #539).

Does your pull request fix any issue.
See #534

Description of the proposed changes

In case of multi line Japanese strings 'AAAA\nBBBB', spacy[ja] sometimes generates tokens ['AAA', 'AB', 'B', 'BB']. Proposal defines bbox of 'AB' as a multi line word (i.e. left is min left of ['A','B'], top is the top of 'A', right is max right of ['A','B'] and bottom is the bottom of 'B').

Test plan

This is cause of Japanese morphological analysis. So, I have added Japanese test data to 'tests/data/hocr_simple/japan.hocr' and test code to 'tests/parser/test_parser.py::test_parse_hocr'

Checklist

  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • I have updated the CHANGELOG.rst accordingly.

@YasushiMiyata YasushiMiyata marked this pull request as ready for review May 5, 2021 23:17
@codecov-commenter
Copy link

codecov-commenter commented May 5, 2021

Codecov Report

Merging #540 (5d666fa) into master (5ab8e9c) will increase coverage by 0.04%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #540      +/-   ##
==========================================
+ Coverage   86.02%   86.07%   +0.04%     
==========================================
  Files          92       92              
  Lines        4773     4775       +2     
  Branches      899      899              
==========================================
+ Hits         4106     4110       +4     
+ Misses        476      475       -1     
+ Partials      191      190       -1     
Flag Coverage Δ
unittests 86.07% <100.00%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...fonduer/parser/visual_parser/hocr_visual_parser.py 97.64% <100.00%> (+2.46%) ⬆️

Copy link
Contributor

@lukehsiao lukehsiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems clear that the behavior is changed (e.g., that RuntimeError will no longer be raised). I don't know if there will be any issues for downstream applications with this change, but I'm inclined to trust your judgement, since you're probably the one who has used it more than average.

Can you manually rebase this on the master branch so there aren't any conflicts?

I'll let @senwu chime in also.

@YasushiMiyata
Copy link
Contributor Author

YasushiMiyata commented May 7, 2021

@lukehsiao Thank you for review. Do you mean "manually rebase" is rebasing branch to master in my local or remote repository, or in this upstream? About local or remote, I will check. About upstream, I have no write access.
I failed manual rebase because the variable name change (d -> candidate) in #549 does not rebase to my local well. I am now fixing.

@YasushiMiyata
Copy link
Contributor Author

@lukehsiao I have rebase on my local master, but I could not get the reason of conflicts. All codes of rebase original and destination are the same but rebase command shows conflict messages.
I have tried git rebase --skip after some commands of resolving conflicts like git am --show-current-patch, git rebase --continue and so on. It is maybe because there are some cached data remains...

@lukehsiao
Copy link
Contributor

@YasushiMiyata I'll take a look, one sec.

@lukehsiao
Copy link
Contributor

lukehsiao commented May 7, 2021

@YasushiMiyata I'm able to resolve the rebase after "skipping" two commits and resolving the conflicts on another two. I can't force push to your PR, though, so I don't think there is an easy way for me to manually rebase this PR.

git checkout -b YasushiMiyata-master master
git pull https://github.com/YasushiMiyata/fonduer.git master
git rebase master
<resolve conflict with tests/test_postgres.py>
git add -u
git rebase --continue
git rebase --skip
<resolve conflict with test_parser>
git add -u
git rebase --continue
git rebase --skip

Those were the steps I took to get to a state rebased on master. Can you give something similar a try?

@YasushiMiyata
Copy link
Contributor Author

YasushiMiyata commented May 7, 2021

@lukehsiao
I did rebase with --continue and --skip and got following results.
I think it would be better to start over this pull request with fork process.
Would you reject this pull request if this process is inappropriate?

$git rebase --continue
Applying: Fix incorrect assertion in test_parser.py
No changes - did you forget to use 'git add'?
If there is nothing left to stage, chances are that something else
already introduced the same changes; you might want to skip this patch.

git rebase --skip
Applying: Fix test_parser.py assertion depended on python version.
Using index info to reconstruct a base tree...
M tests/parser/test_parser.py
Falling back to patching base and 3-way merge...
No changes -- Patch already applied.
Applying: Change variable name 'd' to 'candidate'
Using index info to reconstruct a base tree...
M tests/test_postgres.py
Falling back to patching base and 3-way merge...
No changes -- Patch already applied.
Applying: Add multiline Japanese strings support to HocrVisualParser() to fix #534 and redo #537
Applying: Add CHANGELOG of #540

@YasushiMiyata
Copy link
Contributor Author

@lukehsiao
I have tried rebase once again, and complete with no force push (no --skip command).
Sorry to bother you, but would you mind re-checking to be able to merge?
The rebase failer seems my mistaking of rebase target.

@lukehsiao
Copy link
Contributor

lukehsiao commented May 7, 2021

@YasushiMiyata, maybe you could add me as a collaborator on your repo, and let me force push so I can fix the rebase?

This is what the process looks like for me:

https://asciinema.org/a/x45UXBrTBJdQLMhSDZnR6sIUP

In your case you might just want to git pull --rebase on this repo's master branch to get yours up to date?

@lukehsiao lukehsiao added this to the v0.8.4 milestone May 7, 2021
lukehsiao pushed a commit that referenced this pull request May 9, 2021
@lukehsiao
Copy link
Contributor

lukehsiao commented May 9, 2021

I pushed a master-rebased version to:

https://github.com/YasushiMiyata/fonduer/commits/YasushiMiyata-master.

In particular, note that it's just two clean commits off of 5ab8e9c.

Does that help?

@YasushiMiyata
Copy link
Contributor Author

@lukehsiao, I add you to my repository as a collaborator.

Thank you for the explanation with video.
In the last view, there are same logs but different ID like followings:

28e4d94 - (HEAD -> YasushiMiyata-master) Fix incorrect assertion in test_parser.py
ce9f946 - Fix test code test_postgres.py::test_cand_gen_cascading_delete (Fix #538)
4e7af04 - Fix test code test_postgres.py::test_cand_gen_cascading_delete (Fix #538)
41f34f1 - Fix incorrect assertion in test_parser.py

Maybe, it's cause of this error. I think that original cause is possibly my merge or rebase in the wrong oder.

@YasushiMiyata
Copy link
Contributor Author

Thank you! I check it.

@YasushiMiyata
Copy link
Contributor Author

YasushiMiyata commented May 9, 2021

@lukehsiao, I have tried rebasing and got following result without errors.
Is it okey? Can't i do anything?

(fonduer) (base) miyata@miyata-VB:~/work/fonduer/git/0426/fonduer$ git pull --rebase
remote: Enumerating objects: 2, done.
remote: Counting objects: 100% (2/2), done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 2 (delta 0), reused 1 (delta 0), pack-reused 0
Unpacking objects: 100% (2/2), done.
From https://github.com/YasushiMiyata/fonduer
 * [new branch]      YasushiMiyata-master -> origin/YasushiMiyata-master
Already up to date.
Current branch master is up to date.
(fonduer) (base) miyata@miyata-VB:~/work/fonduer/git/0426/fonduer$ git push
Everything up-to-date

@YasushiMiyata
Copy link
Contributor Author

@lukehsiao, sorry to keep bothering you. I have rebased on my repository successfully and pushed. But some checks end with error other than codes... Meanwhile, my rebase and push process is here:

git rebase upstream/master
<remove conflicts>
git add ./tests/test_postgres.py 
git rebase --continue
<remove conflicts>
git add ./tests/parser/test_parser.py 
git rebase --continue
 ... No changes - did you forget to use 'git add'? ...
git rebase --skip
<remove conflicts>
git add tests/test_postgres.py 
git rebase --continue
<Applying...>
git rebase upstream/master
<Applying...>
git pull --rebase
git push

@lukehsiao
Copy link
Contributor

lukehsiao commented May 10, 2021

This doesn't look like a correct rebase. There should be only 2 commits, I believe. Right now you have 12 kind of messy ones. Maybe it's easier to just open a new PR?

You'll probably also want to sync your fork with this repo so you don't run into issues like this in the future as well? I'd sync each time a PR is merged.

@YasushiMiyata
Copy link
Contributor Author

I also think it's better way to start over this PR. I close this PR. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants