Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add word level element "prediction" for Page XML datasets #172

Closed
wants to merge 6 commits into from

Conversation

maxnth
Copy link
Member

@maxnth maxnth commented May 19, 2020

Adds the flag --pagexml_word_level to predict.py which enables the generation of Word elements when predicting on Page XML datasets.

"Words" are generated from the positions returned by the prediction.
All character sequences separated by spaces are classified as a "word".
All (non-leading/non-trailing) spaces are also classified as "words" and saved as Word elements.
As far as I know the PAGE XML schema doesn't have any clear decisions on whether whitespaces should be declared in Word elements, so this is probably a very subjective decision.

@maxnth maxnth requested review from ChWick, andbue and chreul May 19, 2020 10:55
@maxnth maxnth changed the title Add word level element prediction for Page XML datasets Add word level element "prediction" for Page XML datasets May 19, 2020
calamari_ocr/ocr/datasets/pagexml_dataset/dataset.py Outdated Show resolved Hide resolved
new_word = False

for pos, entry in enumerate(positions[1:]):
if entry[0] == " ":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: if entry[0] in whitespace with from string import whitespace – you never now what's in the codec...
One might even consider giving the user the possibility to supply a list of word breaking characters on the command line.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from string import whitespace was added in 050276e when the user selects whitespace as word boundary.
The default mode for --pagexml_word_boundary was set to unicode and uses the unicode word boundaries.

calamari_ocr/ocr/datasets/pagexml_dataset/dataset.py Outdated Show resolved Hide resolved
@mittagessen
Copy link

mittagessen commented May 19, 2020 via email

@maxnth maxnth closed this Aug 2, 2021
@ChWick ChWick deleted the feature/pageXML_word_level branch September 9, 2021 10:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants