Add word level element "prediction" for Page XML datasets #172

maxnth · 2020-05-19T10:53:54Z

Adds the flag --pagexml_word_level to predict.py which enables the generation of Word elements when predicting on Page XML datasets.

"Words" are generated from the positions returned by the prediction.
All character sequences separated by spaces are classified as a "word".
All (non-leading/non-trailing) spaces are also classified as "words" and saved as Word elements.
As far as I know the PAGE XML schema doesn't have any clear decisions on whether whitespaces should be declared in Word elements, so this is probably a very subjective decision.

calamari_ocr/ocr/datasets/pagexml_dataset/dataset.py

andbue · 2020-05-19T11:45:40Z

calamari_ocr/ocr/datasets/pagexml_dataset/dataset.py

+        new_word = False
+
+        for pos, entry in enumerate(positions[1:]):
+            if entry[0] == " ":


Suggestion: if entry[0] in whitespace with from string import whitespace – you never now what's in the codec...
One might even consider giving the user the possibility to supply a list of word breaking characters on the command line.

from string import whitespace was added in 050276e when the user selects whitespace as word boundary.
The default mode for --pagexml_word_boundary was set to unicode and uses the unicode word boundaries.

calamari_ocr/ocr/datasets/pagexml_dataset/dataset.py

mittagessen · 2020-05-19T12:31:17Z

* andbue :: 2020-05-19 14:21 Tue:

Suggestion: `if entry[0] in whitespace` with `from string import whitespace` – you never now what's in the codec... One might even consider giving the user the possibility to supply a list of word breaking characters on the command line.

string.whitespace is only ASCII whitespace so it won't deal with all the Unicode whitespaces out there. There is a Unicode algorithm for word breaking [0] that is implemented in the regex package. It is far from perfect but better than breaking at whitespace as some of the esoteric ones are sometimes used for typographic purposes. [0] https://unicode.org/reports/tr29/#Word_Boundary_Rules

add word level element prediction for Page XML datasets

640c53b

maxnth requested review from ChWick, andbue and chreul May 19, 2020 10:55

maxnth changed the title ~~Add word level element prediction for Page XML datasets~~ Add word level element "prediction" for Page XML datasets May 19, 2020

andbue requested changes May 19, 2020

View reviewed changes

maxnth added 5 commits May 19, 2020 14:53

remove unnecessary enumerate

5005a88

add ns declaration during Word element creation

97cd61b

add unicode word boundaries for Page XML datasets

050276e

fix dependencies for gitlab runner

91185c6

add and modify constraints for empty positions and words

fc27f21

kba mentioned this pull request Apr 28, 2021

PAGE without words OCR-D/page-to-alto#5

Open

maxnth closed this Aug 2, 2021

ChWick deleted the feature/pageXML_word_level branch September 9, 2021 10:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add word level element "prediction" for Page XML datasets #172

Add word level element "prediction" for Page XML datasets #172

maxnth commented May 19, 2020

andbue May 19, 2020

maxnth May 20, 2020

mittagessen commented May 19, 2020 via email

Add word level element "prediction" for Page XML datasets #172

Add word level element "prediction" for Page XML datasets #172

Conversation

maxnth commented May 19, 2020

andbue May 19, 2020

Choose a reason for hiding this comment

maxnth May 20, 2020

Choose a reason for hiding this comment

mittagessen commented May 19, 2020 via email