-
Notifications
You must be signed in to change notification settings - Fork 209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add word level element "prediction" for Page XML datasets #172
Conversation
new_word = False | ||
|
||
for pos, entry in enumerate(positions[1:]): | ||
if entry[0] == " ": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: if entry[0] in whitespace
with from string import whitespace
– you never now what's in the codec...
One might even consider giving the user the possibility to supply a list of word breaking characters on the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from string import whitespace
was added in 050276e when the user selects whitespace
as word boundary.
The default mode for --pagexml_word_boundary
was set to unicode
and uses the unicode word boundaries.
* andbue :: 2020-05-19 14:21 Tue:
Suggestion: `if entry[0] in whitespace` with `from string import whitespace` – you never now what's in the codec...
One might even consider giving the user the possibility to supply a list of word breaking characters on the command line.
string.whitespace is only ASCII whitespace so it won't deal with all the
Unicode whitespaces out there.
There is a Unicode algorithm for word breaking [0] that is implemented
in the regex package. It is far from perfect but better than breaking at
whitespace as some of the esoteric ones are sometimes used for
typographic purposes.
[0] https://unicode.org/reports/tr29/#Word_Boundary_Rules
|
Adds the flag
--pagexml_word_level
topredict.py
which enables the generation ofWord
elements when predicting on Page XML datasets."Words" are generated from the
positions
returned by the prediction.All character sequences separated by spaces are classified as a "word".
All (non-leading/non-trailing) spaces are also classified as "words" and saved as
Word
elements.As far as I know the PAGE XML schema doesn't have any clear decisions on whether whitespaces should be declared in
Word
elements, so this is probably a very subjective decision.