-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for going forward with this! IMO the overall direction of the language is right, but there are a few points with room for improvement:
- When we attach text recognition results, we might have multiple variants/hypotheses. Annotating those should be encouraged. Each one should go into a separate
T
with ascending@index
and (then mandatory) descending@conf
. - Since this is the intended semantics of the
TextEquiv
sequence, we should forbid other uses within one PAGE. So either there is only oneT
or a list of alternative hypotheses with differentiating attributes. - I find NBSP is a good idea for this corner case. The second sentence could be made more clear though by inserting "at the start or end of some
T
" before the comma. - Terminologically, I think we should stick to "recognized text" or "text results" throughout and ditch "text equivalence".
- I would try to avoid the possessive form of XML identifiers. How about: "The text of each
<pg:Word>
must be equal to the texts of all<pg:Glyph>
contained by it, concatenated directly." - Maybe (just to be as clear as possible) one should even mention that concatenation adds strings in between but never leading (first) or trailing (last).
- There might be valid or even necessary exceptions to this general principle in corner cases. Before we make this a hard requirement, we should search for those cases in our data semi-interactively. See this comment
- As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.
@bertsky All valid and helpful, thank you. I will adress them ASAP and update the PR.
Stopping in a loud way seems less error-prone than silently relying on a convention to me. But I don't feel strongly about that, as long as the convention is clear.
What does that mean, how do you determine which one is "compliant" if line text and concatenated words differ? @tboenig Can you clarify OCR-D/assets#12 (comment)? |
Same here.
Sorry, I mis-represented the rule stated by Matthias. I think he really meant using the lowest available hierarchy level. This may be difficult for the processor, though (it might require re-building the text for the needed level from a lower level). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every time there is a change at a higher level it has to be transferred to the lower levels. There are two cases that make this difficult:
- the letters 'rn' on line level become 'm' which means that two glyphs have to be combined to one glyph .
- the character 'm' becomes 'rn' at line or word level, meaning that a glyph must be split.
The same problem can also happen on the word level, so that a word has to be split or combined.
A simple solution could be to add an index to the text equivalence in each of the above cases and thus mark it as independent of the text equivalence from the lower levels.
This means that all text equivalence with the same index have to match the rules mentioned in the discussion before.
But I don't know enough about the PAGE XML schema to say if that's allowed/possible.
page.md
Outdated
|
||
**Example:** `<pg:Word>` has text `Fool` but contains `<pg:Glyph>` whose | ||
text results, when concatenated, form the text `Foot`. The processor must proceed as if | ||
the `<pg:Word` had the text `Foot`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This means that the text-equivalents of glyphs will overwrite all other results.
Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?
What will happen if multiple textequivs exists?
Must the indexes match?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Different segmentations are an issue, c.f. #72, but this proposal here is simpler.
First of all, it allows us to identify inconsistencies between the levels.
Secondly, if there are inconsistencies, it's always the lowest level that "wins".
IMHO, this should not happen at all and a processor should be expected to either produce consistent TextEquiv
on all levels (processor changed words-> adapt lines, blocks) or delete those TextEquiv
that cannot be kept consistent (processor changed words -> delete glyphs).
Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?
Convenience I suppose. Since enforcing these consistency rules requires generating and comparing text anyway, we could probably prune the upper levels after processing and repopulate them before processing.
What will happen if multiple textequivs exists? Must the indexes match?
Yes, that's a good point I'll add a note on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What will happen if multiple textequivs exists? Must the indexes match?
Updated, @VolkerHartmann please have a look at f94c9bb
@VolkerHartmann Can you please elaborate on what you mean by that? I thought PAGE annotations are never changed (except perhaps for dev mode with our new rollback operation, where it would be overwritten). So far I was told that processors can only add new annotations.
@kba I am quite surprised by the new consistency rules for multiple results in f94c9bb. In my understanding, higher levels necessarily need more alternatives to account for the same information, because they are combinations of their sub-elements' alternatives. Even if the combination is not expanded in full, e.g. a |
I'm struggling with that part as you noticed but wanted to propose something lest we forget this additional source of inconsistencies. I'd be happy about better/cleaner wording and less convoluted rules. "Same cardinality of multiple text equivs" throughout makes only sense when combining results, not for alternatives. Would it be acceptable if consistency was restricted to the "first" text result? This would leave the processor the freedom to handle alternatives as it pleases and we wouldn't need to define it (and implement it, which I imagine to be quite hairy considering all the edge cases and different interpretations of what multiple text results may mean). Essentially, we'd only ensure consistency for the "canonical" text results. |
If multiple
I'd recommend against that. That would shift the burden of checking consistency to the consuming processor again (at least for those processors that do consume multiple results / alternatives). Nevertheless, I completely agree something must be stated about this in the spec, and hopefully also enforced by the validator in core. I think that most of your original idea can be saved if you replace the notion of identical cardinality by subsumption: a level's def validate_level(text, subs):
if not subs:
return True # end of recursion
subtexts = subs[0].get_TextEquiv()
for subtext in subtexts:
prefix = subtext.get_Unicode()
# for Word/TextLine/TextRegion: append space/newline delimiter to prefix here
if text.startswith(prefix):
if validate_level(text[len(prefix):], subs[1:]):
return True # consistent
pass # backtrack to next alternative
return False # inconsistent
# e.g. at some Word element word:
texts = word.get_TextEquiv()
glyphs = word.get_Glyph()
for text in texts:
if not validate_level(text.get_Unicode(), glyphs):
raise Exception("word %s alternative '%s' (index %d) is inconsistent w.r.t. its constituent glyphs" % (word.get_id(), text.get_Unicode(), text.get_index()) |
|
||
## AlternativeImage for derived images | ||
|
||
To encode images derived from the original image, the `<pc:AlternativeImage>` should be used. Its `filename` attribute should reference the URL of the derived image. | ||
To encode images derived from the original image, the `<pc:AlternativeImage>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"should" or "MUST"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MUST
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, now I remember why it is SHOULD
and not MUST
: We do not use this mechanism at the moment. Rather we explicitly demand input file groups on call to a processor. So there is little benefit in setting it in the PAGE XML file.
If we demand it, we must implement it...
Co-Authored-By: kba <kba@users.noreply.github.com>
Still no ideas about the problem with multiple (alternative) results? No opinions on my proposal? |
Indeed I am currently leaning towards the "consistency checked for first alternative only" approach but this being a fairly tricky issue I want to make sure that I am not overlooking anything. I will utter my opinion by Tuesday next week at the latest. |
In lack of a better idea, I'd opt for the way currently proposed by @kba , i.e. we only concern ourselves with the consistency between the upper level and the top-ranked (index position = 1) alternative result on the lower level. Anything else would too quickly escalate into gazillions of combinations which will be impossible to keep consistent. Even using subsumption instead as proposed here, I cannot currently see how this will reduce the scope to keep track of (but will be happily enlightened). What this entails however, is that any alternatives present on a lower level that are not consistent with the upper level MUST have an index position greater than 1. To illustrate: In this case, the Needless to say, we will strive for a clear explanation (ideally illustrated by examples) asap. Last but not least, pinging @finkf again since this will likely also impact the work on ocrd-postcorrection - and I believe this issue may have come up already earlier in the development of PoCoTo? |
We need to be more precise here: IINM keeping track of in that context can only mean validating some element's TextEquiv against its constituent (i.e. lower) elements' TextEquivs, should such exist. There is no need to go into a full combination of all TextEquivs of all its sub-elements here (or with even larger scope, which then indeed would be gazillions): we already know the result text we are looking for. We can do that search by a simple back-tracking recursion, as illustrated, in a function that can be used for every level (it only needs to know what element delimiters the respective level requires). The complexity (and practically, stack depth) of that recursion grows with the number of sub-elements times their average ambiguity. Even in the worst case, one full TextLine of very short Words with lots of alternative readings, or equivalently one very long Word with lots of alternative Glyph readings, that problem will be small computationally IMO. And that cost will be paid for regardless of whether we validate consistency in the workspace or in the consumer itself: Because any sensible processor that does take alternatives and that does consider multiple hierarchy levels (and thus necessitates consistency checking in the first place) will rely on alternatives subsumption. So to me this is actually a configurational issue: Workflows that have such processors would need that kind of validation at some point, but workflows without producers of ambiguity or without multi-level consumers would not. If my assessment is correct, always validating levels but merely validating index 1 is both too much and too little at the same time. |
One could always use an empty Unicode element to represent the empty string. Or one could use a special character like |
@bertsky OK, that makes it clear, thanks for elaborating. However, when we discussed this again in the wider OCR-D coordination group, we felt that the back-tracking recursion approach would introduce considerable complexity in the code with regard to maintenance and testing compared to the added benefit and that we would thus prefer proceeding with the implementation of the "consistency checked for first alternative only" method instead to ensure at least a practical level of consistency. We hope this is acceptable for you and do of course remain open to further discussions. |
@cneud sure, no problem. After all, I can still do it in the processor. Hopefully, those consistency problems can already be avoided by design anyway when we extend PAGE with graph-based or matrix-based alternatives. |
Conflicts: page.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to bring this up so late, but now I can see were this is going. I don't think this consistency-for-index-1 principle works just yet:
- Why again tie index 1 of the containing and the contained element together? As mentioned earlier, that seems overly restrictive (as it did when tying each index
n
together). Take Tesseract for example: it needs to be able to output a Word which seems more likely from its dictionary or language model, which typically is not identical to the concatenation of its 1-best Glyphs, respectively. - In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?
- If validation would be a yes-no step provided by core, who would assist the processor in determining which place to re-combine texts at? On the other hand, if validation would deliver a structured result, why not just apply the repair already?
- Lastly, why not just repeat the module causing the error by the general means of the quality control module?
I see your point. In such a (inconsistent) case, what would you recommend the source of truth should be, the word or the glyphs? Your example seems particular to the word-glyph relation or do you expect TextLine and containing Words / TextRegion and containing TextLines to diverge because of language model/post-correction?
That was not the intended meaning. I had in mind an algorithm like this: for elem in page.get_Page().get_TextRegion() # or Word or Glyph nested
text = elem.get_TextEquiv().get_Unicode()
if not page.isConsistent(elem):
text = page.concatenateConsistently(elem) # construct from lowest consistent level at this particular place in the document
# work with text
You mean make consistency errors fatal?
Keep in mind that the origin of this proposal about consistency was to spot incorrect order of words in a line. But if strict consistency rules would restrict valid use cases too much, might it not be best to just scrap them? |
I guess it all depends on what you ask for: If I need Glyph level annotation and ask Tesseract to recognize one TextLine at a time, accessing results with A consumer that needs Glyphs would naturally like to ignore inconsistent higher levels, but a consumer that needs Words or TextLines would rather like the input repaired from the next-lower consistent (but not necessarily the lowest) level. So maybe we should first look at the use-cases of producing text annotation on multiple levels again (please cmiiw):
Case 3 really is a configuration issue and should probably dealt with differently (e.g. by calling the producer exactly once per required level; see #77). Coming back to our Tesseract example:
Thus, probably I was wrong, and the "index 1 stays index 1" principle can be upheld sensibly. Even for processors using alternatives, if configured appropriately, the producer and consumer will agree on the hierarchy level they operate (provide/expect alternatives) on, and the other levels will merely contain one reading.
Oh, I see. So your intended meaning was element-local, next consistent level backup. But then the wording on L119-120 should IMO be changed to: Regarding the actual code, how do you get a consistent (repaired) TextRegion text from the page element in the 4th line? You probably meant it this way: if elem.isConsistent():
text = elem.get_TextEquiv().get_Unicode()
else:
text = elem.concatenateConsistently() # construct from lower level at this particular place in the document # probably needs to be recursive to find the next-highest lower level which is still consistent in itself
# work with text
You are absolutely right. Because inconsistency could be unavoidable in some use cases, and repairs still possible, this should not be blocked but repaired. |
Conflicts: CHANGELOG.md
Should be mostly obvious, but feedback appreciated on wording and handling whitespace (current proposal is to use nbsp since most string stripping algos do not include it by default and yet it is Unicode
WHITESP=Y
).@bertsky @finkf