PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

kba · 2018-10-09T11:06:31Z

Should be mostly obvious, but feedback appreciated on wording and handling whitespace (current proposal is to use nbsp since most string stripping algos do not include it by default and yet it is Unicode WHITESP=Y).

@bertsky @finkf

bertsky

Thanks for going forward with this! IMO the overall direction of the language is right, but there are a few points with room for improvement:

When we attach text recognition results, we might have multiple variants/hypotheses. Annotating those should be encouraged. Each one should go into a separate T with ascending @index and (then mandatory) descending @conf.
Since this is the intended semantics of the TextEquiv sequence, we should forbid other uses within one PAGE. So either there is only one T or a list of alternative hypotheses with differentiating attributes.
I find NBSP is a good idea for this corner case. The second sentence could be made more clear though by inserting "at the start or end of some T" before the comma.
Terminologically, I think we should stick to "recognized text" or "text results" throughout and ditch "text equivalence".
I would try to avoid the possessive form of XML identifiers. How about: "The text of each <pg:Word> must be equal to the texts of all <pg:Glyph> contained by it, concatenated directly."
Maybe (just to be as clear as possible) one should even mention that concatenation adds strings in between but never leading (first) or trailing (last).
There might be valid or even necessary exceptions to this general principle in corner cases. Before we make this a hard requirement, we should search for those cases in our data semi-interactively. See this comment
As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

kba · 2018-10-09T17:50:48Z

@bertsky All valid and helpful, thank you. I will adress them ASAP and update the PR.

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Stopping in a loud way seems less error-prone than silently relying on a convention to me. But I don't feel strongly about that, as long as the convention is clear.

require it to proceed with the content of the top-most compliant hierarchy level.

What does that mean, how do you determine which one is "compliant" if line text and concatenated words differ? @tboenig Can you clarify OCR-D/assets#12 (comment)?

bertsky · 2018-10-09T18:03:08Z

Stopping in a loud way seems less error-prone than silently relying on a convention to me.

Same here.

require it to proceed with the content of the top-most compliant hierarchy level.

What does that mean, how do you determine which one is "compliant" if line text and concatenated words differ?

Sorry, I mis-represented the rule stated by Matthias. I think he really meant using the lowest available hierarchy level. This may be difficult for the processor, though (it might require re-building the text for the needed level from a lower level).

VolkerHartmann

Every time there is a change at a higher level it has to be transferred to the lower levels. There are two cases that make this difficult:

the letters 'rn' on line level become 'm' which means that two glyphs have to be combined to one glyph .
the character 'm' becomes 'rn' at line or word level, meaning that a glyph must be split.
The same problem can also happen on the word level, so that a word has to be split or combined.
A simple solution could be to add an index to the text equivalence in each of the above cases and thus mark it as independent of the text equivalence from the lower levels.
This means that all text equivalence with the same index have to match the rules mentioned in the discussion before.
But I don't know enough about the PAGE XML schema to say if that's allowed/possible.

VolkerHartmann · 2018-10-23T11:58:38Z

page.md

+
+**Example:** `<pg:Word>` has text `Fool` but contains `<pg:Glyph>` whose
+text results, when concatenated, form the text `Foot`. The processor must proceed as if
+the `<pg:Word` had the text `Foot`.


This means that the text-equivalents of glyphs will overwrite all other results.
Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?
What will happen if multiple textequivs exists?
Must the indexes match?

Different segmentations are an issue, c.f. #72, but this proposal here is simpler.

First of all, it allows us to identify inconsistencies between the levels.

Secondly, if there are inconsistencies, it's always the lowest level that "wins".

IMHO, this should not happen at all and a processor should be expected to either produce consistent TextEquiv on all levels (processor changed words-> adapt lines, blocks) or delete those TextEquiv that cannot be kept consistent (processor changed words -> delete glyphs).

Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?

Convenience I suppose. Since enforcing these consistency rules requires generating and comparing text anyway, we could probably prune the upper levels after processing and repopulate them before processing.

What will happen if multiple textequivs exists? Must the indexes match?

Yes, that's a good point I'll add a note on that.

What will happen if multiple textequivs exists? Must the indexes match?

Updated, @VolkerHartmann please have a look at f94c9bb

page.md

bertsky · 2018-10-31T17:01:41Z

time there is a change

@VolkerHartmann Can you please elaborate on what you mean by that? I thought PAGE annotations are never changed (except perhaps for dev mode with our new rollback operation, where it would be overwritten). So far I was told that processors can only add new annotations.

Add a NOTE how to handle multiple textequivs by index.

@kba I am quite surprised by the new consistency rules for multiple results in f94c9bb. In my understanding, higher levels necessarily need more alternatives to account for the same information, because they are combinations of their sub-elements' alternatives. Even if the combination is not expanded in full, e.g. a Word does not get all possible TextEquiv from its constituent Glyphs TextEquivs but only those which can be found in some dictionary or generated by some morphological rule, it would be overly restrictive to demand the same number of alternatives as its constituents. Also, the current formulation implies that all elements (not just the most ambiguous) need the same number of alternatives. Is that even realistic?

kba · 2018-10-31T17:32:04Z

I am quite surprised by the new consistency rules for multiple results in f94c9bb. In my understanding, higher levels necessarily need more alternatives to account for the same information, because they are combinations of their sub-elements' alternatives. Even if the combination is not expanded in full, e.g. a Word does not get all possible TextEquiv from its constituent Glyphs TextEquivs but only those which can be found in some dictionary or generated by some morphological rule, it would be overly restrictive to demand the same number of alternatives as its constituents. Also, the current formulation implies that all elements (not just the most ambiguous) need the same number of alternatives. Is that even realistic?

I'm struggling with that part as you noticed but wanted to propose something lest we forget this additional source of inconsistencies. I'd be happy about better/cleaner wording and less convoluted rules. "Same cardinality of multiple text equivs" throughout makes only sense when combining results, not for alternatives.

Would it be acceptable if consistency was restricted to the "first" text result? This would leave the processor the freedom to handle alternatives as it pleases and we wouldn't need to define it (and implement it, which I imagine to be quite hairy considering all the edge cases and different interpretations of what multiple text results may mean). Essentially, we'd only ensure consistency for the "canonical" text results.

bertsky · 2018-10-31T21:42:58Z

"Same cardinality of multiple text equivs" throughout makes only sense when combining results, not for alternatives.

If multiple TextEquiv results are not alternative results, what else are they? I get the feeling I am missing some important in the whole discussion...

Would it be acceptable if consistency was restricted to the "first" text result?

I'd recommend against that. That would shift the burden of checking consistency to the consuming processor again (at least for those processors that do consume multiple results / alternatives).

Nevertheless, I completely agree something must be stated about this in the spec, and hopefully also enforced by the validator in core.

I think that most of your original idea can be saved if you replace the notion of identical cardinality by subsumption: a level's TextEquivs must each be subsumed in the concatenated combinations of its subordinate level's TextEquivs. This is easy to check without expanding the full combination (which could be practically infeasible), for instance like this (untested):

def validate_level(text, subs):
    if not subs:
        return True # end of recursion
    subtexts = subs[0].get_TextEquiv()
    for subtext in subtexts:
        prefix = subtext.get_Unicode()
        # for Word/TextLine/TextRegion: append space/newline delimiter to prefix here
        if text.startswith(prefix):
            if validate_level(text[len(prefix):], subs[1:]):
                return True # consistent
        pass # backtrack to next alternative
    return False # inconsistent
# e.g. at some Word element word:
texts = word.get_TextEquiv()
glyphs = word.get_Glyph()
for text in texts:
    if not validate_level(text.get_Unicode(), glyphs):
        raise Exception("word %s alternative '%s' (index %d) is inconsistent w.r.t. its constituent glyphs" % (word.get_id(), text.get_Unicode(), text.get_index())

cneud · 2018-11-22T17:40:28Z

page.md


 ## AlternativeImage for derived images

-To encode images derived from the original image, the `<pc:AlternativeImage>` should be used. Its `filename` attribute should reference the URL of the derived image.
+To encode images derived from the original image, the `<pc:AlternativeImage>`


"should" or "MUST"?

Ah, now I remember why it is SHOULD and not MUST: We do not use this mechanism at the moment. Rather we explicitly demand input file groups on call to a processor. So there is little benefit in setting it in the PAGE XML file.

If we demand it, we must implement it...

page.md

Co-Authored-By: kba <kba@users.noreply.github.com>

bertsky · 2018-11-22T19:34:16Z

Still no ideas about the problem with multiple (alternative) results? No opinions on my proposal?

kba · 2018-11-23T12:48:21Z

Still no ideas about the problem with multiple (alternative) results? No opinions on my proposal?

IIUC @wrznr @cneud you prefer a "consistency checked for first alternative only" approach?

cneud · 2018-11-23T13:32:21Z

Indeed I am currently leaning towards the "consistency checked for first alternative only" approach but this being a fairly tricky issue I want to make sure that I am not overlooking anything. I will utter my opinion by Tuesday next week at the latest.

cneud · 2018-11-27T18:09:50Z

In lack of a better idea, I'd opt for the way currently proposed by @kba , i.e. we only concern ourselves with the consistency between the upper level and the top-ranked (index position = 1) alternative result on the lower level. Anything else would too quickly escalate into gazillions of combinations which will be impossible to keep consistent.

Even using subsumption instead as proposed here, I cannot currently see how this will reduce the scope to keep track of (but will be happily enlightened).

What this entails however, is that any alternatives present on a lower level that are not consistent with the upper level MUST have an index position greater than 1.

To illustrate:
format
  f  f
  o  o
  r  r
  m  r
        n

In this case, the n - since/while not included in the index = 1 variant - MUST have index > 1.

Needless to say, we will strive for a clear explanation (ideally illustrated by examples) asap.

Last but not least, pinging @finkf again since this will likely also impact the work on ocrd-postcorrection - and I believe this issue may have come up already earlier in the development of PoCoTo?

bertsky · 2018-11-27T19:09:10Z

Even using subsumption instead as proposed here, I cannot currently see how this will reduce the scope to keep track of (but will be happily enlightened).

We need to be more precise here: IINM keeping track of in that context can only mean validating some element's TextEquiv against its constituent (i.e. lower) elements' TextEquivs, should such exist. There is no need to go into a full combination of all TextEquivs of all its sub-elements here (or with even larger scope, which then indeed would be gazillions): we already know the result text we are looking for. We can do that search by a simple back-tracking recursion, as illustrated, in a function that can be used for every level (it only needs to know what element delimiters the respective level requires). The complexity (and practically, stack depth) of that recursion grows with the number of sub-elements times their average ambiguity. Even in the worst case, one full TextLine of very short Words with lots of alternative readings, or equivalently one very long Word with lots of alternative Glyph readings, that problem will be small computationally IMO.

And that cost will be paid for regardless of whether we validate consistency in the workspace or in the consumer itself: Because any sensible processor that does take alternatives and that does consider multiple hierarchy levels (and thus necessitates consistency checking in the first place) will rely on alternatives subsumption.

So to me this is actually a configurational issue: Workflows that have such processors would need that kind of validation at some point, but workflows without producers of ambiguity or without multi-level consumers would not.

If my assessment is correct, always validating levels but merely validating index 1 is both too much and too little at the same time.

finkf · 2018-11-28T10:26:10Z

To illustrate:
format
f f
o o
r r
m r
n

In this case, the n - since/while not included in the index = 1 variant - MUST have index > 1.

One could always use an empty Unicode element to represent the empty string. Or one could use a special character like _ or ~ for all text levels.

cneud · 2018-11-30T13:05:10Z

@bertsky OK, that makes it clear, thanks for elaborating.

However, when we discussed this again in the wider OCR-D coordination group, we felt that the back-tracking recursion approach would introduce considerable complexity in the code with regard to maintenance and testing compared to the added benefit and that we would thus prefer proceeding with the implementation of the "consistency checked for first alternative only" method instead to ensure at least a practical level of consistency.

We hope this is acceptable for you and do of course remain open to further discussions.

bertsky · 2018-11-30T13:34:53Z

@cneud sure, no problem. After all, I can still do it in the processor.

Hopefully, those consistency problems can already be avoided by design anyway when we extend PAGE with graph-based or matrix-based alternatives.

Conflicts: page.md

into page-textequiv-consistency

bertsky

Sorry to bring this up so late, but now I can see were this is going. I don't think this consistency-for-index-1 principle works just yet:

Why again tie index 1 of the containing and the contained element together? As mentioned earlier, that seems overly restrictive (as it did when tying each index n together). Take Tesseract for example: it needs to be able to output a Word which seems more likely from its dictionary or language model, which typically is not identical to the concatenation of its 1-best Glyphs, respectively.
In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?
If validation would be a yes-no step provided by core, who would assist the processor in determining which place to re-combine texts at? On the other hand, if validation would deliver a structured result, why not just apply the repair already?
Lastly, why not just repeat the module causing the error by the general means of the quality control module?

kba · 2018-12-05T10:29:21Z

Take Tesseract for example: it needs to be able to output a Word which seems more likely from its dictionary or language model, which typically is not identical to the concatenation of its 1-best Glyphs, respectively.

I see your point. In such a (inconsistent) case, what would you recommend the source of truth should be, the word or the glyphs? Your example seems particular to the word-glyph relation or do you expect TextLine and containing Words / TextRegion and containing TextLines to diverge because of language model/post-correction?

In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?

That was not the intended meaning. I had in mind an algorithm like this:

  for elem in page.get_Page().get_TextRegion() # or Word or Glyph nested
    text = elem.get_TextEquiv().get_Unicode()
    if not page.isConsistent(elem):
      text = page.concatenateConsistently(elem)  # construct from lowest consistent level at this particular place in the document
    # work with text

Lastly, why not just repeat the module causing the error by the general means of the quality control module?

You mean make consistency errors fatal?

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Keep in mind that the origin of this proposal about consistency was to spot incorrect order of words in a line. But if strict consistency rules would restrict valid use cases too much, might it not be best to just scrap them?

bertsky · 2018-12-05T15:18:42Z

In such a (inconsistent) case, what would you recommend the source of truth should be, the word or the glyphs? Your example seems particular to the word-glyph relation or do you expect TextLine and containing Words / TextRegion and containing TextLines to diverge because of language model/post-correction?

I guess it all depends on what you ask for: If I need Glyph level annotation and ask Tesseract to recognize one TextLine at a time, accessing results with GetIterator and GetChoiceIterator, it will give index 1 consistency on all 3 levels. But if I access results with GetBestLSTMChoices (or some future low-level routine), it will usually be inconsistent by the index 1 principle. Likewise if I ask to recognize each level separately (external layout analysis).

A consumer that needs Glyphs would naturally like to ignore inconsistent higher levels, but a consumer that needs Words or TextLines would rather like the input repaired from the next-lower consistent (but not necessarily the lowest) level.

So maybe we should first look at the use-cases of producing text annotation on multiple levels again (please cmiiw):

readability: providing a TextEquiv index 1 on the levels higher than current helps humans quickly navigate through PAGE-XML and comes without extra effort. Here index 1 consistency should be enforced.
avoid loosing granularity: providing a TextEquiv index 1 on the levels lower than current allows to also pass on all features which are available with higher resolution (in the input already or just now; like coordinates, font features etc), so each output is self-contained and no re-alignment of annotations becomes necessary. This case also comes with little extra cost, and index 1 consistency is appropriate here as well.
"one size fits all": providing TextEquivs (possibly with multiple indexes) on multiple levels can seem useful if one does not know what level consumers will expect, or if one knows the consumers to expect different levels. This can be expensive though, and index 1 consistency is overly restrictive here.

Case 3 really is a configuration issue and should probably dealt with differently (e.g. by calling the producer exactly once per required level; see #77).

Coming back to our Tesseract example:

GetBestLSTMChoices with case 1: To abide by the index 1 consistency principle, we should not use the result from GetUTF8Text on the higher levels – throwing away all lexicon, LM and heuristic knowledge going into it –, but instead only concatenate from each lower level, respectively. Or avoid higher level annotation for humans, or avoid that function completely.
external layout, for example calling with a different model/language/engine on a low-confidence substring: we must ensure in the wrapper that higher levels are recombined consistently and lower levels use GetIterator.

Thus, probably I was wrong, and the "index 1 stays index 1" principle can be upheld sensibly. Even for processors using alternatives, if configured appropriately, the producer and consumer will agree on the hierarchy level they operate (provide/expect alternatives) on, and the other levels will merely contain one reading.

In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?

That was not the intended meaning. I had in mind an algorithm like this:

Oh, I see. So your intended meaning was element-local, next consistent level backup. But then the wording on L119-120 should IMO be changed to:
If any of these assertions fail for an element, a processor must proceed with the text results at the lower level.

Regarding the actual code, how do you get a consistent (repaired) TextRegion text from the page element in the 4th line? You probably meant it this way:

    if elem.isConsistent():
        text = elem.get_TextEquiv().get_Unicode()
    else:
        text = elem.concatenateConsistently()  # construct from lower level at this particular place in the document # probably needs to be recursive to find the next-highest lower level which is still consistent in itself
    # work with text

You mean make consistency errors fatal?

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Keep in mind that the origin of this proposal about consistency was to spot incorrect order of words in a line. But if strict consistency rules would restrict valid use cases too much, might it not be best to just scrap them?

You are absolutely right. Because inconsistency could be unavoidable in some use cases, and repairs still possible, this should not be blocked but repaired.

Conflicts: CHANGELOG.md

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16

c0f079d

kba requested review from maria-federbusch and tboenig October 9, 2018 11:06

bertsky requested changes Oct 9, 2018

View reviewed changes

kba added 3 commits October 10, 2018 10:44

page: define "concatenation"

ee5d5a3

page: whitespace handling is about pg:Unicode

2c82e99

page: text equivalence -> text results

7e7cb04

kba added 5 commits October 22, 2018 11:44

Merge branch 'master' into page-textequiv-consistency

89ed2d4

formatting

a940f72

page: how processors should proceed in case of conflict

39441f1

page: text recognition confidence

95357ef

page: expressing multiple text recognition results

830d9b7

ehrmn requested a review from VolkerHartmann October 23, 2018 08:24

cneud self-requested a review October 24, 2018 11:25

kba mentioned this pull request Oct 30, 2018

comparison between the contents of the <Word> elements and the <TextEquivType><Unicode> elements : Schematron OCR-D/assets#16

Closed

kba assigned VolkerHartmann Oct 30, 2018

VolkerHartmann reviewed Oct 31, 2018

View reviewed changes

kba commented Oct 31, 2018

View reviewed changes

page.md Outdated Show resolved Hide resolved

kba commented Oct 31, 2018

View reviewed changes

page.md Show resolved Hide resolved

page consistency: add consistency rules for multiple results

f94c9bb

kba assigned cneud Nov 20, 2018

cneud approved these changes Nov 22, 2018

View reviewed changes

Apply @cneud's suggestions from code review

d7061d5

Co-Authored-By: kba <kba@users.noreply.github.com>

kba added 3 commits December 4, 2018 17:54

Merge branch 'master' into page-textequiv-consistency

a14bc54

Conflicts: page.md

Merge branch 'page-textequiv-consistency' of https://github.com/kba/spec

9f6aeef

into page-textequiv-consistency

rephrase PAGE consistency to reflect first-textequiv-checked-only policy

94eb66c

bertsky requested changes Dec 4, 2018

View reviewed changes

kba added 3 commits December 13, 2018 10:52

Merge branch 'master' into page-textequiv-consistency

3048d02

Conflicts: CHANGELOG.md

consistency: strictness levels

63aec8c

📦 v3.0.0

a0814f0

kba merged commit a0814f0 into OCR-D:master Dec 13, 2018

kba deleted the page-textequiv-consistency branch December 13, 2018 14:55

bertsky mentioned this pull request Mar 3, 2019

allow intermediate PAGE annotation for word segmentation ambiguity #72

Open

bertsky mentioned this pull request Feb 3, 2020

page_validator.py produces wrong concatenated text OCR-D/core#430

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

kba commented Oct 9, 2018

bertsky left a comment •

edited

Loading

kba commented Oct 9, 2018

bertsky commented Oct 9, 2018

VolkerHartmann left a comment

VolkerHartmann Oct 23, 2018

kba Oct 31, 2018

kba Oct 31, 2018

bertsky commented Oct 31, 2018

kba commented Oct 31, 2018

bertsky commented Oct 31, 2018

cneud Nov 22, 2018

kba Nov 22, 2018

kba Nov 22, 2018

bertsky commented Nov 22, 2018

kba commented Nov 23, 2018

cneud commented Nov 23, 2018

cneud commented Nov 27, 2018

bertsky commented Nov 27, 2018

finkf commented Nov 28, 2018

cneud commented Nov 30, 2018

bertsky commented Nov 30, 2018

bertsky left a comment •

edited

Loading

kba commented Dec 5, 2018

bertsky commented Dec 5, 2018 •

edited

Loading

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

Conversation

kba commented Oct 9, 2018

bertsky left a comment • edited Loading

Choose a reason for hiding this comment

kba commented Oct 9, 2018

bertsky commented Oct 9, 2018

VolkerHartmann left a comment

Choose a reason for hiding this comment

VolkerHartmann Oct 23, 2018

Choose a reason for hiding this comment

kba Oct 31, 2018

Choose a reason for hiding this comment

kba Oct 31, 2018

Choose a reason for hiding this comment

bertsky commented Oct 31, 2018

kba commented Oct 31, 2018

bertsky commented Oct 31, 2018

cneud Nov 22, 2018

Choose a reason for hiding this comment

kba Nov 22, 2018

Choose a reason for hiding this comment

kba Nov 22, 2018

Choose a reason for hiding this comment

bertsky commented Nov 22, 2018

kba commented Nov 23, 2018

cneud commented Nov 23, 2018

cneud commented Nov 27, 2018

bertsky commented Nov 27, 2018

finkf commented Nov 28, 2018

cneud commented Nov 30, 2018

bertsky commented Nov 30, 2018

bertsky left a comment • edited Loading

Choose a reason for hiding this comment

kba commented Dec 5, 2018

bertsky commented Dec 5, 2018 • edited Loading

bertsky left a comment •

edited

Loading

bertsky left a comment •

edited

Loading

bertsky commented Dec 5, 2018 •

edited

Loading