Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAGE: How to add TextEquiv and consistency rules, OCR-D/assets#16 #82

Merged
merged 17 commits into from
Dec 13, 2018

Conversation

kba
Copy link
Member

@kba kba commented Oct 9, 2018

Should be mostly obvious, but feedback appreciated on wording and handling whitespace (current proposal is to use nbsp since most string stripping algos do not include it by default and yet it is Unicode WHITESP=Y).

@bertsky @finkf

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going forward with this! IMO the overall direction of the language is right, but there are a few points with room for improvement:

  • When we attach text recognition results, we might have multiple variants/hypotheses. Annotating those should be encouraged. Each one should go into a separate T with ascending @index and (then mandatory) descending @conf.
  • Since this is the intended semantics of the TextEquiv sequence, we should forbid other uses within one PAGE. So either there is only one T or a list of alternative hypotheses with differentiating attributes.
  • I find NBSP is a good idea for this corner case. The second sentence could be made more clear though by inserting "at the start or end of some T" before the comma.
  • Terminologically, I think we should stick to "recognized text" or "text results" throughout and ditch "text equivalence".
  • I would try to avoid the possessive form of XML identifiers. How about: "The text of each <pg:Word> must be equal to the texts of all <pg:Glyph> contained by it, concatenated directly."
  • Maybe (just to be as clear as possible) one should even mention that concatenation adds strings in between but never leading (first) or trailing (last).
  • There might be valid or even necessary exceptions to this general principle in corner cases. Before we make this a hard requirement, we should search for those cases in our data semi-interactively. See this comment
  • As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

@kba
Copy link
Member Author

kba commented Oct 9, 2018

@bertsky All valid and helpful, thank you. I will adress them ASAP and update the PR.

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Stopping in a loud way seems less error-prone than silently relying on a convention to me. But I don't feel strongly about that, as long as the convention is clear.

require it to proceed with the content of the top-most compliant hierarchy level.

What does that mean, how do you determine which one is "compliant" if line text and concatenated words differ? @tboenig Can you clarify OCR-D/assets#12 (comment)?

@bertsky
Copy link
Collaborator

bertsky commented Oct 9, 2018

Stopping in a loud way seems less error-prone than silently relying on a convention to me.

Same here.

require it to proceed with the content of the top-most compliant hierarchy level.

What does that mean, how do you determine which one is "compliant" if line text and concatenated words differ?

Sorry, I mis-represented the rule stated by Matthias. I think he really meant using the lowest available hierarchy level. This may be difficult for the processor, though (it might require re-building the text for the needed level from a lower level).

Copy link
Contributor

@VolkerHartmann VolkerHartmann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time there is a change at a higher level it has to be transferred to the lower levels. There are two cases that make this difficult:

  1. the letters 'rn' on line level become 'm' which means that two glyphs have to be combined to one glyph .
  2. the character 'm' becomes 'rn' at line or word level, meaning that a glyph must be split.
    The same problem can also happen on the word level, so that a word has to be split or combined.
    A simple solution could be to add an index to the text equivalence in each of the above cases and thus mark it as independent of the text equivalence from the lower levels.
    This means that all text equivalence with the same index have to match the rules mentioned in the discussion before.
    But I don't know enough about the PAGE XML schema to say if that's allowed/possible.

page.md Outdated

**Example:** `<pg:Word>` has text `Fool` but contains `<pg:Glyph>` whose
text results, when concatenated, form the text `Foot`. The processor must proceed as if
the `<pg:Word` had the text `Foot`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means that the text-equivalents of glyphs will overwrite all other results.
Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?
What will happen if multiple textequivs exists?
Must the indexes match?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different segmentations are an issue, c.f. #72, but this proposal here is simpler.

First of all, it allows us to identify inconsistencies between the levels.

Secondly, if there are inconsistencies, it's always the lowest level that "wins".

IMHO, this should not happen at all and a processor should be expected to either produce consistent TextEquiv on all levels (processor changed words-> adapt lines, blocks) or delete those TextEquiv that cannot be kept consistent (processor changed words -> delete glyphs).

Question: Why save a textequiv for each level if the textequivs are overwritten by the deeper level anyway?

Convenience I suppose. Since enforcing these consistency rules requires generating and comparing text anyway, we could probably prune the upper levels after processing and repopulate them before processing.

What will happen if multiple textequivs exists? Must the indexes match?

Yes, that's a good point I'll add a note on that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will happen if multiple textequivs exists? Must the indexes match?

Updated, @VolkerHartmann please have a look at f94c9bb

page.md Outdated Show resolved Hide resolved
page.md Show resolved Hide resolved
@bertsky
Copy link
Collaborator

bertsky commented Oct 31, 2018

time there is a change

@VolkerHartmann Can you please elaborate on what you mean by that? I thought PAGE annotations are never changed (except perhaps for dev mode with our new rollback operation, where it would be overwritten). So far I was told that processors can only add new annotations.

Add a NOTE how to handle multiple textequivs by index.

@kba I am quite surprised by the new consistency rules for multiple results in f94c9bb. In my understanding, higher levels necessarily need more alternatives to account for the same information, because they are combinations of their sub-elements' alternatives. Even if the combination is not expanded in full, e.g. a Word does not get all possible TextEquiv from its constituent Glyphs TextEquivs but only those which can be found in some dictionary or generated by some morphological rule, it would be overly restrictive to demand the same number of alternatives as its constituents. Also, the current formulation implies that all elements (not just the most ambiguous) need the same number of alternatives. Is that even realistic?

@kba
Copy link
Member Author

kba commented Oct 31, 2018

I am quite surprised by the new consistency rules for multiple results in f94c9bb. In my understanding, higher levels necessarily need more alternatives to account for the same information, because they are combinations of their sub-elements' alternatives. Even if the combination is not expanded in full, e.g. a Word does not get all possible TextEquiv from its constituent Glyphs TextEquivs but only those which can be found in some dictionary or generated by some morphological rule, it would be overly restrictive to demand the same number of alternatives as its constituents. Also, the current formulation implies that all elements (not just the most ambiguous) need the same number of alternatives. Is that even realistic?

I'm struggling with that part as you noticed but wanted to propose something lest we forget this additional source of inconsistencies. I'd be happy about better/cleaner wording and less convoluted rules. "Same cardinality of multiple text equivs" throughout makes only sense when combining results, not for alternatives.

Would it be acceptable if consistency was restricted to the "first" text result? This would leave the processor the freedom to handle alternatives as it pleases and we wouldn't need to define it (and implement it, which I imagine to be quite hairy considering all the edge cases and different interpretations of what multiple text results may mean). Essentially, we'd only ensure consistency for the "canonical" text results.

@bertsky
Copy link
Collaborator

bertsky commented Oct 31, 2018

"Same cardinality of multiple text equivs" throughout makes only sense when combining results, not for alternatives.

If multiple TextEquiv results are not alternative results, what else are they? I get the feeling I am missing some important in the whole discussion...

Would it be acceptable if consistency was restricted to the "first" text result?

I'd recommend against that. That would shift the burden of checking consistency to the consuming processor again (at least for those processors that do consume multiple results / alternatives).

Nevertheless, I completely agree something must be stated about this in the spec, and hopefully also enforced by the validator in core.

I think that most of your original idea can be saved if you replace the notion of identical cardinality by subsumption: a level's TextEquivs must each be subsumed in the concatenated combinations of its subordinate level's TextEquivs. This is easy to check without expanding the full combination (which could be practically infeasible), for instance like this (untested):

def validate_level(text, subs):
    if not subs:
        return True # end of recursion
    subtexts = subs[0].get_TextEquiv()
    for subtext in subtexts:
        prefix = subtext.get_Unicode()
        # for Word/TextLine/TextRegion: append space/newline delimiter to prefix here
        if text.startswith(prefix):
            if validate_level(text[len(prefix):], subs[1:]):
                return True # consistent
        pass # backtrack to next alternative
    return False # inconsistent
# e.g. at some Word element word:
texts = word.get_TextEquiv()
glyphs = word.get_Glyph()
for text in texts:
    if not validate_level(text.get_Unicode(), glyphs):
        raise Exception("word %s alternative '%s' (index %d) is inconsistent w.r.t. its constituent glyphs" % (word.get_id(), text.get_Unicode(), text.get_index())


## AlternativeImage for derived images

To encode images derived from the original image, the `<pc:AlternativeImage>` should be used. Its `filename` attribute should reference the URL of the derived image.
To encode images derived from the original image, the `<pc:AlternativeImage>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"should" or "MUST"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MUST

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, now I remember why it is SHOULD and not MUST: We do not use this mechanism at the moment. Rather we explicitly demand input file groups on call to a processor. So there is little benefit in setting it in the PAGE XML file.

If we demand it, we must implement it...

page.md Outdated Show resolved Hide resolved
page.md Outdated Show resolved Hide resolved
page.md Outdated Show resolved Hide resolved
page.md Outdated Show resolved Hide resolved
Co-Authored-By: kba <kba@users.noreply.github.com>
@bertsky
Copy link
Collaborator

bertsky commented Nov 22, 2018

Still no ideas about the problem with multiple (alternative) results? No opinions on my proposal?

@kba
Copy link
Member Author

kba commented Nov 23, 2018

Still no ideas about the problem with multiple (alternative) results? No opinions on my proposal?

IIUC @wrznr @cneud you prefer a "consistency checked for first alternative only" approach?

@cneud
Copy link
Member

cneud commented Nov 23, 2018

Indeed I am currently leaning towards the "consistency checked for first alternative only" approach but this being a fairly tricky issue I want to make sure that I am not overlooking anything. I will utter my opinion by Tuesday next week at the latest.

@cneud
Copy link
Member

cneud commented Nov 27, 2018

In lack of a better idea, I'd opt for the way currently proposed by @kba , i.e. we only concern ourselves with the consistency between the upper level and the top-ranked (index position = 1) alternative result on the lower level. Anything else would too quickly escalate into gazillions of combinations which will be impossible to keep consistent.

Even using subsumption instead as proposed here, I cannot currently see how this will reduce the scope to keep track of (but will be happily enlightened).

What this entails however, is that any alternatives present on a lower level that are not consistent with the upper level MUST have an index position greater than 1.

To illustrate:
format
  f  f
  o  o
  r  r
  m  r
        n

In this case, the n - since/while not included in the index = 1 variant - MUST have index > 1.

Needless to say, we will strive for a clear explanation (ideally illustrated by examples) asap.

Last but not least, pinging @finkf again since this will likely also impact the work on ocrd-postcorrection - and I believe this issue may have come up already earlier in the development of PoCoTo?

@bertsky
Copy link
Collaborator

bertsky commented Nov 27, 2018

Even using subsumption instead as proposed here, I cannot currently see how this will reduce the scope to keep track of (but will be happily enlightened).

We need to be more precise here: IINM keeping track of in that context can only mean validating some element's TextEquiv against its constituent (i.e. lower) elements' TextEquivs, should such exist. There is no need to go into a full combination of all TextEquivs of all its sub-elements here (or with even larger scope, which then indeed would be gazillions): we already know the result text we are looking for. We can do that search by a simple back-tracking recursion, as illustrated, in a function that can be used for every level (it only needs to know what element delimiters the respective level requires). The complexity (and practically, stack depth) of that recursion grows with the number of sub-elements times their average ambiguity. Even in the worst case, one full TextLine of very short Words with lots of alternative readings, or equivalently one very long Word with lots of alternative Glyph readings, that problem will be small computationally IMO.

And that cost will be paid for regardless of whether we validate consistency in the workspace or in the consumer itself: Because any sensible processor that does take alternatives and that does consider multiple hierarchy levels (and thus necessitates consistency checking in the first place) will rely on alternatives subsumption.

So to me this is actually a configurational issue: Workflows that have such processors would need that kind of validation at some point, but workflows without producers of ambiguity or without multi-level consumers would not.

If my assessment is correct, always validating levels but merely validating index 1 is both too much and too little at the same time.

@finkf
Copy link

finkf commented Nov 28, 2018

To illustrate:
format
f f
o o
r r
m r
n

In this case, the n - since/while not included in the index = 1 variant - MUST have index > 1.

One could always use an empty Unicode element to represent the empty string. Or one could use a special character like _ or ~ for all text levels.

@cneud
Copy link
Member

cneud commented Nov 30, 2018

@bertsky OK, that makes it clear, thanks for elaborating.

However, when we discussed this again in the wider OCR-D coordination group, we felt that the back-tracking recursion approach would introduce considerable complexity in the code with regard to maintenance and testing compared to the added benefit and that we would thus prefer proceeding with the implementation of the "consistency checked for first alternative only" method instead to ensure at least a practical level of consistency.

We hope this is acceptable for you and do of course remain open to further discussions.

@bertsky
Copy link
Collaborator

bertsky commented Nov 30, 2018

@cneud sure, no problem. After all, I can still do it in the processor.

Hopefully, those consistency problems can already be avoided by design anyway when we extend PAGE with graph-based or matrix-based alternatives.

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to bring this up so late, but now I can see were this is going. I don't think this consistency-for-index-1 principle works just yet:

  • Why again tie index 1 of the containing and the contained element together? As mentioned earlier, that seems overly restrictive (as it did when tying each index n together). Take Tesseract for example: it needs to be able to output a Word which seems more likely from its dictionary or language model, which typically is not identical to the concatenation of its 1-best Glyphs, respectively.
  • In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?
  • If validation would be a yes-no step provided by core, who would assist the processor in determining which place to re-combine texts at? On the other hand, if validation would deliver a structured result, why not just apply the repair already?
  • Lastly, why not just repeat the module causing the error by the general means of the quality control module?

@kba
Copy link
Member Author

kba commented Dec 5, 2018

Take Tesseract for example: it needs to be able to output a Word which seems more likely from its dictionary or language model, which typically is not identical to the concatenation of its 1-best Glyphs, respectively.

I see your point. In such a (inconsistent) case, what would you recommend the source of truth should be, the word or the glyphs? Your example seems particular to the word-glyph relation or do you expect TextLine and containing Words / TextRegion and containing TextLines to diverge because of language model/post-correction?

In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?

That was not the intended meaning. I had in mind an algorithm like this:

  for elem in page.get_Page().get_TextRegion() # or Word or Glyph nested
    text = elem.get_TextEquiv().get_Unicode()
    if not page.isConsistent(elem):
      text = page.concatenateConsistently(elem)  # construct from lowest consistent level at this particular place in the document
    # work with text
     

Lastly, why not just repeat the module causing the error by the general means of the quality control module?

You mean make consistency errors fatal?

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Keep in mind that the origin of this proposal about consistency was to spot incorrect order of words in a line. But if strict consistency rules would restrict valid use cases too much, might it not be best to just scrap them?

@bertsky
Copy link
Collaborator

bertsky commented Dec 5, 2018

In such a (inconsistent) case, what would you recommend the source of truth should be, the word or the glyphs? Your example seems particular to the word-glyph relation or do you expect TextLine and containing Words / TextRegion and containing TextLines to diverge because of language model/post-correction?

I guess it all depends on what you ask for: If I need Glyph level annotation and ask Tesseract to recognize one TextLine at a time, accessing results with GetIterator and GetChoiceIterator, it will give index 1 consistency on all 3 levels. But if I access results with GetBestLSTMChoices (or some future low-level routine), it will usually be inconsistent by the index 1 principle. Likewise if I ask to recognize each level separately (external layout analysis).

A consumer that needs Glyphs would naturally like to ignore inconsistent higher levels, but a consumer that needs Words or TextLines would rather like the input repaired from the next-lower consistent (but not necessarily the lowest) level.

So maybe we should first look at the use-cases of producing text annotation on multiple levels again (please cmiiw):

  1. readability: providing a TextEquiv index 1 on the levels higher than current helps humans quickly navigate through PAGE-XML and comes without extra effort. Here index 1 consistency should be enforced.
  2. avoid loosing granularity: providing a TextEquiv index 1 on the levels lower than current allows to also pass on all features which are available with higher resolution (in the input already or just now; like coordinates, font features etc), so each output is self-contained and no re-alignment of annotations becomes necessary. This case also comes with little extra cost, and index 1 consistency is appropriate here as well.
  3. "one size fits all": providing TextEquivs (possibly with multiple indexes) on multiple levels can seem useful if one does not know what level consumers will expect, or if one knows the consumers to expect different levels. This can be expensive though, and index 1 consistency is overly restrictive here.

Case 3 really is a configuration issue and should probably dealt with differently (e.g. by calling the producer exactly once per required level; see #77).

Coming back to our Tesseract example:

  • GetBestLSTMChoices with case 1: To abide by the index 1 consistency principle, we should not use the result from GetUTF8Text on the higher levels – throwing away all lexicon, LM and heuristic knowledge going into it –, but instead only concatenate from each lower level, respectively. Or avoid higher level annotation for humans, or avoid that function completely.
  • external layout, for example calling with a different model/language/engine on a low-confidence substring: we must ensure in the wrapper that higher levels are recombined consistently and lower levels use GetIterator.

Thus, probably I was wrong, and the "index 1 stays index 1" principle can be upheld sensibly. Even for processors using alternatives, if configured appropriately, the producer and consumer will agree on the hierarchy level they operate (provide/expect alternatives) on, and the other levels will merely contain one reading.

In case of failure, why does it require to use the lowest annotation level globally (on the document level), instead of the lower of the 2 levels in question, for that particular spot within in the document?

That was not the intended meaning. I had in mind an algorithm like this:

Oh, I see. So your intended meaning was element-local, next consistent level backup. But then the wording on L119-120 should IMO be changed to:
If any of these assertions fail for an element, a processor must proceed with the text results at the lower level.

Regarding the actual code, how do you get a consistent (repaired) TextRegion text from the page element in the 4th line? You probably meant it this way:

    if elem.isConsistent():
        text = elem.get_TextEquiv().get_Unicode()
    else:
        text = elem.concatenateConsistently()  # construct from lower level at this particular place in the document # probably needs to be recursive to find the next-highest lower level which is still consistent in itself
    # work with text

You mean make consistency errors fatal?

As far as I understood @tboenig in this comment, failing the consistency assertion would not allow a processor to stop, but require it to proceed with the content of the top-most compliant hierarchy level.

Keep in mind that the origin of this proposal about consistency was to spot incorrect order of words in a line. But if strict consistency rules would restrict valid use cases too much, might it not be best to just scrap them?

You are absolutely right. Because inconsistency could be unavoidable in some use cases, and repairs still possible, this should not be blocked but repaired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants