Skip to content

Latest commit

 

History

History
456 lines (368 loc) · 18.1 KB

developing-a-disambiguator.md

File metadata and controls

456 lines (368 loc) · 18.1 KB

Developing a Disambiguator

A disambiguator might be used for a language in case when the tagger creates many interpretations for a token and rules get very complex because of the same set of exceptions used everywhere to disambiguate part-of-speech tags.

The disambiguator might be rule-based, as it is for French or English, or it can implement a completely different scheme (statistical). Note that you cannot simply adapt existing disambiguators, even rule-based, as they are used to make taggers robust. Robustness means that good taggers should ignore small grammatical problems when tagging. However, we want to recognize them rather than hide from linguistic processing. Anyway, I found that even automatically created rules (such as ones generated by training a Brill tagger for English) can be a source of inspiration.

Note that in contradistinction to XML grammar rules, the order of disambiguation rules is important (like in Brill tagger rules, they are cascaded). They are applied in the order as they appear in the file, so you can use a step-by-step strategy and use the results of previous rules in what follows after them.

The rule-based disambiguator may be used to add additional markup and simplify error-matching rules. For example, you can conditionally mark up some punctuation, or phrases. It's also useful to mark up tokens that you would otherwise with lengthy regular-expression based disjunctions (word1|word2...|wordn), if these disjunctions appear in multiple rules. This will be more efficient in terms of processing speed and will make the rules a bit more understandable for a human being.

To create a new disambiguator

  • in src/main/java:
    • Create a new package org.languagetool.tagging.disambiguation.rules.xx
    • Create in the new package a new Rule Disambiguator containing:
    package org.languagetool.tagging.disambiguation.rules.xx;
    import java.io.IOException;
    import org.languagetool.AnalyzedSentence;
    import org.languagetool.language.Yyyyyy;
    import org.languagetool.tagging.disambiguation.Disambiguator;
    import org.languagetool.tagging.disambiguation.rules.XmlRuleDisambiguator;
    
    public class YyyyyyRuleDisambiguator implements Disambiguator {
      
      private final Disambiguator disambiguator = new XmlRuleDisambiguator(new Yyyyyy());
    
      @Override
      public final AnalyzedSentence disambiguate(AnalyzedSentence input)
          throws IOException {
        return disambiguator.disambiguate(input);
      }
    
    }

where:

xx is the-two letter country code

Yyyyyy is the language name

  • Open org.languagetool.Language.Yyyyyy.java and override getDisambiguator():
    @Override
    public final Disambiguator getDisambiguator() {
      if (disambiguator == null) {
        disambiguator = new YyyyyyRuleDisambiguator();
      }
      return disambiguator;
    }
  • in src/main/resources:

  • Create file org/languagetool/resource/xx/disambiguation.xml

  • Populate it

XML syntax

Rule based XML disambiguator uses a syntax very similar to XML rules. For example:

    <rule name="determiner + verb/NN -> NN" id="DT_VB_NN">
      <pattern>
    	<token postag="DT"><exception postag="PDT" /></token>
    	<marker>
    	  <and>
    	    <token postag="VB" />
    	    <token postag="NN" ><exception negate_pos="yes" postag="VB|NN" postag_regexp="yes"/></token>
    	  </and>
        </marker>
      </pattern>
      <disambig postag="NN" />
    </rule>

The only new element here is disambig. It simply assigns a new POS tag to the word being disambiguated. Note I am using a trick that the rule applies only to words having both NN and VB tags - in English, there are many much more ambiguous words which require much more complex rules. Without the trick, the disambiguation rule could create more damage than good - it would garble the tagger output. This is a constant danger when writing disambiguator rules.

Note that by default disambig is applied to a single token which is selected with the <marker> ... </marker> elements in the
pattern element. However, you can use the action attribute to select more tokens for unification or for adding new interpretations.

The possible values of the action attribute are:

  • replace - the default one, assumed in the above example,
  • filter - used for filtering single tokens,
  • filterall - used for filtering multiple tokens by using the postags given in the rule,
  • unify - used for unification of groups of tokens,
  • remove - used for removing single tokens,
  • add - used for adding interpretations,
  • immunize - used to mark the tokens as immunized, i.e., never matched by any rule,
  • ignore_spelling - used to mark tokens that should not be marked as misspelled.

Filtering tags

Instead of adding a single tag, as above, you can select an already existing tag (that would also retain the old lemma that gets overwritten in case of simple assignment as above):

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig><match no="1" postag="PRP\$" postag_regexp="yes" /></disambig>
    </rule>

In this case, we select an existing interpretation (and only that interpretation) from the set of previous interpretations.

You can also assign a lemma if there are multiple interpretations and you don't want to pick just the first one as supplied by the tagger (this is the default behavior):

    <rule name="Don't|do|don/vb ->don't/vb" id="DONT_VB">
      <pattern>
        <marker>
          <token>don</token>
        </marker>
    	<token>'</token>
    	<token>t</token>
      </pattern>
      <disambig><match no="1" postag="VBP">do</match></disambig>
    </rule>

In this case, a contracted form of ''do'' is assigned a proper lemma and form tag. All other interpretations are being discarded.

There is another, shorter syntax that you might use for simple forms of filtering:

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig action="filter" postag="PRP\$" />
    </rule>

It is exactly equivalent to the first example. Note that you cannot specify a lemma this way, so you need the full syntax for this. Note that if "his" is not tagged as PRP$, this action is not executed. In other words, this action of the disambiguator presupposes that the new tag matches POS tags already found in the given token. To add new interpretations or replace ones, you need to use actions add or replace, accordingly.

Note: it is also possible to filter out some interpretations that you need to remove by using a special regular-expression with negative lookahead, which is a trick that enables negation in regular expression syntax. For example, this rule will remove all interpretations that are equivalent to PRP$ from the token "his":

    <rule name="his (noun/prep) + noun -> his (prep)" id="HIS_NN_PRP">
      <pattern>
        <marker>
          <token>his</token>
        </marker>
        <token postag="NN.*" postag_regexp="yes" />
      </pattern>
      <disambig><match no="1" postag="^(?!PRP\$).*$" postag_regexp="yes" /></disambig>
    </rule>

As you specify the filter using the regular expression, you can remove multiple interpretations at once.

Filtering multiple tokens

The action filterall works the following way: Once a pattern matches, then every token inside the <marker> tag is filtered by its correspondent postag. So if a pattern like "determinant-noun-adjective masculine singular (with its exceptions)" is matched, then all other readings (pronoun, verb, etc.) are removed.

For example:

    <rule>
      <pattern>
        <marker>
          <token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
          <token postag="N.[MC][SN0].*" postag_regexp="yes"/>
          <token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"><exception postag="V[MA]IP3S0" postag_regexp="yes"/></token>
        </marker>
      </pattern>
      <disambig action="filterall"/>
    </rule>

This would be equivalent to three rules (one for every token) like this:

    <rule>
      <pattern>
        <marker>
          <token postag="D[^R].[MC][SN0].*" postag_regexp="yes"/>
        </marker>
        <token postag="N.[MC][SN0].*" postag_regexp="yes"/>
        <token postag="A..[MC][SN0].*|V.P..SM|PX.[MC][SN0].*" postag_regexp="yes"/>
      </pattern>
      <disambig action="filter" postag="D.*"/>
    </rule>

Using unification

Before using unification, you need to define features and equivalences of features, as described in Using unification. In the disambiguator file, you add the same unification block as in the rules file (the syntax is the same). Then, in the rule, you can leave only unified tokens, that is tokens that share the same features. For example, take a simple agreement rule from the Polish disambiguator:

    <rule name="unifikacja przymiotnika z rzeczownikiem" id="unify_adj_subst">
      <pattern>
        <marker>
          <unify> 
          <feature id="number"><feature id="gender"><feature id="case">
            <token postag="adj.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="adj.*"/></token>
            <token postag="subst.*" postag_regexp="yes"><exception negate_pos="yes" postag_regexp="yes" postag="subst.*"/></token>
          </unify>
        </marker>
      </pattern>
      <disambig action="unify"/>
    </rule>

It uses unification on three features (defined earlier in the file): number, gender, and case. Note that I am using a Tips and tricks to make sure that only words that are marked only as adjectives or substantives are unified (otherwise the rule is too greedy).

There are several important restrictions: You cannot use two unified blocks in the disambiguator file; only one unify sequence per pattern is allowed. Moreover, the length of the matched tokens (selected with <marker> ... </marker>) must match the length of the unified sequence. Of course, there might be more tokens in the rule, but they cannot be selected with <marker> ... </marker> if the disambiguator is supposed to unify the sequence of tokens.

Removing only some interpretations

Sometimes, instead of filtering, you might want to remove only one interpretation from the token. You can do this in the following way:

    <rule name="mają to nie maić" id="MAJA_MAIC">
      <pattern>
        <token>mają</token>
      </pattern>
      <disambig action="remove"><wd lemma="maić" pos="verb:fin:pl:ter:imperf">mają</wd></disambig>
    </rule>

The above code removes one interpretation of the word "mają": the one with the POS equal to "verb:fin...", token equal "mają", and lemma "maić". You can supply only some of three parameters to remove a token: all supplied parameters must be matched, so the fewer parameters supplied, the more interpretations removed.

Adding completely new readings

Adding new readings can be useful to mark up groups, such as noun groups or multi-word expressions. You can add a single reading or many readings to the whole sequence (for example, a start mark, an "inside" mark, and an end mark).

For example:

    <rule name="ciemku" id="ciemku">
      <pattern>
        <token>ciemku</token>
      </pattern>
      <disambig action="add"><wd lemma="po ciemku" pos="adjp">ciemku</wd></disambig>    
    </rule>

The number of wd elements must match the number of tokens selected with <marker> ... </marker>.

Adding only POS tags or tokens

You can also add just POS tags without having to specify the lemmas or tokens added. This is especially useful if you're tagging tokens that are matched by regular expressions or POS tags, so you don't actually know which one you will find. You can add a POS tag just by supplying the wd element without the lemma attribute or without textual content:

    <rule name="uppercase tag" id="UPTAG">
      <pattern case_sensitive="yes">
        <token regexp="yes">\p{Lu}+</token>
      </pattern>
      <disambig action="add"><wd pos="UP"/></disambig>    
    </rule>

In the above example, I only added UP tag to uppercase words, the lemma is assumed to be equal to the token content, and the content of the token is not changed. So if the word was "Smiths", it would be tagged as "UP", and the lemma would be "Smiths" (although in other readings it could be "Smith").

If you omit only a token, it will be equal to the token matched by the current rule (rather than empty).

Immunizing words from matching

Sometimes a string of tokens get raise a false alarm in many rules even when the words are correctly tagged, and adding an exception to all the rules might be overkill (for example, an idiomatic phrase). You can then immunize the tokens by using the action immunize:

    <pattern>
      <token>dla</token>
      <marker>
        <token>Windows</token>
      </marker>
    </pattern>
    <disambig action="immunize"/>

The above pattern will be immunized, but only for the word "Windows". This way no XML rule will match it. Java rules can ignore immunization - it's up to their authors to respect immunization.

Ignoring in spell-checking rules

One can mark up tokens as spelled correctly. For example, one can use a regular expression to make the spell checker accept all Roman numbers:

    <pattern case_sensitive="yes">
       <token regexp="yes">(?:M*(?:D?C{0,3}|C[DM])(?:L?X{0,3}|X[LC])(?:V?I{0,3}|I[VX]))</token>
    </pattern>
    <disambig action="ignore_spelling"/>

There are also very rare words that are not used anywhere but in some specific contexts. We want the spell checker to complain about those in all but these special contexts. For example, in Polish, the the word ząb cannot be used in its genetive form zębu unless it's applied in a specific context when one speaks of a kind of corn. The rule looks like this:

    <rule id='KONSKI_ZAB' name="końskiego zębu - dobra pisownia">
      <pattern>
        <token>końskiego</token>
        <marker>
          <token>zębu</token>
        </marker>      
      </pattern>
      <disambig action="ignore_spelling"/>      
    </rule>

Possible strategies of disambiguation

  • Remove very rare but possible POS tag interpretations, possibly ignoring the context (greedy strategy which must be evaluated on a corpus)
  • Remove only these ambiguities which create false alarms by looking at the number
  • Remove some ambiguities one at a time, starting from very general and safe rules and ending with some very specific ones
  • Remember that you don't have to disambiguate everything in one go. The rules are a cascade, so you can write up very small disambiguation rules that then would trigger yet another rule
    • Start from very safe and general rules. For example, adjusting a case of a noun after a preposition is usually quite safe. In most languages, as well, it's impossible for a verb to occur after a preposition, and if you have an ambiguous interpretation that includes a noun or adjective and a verb, you can safely assume that the verb cannot occur there.
    • You can also tag numbers, abbreviations and proper names at a later stage and use them in disambiguation. For example, abbreviations may be discovered heuristically as sequences of non-vowel alphabetic characters with a dot as the next token. Numbers > 2 can occur before plural nouns, and not really before verbs (remember that dates may!). Words unrecognized by the tagger that start with an uppercase but not at the beggining of a sentence may be proper names (especially if they occur in a sequence of two such names -- these may be foreign proper names).
  • If you have access to some manually tagged corpus that uses the same or similar tagset, it's very useful to check for phrases that you want to disambiguate
  • Another good strategy is to generate a list of ambiguous words (possibly sorted by frequency in a dictionary), and then to go manually by from the top:
    • If one of the POS tags is extremely rare, simply remove that interpretation.
    • If both (or more) POS tags may really appear, write a rule that matches that exact word (not the POS tag), and run it on a corpus.
      • Try to see whether you can spot differences. Usually, the surrounding words (in idiomatic phrases) or other POS tags constrain possible interpretations.
      • Write the disambiguation rule and see how it matches your set of sentences. Then run all the rules on a corpus to see whether you get false alarms in some other rules.

Testing disambiguation rules

The best way to test disambiguation rules is to run LanguageTool on a middle-sized corpus (comparable to Brown corpus in English) and see if the previous false alarms are now fixed and no new false alarms are being created. Otherwise, it's very hard to predict the impact of disambiguation rules.

You can test the disambiguation rules in a similar fashion to the way grammar rules are being tested. Let's look at the example.

    <example type="ambiguous" inputform="What[what/WDT,what/WP,what/UH]" outputform="What[what/WDT]"><marker>What</marker> kind of bread is this?</example>
    <example type="untouched">What are you doing?</example>

In the above snippet, we declare that the sentence "What are you doing?" should be left untouched, or unchanged by the disambiguation rule, contrary to the ambiguous sentence that will be processed. Using marker element, we select the token that will be changed. The attribute inputform specifies the input forms of the token, in a word[lemma/POS] format. The outputform is of course what the disambiguation rule should produce.

Note also that in verbose mode (-v on the commandline interface), LT will display a log of all actions of the disambiguator for a given sentence.