# Morphological analysis with FSTs
The following is a brief and basic tutorial on how to construct a morphological analyzer for a language using finite-state techniques. A small toy grammar of English noun and verb inflection is built step-by-step to illustrate overall design issues. While the grammar is small, much larger grammars can be built using the same design principles. Basic familiarity with regular expressions and foma is assumed, such as is outlined in the Getting started with foma page.

## Definition
Since a "morphological analyzer" could mean any number of things, let's first settle on a task description and define what the morphological analyzer is supposed to accomplish. In this implementation, a morphological analyzer is taken to be a black box (as in Fig. 1), which happens to be implemented as a finite-state transducer, that translates word forms (such as runs) into a string that represents its morphological makeup, such as run+V+3p+Sg: a verb in the third person singular present tense.



Naturally, if the word form is ambiguous (as runs is), the job of the analyzer is to output all tag sequences consistent with the grammar and the input word. In the above example, the transducer should perhaps also output run+N+Pl, or some similar sequence to convey the possibility of a noun reading of runs. Since finite-state transducers are inherently bidirectional devices, i.e. we can run a transducer in the inverse direction as well as the forward direction, the same FST, once we've built it, can serve both as a generator and an analyzer. The standard practice is to build morphological transducers so that the input (or domain) side is the analysis side, and the output (or range) side contains the word forms.

In real life, morphological analyzers tend to provide much more detailed information than this. Figure 2 shows a more elaborate analyzer's output for Basque with the input work maiatzaren, together with an illustration about how a feature matrix can be recovered from the string output of the analyzer.



The goal is then is build a finite-state transducer that accomplishes this string-to-string mapping of analyses to surface forms and vice versa.

## Design
The construction of the final transducer will be broken down into two large components:

- A lexicon/morphotactics part
- A phonological/morphophonological/orthographic part

### The lexicon
The first component, which we call the lexicon component, will be a transducer that:

- Accepts as input only the valid stems/lemmas of the language, followed by only a legal sequence of tags.
- Produces as output from these, an intermediate form, where the tags are replaced by the morphemes that they correspond to.
- May produce additional symbols in the output, such as special symbols that serve to mark the presence of morpheme boundaries.
For example, in the analyzer about to be constructed, the lexicon component FST will perform the following mappings:

```
c a t +N +Pl      w a t c h +N +Pl      w a t c h +V +3P +Sg     (input side)
c a t ^  s        w a t c h ^  s        w a t c h ^  s           (output side)
```

There are two things to note here. The first is that we are using the symbol ^ to mark a morpheme boundary. The second is that while each letter in the stem is represented by its own symbol (`w,a,t,c,h,` etc.), each complete tag is one separate symbol, a multicharacter symbol (`+N, +Pl,` etc.) The spaces in the above show the symbol boundaries to illustrate this. Figure 3 shows what a lexicon transducer that only encoded these three words would look like. Naturally, we will have some more features and a larger lexicon in what is described below.


The part that accomplishes this, the lexicon transducer, will be written in a formalism called lexc. While it is possible to construct the lexicon transducer through regular expressions in foma, the lexc-formalism is more suited for lexicon construction and expressing morphotactics.

In [1]:
import hfst
import fstutils as fst

In [2]:
help(fst.remove_epsilons)

Help on function remove_epsilons in module fstutils:

remove_epsilons(string, epsilon='@_EPSILON_SYMBOL_@')
    Removes the epsilon transitions from the string along a path from hfst.
    
    Args:
        string (str): The string (e.g. input path, output form) from which the epsilons should be deleted.
        epsilon (str, optional):  The epsilon string to remove. Defaults to the default setting in hfst,
        '@_EPSILON_SYMBOL_@'. Pass this only if you've redefined the epsilon symbol string in hfst.
    
    Returns:
        str: The desired string, without epsilons



In [3]:
defs = fst.Definitions({
    "V": "[a|i|e|o|u]",
})

In [4]:
grammar = hfst.compile_lexc_file('english.lexc')

consonantduplication = hfst.regex(defs.replace('g -> g g || _ "^" [i n g | e d]'))
edeletion = hfst.regex('e -> 0 || _ "^" [ i n g | e d ]')
einsertion = hfst.regex('[..] -> e || s | z | x | c h | s h _ "^" s')
yreplacement = hfst.regex('y -> i e || _ "^" s ,, y-> i || _ "^" e d')
kinsertion = hfst.regex(defs.replace('[..] -> k || V c _ "^" [e d | i n g]'))
cleanup = hfst.regex('"^" -> 0')

# be careful, since composition is done in place, rerunning composes without redefining the fst from scratch will make mega-fsts

grammar.compose(consonantduplication)
grammar.compose(einsertion)
grammar.compose(edeletion)
grammar.compose(yreplacement)
grammar.compose(kinsertion)
grammar.compose(cleanup)

In [5]:
fst.lookup(grammar, 'panic+V+Past')

'panicked'

In [6]:
print(fst.pairs(grammar))

beg+V:
 beg
beg+V+3P+Sg:
 begs
beg+V+Past:
 begged
beg+V+PastPart:
 begged
beg+V+PresPart:
 begging
cat+N+Sg:
 cat
cat+N+Pl:
 cats
city+N+Pl:
 cities
city+N+Sg:
 city
fox+N+Sg:
 fox
fox+V:
 fox
fox+V+Past:
 foxed
fox+V+PastPart:
 foxed
fox+V+PresPart:
 foxing
fox+N+Pl:
 foxes
fox+V+3P+Sg:
 foxes
make+V+Past:
 maked
make+V+PastPart:
 maked
make+V+PresPart:
 making
make+V:
 make
make+V+3P+Sg:
 makes
panic+N+Sg:
 panic
panic+N+Pl:
 panics
panic+V:
 panic
panic+V+3P+Sg:
 panics
panic+V+Past:
 panicked
panic+V+PastPart:
 panicked
panic+V+PresPart:
 panicking
try+V+Past:
 tried
try+V+PastPart:
 tried
try+N+Pl:
 tries
try+V+3P+Sg:
 tries
try+N+Sg:
 try
try+V:
 try
try+V+PresPart:
 trying
watch+N+Sg:
 watch
watch+V:
 watch
watch+V+Past:
 watched
watch+V+PastPart:
 watched
watch+V+PresPart:
 watching
watch+N+Pl:
 watches
watch+V+3P+Sg:
 watches

