# Morphological analysis with FSTs
The following is a brief and basic tutorial on how to construct a **morphological analyzer** for a language using finite-state techniques. A toy grammar of English noun and verb inflections is built step-by-step to illustrate overall design issues. While the grammar is small, much larger grammars can be built using the same design principles. This tutorial uses the [Helsinki Finite-State Transducer toolkit](http://hfst.github.io/).

In [1]:
import hfst
import fstutils as fst

In [2]:
help(fst.remove_epsilons)

Help on function remove_epsilons in module fstutils:

remove_epsilons(string, epsilon='@_EPSILON_SYMBOL_@')
    Removes the epsilon transitions from the string along a path from hfst.
    
    Args:
        string (str): The string (e.g. input path, output form) from which the epsilons should be deleted.
        epsilon (str, optional):  The epsilon string to remove. Defaults to the default setting in hfst,
        '@_EPSILON_SYMBOL_@'. Pass this only if you've redefined the epsilon symbol string in hfst.
    
    Returns:
        str: The desired string, without epsilons



# Design
The construction of the final transducer is broken down into two large components:

- A lexicon
- Alternation rules

We combine the lexicon FST and the various FSTs that encode alternation rules into one large transducer that acts like a cascade. This single large transducer has the same effect as providing an input to the lexicon transducer, taking its output and feeding it into the first rule transducer, taking its output and feeding it into the next rule transducer, and so on.
This cascade is accomplished by the regular expression composition operator (`.o.`). Suppose we have the lexicon transducer in an FST named `Lexicon` and the various alternation rules as FSTs named `Rule1`, ..., `RuleN`. We can issue the regular expression
```
Lexicon .o. Rule1 .o. Rule2 .o. ... .o. RuleN ;
```
and produce a single transducer that is the composite of the different rule transducers and the lexicon transducer.

In [3]:
defs = fst.Definitions({
    "V": "[a|i|e|o|u]",
})

In [4]:
grammar = hfst.compile_lexc_file('english.lexc')

consonantduplication = hfst.regex(defs.replace('g -> g g || _ "^" [i n g | e d]'))
edeletion = hfst.regex('e -> 0 || _ "^" [ i n g | e d ]')
einsertion = hfst.regex('[..] -> e || s | z | x | c h | s h _ "^" s')
yreplacement = hfst.regex('y -> i e || _ "^" s ,, y-> i || _ "^" e d')
kinsertion = hfst.regex(defs.replace('[..] -> k || V c _ "^" [e d | i n g]'))
cleanup = hfst.regex('"^" -> 0')

# be careful, since composition is done in place, rerunning composes without redefining the fst from scratch will make mega-fsts

grammar.compose(consonantduplication)
grammar.compose(einsertion)
grammar.compose(edeletion)
grammar.compose(yreplacement)
grammar.compose(kinsertion)
grammar.compose(cleanup)

In [5]:
fst.lookup(grammar, 'panic+V+Past')

'panicked'

In [6]:
print(fst.pairs(grammar))

beg+V:
 beg
beg+V+3P+Sg:
 begs
beg+V+Past:
 begged
beg+V+PastPart:
 begged
beg+V+PresPart:
 begging
cat+N+Sg:
 cat
cat+N+Pl:
 cats
city+N+Pl:
 cities
city+N+Sg:
 city
fox+N+Sg:
 fox
fox+V:
 fox
fox+V+Past:
 foxed
fox+V+PastPart:
 foxed
fox+V+PresPart:
 foxing
fox+N+Pl:
 foxes
fox+V+3P+Sg:
 foxes
make+V+Past:
 maked
make+V+PastPart:
 maked
make+V+PresPart:
 making
make+V:
 make
make+V+3P+Sg:
 makes
panic+N+Sg:
 panic
panic+N+Pl:
 panics
panic+V:
 panic
panic+V+3P+Sg:
 panics
panic+V+Past:
 panicked
panic+V+PastPart:
 panicked
panic+V+PresPart:
 panicking
try+V+Past:
 tried
try+V+PastPart:
 tried
try+N+Pl:
 tries
try+V+3P+Sg:
 tries
try+N+Sg:
 try
try+V:
 try
try+V+PresPart:
 trying
watch+N+Sg:
 watch
watch+V:
 watch
watch+V+Past:
 watched
watch+V+PastPart:
 watched
watch+V+PresPart:
 watching
watch+N+Pl:
 watches
watch+V+3P+Sg:
 watches

