# Morphology: the internal linguistic structure of words

In addition to consisting of phones (in speech) and characters (in writing), words can often be decomposed into small, but still meaning bearing units, so-called __morphemes__. By definition, 

> A __morpheme__ is the smallest meaningful unit in a language. ([Wikipedia: Morpheme](https://en.wikipedia.org/wiki/Morpheme))

Although English is not a morphologically rich language, many English words consist of several morphemes, e.g. the word

> unbearables = un + bear + able + s

Several useful distinctions can be made among morphemes:

+ __bound__ vs __free (unbound)__ morpheme: While free morphemes (like "bear") can stand alone as independent words, bound morphemes (like "-un" and "-s") can only constitute words together with other morphemes.
+ __affixes__ vs __roots__: Root morphemes (of which usually there is only one in a word) are the main parts of the word with the most specific semantic content (this is "bear" in the example), around which the other morphemes, the affixes are placed ("un-", "-able" and "-s"). Most but not all roots are free: the root in "sociology" is "socio" but it is bound, cannot stand alone.

## Affixes

Affixes can further grouped according to their (typically positional) relation to the morpheme(s) to which they are attached to:

|  Affix type | Relation to the other morphemes|Example   | 
|---|---|---|
| prefix  | precedes   |  un-, anti- | 
| suffix  | follows  | -s, -ing  | 
| infix  | between  |  "I've gone to Singabloodypore!" | 
| circumfix| around | ge..t in German, as in gespielt|
| internal stem change | changes | e.g. kitaab 'book', kutub 'books' in Arabic. English remnant: swim -> swam|

The above list is far from being complete, in certain languages there are other affix types such as duplication (of a morpheme) and tone/pitch changes.

### Inflectional vs derivational affixes

A crucial distinction is between inflectional and derivational affixes:
+ __inflectional affixes__ create new forms of the _same word_: they represent grammatical aspects such as person, tense, number etc. Inflections never change the POS category of a word (!!). English examples are the plural "-(e)s" and the 3rd person singular "-s", or the progressive "-ing"
+ __derivational affixes__, on the other hand, __form new words__, e.g.  the "-able" suffix can form adjectives from verbs, like "bear + able".

## Stem and lemma (in theory)

+ The __stem__ of a word consists of the base part of the word that is common in all inflected forms. As a consequence, 
the stem is often not a meaningful word, e.g., the stem of "produced" is "produc" (because of "producing" etc.) (the example is from the [Wikipedia Lemma entry](https://en.wikipedia.org/wiki/Lemma))
+ A __lemma__, in contrast, is always a complete word, namely, the uninflected base form of inflected forms. Continuing the example, the lemma of "produced"  is "produce".

## Morphologically rich(er) languages

English contains only a few (9 to wit) inflectional affixes, and, moreover, an English  word cannot contain more than one of them -- this is why the PTB POS tag set can cover the inflectional variants of POS categories  without introducing a huge number of additional POS tags. This also means that POS taggers using these tags actually perform both POS and morphological analysis in one go.

While there are languages with even lower morphemes/word ratio and fewer inflections than English (as an extreme case, so called purely isolating languages contain only one-morpheme words), a lot of languages contain a rich set of inflectional morphemes that convey grammatical information about case, person, number etc., plus an arsenal of derivational ones. An extreme, anecdotical Turkish example:

<a href="https://qph.fs.quoracdn.net/main-qimg-c8e8c80c6f2ac4c1ed249c6cb8211a12-c"><img src="https://drive.google.com/uc?export=view&id=1vfUYFf0DQXiscioy5KkxcqfN8hgFSu1o"></a>

It is obvious that the syntactic and semantic analysis of this type of languages is impossible without somehow dealing with the complex internal structure of words, and this structure is too complicated to deal with it using simply atomic tags like the PTB POS-tags for English.

For many NLP tasks the sheer size of the vocabularies in a standard size corpus makes it necessary to reduce the words to their stems or lemmas.

## Compounds

An important complicating factor in the morphology of many languages is the presence of __compound words__, that is, words built up from two or more individual words. For instance German is (in)famous for its many compounds, e.g.

> Schadenfreude, Weltschmerz, Zeitgeist, Wanderlust

etc. Productive compounding makes the task of morphological analysis considerably more complicated (for starters, these words contain more than one root morpheme).

## Morphological analysis tasks

+ __Decide whether a string is a well formed word__: This is an "odd one out" here, but, in fact, for many morphologically rich languages simple list lookup is not feasible, since word formation is recursive. The task itself is crucial for spell-checking.
+ __Stemming__: determine the stem of input words.
+ __Lemmatization__: determine the lemma of input words.
+ __Morphological tagging__: tag input words according to the grammatical information their inflections etc. express, i.e., for case, person etc. tags here are typically structured, e.g. sets of attribute-value pairs.
+ __Morphological segmentation__: segment the input words into morphemes
+ __Full morphological analysis__: segment the words into morphemes and categorize each morpheme according to type, and grammatical information they convey (in the case of inflectional morphemes). Full morphological analysis often includes lemmatization and is frequently limited to analyzing the inflectional morphemes.

### Context dependence

Note that, similarly to POS-tags, in many cases the morphological structure of words is ambiguous without its context, e.g., only context can decide whether

>the "-s" suffix in "chairs" is a plural or a 3rd person suffix

In these cases analyzers that work on individual words can only offer _alternatives,_ and further, context-dependent methods are needed for choosing between them.

## Data sets

Since in the case of morphologically rich languages syntactic analysis requires morphological information, __tree banks__ (like PTB) usually contain some morphological annotations for their language. 

As for a multilingual, morphologically oriented data set, currently by far the most important one is the [Universal Dependencies corpus](https://universaldependencies.org/) (UD) which is also the most commonly used data set for evaluation. 

There are several online query services for UD corpora, e.g. [grew-match](http://match.grew.fr/).

# Rule-based methods

## Stemming

Although in theory stemming is not necessarily far removed from lemmatization, in computational linguistics stemmers are typically crude heuristic rule sequences to remove (or sometimes replace) affixes on some pattern matching conditions and arrive at a basic "core" of the word. The removed affixes are not only inflectional, and the result is frequently not a full word. Also, the algorithms are typically totally deterministic, there are no alternative stemmings produced. The following examples from [Manning et al. (2008): Introduction to Information Retrieval](https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html) show the output of three different English stemmers on an example:

<a href="https://nlp.stanford.edu/IR-book/html/htmledition/img102.png"><img src="https://drive.google.com/uc?export=view&id=1moQLHDSzCQhd7FAZ8IPVNSE4_oIYWNJ2" width="600px"></a>

The most influential English stemming algorithm has been the [Porter stemmer](https://snowballstem.org/algorithms/) (1980), which consists of rules like the following ones:

<a href="https://carlmorphet.files.wordpress.com/2013/12/porter-stemming-overview.png?w=594"><img src="https://drive.google.com/uc?export=view&id=1-7IT4EE9Q4zyxYPSc979h1ty3Ixjgk7S" width="600px"></a>

The algorithm applies the rule with the longest matching pattern in each group, and has some heuristics for measuring whether a word is long enough to apply a rule (otherwise the ending might not be a suffix but part of the stem).

The Porter stemmer was implemented using a special language for describing "suffix stripping grammars" called Snowball, and stemmers for other European languages like German, French etc. were also implemented in it -- see the [homepage](https://snowballstem.org/algorithms/) for details.

### Trade-offs

Stemmers like the Porter-stemmer are crude algorithms that work with a few manually engineered heuristic rules. They know only about suffixes to remove and have no dictionary. As a consequence, they are quick and can work on any word of the target language, but are more error-prone than the dictionary-based full morphological analyzers, which can actually check whether the lemma / stem they arrive at is an actually existing one or not.

## FSA-based morphological analysis

### Defining wordhood with regular expressions
We have already mentioned that even the seemingly simplest task on our list -- deciding whether a string is a word of a language can be tricky, especially for morphologically rich languages like Turkish or Hungarian. Considering the _recursivity_ of some morphological constructs, e.g., compounding, it is natural to turn (again) to regular expressions and try to define the words of the language with an -- obviously huge -- regular expression pattern.

By doing so, we actually

+ assume, by definition, that the words of the language together constitute a [__regular language__](https://en.wikipedia.org/wiki/Regular_language), and also
+ assume (by [Kleene's equivalence theorem](https://www.cs.odu.edu/~toida/nerzic/390teched/regular/fa/kleene-1.html)) that there is a __finite state machine acceptor__ which accepts exactly the well formed words of the language in question.

### FSA acceptors

Finite state acceptors are finite state machines that consume an input sequence of characters and produce only a binary output, by "accepting" or "rejecting" the input. They differ from the basic FSA-s we discussed earlier by having

+ an explicit __start state__,
+ a set of designated __accepting states__, and
+ their transitions labeled with symbols from a finite alphabet, or with the empty string.

By definition, an FSA acceptor accepts a string iff it has a sequence of consecutive transitions which
+ starts from the start state,
+ ends in an accepting state,
+ and the concatenation of the transition labels is the string in question.

For instance, the 

<a href="http://drive.google.com/uc?export=view&id=1G-bhTE9Xo9-kuPrrg-GDL5sAkC2MrPID"><img src="https://drive.google.com/uc?export=view&id=1YS0jn438yJq664pm4x_3Zq6fVkPd-CZJ" width="450px"></a>

acceptor accepts the words "car", "cars", "cat" and "cats". It also illustrates that there can be several acceptors accepting the same set of strings, with hugely different complexities, since the following acceptor is equivalent:

<a href="http://drive.google.com/uc?export=view&id=18eBjTTiSjAJt1S9f9YnYC1HNOnppA7us"><img src="https://drive.google.com/uc?export=view&id=1w6KgbYePcL8kPIvQQ-VbS6jOJkN_tXeB" width="450px"></a>

(The two FSA examples are from [Miriam Butt and Tina Bogel's Finite State Morphology Tutorial](https://ling.sprachwiss.uni-konstanz.de/pages/home/boegel/Dateien/CLT09_tutorial.pdf))

This example also shows a (primitive) example of handling a common suffix ("-s") and a dictionary of base forms together.

One important advantage of using regexps/FSA acceptors to represent a language's lexicon is that there are very efficient algorithms to simplify/minimize acceptors and also to decide whether they accept a string or not.  

Although regular expressions/FSA acceptors could be sufficient to represent the __lexicon__ of languages (even if they are morphologically rich), there is no easy way to recover the _analysis_ of strings.
FSA variants that can produce _structured output_ while consuming their input are much more practical for morphological analysis.

### FSA transducers

From a technical point of view, the only difference between acceptors and transducers is that they add _outputs_ to transitions: in addition to consuming an input symbol, transitions can output symbols from an output alphabet as well, which makes it possible to output a morphological analysis while consuming the input string.

Transducers have two important properties that make them very convenient for developing morphological analyzers: they can be 

+ __composed__: if $f$ and $g$ are transducers, then there is a composition transducer $f \circ g$ which maps exactly those inputs and outputs to each other that can be produced by first "running" $g$ on the input and then $f$ on the output of $g$, and
+ __inverted__: by simply changing the input and output label on a transducer ("running it backward") we can produce a transducer which generates the original input from the original transducer's output.

As a result, morphological analyzer transducers can be composed by designing a cascade of transducers that generate actual words from morphological descriptions, and then invert the transducer.

The first component of a generating cascade can be a "lexicon" along the following lines (the edge label notation is &lt;INPUT:OUTPUT&gt;, "^" stands for morpheme boundary and single symbols indicate identical input and output):

<a href="https://fomafst.github.io/englex.png"><img src="https://drive.google.com/uc?export=view&id=18vDnhIQ-rth05HzFjsXbk8JroBwUfwAn" width="100%"></a>

(Example image source: [FOMA morphological analysis tutorial](https://fomafst.github.io/morphtut.html))

this example illustrates nicely how the lexicon can generate word forms from morphological descriptions:

+ "make+V+PresPart" $\rightarrow$ "make^ing"
+ "watch+N+Pl" $\rightarrow$ "watch^s"

Of course, these lexicon outputs are only intermediary representations that have to be further processed by other transducers that insert an "e" into "watch^s" to produce the correct "watches" etc. All in all, inverting the final composed transducer results in a single transducer performing morphological analysis, and, as a bonus, we also produced a functional __generator__, which generates words from morphological descriptions. A nice consequence of this generation ability is that it is easy to generate __lemmas,__ so these analyzers can also act as __lemmatizers__.

__Assessment__

As single-word (context independent) analyzers, FSA transducer-based solutions still represent the state of the art for morphologically rich languages, with a number of notable limitations:

+ __Undergeneration__: if a stem is missing from the lexicon then the system is unable to analyze it correctly. This means that _unknown words_ cannot be analyzed.
+ __Overgeneration__: if the rules are not restrictive enough then the the system can generate superfluous incorrect analyses 
+ __Context independence__:  if there are more than one alternatives then FSAs can only simply list them -- disambiguation needs to be done with other, additional methods.

__See also__

+ A good introduction to the transducer-based morphology methodology, on which this discussion was partly based is the [FOMA Morphological Analysis Tutorial](https://fomafst.github.io/morphtut.html) of Mans Hulden.
+ Eisenstein's excellent [NLP book](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf) has a good discussion of FSAs and morphology in Chapter 9.
+ Important open source FSA libraries: [OpenFST](http://www.openfst.org/twiki/bin/view/FST/WebHome), [HFST](http://hfst.github.io/).

# Hybrid systems

As we have seen, one of the weaknesses of rule-based morphological tools is that they are not context-aware. A possible solution is to combine an ML-based, context aware sequence tagging model (e.g., a POS-tagger) with a rule based morphological tool (lemmatizer, analyzer etc.) either to
+ disambiguate, i.e. to choose between the alternative analyses produced by the rule-based tool.
+ or -- more intricately -- to drive the rule based analysis process as input.

[spaCy's lemmatizer module](https://spacy.io/api/lemmatizer) is a good example of the second strategy: it is a rule-based algorithm (relying on lexical lookup-tables and suffix-removal rules) which uses POS- and morphological tags as input.

# ML-based approaches

The context sensitivity of morphological attributes and the problem FSA-based approaches have with unseen words have led to an increased interest in developing alternative, classical ML and DNN-based morphological tools in the last few years.

## Morphological tagging

### As multiclass sequence tagging

The simplest approach (as in the case of PTB POS tags), has been to see all possible POS + morphological tag combinations as atomic tags and use simply a multiclass sequence tagging model. This means that, e.g., the labelings

> [POS=NOUN,CASE=Dat,NUM=SG] and [POS=NOUN,CASE=Dat,NUM=Pl]

are treated as two atomic tags without acknowledging that they differ only regarding their NUM attribute.

In the beginning, standard ML sequence tagging approaches were used (see, e.g. [MarMoT](http://cistern.cis.lmu.de/marmot/), which is a CRF-based, manually feature-engineered tagger from 2013), but current solutions are DNN-based.
Naturally, word-level modelling/embedding is even more important for this task than to basic POS-tagging, so the typical models are __hierarchical__, with elaborate word-level and subword/character level modeling. A relatively simple example: the paper [An Extensive Empirical Evaluation of Character-Based Morphological Tagging for 14 Languages (2017)](https://www.aclweb.org/anthology/E17-1048.pdf) examines the performance of combining character-level LSTMs and (highway) CNNs with token-level biLSTMs):

<a href="https://d3i71xaburhd42.cloudfront.net/2c4588f53a562385086c74205fcdfa853ac03aa7/3-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=1e2M-JEkEp2Y2lzLICLaUwaEROr1mriso" width=700px></a>

(Image from the paper, LUT stands for learned "lookup tables" for character embedding)

and they registered significant imrovements compared to MarMoT for certain languages:

<a href="https://d3i71xaburhd42.cloudfront.net/2c4588f53a562385086c74205fcdfa853ac03aa7/9-Table3-1.png"><img src="https://drive.google.com/uc?export=view&id=1IVTto9MVIAg4tOmeC0BK1an9tSW7-Cte" width=600px></a>

(Image from the paper)

### Utilizing the compositional structure of tags

Treating composite tags as atomic has obvious problems: the large number of classes leads to performance and data sparsity problems, and it ignores the compositional structure of morphological tags. The most recent neural taggers, accordingly, try to model the internal structure of morphological tags, keeping the hierarchical subword-level-word-level approach. Modeling choices include:

+ Training different classifiers for each morphological tag category (POS, PERSON, NUMBER, CASE etc.)
+ Hierarchical modeling, in which POS probabilities are used as input for individual classifiers for tag categories
+ Seq2Seq: Generate the sequence of morphological key-value tags with a seq2seq model from the token's embedding

<a href="https://d3i71xaburhd42.cloudfront.net/33d06410f29d12f001bdcd4c830c4af6a533278d/4-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=111n1_vKX61KVI9kVhdXQbV9Ha_9Z7TbF"></a>

(Image from the paper [Tkachenko & Sirts (2018): Modeling Composite Labels for Neural Morphological Tagging](https://arxiv.org/pdf/1810.08815.pdf), (d) is the simple multiple classification baseline)

The best results have been reported using seq2seq methods, see [Tkachenko & Sirts (2018): Modeling Composite Labels for Neural Morphological Tagging](https://arxiv.org/pdf/1810.08815.pdf).

Due to the relative sparsity of data, there is research activity into transfer learning methods, see, e.g., [CMU-01 at the SIGMORPHON 2019 Shared Task onCrosslinguality and Context in Morphology (2019)](https://arxiv.org/pdf/1907.10129.pdf).

## Lemmatization

Lemmatization is a word-level sequence-to-sequence task: based on its form and context the input word has to be mapped onto its lemma. Two main approaches have emerged for modeling the mapping:

### Edit-tree classification

This approach collects the "edit-trees" between full words and their lemmas from the training data. An edit tree consists of a tree of edit operations to get the lemma:

<a href="https://d3i71xaburhd42.cloudfront.net/bb24d3c66fd7924b219c410d6fab296c9db738cc/2-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=1ecmup44KDs_Z567YIADqa3F6gxQtwHl3" width="400px"></a>

(Images source: [Joint Lemmatization and Morphological Tagging with Lemming](https://pdfs.semanticscholar.org/ab0b/9ed83cbb618505353542e9aea8e002026285.pdf))

The tree on the left consists of the concrete edits, and the tree on the right is the abstract edit tree, which registers only the length of the prefix and suffix before and after the longest common substring.

The lemmatizer, in turn, is implemented as a context-dependent classifier, where the classes correspond to the collected edit trees. The choice of classification models ranges from log-linear models (multiclass logistic regression) (see [Müller et al. (2015) Joint Lemmatization and Morphological Tagging with LEMMING](https://ryancotterell.github.io/papers/mueller+al.emnlp15.pdf)) to hierarchical neural models similar to the ones used for morphological tagging (see, e.g., [Chakrabarty et al. (2017) Context Sensitive Lemmatization Using Two Successive BidirectionalGated Recurrent Networks](https://pdfs.semanticscholar.org/6aed/32124e761167332f1175909c6b0864e54bb3.pdf)).

### Neural seq2seq models

These models use standard, but character-level LSTM-based seq2seq models (in certain cases with attention) to produce a lemma from an input consisting of a word and its context. See, e.g., [Bergmanis & Goldwater: Context Sensitive Neural Lemmatization with Lematus (2018)](https://www.aclweb.org/anthology/N18-1126.pdf).

# Unsupervised stemming

There are a lot of languages for which morphologically annotated corpora do not exist. A morphological task which can be solved relatively successfully using unannotated training data is stemming: there are various algorithms that collect suffixes and/or prefixes in the corpus and cluster them using some similarity based heuristics, see, e.g. the [GRAS algorithm](https://www.researchgate.net/profile/Jiaul_Paik/publication/220515896_GRAS_An_Effective_and_Efficient_Stemming_Algorithm_for_Information_Retrieval/links/56f22e6008aed354e56fcd90.pdf).