# What are parts of speech?

A __syntactic theory__ for a language is a theoretical characterization of the language's well formed sentences. Although there is a plethora of syntactic theories, basically all of them rely on some notion of "parts of speech" (POS for short), i.e., basic syntactic roles that expressions can have in well formed sentences. For instance, in the standard constituency analysis of the English sentence

> John hit the ball.

<a href="https://upload.wikimedia.org/wikipedia/commons/5/54/Parse_tree_1.jpg"><img src="https://drive.google.com/uc?export=view&id=1-e4Vqo7pMZdCx0der8WxyaD-VUw-tY6s"></a>

(Image source: [Wikipedia: Parse tree](https://en.wikipedia.org/wiki/Parse_tree))


"John" and "ball" are nouns, "hit" is a verb, and "the" is a determinant -- these are all parts of speech categorizations that are an important part of the full syntactic analysis.

__Context dependence__

It is important to note that the part of speech role played by an expression can be context dependent. For instance, in contrast to the previous sentence, "hit" in the sentence

> His first song was a huge hit in Europe.

is a noun. 

__Theory and language dependence__

Although the full list of POS categories is language and syntactic theory dependent, some parts of speech are pretty much universal, e.g. the categories __noun__, __verb__, __adjective__ and __adverb__ can be found in almost all languages and syntactic theories.

__Open vs closed POS categories__

- Closed POS categories, e.g., the category of determiners in English, consist of relatively small sets of words, and these sets do not change easily: it's a rare phenomenon that a new determiner is added to a language.
- Open POS categories, on the other hand, like that of English verbs, contain a large number of words and new members are added on daily basis.

A strongly related distinction is that of between __function words__ and __content words__: while words belonging to open POS categories are content words in the sense that they typically have a more or less well characterizable lexical semantic content on their own (many verbs refer to actions, many proper nouns to indviduals etc.), closed POS categories contain words without much independent semantic content -- their semantics is closely tied to their semantic function within sentences.

__Why are POS categories useful?__

Determining the POS category of each word in a text is the first linguistic analysis step after tokenization/sentence segmentation: it is necessary for all later stages, i.e. full syntactic analysis, semantics etc.

In addition to being an important part of syntax, part of speech information contains useful information about a word's
- __distributional properties__, i.e., in which context the word can occur, e.g., in English nouns can be preceded by a determiner, but not verbs, and, especially for content words, about its 
- __semantics__, e.g., verbs frequently (but not exclusively) refer to some sort of actions/events while nouns frequently refer to participants of these actions/events.

this type of information is routinely exploited in NLP applications, e.g. in information retrieval.

# POS tagsets

In NLP, POS categories are typically encoded with shorthands, so called POS tags.
## Penn Treebank tagset

A historically very influential English POS tagset was developed for the [Penn Treebank (PTB) project](https://catalog.ldc.upenn.edu/LDC99T42) (1989-1996):


||||
|--- |--- |--- |
|Number|Tag|Description|
|1.|CC|Coordinating conjunction|
|2.|CD|Cardinal number|
|3.|DT|Determiner|
|4.|EX|Existential there|
|5.|FW|Foreign word|
|6.|IN|Preposition or subordinating conjunction|
|7.|JJ|Adjective|
|8.|JJR|Adjective, comparative|
|9.|JJS|Adjective, superlative|
|10.|LS|List item marker|
|11.|MD|Modal|
|12.|NN|Noun, singular or mass|
|13.|NNS|Noun, plural|
|14.|NNP|Proper noun, singular|
|15.|NNPS|Proper noun, plural|
|16.|PDT|Predeterminer|
|17.|POS|Possessive ending|
|18.|PRP|Personal pronoun|
|19.|PRP\$|Possessive pronoun|
|20.|RB|Adverb|
|21.|RBR|Adverb, comparative|
|22.|RBS|Adverb, superlative|
|23.|RP|Particle|
|24.|SYM|Symbol|
|25.|TO|to|
|26.|UH|Interjection|
|27.|VB|Verb, base form|
|28.|VBD|Verb, past tense|
|29.|VBG|Verb, gerund or present participle|
|30.|VBN|Verb, past participle|
|31.|VBP|Verb, non-3rd person singular present|
|32.|VBZ|Verb, 3rd person singular present|
|33.|WDT|Wh-determiner|
|34.|WP|Wh-pronoun|
|35.|WP$|Possessive wh-pronoun|
|36.|WRB|Wh-adverb|

(Source: [Penn Treebank P.O.S. Tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html))

Since it is was designed specifically for English, the tagset is very fine grained, and some of the tags encode _morphological_ information (which we will talk about later) in addition to the basic POS category, e.g. the JJ tags for adjectives indicate the _degree_ of the adjective, and VB tags indicate tense and person. 

## Universal Dependencies POS tagset

Recently, the [Universal Dependencies treebank project](https://universaldependencies.org/introduction.html) (2014) has developed a cross-linguistic set of POS tags. The tagset consists only of 17 POS categories, but word annotations can contain additional language-specific morphological tags, and, therefore, can reach the level of detail that is provided by more fine-grained POS tagsets like the PTB tagset.

The Universal Dependencies (UD) POS tagset contains the following tags:

__Open class tags__

||||
|-- |-- |-- |
|Tag|Description|Examples|
|ADJ|adjective|big, old, green, African, first|
|ADV|adverb|very, well, exactly, tomorrow, where, here, somewhere|
|INTJ| interjection|psst, ouch, bravo, hello|
|NOUN|noun|girl, cat, tree, air|
|PROPN|proper noun|Mary, John, London, NATO|
|VERB|verb|run, eat, runs, ate|

__Closed class tags__

||||
|--|--|--|
|Tag|Description|Examples|
|ADP|adposition|in, to, during|
|AUX|auxiliary|has, is (as in "He is a teacher."), should, was, must|
|CCONJ|coordinating conjunction|and, or, but|
|DET|determiner|a, an, the, this, which, any, no (as in "I have no car.")|
|NUM|numeral|0,1,2,one,two|
|PART|particle|not, 's (as in "Andrew's table")|
|PRON|pronoun|I, myself, who|
|SCONJ|subordinating conjunction|that, if|

__Other tags__

||||
|--|--|--|
|Tag|Description|Examples|
|PUNCT|punctuation|.  ,  ;|
|SYM|symbol|$,  §, ©, 😝|
|X|other|for unanalyizable elements, as in "And then he just xfgh pdl jklw"|

(the examples are from the official [UD tagset description](https://universaldependencies.org/u/pos/all.html))

### See also

- [Detailed official explanation of all UD POS categories with more examples](https://universaldependencies.org/u/pos/all.html)
- Paper about the UD treebank project: [Joakim Nivre et al, Universal Dependencies v1: A Multilingual Treebank Collection (2016)](http://www.lrec-conf.org/proceedings/lrec2016/pdf/348_Paper.pdf)

# The POS-tagging task

The POS-tagging task is simply to assign correct tags from a given POS tagset to each word/token in a tokenized (and possibly sentence-segmented) input text. This _sequence labeling_ task is naturally supervised: POS-taggers are trained and evaluated on already tagged text corpora. If the tokenized text is also sentence segmented then there can be special sentence start and sentence end tokens (e.g., &lt;s&gt;, &lt;/s&gt;) indicating sentence boundaries.

## Performance metrics

The most common performance metric for POS-tagging is __accuracy__.

## Corpora

Treebanks automatically contain POS-information, so the standard POS-data sets usually coincide with the standard treebank data sets for a language (if they exist). 

+ In the case of English, the Wall Street Journal part of the [Penn Treebank](https://catalog.ldc.upenn.edu/LDC99T42) is the most common POS data set on which models are evaluated. Unfortunately, the PTB can be obtained from the Linguistic Data Consortium and  is not free.

+ An important free multilanguage POS data set is provided by the [Universal dependencies project](https://universaldependencies.org/) which contains corpora (of varying size) for more than 70 languages.

# POS tagging with classic sequence labeling ML methods

As we have seen, POS-tagging is a __sequence labeling__ or sequence tagging task: given a $\langle w_1,\dots,w_n\rangle$ __sequence of word "observations"__, we have to find the correct $\langle t_1,\dots, t_n \rangle$ __sequence of tags__, classifying each element in the input sequence.

## Baseline: most frequent tag

If we disregard context then our task boils down to a simple classification: label a $w$ word with the correct POS-tag. Given a labeled training corpus, the simplest possible strategy is to predict for any $w$ word the tag with which it is most frequently associated in the corpus. Somewhat surprisingly, this primitive baseline can achieve over 90% accuracy in certain settings: e.g., [Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/8.pdf) reports that

>If we train on the WSJ training corpus and test on sections 22-24 of the same corpus the most-frequent-tag baseline achieves an accuracy of 92.34%.

Naturally, this baseline does not address two crucial problems:

+ __Context dependence__: In real life, POS-tags are context dependent (if they were not, the baseline accuracy would be 100% on the corpus it was trained on).
+ __Unknown words__: The baseline has no strategy to deal with words that do not occur in the training corpus.

## HMM-based tagging

A way more sophisticated approach is to build a probabilistic model to assign probabilities to possible __$\langle t_1,\dots, t_n \rangle$ tag sequences__ for the input sequence. Of course, in order to make this feasible, we have to make some independence assumptions, but we also want to capture dependencies, e.g. regularities like 

> determiners are frequently followed by nouns but very rarely by verbs

or

> prepositions are frequently followed by verbs 

and a relatively obvious, and __historically important__ approach has been to work with a __hidden Markov model (HMM)__. 

In contrast to the ASR scenario, here the hidden states of the HMM will be the class labels (POS-tags), and the observable states will be words:

<a href="https://d1m75rqqgidzqn.cloudfront.net/wp-data/2020/04/16134154/pos2.png"><img src="https://drive.google.com/uc?export=view&id=1I1Sn6QVQfOKAogrGSm8zqP-mcMzYeTcT" width="400px"></a>

(Image source: [Part of Speech (POS) tagging with Hidden Markov Model](https://www.mygreatlearning.com/blog/pos-tagging/))

Since HMMs are __generative__ (as opposed to __discriminative__) models, it is worth thinking about this POS-tagging model in terms of language generation. What our HMM says is that a text can be generated by

1. generating a __sequence of POS-classes according to the transition probabilities__ (a "syntactic skeleton")
2. generating a __word for each chosen POS-class according to the emission probabilities__.

### Learning

Again, unlike in ASR, the training corpus gives us full __access to the hidden states (POS-tags)__, so given a large enough corpus, we can perform the MLE of transition and emission probabilities simply by __counting__:

__Transition probability:__
$$
\hat P(t_j \mid t_i) = \frac{Count(t_i, t_j)}{Count(t_j)}
$$

__Emission probability:__
$$
\hat P(w_k \mid t_i) = \frac{Count(w_k, t_i)}{Count(t_i)}
$$


A serious complication is that some combinations can be missing from the corpus. In these cases __smoothing methods__ can be used, which will be discussed in some detail later when are going to talk about language modeling.

Decoding the most likely tag sqeuence for a sentence:

$$\hat{y}=argmax \prod\nolimits_{n=1}^N P(w_n|t_n)P(t_n|t_{n-1})$$

### Unknown words

How can we handle unknown words with this type of model? If __no emission probabilities__ are allocated to unknown words then the __probability__ of any observation sequence containing one will be __zero__, so there will __not be any difference__ between the probability of __tag sequences__. This means that we have to allocate __some emission probability__ mass to __unknown words__. The simplest way to do this is (again) smoothing: tags (for open classes) should have some probability of "emitting" unknown words, which can be handled by introducing a new symbol, e.g. UNK. into the model with smoothed probabilities.

#### Features for unknown words

A radically better approach is to recognize that even __unknown/unseen words__ can __have features__ that are useful for predicting the correct label and, crucially, _can_ be learned from the training corpus. E.g., even if we have never seen the word

> supercalifragilisticexpialidocious 

we can have the educated guess that there is a high probability that it is an adjective, based on its ending. This leads to the thought of adding unknown word features to the model, e.g.

+ word suffix (e.g., the last "4" character)

instead of emitting UNK, now the model can emit and handle UNK_with_suffix:ious, UNK_with_suffix:cely etc. symbols, and the ratio of the emission probabilities for these alternatives can be learned from the training corpus.

Variants of this strategy actually work quite well: the most important example is the [TnT](http://www.coli.uni-saarland.de/~thorsten/tnt/) POS-tagger, which used a 2-order Markov model and handled unknown words based on their suffixes achieving 96.7% accuracy on the PTB (see the [paper](https://arxiv.org/pdf/cs/0003055) for details).

#### The problem of generative modeling

An important problem with the above strategy for unknown words that it is __hard to add other type of features__ (e.g. whether the word is capitalized or not) because this would require the modeling of the probabilistic interdependencies between these features, as HMMs are __generative models__. This motivates switching to __discriminative models__, that model the $P(T \mid W)$ conditional probabilities instead of the the full $P(T, W)$ joint distribution.

## Tagging with a maximum entropy Markov model (MEMM)
Based on: Multinomial regression, Maximum entropy regression, Multiclass linear regression

- Based on the idea of __logistic regression as a discriminative model__ 
- But logistic regression isn’t a __sequence model__: assigns __class__ to a __single observation__
- Solution: turn logistic regression into __discriminative sequence__ model
- Run on successive words: use __class assigned to the prior word as feature__ in the classification of the next word. 
- Apply logistic regression for __each possible outcome class__
- Chose __most likely class__

When we apply logistic regression
in this way (to a multi-class output), it’s called the __Maximum Entropy Markov model__ or __MEMM__.

Maximum entropy: distrubutes the probability mass to the correct label using a softmax function. Whilst a single label with probability mass P=1 has the smallest entropy a uniform distribution has the largest entropy.


Keeping the Markov chain model for the tags but conditioning on the __input sequence ("changing the direction of the arrows")__ we get a probabilistic model that does not require us to model the distribution of the input: 



<a href="http://drive.google.com/uc?export=view&id=1WKQNlnpkcUljDCdxdJL7RuA_2uZZS2U3"><img src="https://drive.google.com/uc?export=view&id=1yzkwMAjdseQW29wKDRTv955ngb_UQyLU" width="450"></a>

(Image source: [HMM, MEMM, and CRF: A comparative analysis](https://medium.com/@Alibaba_Cloud/hmm-memm-and-crf-a-comparative-analysis-of-statistical-modeling-methods-49fc32a73586))

### Feature templates

__Switch to discriminative modeling__:
- In contrast to our HMM-based solution, can use __all types of useful features__ about the input elements without having to model their interactions

For POS-tagging, like in many other sequence labeling scenarios, it is customary to actually condition only on __local features__ of the element to be labeled by considering only a __context window__. 

The following is a typical POS-tagging feature template for an individual input word $x_i$:

- Elements in a context window around $x_i$, e.g. $\langle x_{i-1}, x_{i}, x_{i+1} \rangle$ (using the presence of concrete words as features is called __lexicalization__)
- __Suffixes__ (of a fixed length) of the context window's elements
- __Prefixes__ (of a fixed length) of the context window's elements
- __Capitalization__ information of the context window's elements



<a href="http://drive.google.com/uc?export=view&id=1ZRaHHiP2vPbbLNna8SpK5Rz7lOpWJkxg"><img src="https://drive.google.com/uc?export=view&id=1txD5NRw22UmTw8O4LVBQhf-J-kN7RibI" width="450"></a>

__WARNING__: A typical feature template like the above can actually produce __thousands of features__ because the categoricals have to be one-hot encoded. In addition to performance implications this also has consequences for data sparsity: smoothing becomes very important.

### Modeling the probabilities

Let the sequence of words be $W = w_{1}^{n}$
and the sequence of tags $T = t_{1}^{n}$. In an
HMM to compute the best tag sequence that maximizes $P(T|W)$ we rely on Bayes’rule and the likelihood P(W|T):

$ \hat{T}= \max\limits_{T}P(T|W)$

$ \hat{T}= \max\limits_{T}P(W|T)*P(T)$

$ \hat{T}= \max\limits_{T}\prod_{i}P(word_i|tag_{i})\prod_{i}P(tag_i|tag_{i-1})$

In an MEMM, by contrast, we compute the posterior P(T|W) directly, training it to
discriminate among the possible tag sequences:

$ \hat{T}= \max\limits_{T}P(W|T)$

$ \hat{T}= \max\limits_{T}\prod_{i}P(tag_i|word_{i}, tag_{i-1})$




#### Recap - logistic regression
want output of probabilities between 0 and 1

$0\leq h_{\theta} \leq 1$

$h_{\theta}(x)=g(\theta^t*x)$
 
and as activation we take the sigmoid or ligistic function

$g(z)=\frac{1}{1+e^{-z}}$

<a href="https://miro.medium.com/max/728/1*Xu7B5y9gp0iL5ooBj7LtWw.png"><img src="https://drive.google.com/uc?export=view&id=1QZ_UdtUIkpJnbtDlWvsdj4J0HdHEIW8s" width="450"></a>

$h_{\theta}(x)=\frac{1}{1+e^{\theta^T*x}}$

This can be illustrated as the activation of a single neuron:

<a href="https://miro.medium.com/max/1001/0*rtJ7w5lrNFwW1Per.jpg"><img src="https://drive.google.com/uc?export=view&id=1NrwjXk_HsuflYK4CDBYcmq23IpILTDV8" width="450"></a>

#### Maximum Entropy Regression
The main difference is that we now multiple output classes

<a href="https://i.stack.imgur.com/0rewJ.png"><img src="https://drive.google.com/uc?export=view&id=1t6M-DaC_ZzYgwJAFcZFVpSMlP96oUaXF" width="450"></a>


This is equivalent to:
- __Neural network__ with a __multi-class__ output 
- __Without a hidden layer__ 
- With a __sigmoid activation__
- With a __softmax transformation__ over the output 

Note: except for the sigmoid and softmax function, it is very close to a linear perceptron with multiple classes as output



The reason for taking the __softmax function__ is to normalize the output in terms of forming a probability distribution with probability mass 1.

Works as follows:

let $y \in (1,2,..,c)$ be the class labels

and let w(1), w(2),..,w(c) be the weight vector

$\hat{p}=  \frac{exp(w^T(i)*X_n)}{\sum\limits_{j=1}^c exp(w^T(j)*X_n})$

where $exp(w^T(i)*X_n)$ 

is the score for the particular label, also known as scoring function

and $\sum\limits_{j=1}^c exp(w^T(j)*X_n)$ 

is the partition function or normalization. Simply adds up all scoring functions to ensure that the total probability mass will be 1.

#### loss function

let $y_n=1$ be the true label

In logistic regression we take the log-loss:

$L(w)=-log \big(\hat{p} (y_n|x_n)\big)$

which, for our function translates into:

$L(w)=-log \big(  \frac{exp(w^T(i)*X_n)}{\sum\limits_{j=1}^c exp(w^T(j)*X_n)}\big)$

#### Transfering Maximum Entropy Regression into a Maximum Entropy Markov Model

The above general formula can be adjusted to __account for the feature template__ as well as the __time dependence__.

Using an  $f$  feature template function which generates local features for the $i$ index from $X$ and $y_{i-1}$,
with __multiclass logistic regression__ (this is where the "maximum entropy" in MEMM comes from) we can model each individual
$P(y_i \mid y_{i-1}, X)$ 

The most likely sequence of tags is then computed by combining these features of the input word $w_i$, its neighbors within l words $w_{i+l}^{i−l}$, and the previous k tags $t_{i−1}^{i−k}$
as follows (using θ to refer to feature weights instead of w to avoid the confusion with
w meaning words):


$ \hat{T}= \max\limits_{T}P(W|T)$

$ \hat{T}= \max\limits_{T}\prod_{i}P(tag_i|word_{i}, tag_{i-1})$

$
\hat{T} =  \frac{ \sum\limits_{j} \exp\big(\mathbf \theta_{j} \cdot f(t_{i}, w_{i+l}^{i−l}, t_{i−1}^{i−k})\big)}{\sum\limits_{t \in tagest }\exp\big(\sum\limits_{j}\mathbf \theta_{j} \cdot f(t_{i}, w_{i+l}^{i−l}, t_{i−1}^{i−k})\big)}
$

where the $\mathbf \theta_j$ vectors ($j=1,\dots,K$) are learned weight vectors for the features provided by $f$ for the $j$-th class.

### Higher order MEMM variants 
Similarly to HMMs, MEMMs it is straightforward to produce higher-order variants, at the expense of costlier inference and training.

### Inference

+ Greedy search is the simplest method and can work reasonably well, but
+ similarly to the case of HMMs, the most probable $Y$ sequence for a given $X$ input can be precisely calculated with the Viterbi algorithm.
+ If even Viterbi is too slow then beam search can be used.

### Learning

Since we managed to reduce our original sequence labeling problem to a simple linear classification, we can choose from a variety of learning algorithms, most importantly, we can use

+ gradient descent,
+ quasi-Newton methods like (L)BFGS
+ (as an approximation of the gradient) perceptron

#### Structured perceptron

Since in practice it works very well and (as we will see), its use can be generalized to other "structured prediction" settings, the application of the perceptron algorithm is worth detailing a bit. First, recall the perceptron update rule for multiclass classification:

For each $\langle \mathbf x, y\rangle$ data point, if the $\hat y$ prediction with the current weight is incorrect, then:
1. Update the weights for the correct class: $\mathbf w_y \leftarrow \mathbf w_y + \eta \mathbf x$
2. Update the weights for the incorrect class: $\mathbf w_\hat{y} \leftarrow \mathbf w_\hat{y} - \eta \mathbf x$

where $\eta$ is the learning rate.

This can be adapted to the sequence labeling setting as follows: For each $\langle X, Y\rangle$ input sequence, correct label sequence data point:

1. Calculate the $\hat Y=\langle \hat y_1,\dots, \hat y_N \rangle $  most probable predicted label  sequence using the current weights, e.g., by Viterbi.
2. Apply the above perceptron rule for each sequence index $i \in \{1,\dots, N\}$ using  $\langle f(\hat y_{i-1}, X, i), y_i \rangle$ as the data point and $\hat y_i$ as the predicted label.

As mentioned, in practice, structured perceptron works very well for optimizing MEMMs, especially the weighted variants.


### MEMM problems and their solutions

Unfortunately, MEMMs have some important limitations:
+ __One-directionality__: Although indirectly Viterbi makes it possible for the next tag(s) to have some influence on the current one, this information cannot be part of the features on which the linear classifier is trained.
+ __The so-called label bias problem__: Since outgoing state transition probabilities are normalized (add up to 1 since they are conditional probabilities), labels with few outgoing transitions (1 in the extreme case) are preferred to those with a large number of outgoing edges and can be chosen as part of the most probable sequence disregarding observations.



<a href="http://www.davidsbatista.net/assets/images/2017-11-13-Label_Bias_Problem.png"><img src="https://drive.google.com/uc?export=view&id=1Y1wwj_IbgRI8KIIpMCWJT4Sqb3HyHUJ0" width="800"></a>

Transitions from a given state are competing against each other only.

Per state normalization, i.e. sum of transition probability for any state has to sum to 1.

MEMM are normalized locally over each observation where the transitions going out from a state compete only against each other, as opposed to all the other transitions in the model.

States with a single outgoing transition effectively ignore their observations.

Causes bias: states with fewer arcs are preferred.

__Solutions to achieve bidirectionality__
+ Making multiple one-directional passes: starting from the second pass later labels can also be used from the previous pass.
+ Using both a forward and a backward MEMM, and choosing the higher scoring label-sequence when decoding -- either tag-by-tag or from the two enitre Viterbi decoded label sequences.
+ Finally, using bidirectional probabilistic models is also a possibility:
    + The Stanford POS tagger uses a special, bidirectional MEMM version, see [Toutanova et al.:  Feature-rich part-of-speech tagging with a cyclic dependency network](https://www.aclweb.org/anthology/N03-1033.pdf)
    + A more principled solution is to use a Conditional Random Fields model (CRF), to which we will dedicate a separate section.

### See also
- For some more details on MEMM-based taggers see [chapter 8 of Jurafsky and Martin](https://web.stanford.edu/~jurafsky/slp3/8.pdf), on which the present discussion was partly based.
- The first version of spaCy used MEMMs and weighted perceptrons for most of its NLP models, see [A Good Part-of-Speech Tagger in about 200 Lines of Python](https://explosion.ai/blog/part-of-speech-pos-tagger-in-python) for some details.

## Linear chain conditional random fields (CRFs)

The critical difference between CRF and MEMM is that the latter uses __per-state exponential models__ for the conditional probabilities of next states given the current state, whereas CRF uses a __single exponential model to determine the joint probability of the entire sequence__ of labels, given the observation sequence. Therefore, in CRF, the weights of different features in different states compete against each other.

This means that in the MEMMs there is a model to compute the probability of the next state, given the current state and the observation. On the other hand __CRF computes all state transitions globally, in a single model__

In contrast to HMMs and MEMMs, linear-chain CRFs are undirected graphical models with the following structure:

<a href="http://www.davidsbatista.net/assets/images/2017-11-13-Conditional_Random_Fields.png"><img src="https://drive.google.com/uc?export=view&id=1qxU2orll4pNMczGWxS0MOMORY_ViMbMi" width="450"></a>

(Image source: [Conditional Random Fields for Sequence Prediction](http://www.davidsbatista.net/blog/2017/11/13/Conditional_Random_Fields/))

Similarly to MEMMs, CRFs are discriminative models that conditionalize on the input, but they are
+ undirected, solving the one-directionality problem
+ allow observations to "modulate" the local probability mass for state transitions, solving the label bias problem.

<a href="http://drive.google.com/uc?export=view&id=18t3BCoHhsZpdeUI7qy-bnuoweNoQjtzu"><img src="https://drive.google.com/uc?export=view&id=1vcAg3QMZON52d3aNp0WAQa0vhTgalhHq" width="450"></a>

#### Modeling Linear CRF

Suppose we want to classify a whole sequence and take regular classification. Then:

$p(\bar{y}|x)= \prod_{k=1}^{n} p({y_k}|x_k)=\prod_{k=1}^{n}\big( \frac{exp(a_k*x_k)}{z(x_k)}\big)$

which tells us that the __probability of a sequence of labels__ given that they are independent is simply the __product of these labels__

the __product of the exponentials__ is the __expeonentioal of the sum__, so 

=$\frac {exp \sum\limits_{k=1}^k a_k*x_k} {\sum\limits_{k=1}^{n}z(x_k))}$


We __adjust__ this to __obtain sequence classification__ that also takes into account the interdependence between particular labels:

$p(\bar{y}|x)=\frac{exp \big(\sum\limits_{k=1}^k a_k*x_k \sum\limits_{k=1}^k V_{y_k,y_{k-1}} \big)}{ \sum\limits_{k=1}^{n}z(x_k)}$

where $\sum\limits_{k=1}^k a_k*x_k $ describes how likelx $y_k$ is given the input

and $\sum\limits_{k=1}^k V_{y_k,y_{k-1}}$ is a matrix of $y_k$ followed by $y_{k+1}$ which shows preference of the model for particular pairs of sequences for $y_k$, $y_{k+1}$



__This can be slighly rewritten to point out the general structure:__

Define 

  $F(y,x)= \big(\sum\limits_{k=1}^k a_k*x_k +\sum\limits_{k=1}^k V_{y_k,y_{k-1}} \big)$ as the scoring function

and  

$Z(x)= \sum\limits_{k=1}^{n}z(x_k)= \sum\limits_{l=1}^c \bigg(\big(\sum\limits_{k=1}^k a_k*x_k +\sum\limits_{k=1}^k V_{y_k,y_{k-1}} \big)\bigg)$ as the partition function

so,

$P(y,x)=\frac {exp(F(y,x))} {Z(x)}$

which is another way in which these types of models are frequently presented

### Inference

Calculating the most probable label sequence for a given $X$ input sequence can be done analogously to MEMMs, i.e., with Viterbi or beam search --  the computation of the normalizer factor is not required as it is the same for all alternatives.

### Learning

The methods used for MEMMs (GD variants, quasi-Newton and structured perceptron) can all be used, but global normalization makes the training significantly slower than in the case of MEMMs, as computing the normalizer requires computing the sum of scores for all possible label variations for the input sequences in the data set.

### Higher order variants 
Similarly to HMMs and MEMMs, there are higher-order CRF variants, see, e.g., [Chuon & Cieu (2014): Conditional Random Field with High-order Dependencies for Sequence Labeling and Segmentation](http://www.jmlr.org/papers/volume15/cuong14a/cuong14a.pdf).

### See also

+ The paper which introduced CRFs: [Lafferty et al (2001):Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data](http://repository.upenn.edu/cgi/viewcontent.cgi?article=1162&context=cis_papers)
+ A detailed introduction to CRFs: [Sutton and McCallum: An introduction to CRFs](https://homepages.inf.ed.ac.uk/csutton/publications/crftut-fnt.pdf)
+ The most important CRF implementation is the [CRFSuite](http://www.chokkan.org/software/crfsuite/)
+ CRFSuite has a Python wrapper as well, see (python-crfsuite)(https://python-crfsuite.readthedocs.io/en/latest/)

# Neural methods

## The standard: word embeddings +  RNN(s) + softmax

Starting from the early 2000s, neural network-based solutions started to appear and eventually outperformed the traditional ML-based, manually feature engineered approaches.

The most common, standard neural architecture has been to use RNNs (mostly LSTM variants) to classify the tokenized and embedded input text token by token with a softmax output layer, which can be taught, e.g., with the usual cross-entropy loss:

<a href="https://miro.medium.com/max/1481/1*wf9iOTO853P5ewjPX079RQ.png"><img src="https://drive.google.com/uc?export=view&id=1q42Ab7-I-d5_zqAIXNwdvCMluAeu7Kr6" width="800"></a>

(Image source: [Taming LSTMs](https://towardsdatascience.com/taming-lstms-variable-sized-mini-batches-and-why-pytorch-is-good-for-your-health-61d35642972e))

__Word embeddings__

Word embeddings, used as the input of the RNN layer, are __dense representations__ of the __input tokens__ that are typically produced in an unsupervised manner. The details will be discussed later, but it is worth noting that they play the __role of the manually engineered feature vectors__ of the old ML models, although -- at least in the simplest case -- they are context independent, characterizing the token in general, without reflecting its actual context in the input sequence.

__Bidirectional RNN layers__

One-directionality is an important limitation of individual RNN-layers: the output is based on a one-sided context only. A standard solution to this problem is to use __bidirectional RNN layers__ (or bi-RNNs),  which combine the output of a forward and backward processing RNN at each position:

<a href="https://www.i2tutorials.com/wp-content/media/2019/05/Deep-Dive-into-Bidirectional-LSTM-i2tutorials.jpg"><img src="https://drive.google.com/uc?export=view&id=1_1QDv9PB190BW9janTjzqdMdvT4sLt2g" width="700px"></a>

(Image source: [Deep dive into bidirectional LSTM](https://www.i2tutorials.com/technology/deep-dive-into-bidirectional-lstm/))

Naturally, bidirectional RNNs can be stacked similarly to one-directional ones:

<a href="https://d3i71xaburhd42.cloudfront.net/828dbeb7cf922dc9b6657dd169b8d26d2b58eedb/3-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=1dnSzM0YKQOXZJ3zJy2Liwnsu9f9nqu0l" width="500px"></a>

(Image source: [A Long Short-Term Memory Model for Answer Sentence Selection in Question Answering](https://www.semanticscholar.org/paper/A-Long-Short-Term-Memory-Model-for-Answer-Sentence-Wang-Nyberg/828dbeb7cf922dc9b6657dd169b8d26d2b58eedb))

## CNN-based alternatives

In the last few years CNN-based sequence labeling architectures have also played an important role. Their performance can be competitive with RNNs but are typically faster (because of parallelizability). spaCy 2.x versions, for instance, use  dilated 1d convolutions (see https://github.com/explosion/spaCy/issues/1057 for some details):


<a href="https://d3i71xaburhd42.cloudfront.net/1478075a10ea2e0a1b5bdc170468dcea81e6fcb2/2-Figure1-1.png"><img src="https://drive.google.com/uc?export=view&id=1gCEkzatLbCo7YVWEpe_FBdTg6SAxKH_W"></a>

(Image source: [Strubel et al (2017): Fast and Accurate Entity Recognition with Iterated Dilated Convolutions](https://arxiv.org/pdf/1702.02098.pdf))

Dilated convolutions give an especially large context window for classification. In theory, RNNs are not limited to a context window at all, but in practice truncated backpropagation limits RNNs to a window as well.

## Neural-CRF hybrids

The RNN and CNN based architectures mentioned so far __do not model explicitly the interaction between tags__ -- the final tagging decisions are made independently from each other (although, of course, on the basis of context-dependent representations). This fact led to the emergence of hybrid models that contain an __explicit "inference" layer on top of RNN or CNN layers which represent the label interactions__, the __typical choice being a CRF__.

From the point of view of performance, __neural-CRF hybrids__ represent the state of the art in many sequence labeling tasks, adding a CRF top layer to existing models typically improves results a bit, at the expense of a performance penalty. In fact, hybrid sequence taggers are so common that there is even a dedicated platform to build them: [NCRF++](https://github.com/jiesutd/NCRFpp). NCRF++ supports building models that follow the following general schema:

<a href="https://d3i71xaburhd42.cloudfront.net/67d40c3f7470287a3bccfccdb506bcb6d522ac8c/2-Figure2-1.png"><img src="https://drive.google.com/uc?export=view&id=1kT4wv0rSyCBEu_FASkBZZf2GbdrhYHxS" width="750px"></a>

(Image from the NCRF++ paper: [NCRF++: An Open-source Neural Sequence Labeling Toolkit](https://arxiv.org/pdf/1806.05626.pdf))

## The role of subword/character based embeddings

Although our ongoing discussion has concentrated on token/word level models, it is very important to keep in mind that the performance of these models can be only as good as the token-level representations (embeddings) they rely on, especially for unseen/unknown words, for which there is no token level-information in the training data. Modeling tokens on a subword (character, morpheme etc.) level is therefore a crucial part of neural sequence tagging -- we will return to this problem in the word embedding class.

## See also

+ The textbooks by [Jurafsky & Manning](https://web.stanford.edu/~jurafsky/slp3/9.pdf) and [Eisenstein](https://github.com/jacobeisenstein/gt-nlp-class/blob/master/notes/eisenstein-nlp-notes.pdf) both contain good introductory chapters on neural sequence tagging.
+ The paper [Bai et al. (2018): An Empirical Evaluation of Generic Convolutional and Recurrent Networksfor Sequence Modeling](https://arxiv.org/pdf/1803.01271.pdf) is a useful comparison of the performance of CNNs and RNNs in sequence modeling.

# When will we reach 100% accuracy?

The current state of the art in POS tagging on the industry standard PTB/WSJ data set is (according to [NLP progress](https://github.com/sebastianruder/NLP-progress/blob/master/english/part-of-speech_tagging.md)) 97.96%, but the TnT HMM tagger was already at 96.7% in 2000 (!), so progress has been far from fast. In a very interesting paper entitled [Part-of-Speech Tagging from 97% to 100%:Is It Time for Some Linguistics?](https://nlp.stanford.edu/pubs/CICLing2011-manning-tagging.pdf) Christopher D. Manning conducted an error analysis and discusses reasons why we will probably __not__ reach 100%.

In addition to the fact that the standard data sets actually contain incorrect annotations in certain cases, and in general interannotator agreement itself seems to be around 97%, the most important barrier might be __theoretical__: Manning asks the question 

> Are part-of-speech labels well-defined discrete properties enabling us to assign each word a single symbolic label?

and suggests that the answer is not necessarily affirmative.